LinuxLists.cc - [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-02-26 00:24:51

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

Changes from V1:
- Factor some helper functions to simplify dax fault code
- Introduce iomap_apply2() for dax_dedupe_file_range_compare()
- Fix mistakes and other problems
- Rebased on v5.11

One of the key mechanism need to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destance
iomap. And we just copy range in which data won't be changed.

Another mechanism is range comparison. In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data. In fsdax case, readpage() does not work. So, we need
another compare data with direct access support.

With the two mechanism implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.

Some of the patches are picked up from Goldwyn's patchset. I made some
changes to adapt to this patchset.

(Rebased on v5.11)
==

Shiyang Ruan (10):
fsdax: Factor helpers to simplify dax fault code
fsdax: Factor helper: dax_fault_actor()
fsdax: Output address in dax_iomap_pfn() and rename it
fsdax: Introduce dax_iomap_cow_copy()
fsdax: Replace mmap entry in case of CoW
fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
iomap: Introduce iomap_apply2() for operations on two files
fsdax: Dedup file range to use a compare function
fs/xfs: Handle CoW for fsdax write() path
fs/xfs: Add dedupe support for fsdax

fs/dax.c | 532 +++++++++++++++++++++++++++--------------
fs/iomap/apply.c | 51 ++++
fs/iomap/buffered-io.c | 2 +-
fs/remap_range.c | 45 +++-
fs/xfs/xfs_bmap_util.c | 3 +-
fs/xfs/xfs_file.c | 29 ++-
fs/xfs/xfs_inode.c | 8 +-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 30 ++-
fs/xfs/xfs_iomap.h | 1 +
fs/xfs/xfs_iops.c | 11 +-
fs/xfs/xfs_reflink.c | 16 +-
include/linux/dax.h | 7 +-
include/linux/fs.h | 15 +-
include/linux/iomap.h | 7 +-
15 files changed, 550 insertions(+), 208 deletions(-)

--
2.30.1

2021-02-26 00:26:32

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor()

The core logic in the two dax page fault functions is similar. So, move
the logic into a common helper function. Also, to facilitate the
addition of new features, such as CoW, switch-case is no longer used to
handle different iomap types.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 211 ++++++++++++++++++++++++++++++-------------------------
1 file changed, 117 insertions(+), 94 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7031e4302b13..9dea1572868e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1289,6 +1289,93 @@ static int dax_fault_cow_page(struct vm_fault *vmf, struct iomap *iomap,
return 0;
}

+static vm_fault_t dax_fault_insert_pfn(struct vm_fault *vmf, pfn_t pfn,
+ bool pmd, bool write)
+{
+ vm_fault_t ret;
+
+ if (!pmd) {
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long address = vmf->address;
+
+ if (write)
+ ret = vmf_insert_mixed_mkwrite(vma, address, pfn);
+ else
+ ret = vmf_insert_mixed(vma, address, pfn);
+ } else
+ ret = vmf_insert_pfn_pmd(vmf, pfn, write);
+
+ return ret;
+}
+
+#ifdef CONFIG_FS_DAX_PMD
+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+ struct iomap *iomap, void **entry);
+#else
+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+ struct iomap *iomap, void **entry)
+{
+ return VM_FAULT_FALLBACK;
+}
+#endif
+
+/**
+ * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
+ * @vmf: vm fault instance
+ * @pfnp: pfn to be returned
+ * @xas: the dax mapping tree of a file
+ * @entry: an unlocked dax entry to be inserted
+ * @pmd: distinguish whether it is a pmd fault
+ * @flags: iomap flags
+ * @iomap: from iomap_begin()
+ * @srcmap: from iomap_begin(), not equal to iomap if it is a CoW
+ */
+static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
+ struct xa_state *xas, void *entry, bool pmd, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
+ loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
+ bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
+ vm_fault_t ret = 0;
+ int err = 0;
+ pfn_t pfn;
+
+ /* if we are reading UNWRITTEN and HOLE, return a hole. */
+ if (!write &&
+ (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
+ if (!pmd)
+ return dax_load_hole(xas, mapping, &entry, vmf);
+ else
+ return dax_pmd_load_hole(xas, vmf, iomap, &entry);
+ }
+
+ if (iomap->type != IOMAP_MAPPED) {
+ WARN_ON_ONCE(1);
+ return VM_FAULT_SIGBUS;
+ }
+
+ err = dax_iomap_pfn(iomap, pos, size, &pfn);
+ if (err)
+ goto error_fault;
+
+ entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
+ write && !sync);
+
+ if (sync)
+ return dax_fault_synchronous_pfnp(pfnp, pfn);
+
+ ret = dax_fault_insert_pfn(vmf, pfn, pmd, write);
+
+error_fault:
+ if (err)
+ ret = dax_fault_return(err);
+
+ return ret;
+}
+
static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
int *iomap_errp, const struct iomap_ops *ops)
{
@@ -1296,17 +1383,14 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
struct address_space *mapping = vma->vm_file->f_mapping;
XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
struct inode *inode = mapping->host;
- unsigned long vaddr = vmf->address;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
struct iomap iomap = { .type = IOMAP_HOLE };
struct iomap srcmap = { .type = IOMAP_HOLE };
unsigned flags = IOMAP_FAULT;
int error, major = 0;
bool write = vmf->flags & FAULT_FLAG_WRITE;
- bool sync;
vm_fault_t ret = 0;
void *entry;
- pfn_t pfn;

trace_dax_pte_fault(inode, vmf, ret);
/*
@@ -1352,8 +1436,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
goto unlock_entry;
}
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
- error = -EIO; /* fs corruption? */
- goto error_finish_iomap;
+ ret = VM_FAULT_SIGBUS; /* fs corruption? */
+ goto finish_iomap;
}

if (vmf->cow_page) {
@@ -1363,49 +1447,19 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
goto finish_iomap;
}

- sync = dax_fault_is_synchronous(flags, vma, &iomap);
-
- switch (iomap.type) {
- case IOMAP_MAPPED:
- if (iomap.flags & IOMAP_F_NEW) {
- count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
- major = VM_FAULT_MAJOR;
- }
- error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
- if (error < 0)
- goto error_finish_iomap;
-
- entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
- 0, write && !sync);
-
- if (sync) {
- ret = dax_fault_synchronous_pfnp(pfnp, pfn);
- goto finish_iomap;
- }
- trace_dax_insert_mapping(inode, vmf, entry);
- if (write)
- ret = vmf_insert_mixed_mkwrite(vma, vaddr, pfn);
- else
- ret = vmf_insert_mixed(vma, vaddr, pfn);
-
+ ret = dax_fault_actor(vmf, pfnp, &xas, entry, false, flags,
+ &iomap, &srcmap);
+ if (ret == VM_FAULT_SIGBUS)
goto finish_iomap;
- case IOMAP_UNWRITTEN:
- case IOMAP_HOLE:
- if (!write) {
- ret = dax_load_hole(&xas, mapping, &entry, vmf);
- goto finish_iomap;
- }
- fallthrough;
- default:
- WARN_ON_ONCE(1);
- error = -EIO;
- break;
+
+ /* read/write MAPPED, CoW UNWRITTEN */
+ if (iomap.flags & IOMAP_F_NEW) {
+ count_vm_event(PGMAJFAULT);
+ count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
}

- error_finish_iomap:
- ret = dax_fault_return(error);
- finish_iomap:
+finish_iomap:
if (ops->iomap_end) {
int copied = PAGE_SIZE;

@@ -1419,9 +1473,9 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
*/
ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
}
- unlock_entry:
+unlock_entry:
dax_unlock_entry(&xas, entry);
- out:
+out:
trace_dax_pte_fault_done(inode, vmf, ret);
return ret | major;
}
@@ -1519,17 +1573,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
struct address_space *mapping = vma->vm_file->f_mapping;
XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
bool write = vmf->flags & FAULT_FLAG_WRITE;
- bool sync;
- unsigned int iomap_flags = (write ? IOMAP_WRITE : 0) | IOMAP_FAULT;
+ unsigned int flags = (write ? IOMAP_WRITE : 0) | IOMAP_FAULT;
struct inode *inode = mapping->host;
- vm_fault_t result = VM_FAULT_FALLBACK;
+ vm_fault_t ret = VM_FAULT_FALLBACK;
struct iomap iomap = { .type = IOMAP_HOLE };
struct iomap srcmap = { .type = IOMAP_HOLE };
pgoff_t max_pgoff;
void *entry;
loff_t pos;
int error;
- pfn_t pfn;

/*
* Check whether offset isn't beyond end of file now. Caller is
@@ -1541,7 +1593,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
trace_dax_pmd_fault(inode, vmf, max_pgoff, 0);

if (xas.xa_index >= max_pgoff) {
- result = VM_FAULT_SIGBUS;
+ ret = VM_FAULT_SIGBUS;
goto out;
}

@@ -1556,7 +1608,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
*/
entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
if (xa_is_internal(entry)) {
- result = xa_to_internal(entry);
+ ret = xa_to_internal(entry);
goto fallback;
}

@@ -1568,7 +1620,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
*/
if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
!pmd_devmap(*vmf->pmd)) {
- result = 0;
+ ret = 0;
goto unlock_entry;
}

@@ -1578,49 +1630,21 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* to look up our filesystem block.
*/
pos = (loff_t)xas.xa_index << PAGE_SHIFT;
- error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap,
- &srcmap);
+ error = ops->iomap_begin(inode, pos, PMD_SIZE, flags, &iomap, &srcmap);
if (error)
goto unlock_entry;

if (iomap.offset + iomap.length < pos + PMD_SIZE)
goto finish_iomap;

- sync = dax_fault_is_synchronous(iomap_flags, vma, &iomap);
-
- switch (iomap.type) {
- case IOMAP_MAPPED:
- error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
- if (error < 0)
- goto finish_iomap;
+ ret = dax_fault_actor(vmf, pfnp, &xas, entry, true, flags,
+ &iomap, &srcmap);

- entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
- DAX_PMD, write && !sync);
-
- if (sync) {
- result = dax_fault_synchronous_pfnp(pfnp, pfn);
- goto finish_iomap;
- }
-
- trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
- result = vmf_insert_pfn_pmd(vmf, pfn, write);
- break;
- case IOMAP_UNWRITTEN:
- case IOMAP_HOLE:
- if (WARN_ON_ONCE(write))
- break;
- result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
- break;
- default:
- WARN_ON_ONCE(1);
- break;
- }
-
- finish_iomap:
+finish_iomap:
if (ops->iomap_end) {
int copied = PMD_SIZE;

- if (result == VM_FAULT_FALLBACK)
+ if (ret == VM_FAULT_FALLBACK)
copied = 0;
/*
* The fault is done by now and there's no way back (other
@@ -1628,19 +1652,18 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* Just ignore error from ->iomap_end since we cannot do much
* with it.
*/
- ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
- &iomap);
+ ops->iomap_end(inode, pos, PMD_SIZE, copied, flags, &iomap);
}
- unlock_entry:
+unlock_entry:
dax_unlock_entry(&xas, entry);
- fallback:
- if (result == VM_FAULT_FALLBACK) {
+fallback:
+ if (ret == VM_FAULT_FALLBACK) {
split_huge_pmd(vma, vmf->pmd, vmf->address);
count_vm_event(THP_FAULT_FALLBACK);
}
out:
- trace_dax_pmd_fault_done(inode, vmf, max_pgoff, result);
- return result;
+ trace_dax_pmd_fault_done(inode, vmf, max_pgoff, ret);
+ return ret;
}
#else
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
--
2.30.1

2021-02-26 00:26:36

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 03/10] fsdax: Output address in dax_iomap_pfn() and rename it

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case. Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/dax.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9dea1572868e..1459ef4095fb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -997,8 +997,8 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
}

-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
- pfn_t *pfnp)
+static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
+ void **kaddr, pfn_t *pfnp)
{
const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
@@ -1010,11 +1010,13 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
return rc;
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
- NULL, pfnp);
+ kaddr, pfnp);
if (length < 0) {
rc = length;
goto out;
}
+ if (!pfnp)
+ goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1024,6 +1026,12 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+ if (!kaddr)
+ goto out;
+ if (!*kaddr)
+ rc = -EFAULT;
out:
dax_read_unlock(id);
return rc;
@@ -1357,7 +1365,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
return VM_FAULT_SIGBUS;
}

- err = dax_iomap_pfn(iomap, pos, size, &pfn);
+ err = dax_iomap_direct_access(iomap, pos, size, NULL, &pfn);
if (err)
goto error_fault;

--
2.30.1

2021-02-26 00:27:07

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 04/10] fsdax: Introduce dax_iomap_cow_copy()

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

The destance extent which iomap indicated is new allocated extent.
So, it is needed to copy the data from srcmap to new allocated extent.
In theory, it is better to copy the head and tail ranges which is
outside of the non-aligned area instead of copying the whole aligned
range. But in dax page fault, it will always be an aligned range. So,
we have to copy the whole range in this case.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 68 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1459ef4095fb..748dfb89fb41 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1037,6 +1037,53 @@ static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
return rc;
}

+/*
+ * Copy the head and tail part of the pages not included in the write but
+ * required for CoW, because pos/pos+length are not page aligned. But in dax
+ * page fault case, the range is page aligned, we need to copy the whole range
+ * of data. Use copy_edge to distinguish these cases.
+ */
+static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,
+ struct iomap *srcmap, void *daddr, bool copy_edge)
+{
+ loff_t head_off = pos & (align_size - 1);
+ size_t size = ALIGN(head_off + length, align_size);
+ loff_t end = pos + length;
+ loff_t pg_end = round_up(end, align_size);
+ void *saddr = 0;
+ int ret = 0;
+
+ ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+ if (ret)
+ return ret;
+
+ if (!copy_edge) {
+ ret = copy_mc_to_kernel(daddr, saddr, length);
+ return ret;
+ }
+
+ /* Copy the head part of the page. Note: we pass offset as length. */
+ if (head_off) {
+ if (saddr)
+ ret = copy_mc_to_kernel(daddr, saddr, head_off);
+ else
+ memset(daddr, 0, head_off);
+ }
+ /* Copy the tail part of the range */
+ if (end < pg_end) {
+ loff_t tail_off = head_off + length;
+ loff_t tail_len = pg_end - end;
+
+ if (saddr)
+ ret = copy_mc_to_kernel(daddr + tail_off,
+ saddr + tail_off, tail_len);
+ else
+ memset(daddr + tail_off, 0, tail_len);
+ }
+
+ return ret;
+}
+
/*
* The user has performed a load from a hole in the file. Allocating a new
* page in the file would cause excessive storage usage for workloads with
@@ -1106,11 +1153,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct dax_device *dax_dev = iomap->dax_dev;
struct iov_iter *iter = data;
loff_t end = pos + length, done = 0;
+ bool write = iov_iter_rw(iter) == WRITE;
ssize_t ret = 0;
size_t xfer;
int id;

- if (iov_iter_rw(iter) == READ) {
+ if (!write) {
end = min(end, i_size_read(inode));
if (pos >= end)
return 0;
@@ -1119,7 +1167,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
return iov_iter_zero(min(length, end - pos), iter);
}

- if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+ if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+ !(iomap->flags & IOMAP_F_SHARED)))
return -EIO;

/*
@@ -1158,6 +1207,13 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
break;
}

+ if (write && srcmap->addr != iomap->addr) {
+ ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+ kaddr, true);
+ if (ret)
+ break;
+ }
+
map_len = PFN_PHYS(map_len);
kaddr += offset;
map_len -= offset;
@@ -1169,7 +1225,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
* validated via access_ok() in either vfs_read() or
* vfs_write(), depending on which operation we are doing.
*/
- if (iov_iter_rw(iter) == WRITE)
+ if (write)
xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
map_len, iter);
else
@@ -1350,6 +1406,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
vm_fault_t ret = 0;
int err = 0;
pfn_t pfn;
+ void *kaddr;

/* if we are reading UNWRITTEN and HOLE, return a hole. */
if (!write &&
@@ -1360,18 +1417,24 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
return dax_pmd_load_hole(xas, vmf, iomap, &entry);
}

- if (iomap->type != IOMAP_MAPPED) {
+ if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
WARN_ON_ONCE(1);
return VM_FAULT_SIGBUS;
}

- err = dax_iomap_direct_access(iomap, pos, size, NULL, &pfn);
+ err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
if (err)
goto error_fault;

entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
write && !sync);

+ if (write && srcmap->addr != iomap->addr) {
+ err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr, false);
+ if (err)
+ goto error_fault;
+ }
+
if (sync)
return dax_fault_synchronous_pfnp(pfnp, pfn);

--
2.30.1

2021-02-26 00:28:39

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

Punch hole on a reflinked file needs dax_copy_edge() too. Otherwise,
data in not aligned area will be not correct. So, add the srcmap to
dax_iomap_zero() and replace memset() as dax_copy_edge().

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 6 ++++--
fs/iomap/buffered-io.c | 2 +-
include/linux/dax.h | 3 ++-
3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ec4b733e0b59..4f6c6ba68e6f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1116,7 +1116,8 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
return ret;
}

-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
+ struct iomap *srcmap)
{
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
pgoff_t pgoff;
@@ -1146,7 +1147,8 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
}

if (!page_aligned) {
- memset(kaddr + offset, 0, size);
+ dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
+ kaddr, true);
dax_flush(iomap->dax_dev, kaddr + offset, size);
}
dax_read_unlock(id);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16a1e82e3aeb..d754b1f1a05d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -933,7 +933,7 @@ static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
s64 bytes;

if (IS_DAX(inode))
- bytes = dax_iomap_zero(pos, length, iomap);
+ bytes = dax_iomap_zero(pos, length, iomap, srcmap);
else
bytes = iomap_zero(inode, pos, length, iomap, srcmap);
if (bytes < 0)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..3275e01ed33d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
+ struct iomap *srcmap);
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
--
2.30.1

2021-02-26 00:29:13

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files

Some operations, such as comparing a range of data in two files under
fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus,
we introduce iomap_apply2() to accept arguments from two files and
iomap_actor2_t for actions on two files.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/iomap/apply.c | 51 +++++++++++++++++++++++++++++++++++++++++++
include/linux/iomap.h | 7 +++++-
2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 26ab6563181f..fd2f8bde5791 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -97,3 +97,54 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,

return written ? written : ret;
}
+
+loff_t
+iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2,
+ loff_t length, unsigned int flags, const struct iomap_ops *ops,
+ void *data, iomap_actor2_t actor)
+{
+ struct iomap smap = { .type = IOMAP_HOLE };
+ struct iomap dmap = { .type = IOMAP_HOLE };
+ loff_t written = 0, ret;
+
+ ret = ops->iomap_begin(ino1, pos1, length, 0, &smap, NULL);
+ if (ret)
+ goto out_src;
+ if (WARN_ON(smap.offset > pos1)) {
+ written = -EIO;
+ goto out_src;
+ }
+ if (WARN_ON(smap.length == 0)) {
+ written = -EIO;
+ goto out_src;
+ }
+
+ ret = ops->iomap_begin(ino2, pos2, length, 0, &dmap, NULL);
+ if (ret)
+ goto out_dest;
+ if (WARN_ON(dmap.offset > pos2)) {
+ written = -EIO;
+ goto out_dest;
+ }
+ if (WARN_ON(dmap.length == 0)) {
+ written = -EIO;
+ goto out_dest;
+ }
+
+ /* make sure extent length of two file is equal */
+ if (WARN_ON(smap.length != dmap.length)) {
+ written = -EIO;
+ goto out_dest;
+ }
+
+ written = actor(ino1, pos1, ino2, pos2, length, data, &smap, &dmap);
+
+out_dest:
+ if (ops->iomap_end)
+ ret = ops->iomap_end(ino2, pos2, length, 0, 0, &dmap);
+out_src:
+ if (ops->iomap_end)
+ ret = ops->iomap_end(ino1, pos1, length, 0, 0, &smap);
+
+ return ret;
+}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5bd3cac4df9c..913f98897a77 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -148,10 +148,15 @@ struct iomap_ops {
*/
typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
void *data, struct iomap *iomap, struct iomap *srcmap);
-
+typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
+ struct inode *ino2, loff_t pos2, loff_t len, void *data,
+ struct iomap *smap, struct iomap *dmap);
loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
iomap_actor_t actor);
+loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
+ loff_t pos2, loff_t length, unsigned int flags,
+ const struct iomap_ops *ops, void *data, iomap_actor2_t actor);

ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops);
--
2.30.1

2021-02-26 00:29:24

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

We replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected. This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 37 ++++++++++++++++++++++++++-----------
1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 748dfb89fb41..ec4b733e0b59 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
return 0;
}

+#define DAX_IF_DIRTY (1 << 0)
+#define DAX_IF_COW (1 << 1)
+
/*
* By this point grab_mapping_entry() has ensured that we have a locked entry
* of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -729,16 +732,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
* already in the tree, we will skip the insertion and just dirty the PMD as
* appropriate.
*/
-static void *dax_insert_entry(struct xa_state *xas,
- struct address_space *mapping, struct vm_fault *vmf,
- void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+ void *entry, pfn_t pfn, unsigned long flags,
+ unsigned int insert_flags)
{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
+ bool dirty = insert_flags & DAX_IF_DIRTY;
+ bool cow = insert_flags & DAX_IF_COW;

if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

- if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+ if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -750,7 +756,7 @@ static void *dax_insert_entry(struct xa_state *xas,

xas_reset(xas);
xas_lock_irq(xas);
- if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+ if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;

dax_disassociate_entry(entry, mapping, false);
@@ -774,6 +780,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);

+ if (cow)
+ xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
}
@@ -1100,8 +1109,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
vm_fault_t ret;

- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_ZERO_PAGE, false);
+ *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);

ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1403,6 +1411,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
+ unsigned int insert_flags = 0;
vm_fault_t ret = 0;
int err = 0;
pfn_t pfn;
@@ -1426,8 +1435,14 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
if (err)
goto error_fault;

- entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
- write && !sync);
+ if (write) {
+ if (!sync)
+ insert_flags |= DAX_IF_DIRTY;
+ if (iomap->flags & IOMAP_F_SHARED)
+ insert_flags |= DAX_IF_COW;
+ }
+
+ entry = dax_insert_entry(xas, vmf, entry, pfn, 0, insert_flags);

if (write && srcmap->addr != iomap->addr) {
err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr, false);
@@ -1571,8 +1586,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
goto fallback;

pfn = page_to_pfn_t(zero_page);
- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_PMD | DAX_ZERO_PAGE, false);
+ *entry = dax_insert_entry(xas, vmf, *entry, pfn,
+ DAX_PMD | DAX_ZERO_PAGE, 0);

if (arch_needs_pgtable_deposit()) {
pgtable = pte_alloc_one(vma->vm_mm);
--
2.30.1

2021-02-26 00:29:53

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 08/10] fsdax: Dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a dax
comparison funciton which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 51 ++++++++++++++++++++++++++++++++++++++++++++
fs/remap_range.c | 45 +++++++++++++++++++++++++++++++-------
fs/xfs/xfs_reflink.c | 9 ++++++--
include/linux/dax.h | 4 ++++
include/linux/fs.h | 15 +++++++++----
5 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4f6c6ba68e6f..bdf2b5dfee01 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1856,3 +1856,54 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
+ struct inode *ino2, loff_t pos2, loff_t len, void *data,
+ struct iomap *smap, struct iomap *dmap)
+{
+ void *saddr, *daddr;
+ bool *same = data;
+ int ret;
+
+ while (len) {
+ if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE)
+ goto next;
+
+ if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+ *same = false;
+ break;
+ }
+
+ ret = dax_iomap_direct_access(smap, pos1,
+ ALIGN(pos1 + len, PAGE_SIZE), &saddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ ret = dax_iomap_direct_access(dmap, pos2,
+ ALIGN(pos2 + len, PAGE_SIZE), &daddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ *same = !memcmp(saddr, daddr, len);
+ if (!*same)
+ break;
+next:
+ len -= len;
+ }
+
+ return 0;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
+ const struct iomap_ops *ops)
+{
+ int id, ret = 0;
+
+ id = dax_read_lock();
+ ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops, is_same,
+ dax_range_compare_actor);
+ dax_read_unlock(id);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 77dba3a49e65..9079390edaf3 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/fs.h>
+#include <linux/dax.h>
#include "internal.h"

#include <linux/uaccess.h>
@@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
- loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
{
loff_t src_poff;
loff_t dest_poff;
@@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);

/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
* If there's an error, then the usual negative error code is returned.
* Otherwise returns 0 with *len set to the request length.
*/
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+static int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;

- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ if (!IS_DAX(inode_in) && !IS_DAX(inode_out))
+ ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same);
+ else if (IS_DAX(inode_in) && IS_DAX(inode_out) && ops)
+ ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same,
+ ops);
+ else
+ return -EINVAL;
if (ret)
return ret;
if (!is_same)
@@ -370,6 +381,24 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,

return ret;
}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL(dax_remap_file_range_prep);
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, NULL);
+}
EXPORT_SYMBOL(generic_remap_file_range_prep);

loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fa05fb78189..f5b3a3da36b7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1308,8 +1308,13 @@ xfs_reflink_remap_prep(
if (IS_DAX(inode_in) || IS_DAX(inode_out))
goto out_unlock;

- ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ if (IS_DAX(inode_in))
+ ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags);
+ else
+ ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags,
+ &xfs_read_iomap_ops);
if (ret || *len == 0)
goto out_unlock;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3275e01ed33d..32e1c34349f2 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,6 +239,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
struct iomap *srcmap);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..2e6ec5bdf82a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -68,6 +68,7 @@ struct fsverity_info;
struct fsverity_operations;
struct fs_context;
struct fs_parameter_spec;
+struct iomap_ops;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1910,13 +1911,19 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+ struct inode *dest, loff_t destpos,
+ loff_t len, bool *is_same);
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *count,
- unsigned int remap_flags);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *count, unsigned int remap_flags);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.30.1

2021-02-26 00:31:16

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
CoW, new allocated extents needs to be remapped to the file. So, add an
iomap_end for dax write ops to do the remapping work.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 3 ++-
fs/xfs/xfs_file.c | 9 +++------
fs/xfs/xfs_iomap.c | 30 +++++++++++++++++++++++++++++-
fs/xfs/xfs_iomap.h | 1 +
fs/xfs/xfs_iops.c | 11 ++++++++---
fs/xfs/xfs_reflink.c | 2 ++
6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7371a7f7c652..65a8782b6378 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -977,7 +977,8 @@ xfs_free_file_space(
if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;
error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
- &xfs_buffered_write_iomap_ops);
+ IS_DAX(VFS_I(ip)) ?
+ &xfs_dax_write_iomap_ops : &xfs_buffered_write_iomap_ops);
if (error)
return error;

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..1987d15eab61 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -623,11 +623,8 @@ xfs_file_dax_write(
count = iov_iter_count(from);

trace_xfs_file_dax_write(ip, count, pos);
- ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
- if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
- i_size_write(inode, iocb->ki_pos);
- error = xfs_setfilesize(ip, pos, ret);
- }
+ ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
+
out:
xfs_iunlock(ip, iolock);
if (error)
@@ -1250,7 +1247,7 @@ __xfs_filemap_fault(

ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
(write_fault && !vmf->cow_page) ?
- &xfs_direct_write_iomap_ops :
+ &xfs_dax_write_iomap_ops :
&xfs_read_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7b9ff824e82d..23c6f8c97047 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -771,7 +771,8 @@ xfs_direct_write_iomap_begin(

/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
- &lockmode, flags & IOMAP_DIRECT);
+ &lockmode,
+ flags & IOMAP_DIRECT || IS_DAX(inode));
if (error)
goto out_unlock;
if (shared)
@@ -850,6 +851,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
.iomap_begin = xfs_direct_write_iomap_begin,
};

+static int
+xfs_dax_write_iomap_end(
+ struct inode *inode,
+ loff_t pos,
+ loff_t length,
+ ssize_t written,
+ unsigned int flags,
+ struct iomap *iomap)
+{
+ int error = 0;
+ xfs_inode_t *ip = XFS_I(inode);
+
+ if (pos + written > i_size_read(inode)) {
+ i_size_write(inode, pos + written);
+ error = xfs_setfilesize(ip, pos, written);
+ }
+ if (xfs_is_cow_inode(ip))
+ error = xfs_reflink_end_cow(ip, pos, written);
+
+ return error;
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+ .iomap_begin = xfs_direct_write_iomap_begin,
+ .iomap_end = xfs_dax_write_iomap_end,
+};
+
static int
xfs_buffered_write_iomap_begin(
struct inode *inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 7d3703556d0e..a361c2f27cf3 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -42,6 +42,7 @@ xfs_aligned_fsb_count(

extern const struct iomap_ops xfs_buffered_write_iomap_ops;
extern const struct iomap_ops xfs_direct_write_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
extern const struct iomap_ops xfs_read_iomap_ops;
extern const struct iomap_ops xfs_seek_iomap_ops;
extern const struct iomap_ops xfs_xattr_iomap_ops;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 67c8dc9de8aa..adf4467ab862 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -841,6 +841,7 @@ xfs_setattr_size(
int error;
uint lock_flags = 0;
bool did_zeroing = false;
+ const struct iomap_ops *ops;

ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
@@ -887,10 +888,15 @@ xfs_setattr_size(
* extension, or zeroing out the rest of the block on a downward
* truncate.
*/
+ if (IS_DAX(inode))
+ ops = &xfs_direct_write_iomap_ops;
+ else
+ ops = &xfs_buffered_write_iomap_ops;
+
if (newsize > oldsize) {
trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
error = iomap_zero_range(inode, oldsize, newsize - oldsize,
- &did_zeroing, &xfs_buffered_write_iomap_ops);
+ &did_zeroing, ops);
} else {
/*
* iomap won't detect a dirty page over an unwritten block (or a
@@ -902,8 +908,7 @@ xfs_setattr_size(
newsize);
if (error)
return error;
- error = iomap_truncate_page(inode, newsize, &did_zeroing,
- &xfs_buffered_write_iomap_ops);
+ error = iomap_truncate_page(inode, newsize, &did_zeroing, ops);
}

if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5b3a3da36b7..dfe4e1912ff9 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1246,6 +1246,8 @@ xfs_reflink_zero_posteof(

trace_xfs_zero_eof(ip, isize, pos - isize);
return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+ IS_DAX(VFS_I(ip)) ?
+ &xfs_dax_write_iomap_ops :
&xfs_buffered_write_iomap_ops);
}

--
2.30.1

2021-02-26 00:33:03

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 10/10] fs/xfs: Add dedupe support for fsdax

Add xfs_break_two_dax_layouts() to break layout for tow dax files. Then
call compare range function only when files are both DAX or not.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_file.c | 20 ++++++++++++++++++++
fs/xfs/xfs_inode.c | 8 +++++++-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_reflink.c | 5 +++--
4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1987d15eab61..82467d08e3ce 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -784,6 +784,26 @@ xfs_break_dax_layouts(
0, 0, xfs_wait_dax_page(inode));
}

+int
+xfs_break_two_dax_layouts(
+ struct inode *src,
+ struct inode *dest)
+{
+ int error;
+ bool retry = false;
+
+retry:
+ error = xfs_break_dax_layouts(src, &retry);
+ if (error || retry)
+ goto retry;
+
+ error = xfs_break_dax_layouts(dest, &retry);
+ if (error || retry)
+ goto retry;
+
+ return error;
+}
+
int
xfs_break_layouts(
struct inode *inode,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b7352bc4c815..c11b11e59a83 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3651,8 +3651,10 @@ xfs_ilock2_io_mmap(
struct xfs_inode *ip2)
{
int ret;
+ struct inode *inode1 = VFS_I(ip1);
+ struct inode *inode2 = VFS_I(ip2);

- ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+ ret = xfs_iolock_two_inodes_and_break_layout(inode1, inode2);
if (ret)
return ret;
if (ip1 == ip2)
@@ -3660,6 +3662,10 @@ xfs_ilock2_io_mmap(
else
xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
ip2, XFS_MMAPLOCK_EXCL);
+
+ if (IS_DAX(inode1) && IS_DAX(inode2))
+ ret = xfs_break_two_dax_layouts(inode1, inode2);
+
return 0;
}

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index eca333f5f715..9ed7a2895602 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -431,6 +431,7 @@ enum xfs_prealloc_flags {

int xfs_update_prealloc_flags(struct xfs_inode *ip,
enum xfs_prealloc_flags flags);
+int xfs_break_two_dax_layouts(struct inode *inode1, struct inode *inode2);
int xfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index dfe4e1912ff9..9a6374550560 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
#include "xfs_iomap.h"
#include "xfs_sb.h"
#include "xfs_ag_resv.h"
+#include <linux/dax.h>

/*
* Copy on Write of Shared Blocks
@@ -1306,8 +1307,8 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;

- /* Don't share DAX file data for now. */
- if (IS_DAX(inode_in) || IS_DAX(inode_out))
+ /* Don't share DAX file data with non-DAX file. */
+ if (IS_DAX(inode_in) != IS_DAX(inode_out))
goto out_unlock;

if (IS_DAX(inode_in))
--
2.30.1

2021-02-26 04:18:44

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files

On Fri, Feb 26, 2021 at 08:20:27AM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/iomap/apply.c | 51 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/iomap.h | 7 +++++-
> 2 files changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..fd2f8bde5791 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,54 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>
> return written ? written : ret;
> }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret;
> +
> + ret = ops->iomap_begin(ino1, pos1, length, 0, &smap, NULL);
> + if (ret)
> + goto out_src;
> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> +
> + ret = ops->iomap_begin(ino2, pos2, length, 0, &dmap, NULL);
> + if (ret)
> + goto out_dest;
> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> +
> + /* make sure extent length of two file is equal */
> + if (WARN_ON(smap.length != dmap.length)) {

Why not set smap.length and dmap.length to min(smap.length, dmap.length) ?

--D

> + written = -EIO;
> + goto out_dest;
> + }
> +
> + written = actor(ino1, pos1, ino2, pos2, length, data, &smap, &dmap);
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino2, pos2, length, 0, 0, &dmap);
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, length, 0, 0, &smap);
> +
> + return ret;
> +}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5bd3cac4df9c..913f98897a77 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -148,10 +148,15 @@ struct iomap_ops {
> */
> typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
> void *data, struct iomap *iomap, struct iomap *srcmap);
> -
> +typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap);
> loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
> unsigned flags, const struct iomap_ops *ops, void *data,
> iomap_actor_t actor);
> +loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
> + loff_t pos2, loff_t length, unsigned int flags,
> + const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
>
> ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> const struct iomap_ops *ops);
> --
> 2.30.1
>
>
>

2021-02-26 08:24:11

by Shiyang Ruan

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files

> On Fri, Feb 26, 2021 at 08:20:27AM +0800, Shiyang Ruan wrote:
> > Some operations, such as comparing a range of data in two files under
> > fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus,
> > we introduce iomap_apply2() to accept arguments from two files and
> > iomap_actor2_t for actions on two files.
> >
> > Signed-off-by: Shiyang Ruan <[email protected]>
> > ---
> > fs/iomap/apply.c | 51 +++++++++++++++++++++++++++++++++++++++++++
> > include/linux/iomap.h | 7 +++++-
> > 2 files changed, 57 insertions(+), 1 deletion(-)
> >
...
> > + ret = ops->iomap_begin(ino2, pos2, length, 0, &dmap, NULL);
> > + if (ret)
> > + goto out_dest;
> > + if (WARN_ON(dmap.offset > pos2)) {
> > + written = -EIO;
> > + goto out_dest;
> > + }
> > + if (WARN_ON(dmap.length == 0)) {
> > + written = -EIO;
> > + goto out_dest;
> > + }
> > +
> > + /* make sure extent length of two file is equal */
> > + if (WARN_ON(smap.length != dmap.length)) {
>
> Why not set smap.length and dmap.length to min(smap.length, dmap.length) ?
>

You are right. I found that I understood it wrong. My bad.

I'll fix this patch and the next one which call this function.

--
Thanks,
Ruan Shiyang.

> --D
>

2021-02-26 08:31:23

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 08/10] fsdax: Dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a dax
comparison funciton which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 56 ++++++++++++++++++++++++++++++++++++++++++++
fs/remap_range.c | 45 ++++++++++++++++++++++++++++-------
fs/xfs/xfs_reflink.c | 9 +++++--
include/linux/dax.h | 4 ++++
include/linux/fs.h | 15 ++++++++----
5 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4f6c6ba68e6f..dbb95f00b38b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1856,3 +1856,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
+ struct inode *ino2, loff_t pos2, loff_t len, void *data,
+ struct iomap *smap, struct iomap *dmap)
+{
+ void *saddr, *daddr;
+ bool *same = data;
+ int ret;
+
+ if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+ *same = true;
+ return len;
+ }
+
+ if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+ *same = false;
+ return 0;
+ }
+
+ ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+ &saddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+ &daddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ *same = !memcmp(saddr, daddr, len);
+ return len;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
+ const struct iomap_ops *ops)
+{
+ int id, ret = 0;
+
+ id = dax_read_lock();
+ while (len) {
+ ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
+ is_same, dax_range_compare_actor);
+ if (ret < 0 || !*is_same)
+ goto out;
+
+ len -= ret;
+ srcoff += ret;
+ destoff += ret;
+ }
+ ret = 0;
+out:
+ dax_read_unlock(id);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 77dba3a49e65..9079390edaf3 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/fs.h>
+#include <linux/dax.h>
#include "internal.h"

#include <linux/uaccess.h>
@@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
- loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
{
loff_t src_poff;
loff_t dest_poff;
@@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);

/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
* If there's an error, then the usual negative error code is returned.
* Otherwise returns 0 with *len set to the request length.
*/
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+static int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;

- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ if (!IS_DAX(inode_in) && !IS_DAX(inode_out))
+ ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same);
+ else if (IS_DAX(inode_in) && IS_DAX(inode_out) && ops)
+ ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same,
+ ops);
+ else
+ return -EINVAL;
if (ret)
return ret;
if (!is_same)
@@ -370,6 +381,24 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,

return ret;
}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL(dax_remap_file_range_prep);
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, NULL);
+}
EXPORT_SYMBOL(generic_remap_file_range_prep);

loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fa05fb78189..f5b3a3da36b7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1308,8 +1308,13 @@ xfs_reflink_remap_prep(
if (IS_DAX(inode_in) || IS_DAX(inode_out))
goto out_unlock;

- ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ if (IS_DAX(inode_in))
+ ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags);
+ else
+ ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags,
+ &xfs_read_iomap_ops);
if (ret || *len == 0)
goto out_unlock;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3275e01ed33d..32e1c34349f2 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,6 +239,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
struct iomap *srcmap);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..2e6ec5bdf82a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -68,6 +68,7 @@ struct fsverity_info;
struct fsverity_operations;
struct fs_context;
struct fs_parameter_spec;
+struct iomap_ops;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1910,13 +1911,19 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+ struct inode *dest, loff_t destpos,
+ loff_t len, bool *is_same);
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *count,
- unsigned int remap_flags);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *count, unsigned int remap_flags);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.30.1

2021-02-26 08:31:33

by Shiyang Ruan

[permalink] [raw]

Subject: [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files

Some operations, such as comparing a range of data in two files under
fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus,
we introduce iomap_apply2() to accept arguments from two files and
iomap_actor2_t for actions on two files.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/iomap/apply.c | 56 +++++++++++++++++++++++++++++++++++++++++++
include/linux/iomap.h | 7 +++++-
2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 26ab6563181f..fbc38ce3d5b6 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -97,3 +97,59 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,

return written ? written : ret;
}
+
+loff_t
+iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2,
+ loff_t length, unsigned int flags, const struct iomap_ops *ops,
+ void *data, iomap_actor2_t actor)
+{
+ struct iomap smap = { .type = IOMAP_HOLE };
+ struct iomap dmap = { .type = IOMAP_HOLE };
+ loff_t written = 0, ret, ret2 = 0;
+ loff_t len1 = length, len2, min_len;
+
+ ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL);
+ if (ret)
+ goto out_src;
+ if (WARN_ON(smap.offset > pos1)) {
+ written = -EIO;
+ goto out_src;
+ }
+ if (WARN_ON(smap.length == 0)) {
+ written = -EIO;
+ goto out_src;
+ }
+ len2 = min_t(loff_t, len1, smap.length);
+
+ ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL);
+ if (ret)
+ goto out_dest;
+ if (WARN_ON(dmap.offset > pos2)) {
+ written = -EIO;
+ goto out_dest;
+ }
+ if (WARN_ON(dmap.length == 0)) {
+ written = -EIO;
+ goto out_dest;
+ }
+ min_len = min_t(loff_t, len2, dmap.length);
+
+ written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap);
+
+out_dest:
+ if (ops->iomap_end)
+ ret2 = ops->iomap_end(ino2, pos2, len2,
+ written > 0 ? written : 0, flags, &dmap);
+out_src:
+ if (ops->iomap_end)
+ ret = ops->iomap_end(ino1, pos1, len1,
+ written > 0 ? written : 0, flags, &smap);
+
+ if (ret)
+ return written ? written : ret;
+
+ if (ret2)
+ return written ? written : ret2;
+
+ return written;
+}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5bd3cac4df9c..913f98897a77 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -148,10 +148,15 @@ struct iomap_ops {
*/
typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
void *data, struct iomap *iomap, struct iomap *srcmap);
-
+typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
+ struct inode *ino2, loff_t pos2, loff_t len, void *data,
+ struct iomap *smap, struct iomap *dmap);
loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
iomap_actor_t actor);
+loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
+ loff_t pos2, loff_t length, unsigned int flags,
+ const struct iomap_ops *ops, void *data, iomap_actor2_t actor);

ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops);
--
2.30.1

2021-02-26 09:59:48

by Shiyang Ruan

[permalink] [raw]

Subject: Question about the "EXPERIMENTAL" tag for dax in XFS

Hi, guys

Beside this patchset, I'd like to confirm something about the "EXPERIMENTAL" tag for dax in XFS.

In XFS, the "EXPERIMENTAL" tag, which is reported in waring message when we mount a pmem device with dax option, has been existed for a while. It's a bit annoying when using fsdax feature. So, my initial intention was to remove this tag. And I started to find out and solve the problems which prevent it from being removed.

As is talked before, there are 3 main problems. The first one is "dax semantics", which has been resolved. The rest two are "RMAP for fsdax" and "support dax reflink for filesystem", which I have been working on.

So, what I want to confirm is: does it means that we can remove the "EXPERIMENTAL" tag when the rest two problem are solved? Or maybe there are other important problems need to be fixed before removing it? If there are, could you please show me that?

Thank you.

--
Ruan Shiyang.

2021-02-26 19:08:49

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> Hi, guys
>
> Beside this patchset, I'd like to confirm something about the
> "EXPERIMENTAL" tag for dax in XFS.
>
> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> when we mount a pmem device with dax option, has been existed for a
> while. It's a bit annoying when using fsdax feature. So, my initial
> intention was to remove this tag. And I started to find out and solve
> the problems which prevent it from being removed.
>
> As is talked before, there are 3 main problems. The first one is "dax
> semantics", which has been resolved. The rest two are "RMAP for
> fsdax" and "support dax reflink for filesystem", which I have been
> working on.

<nod>

> So, what I want to confirm is: does it means that we can remove the
> "EXPERIMENTAL" tag when the rest two problem are solved?

Yes. I'd keep the experimental tag for a cycle or two to make sure that
nothing new pops up, but otherwise the two patchsets you've sent close
those two big remaining gaps. Thank you for working on this!

> Or maybe there are other important problems need to be fixed before
> removing it? If there are, could you please show me that?

That remains to be seen through QA/validation, but I think that's it.

Granted, I still have to read through the two patchsets...

--D

>
> Thank you.
>
>
> --
> Ruan Shiyang.

2021-02-26 19:27:21

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
>
> On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> > Hi, guys
> >
> > Beside this patchset, I'd like to confirm something about the
> > "EXPERIMENTAL" tag for dax in XFS.
> >
> > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > when we mount a pmem device with dax option, has been existed for a
> > while. It's a bit annoying when using fsdax feature. So, my initial
> > intention was to remove this tag. And I started to find out and solve
> > the problems which prevent it from being removed.
> >
> > As is talked before, there are 3 main problems. The first one is "dax
> > semantics", which has been resolved. The rest two are "RMAP for
> > fsdax" and "support dax reflink for filesystem", which I have been
> > working on.
>
> <nod>
>
> > So, what I want to confirm is: does it means that we can remove the
> > "EXPERIMENTAL" tag when the rest two problem are solved?
>
> Yes. I'd keep the experimental tag for a cycle or two to make sure that
> nothing new pops up, but otherwise the two patchsets you've sent close
> those two big remaining gaps. Thank you for working on this!
>
> > Or maybe there are other important problems need to be fixed before
> > removing it? If there are, could you please show me that?
>
> That remains to be seen through QA/validation, but I think that's it.
>
> Granted, I still have to read through the two patchsets...

I've been meaning to circle back here as well.

My immediate concern is the issue Jason recently highlighted [1] with
respect to invalidating all dax mappings when / if the device is
ripped out from underneath the fs. I don't think that will collide
with Ruan's implementation, but it does need new communication from
driver to fs about removal events.

[1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com

2021-02-26 20:54:57

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
> >
> > On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> > > Hi, guys
> > >
> > > Beside this patchset, I'd like to confirm something about the
> > > "EXPERIMENTAL" tag for dax in XFS.
> > >
> > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > when we mount a pmem device with dax option, has been existed for a
> > > while. It's a bit annoying when using fsdax feature. So, my initial
> > > intention was to remove this tag. And I started to find out and solve
> > > the problems which prevent it from being removed.
> > >
> > > As is talked before, there are 3 main problems. The first one is "dax
> > > semantics", which has been resolved. The rest two are "RMAP for
> > > fsdax" and "support dax reflink for filesystem", which I have been
> > > working on.
> >
> > <nod>
> >
> > > So, what I want to confirm is: does it means that we can remove the
> > > "EXPERIMENTAL" tag when the rest two problem are solved?
> >
> > Yes. I'd keep the experimental tag for a cycle or two to make sure that
> > nothing new pops up, but otherwise the two patchsets you've sent close
> > those two big remaining gaps. Thank you for working on this!
> >
> > > Or maybe there are other important problems need to be fixed before
> > > removing it? If there are, could you please show me that?
> >
> > That remains to be seen through QA/validation, but I think that's it.
> >
> > Granted, I still have to read through the two patchsets...
>
> I've been meaning to circle back here as well.
>
> My immediate concern is the issue Jason recently highlighted [1] with
> respect to invalidating all dax mappings when / if the device is
> ripped out from underneath the fs. I don't think that will collide
> with Ruan's implementation, but it does need new communication from
> driver to fs about removal events.
>
> [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com

Oh, yay.

The XFS shutdown code is centred around preventing new IO from being
issued - we don't actually do anything about DAX mappings because,
well, I don't think anyone on the filesystem side thought they had
to do anything special if pmem went away from under it.

My understanding -was- that the pmem removal invalidates
all the ptes currently mapped into CPU page tables that point at
the dax device across the system. THe vmas that manage these
mappings are not really something the filesystem really manages,
but a function of the mm subsystem. What the filesystem cares about
is that it gets page faults triggered when a change of state occurs
so that it can remap the page to it's backing store correctly.

IOWs, all the mm subsystem needs to when pmem goes away is clear the
CPU ptes, because then when then when userspace tries to access the
mapped DAX pages we get a new page fault. In processing the fault, the
filesystem will try to get direct access to the pmem from the block
device. This will get an ENODEV error from the block device because
because the backing store (pmem) has been unplugged and is no longer
there...

AFAICT, as long as pmem removal invalidates all the active ptes that
point at the pmem being removed, the filesystem doesn't need to
care about device removal at all, DAX or no DAX...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-02-26 21:02:15

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
> > >
> > > On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> > > > Hi, guys
> > > >
> > > > Beside this patchset, I'd like to confirm something about the
> > > > "EXPERIMENTAL" tag for dax in XFS.
> > > >
> > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > when we mount a pmem device with dax option, has been existed for a
> > > > while. It's a bit annoying when using fsdax feature. So, my initial
> > > > intention was to remove this tag. And I started to find out and solve
> > > > the problems which prevent it from being removed.
> > > >
> > > > As is talked before, there are 3 main problems. The first one is "dax
> > > > semantics", which has been resolved. The rest two are "RMAP for
> > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > working on.
> > >
> > > <nod>
> > >
> > > > So, what I want to confirm is: does it means that we can remove the
> > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > >
> > > Yes. I'd keep the experimental tag for a cycle or two to make sure that
> > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > those two big remaining gaps. Thank you for working on this!
> > >
> > > > Or maybe there are other important problems need to be fixed before
> > > > removing it? If there are, could you please show me that?
> > >
> > > That remains to be seen through QA/validation, but I think that's it.
> > >
> > > Granted, I still have to read through the two patchsets...
> >
> > I've been meaning to circle back here as well.
> >
> > My immediate concern is the issue Jason recently highlighted [1] with
> > respect to invalidating all dax mappings when / if the device is
> > ripped out from underneath the fs. I don't think that will collide
> > with Ruan's implementation, but it does need new communication from
> > driver to fs about removal events.
> >
> > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
>
> Oh, yay.
>
> The XFS shutdown code is centred around preventing new IO from being
> issued - we don't actually do anything about DAX mappings because,
> well, I don't think anyone on the filesystem side thought they had
> to do anything special if pmem went away from under it.
>
> My understanding -was- that the pmem removal invalidates
> all the ptes currently mapped into CPU page tables that point at
> the dax device across the system. THe vmas that manage these
> mappings are not really something the filesystem really manages,
> but a function of the mm subsystem. What the filesystem cares about
> is that it gets page faults triggered when a change of state occurs
> so that it can remap the page to it's backing store correctly.
>
> IOWs, all the mm subsystem needs to when pmem goes away is clear the
> CPU ptes, because then when then when userspace tries to access the
> mapped DAX pages we get a new page fault. In processing the fault, the
> filesystem will try to get direct access to the pmem from the block
> device. This will get an ENODEV error from the block device because
> because the backing store (pmem) has been unplugged and is no longer
> there...
>
> AFAICT, as long as pmem removal invalidates all the active ptes that
> point at the pmem being removed, the filesystem doesn't need to
> care about device removal at all, DAX or no DAX...

How would the pmem removal do that without walking all the active
inodes in the fs at the time of shutdown and call
unmap_mapping_range(inode->i_mapping, 0, 0, 1)?

The core-mm does tear down the ptes in the direct map, but user
mappings to pmem are not afaics in xfs_do_force_shutdown().

2021-02-26 21:31:47

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
> > > >
> > > > On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> > > > > Hi, guys
> > > > >
> > > > > Beside this patchset, I'd like to confirm something about the
> > > > > "EXPERIMENTAL" tag for dax in XFS.
> > > > >
> > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > > when we mount a pmem device with dax option, has been existed for a
> > > > > while. It's a bit annoying when using fsdax feature. So, my initial
> > > > > intention was to remove this tag. And I started to find out and solve
> > > > > the problems which prevent it from being removed.
> > > > >
> > > > > As is talked before, there are 3 main problems. The first one is "dax
> > > > > semantics", which has been resolved. The rest two are "RMAP for
> > > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > > working on.
> > > >
> > > > <nod>
> > > >
> > > > > So, what I want to confirm is: does it means that we can remove the
> > > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > > >
> > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that
> > > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > > those two big remaining gaps. Thank you for working on this!
> > > >
> > > > > Or maybe there are other important problems need to be fixed before
> > > > > removing it? If there are, could you please show me that?
> > > >
> > > > That remains to be seen through QA/validation, but I think that's it.
> > > >
> > > > Granted, I still have to read through the two patchsets...
> > >
> > > I've been meaning to circle back here as well.
> > >
> > > My immediate concern is the issue Jason recently highlighted [1] with
> > > respect to invalidating all dax mappings when / if the device is
> > > ripped out from underneath the fs. I don't think that will collide
> > > with Ruan's implementation, but it does need new communication from
> > > driver to fs about removal events.
> > >
> > > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
> >
> > Oh, yay.
> >
> > The XFS shutdown code is centred around preventing new IO from being
> > issued - we don't actually do anything about DAX mappings because,
> > well, I don't think anyone on the filesystem side thought they had
> > to do anything special if pmem went away from under it.
> >
> > My understanding -was- that the pmem removal invalidates
> > all the ptes currently mapped into CPU page tables that point at
> > the dax device across the system. THe vmas that manage these
> > mappings are not really something the filesystem really manages,
> > but a function of the mm subsystem. What the filesystem cares about
> > is that it gets page faults triggered when a change of state occurs
> > so that it can remap the page to it's backing store correctly.
> >
> > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > CPU ptes, because then when then when userspace tries to access the
> > mapped DAX pages we get a new page fault. In processing the fault, the
> > filesystem will try to get direct access to the pmem from the block
> > device. This will get an ENODEV error from the block device because
> > because the backing store (pmem) has been unplugged and is no longer
> > there...
> >
> > AFAICT, as long as pmem removal invalidates all the active ptes that
> > point at the pmem being removed, the filesystem doesn't need to
> > care about device removal at all, DAX or no DAX...
>
> How would the pmem removal do that without walking all the active
> inodes in the fs at the time of shutdown and call
> unmap_mapping_range(inode->i_mapping, 0, 0, 1)?

Which then immediately ends up back at the vmas that manage the ptes
to unmap them.

Isn't finding the vma(s) that map a specific memory range exactly
what the rmap code in the mm subsystem is supposed to address?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-02-26 22:44:59

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
> > > > >
> > > > > On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> > > > > > Hi, guys
> > > > > >
> > > > > > Beside this patchset, I'd like to confirm something about the
> > > > > > "EXPERIMENTAL" tag for dax in XFS.
> > > > > >
> > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > > > when we mount a pmem device with dax option, has been existed for a
> > > > > > while. It's a bit annoying when using fsdax feature. So, my initial
> > > > > > intention was to remove this tag. And I started to find out and solve
> > > > > > the problems which prevent it from being removed.
> > > > > >
> > > > > > As is talked before, there are 3 main problems. The first one is "dax
> > > > > > semantics", which has been resolved. The rest two are "RMAP for
> > > > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > > > working on.
> > > > >
> > > > > <nod>
> > > > >
> > > > > > So, what I want to confirm is: does it means that we can remove the
> > > > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > > > >
> > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that
> > > > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > > > those two big remaining gaps. Thank you for working on this!
> > > > >
> > > > > > Or maybe there are other important problems need to be fixed before
> > > > > > removing it? If there are, could you please show me that?
> > > > >
> > > > > That remains to be seen through QA/validation, but I think that's it.
> > > > >
> > > > > Granted, I still have to read through the two patchsets...
> > > >
> > > > I've been meaning to circle back here as well.
> > > >
> > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > respect to invalidating all dax mappings when / if the device is
> > > > ripped out from underneath the fs. I don't think that will collide
> > > > with Ruan's implementation, but it does need new communication from
> > > > driver to fs about removal events.
> > > >
> > > > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
> > >
> > > Oh, yay.
> > >
> > > The XFS shutdown code is centred around preventing new IO from being
> > > issued - we don't actually do anything about DAX mappings because,
> > > well, I don't think anyone on the filesystem side thought they had
> > > to do anything special if pmem went away from under it.
> > >
> > > My understanding -was- that the pmem removal invalidates
> > > all the ptes currently mapped into CPU page tables that point at
> > > the dax device across the system. THe vmas that manage these
> > > mappings are not really something the filesystem really manages,
> > > but a function of the mm subsystem. What the filesystem cares about
> > > is that it gets page faults triggered when a change of state occurs
> > > so that it can remap the page to it's backing store correctly.
> > >
> > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > CPU ptes, because then when then when userspace tries to access the
> > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > filesystem will try to get direct access to the pmem from the block
> > > device. This will get an ENODEV error from the block device because
> > > because the backing store (pmem) has been unplugged and is no longer
> > > there...
> > >
> > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > point at the pmem being removed, the filesystem doesn't need to
> > > care about device removal at all, DAX or no DAX...
> >
> > How would the pmem removal do that without walking all the active
> > inodes in the fs at the time of shutdown and call
> > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
>
> Which then immediately ends up back at the vmas that manage the ptes
> to unmap them.
>
> Isn't finding the vma(s) that map a specific memory range exactly
> what the rmap code in the mm subsystem is supposed to address?

rmap can lookup only vmas from a virt address relative to a given
mm_struct. The driver has neither the list of mm_struct objects nor
virt addresses to do a lookup. All it knows is that someone might have
mapped pages through the fsdax interface.

To me this looks like a notifier that fires from memunmap_pages()
after dev_pagemap_kill() to notify any block_device associated with
that dev_pagemap() to say that any dax mappings arranged through this
block_device are now invalid. The reason to do this after
dev_pagemap_kill() is so that any new mapping attempts that are racing
the removal will be blocked.

The receiver of that notification needs to go from a block_device to a
superblock that has mapped inodes and walk ->sb_inodes triggering the
unmap/invalidation.

2021-02-27 22:38:43

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <[email protected]> wrote:
> > > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > > respect to invalidating all dax mappings when / if the device is
> > > > > ripped out from underneath the fs. I don't think that will collide
> > > > > with Ruan's implementation, but it does need new communication from
> > > > > driver to fs about removal events.
> > > > >
> > > > > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
> > > >
> > > > Oh, yay.
> > > >
> > > > The XFS shutdown code is centred around preventing new IO from being
> > > > issued - we don't actually do anything about DAX mappings because,
> > > > well, I don't think anyone on the filesystem side thought they had
> > > > to do anything special if pmem went away from under it.
> > > >
> > > > My understanding -was- that the pmem removal invalidates
> > > > all the ptes currently mapped into CPU page tables that point at
> > > > the dax device across the system. THe vmas that manage these
> > > > mappings are not really something the filesystem really manages,
> > > > but a function of the mm subsystem. What the filesystem cares about
> > > > is that it gets page faults triggered when a change of state occurs
> > > > so that it can remap the page to it's backing store correctly.
> > > >
> > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > > CPU ptes, because then when then when userspace tries to access the
> > > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > > filesystem will try to get direct access to the pmem from the block
> > > > device. This will get an ENODEV error from the block device because
> > > > because the backing store (pmem) has been unplugged and is no longer
> > > > there...
> > > >
> > > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > > point at the pmem being removed, the filesystem doesn't need to
> > > > care about device removal at all, DAX or no DAX...
> > >
> > > How would the pmem removal do that without walking all the active
> > > inodes in the fs at the time of shutdown and call
> > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
> >
> > Which then immediately ends up back at the vmas that manage the ptes
> > to unmap them.
> >
> > Isn't finding the vma(s) that map a specific memory range exactly
> > what the rmap code in the mm subsystem is supposed to address?
>
> rmap can lookup only vmas from a virt address relative to a given
> mm_struct. The driver has neither the list of mm_struct objects nor
> virt addresses to do a lookup. All it knows is that someone might have
> mapped pages through the fsdax interface.

So there's no physical addr to vma translation in the mm subsystem
at all?

That doesn't make sense. We do exactly this for hwpoison for DAX
mappings. While we don't look at ptes, we get a pfn, grab the page
it points to, check if it points to the PMEM that is being removed,
grab the page it points to, map that to the relevant struct page,
run collect_procs() on that page, then kill the user processes that
map that page.

So why can't we walk the ptes, check the physical pages that they
map to and if they map to a pmem page we go poison that
page and that kills any user process that maps it.

i.e. I can't see how unexpected pmem device unplug is any different
to an MCE delivering a hwpoison event to a DAX mapped page. Both
indicate a physical address range now contains invalid data and the
filesystem has to take the same action...

IOWs, we could just call ->corrupted_range(0, EOD) here to tell the
filesystem the entire device went away. Then the filesystem deal
with this however it needs to. However, it would be more efficient
from an invalidation POV to just call it on the pages that have
currently active ptes because once the block device is dead
new page faults on DAX mappings will get a SIGBUS naturally.

> To me this looks like a notifier that fires from memunmap_pages()
> after dev_pagemap_kill() to notify any block_device associated with
> that dev_pagemap() to say that any dax mappings arranged through this
> block_device are now invalid. The reason to do this after
> dev_pagemap_kill() is so that any new mapping attempts that are racing
> the removal will be blocked.

I don't see why this needs a unique notifier. At the filesystem
level, we want a single interface that tells us "something bad
happened to the block device", not a proliferation of similar but
subtly different "bad thing X happened to block device" interfaces
that are unique to specific physical device drivers...

> The receiver of that notification needs to go from a block_device to a
> superblock that has mapped inodes and walk ->sb_inodes triggering the
> unmap/invalidation.

Not necessarily.

What if the filesystem is managing mirrored data across multiple
devices and this device is only one leg of the mirror? Or that the
pmem was used by the RT device in XFS and the data/log devices are
still fine? What if the pmem is just being used as a cache tier, and
no data was actually lost?

IOWs, what needs to happen at this point is very filesystem
specific. Assuming that "device unplug == filesystem dead" is not
correct, nor is specifying a generic action that assumes the
filesystem is dead because a device it is using went away.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-02-27 23:47:06

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <[email protected]> wrote:
> > > > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > > > respect to invalidating all dax mappings when / if the device is
> > > > > > ripped out from underneath the fs. I don't think that will collide
> > > > > > with Ruan's implementation, but it does need new communication from
> > > > > > driver to fs about removal events.
> > > > > >
> > > > > > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
> > > > >
> > > > > Oh, yay.
> > > > >
> > > > > The XFS shutdown code is centred around preventing new IO from being
> > > > > issued - we don't actually do anything about DAX mappings because,
> > > > > well, I don't think anyone on the filesystem side thought they had
> > > > > to do anything special if pmem went away from under it.
> > > > >
> > > > > My understanding -was- that the pmem removal invalidates
> > > > > all the ptes currently mapped into CPU page tables that point at
> > > > > the dax device across the system. THe vmas that manage these
> > > > > mappings are not really something the filesystem really manages,
> > > > > but a function of the mm subsystem. What the filesystem cares about
> > > > > is that it gets page faults triggered when a change of state occurs
> > > > > so that it can remap the page to it's backing store correctly.
> > > > >
> > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > > > CPU ptes, because then when then when userspace tries to access the
> > > > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > > > filesystem will try to get direct access to the pmem from the block
> > > > > device. This will get an ENODEV error from the block device because
> > > > > because the backing store (pmem) has been unplugged and is no longer
> > > > > there...
> > > > >
> > > > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > > > point at the pmem being removed, the filesystem doesn't need to
> > > > > care about device removal at all, DAX or no DAX...
> > > >
> > > > How would the pmem removal do that without walking all the active
> > > > inodes in the fs at the time of shutdown and call
> > > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
> > >
> > > Which then immediately ends up back at the vmas that manage the ptes
> > > to unmap them.
> > >
> > > Isn't finding the vma(s) that map a specific memory range exactly
> > > what the rmap code in the mm subsystem is supposed to address?
> >
> > rmap can lookup only vmas from a virt address relative to a given
> > mm_struct. The driver has neither the list of mm_struct objects nor
> > virt addresses to do a lookup. All it knows is that someone might have
> > mapped pages through the fsdax interface.
>
> So there's no physical addr to vma translation in the mm subsystem
> at all?
>
> That doesn't make sense. We do exactly this for hwpoison for DAX
> mappings. While we don't look at ptes, we get a pfn,

True hwpoison does get a known failing pfn and then uses page->mapping
to get the 'struct address_space' to do the unmap. I discounted that
approach from the outset because it would mean walking every pfn in a
multi-terabyte device just in case one is mapped at device shutdown.

> it points to, check if it points to the PMEM that is being removed,
> grab the page it points to, map that to the relevant struct page,
> run collect_procs() on that page, then kill the user processes that
> map that page.
>
> So why can't we walk the ptescheck the physical pages that they
> map to and if they map to a pmem page we go poison that
> page and that kills any user process that maps it.
>
> i.e. I can't see how unexpected pmem device unplug is any different
> to an MCE delivering a hwpoison event to a DAX mapped page.

I guess the tradeoff is walking a long list of inodes vs walking a
large array of pages.

There's likely always more pages than inodes, but perhaps it's more
efficient to walk the 'struct page' array than sb->s_inodes?

> Both
> indicate a physical address range now contains invalid data and the
> filesystem has to take the same action...
>
> IOWs, we could just call ->corrupted_range(0, EOD) here to tell the
> filesystem the entire device went away. Then the filesystem deal
> with this however it needs to. However, it would be more efficient
> from an invalidation POV to just call it on the pages that have
> currently active ptes because once the block device is dead
> new page faults on DAX mappings will get a SIGBUS naturally.

There is no efficient way to lookup "currently active ptes" relative
to a physical pfn range.

SIGBUS will happen naturally either way. I don't think the hwpoison
signal with the extra BUS_MCEERR_* info is appropriate given that
indicates data loss vs data offline of a device being unplugged.

>
> > To me this looks like a notifier that fires from memunmap_pages()
> > after dev_pagemap_kill() to notify any block_device associated with
> > that dev_pagemap() to say that any dax mappings arranged through this
> > block_device are now invalid. The reason to do this after
> > dev_pagemap_kill() is so that any new mapping attempts that are racing
> > the removal will be blocked.
>
> I don't see why this needs a unique notifier. At the filesystem
> level, we want a single interface that tells us "something bad
> happened to the block device", not a proliferation of similar but
> subtly different "bad thing X happened to block device" interfaces
> that are unique to specific physical device drivers...
>
> > The receiver of that notification needs to go from a block_device to a
> > superblock that has mapped inodes and walk ->sb_inodes triggering the
> > unmap/invalidation.
>
> Not necessarily.
>
> What if the filesystem is managing mirrored data across multiple
> devices and this device is only one leg of the mirror?

I can see DAX mapping for read access to one leg of the mirror. The
unplug would fire zap_pte for all the inodes with DAX mappings for
that fs. Filesystem is still free at that point to wait for the next
user access, take a refault, and re-establish the mapping to another
leg of the mirror.

> Or that the
> pmem was used by the RT device in XFS and the data/log devices are
> still fine?

I was assuming that the callback would only be triggered for a dax
device as the data device. So xfs_open_devices() would register
mp->m_super for dax_rtdev.

> What if the pmem is just being used as a cache tier, and
> no data was actually lost?

That's fine the cache mapping is zapped and re-fault figures out what
to do. If anything these questions are a reason not to use
->corrupted_range() for this because recovery can happen at refault vs
taking permanent action on a data loss event.

>
> IOWs, what needs to happen at this point is very filesystem
> specific. Assuming that "device unplug == filesystem dead" is not
> correct, nor is specifying a generic action that assumes the
> filesystem is dead because a device it is using went away.

Ok, I think I set this discussion in the wrong direction implying any
mapping of this action to a "filesystem dead" event. It's just a "zap
all ptes" event and upper layers recover from there.

2021-02-28 22:56:47

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
> > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > it points to, check if it points to the PMEM that is being removed,
> > grab the page it points to, map that to the relevant struct page,
> > run collect_procs() on that page, then kill the user processes that
> > map that page.
> >
> > So why can't we walk the ptescheck the physical pages that they
> > map to and if they map to a pmem page we go poison that
> > page and that kills any user process that maps it.
> >
> > i.e. I can't see how unexpected pmem device unplug is any different
> > to an MCE delivering a hwpoison event to a DAX mapped page.
>
> I guess the tradeoff is walking a long list of inodes vs walking a
> large array of pages.

Not really. You're assuming all a filesystem has to do is invalidate
everything if a device goes away, and that's not true. Finding if an
inode has a mapping that spans a specific device in a multi-device
filesystem can be a lot more complex than that. Just walking inodes
is easy - determining whihc inodes need invalidation is the hard
part.

That's where ->corrupt_range() comes in - the filesystem is already
set up to do reverse mapping from physical range to inode(s)
offsets...

> There's likely always more pages than inodes, but perhaps it's more
> efficient to walk the 'struct page' array than sb->s_inodes?

I really don't see you seem to be telling us that invalidation is an
either/or choice. There's more ways to convert physical block
address -> inode file offset and mapping index than brute force
inode cache walks....

.....

> > IOWs, what needs to happen at this point is very filesystem
> > specific. Assuming that "device unplug == filesystem dead" is not
> > correct, nor is specifying a generic action that assumes the
> > filesystem is dead because a device it is using went away.
>
> Ok, I think I set this discussion in the wrong direction implying any
> mapping of this action to a "filesystem dead" event. It's just a "zap
> all ptes" event and upper layers recover from there.

Yes, that's exactly what ->corrupt_range() is intended for. It
allows the filesystem to lock out access to the bad range
and then recover the data. Or metadata, if that's where the bad
range lands. If that recovery fails, it can then report a data
loss/filesystem shutdown event to userspace and kill user procs that
span the bad range...

FWIW, is this notification going to occur before or after the device
has been physically unplugged? i.e. what do we do about the
time-of-unplug-to-time-of-invalidation window where userspace can
still attempt to access the missing pmem though the
not-yet-invalidated ptes? It may not be likely that people just yank
pmem nvdimms out of machines, but with NVMe persistent memory
spaces, there's every chance that someone pulls the wrong device...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-03-01 07:41:55

by Yasunori Gotou (Fujitsu)

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

Hello, Dan-san,

On 2021/02/27 4:24, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
>>
>> On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
>>> Hi, guys
>>>
>>> Beside this patchset, I'd like to confirm something about the
>>> "EXPERIMENTAL" tag for dax in XFS.
>>>
>>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
>>> when we mount a pmem device with dax option, has been existed for a
>>> while. It's a bit annoying when using fsdax feature. So, my initial
>>> intention was to remove this tag. And I started to find out and solve
>>> the problems which prevent it from being removed.
>>>
>>> As is talked before, there are 3 main problems. The first one is "dax
>>> semantics", which has been resolved. The rest two are "RMAP for
>>> fsdax" and "support dax reflink for filesystem", which I have been
>>> working on.
>>
>> <nod>
>>
>>> So, what I want to confirm is: does it means that we can remove the
>>> "EXPERIMENTAL" tag when the rest two problem are solved?
>>
>> Yes. I'd keep the experimental tag for a cycle or two to make sure that
>> nothing new pops up, but otherwise the two patchsets you've sent close
>> those two big remaining gaps. Thank you for working on this!
>>
>>> Or maybe there are other important problems need to be fixed before
>>> removing it? If there are, could you please show me that?
>>
>> That remains to be seen through QA/validation, but I think that's it.
>>
>> Granted, I still have to read through the two patchsets...
>
> I've been meaning to circle back here as well.
>
> My immediate concern is the issue Jason recently highlighted [1] with
> respect to invalidating all dax mappings when / if the device is
> ripped out from underneath the fs. I don't think that will collide
> with Ruan's implementation, but it does need new communication from
> driver to fs about removal events.
>
> [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
>

I'm not sure why there is a race condition between unbinding operation
and accessing mmaped file on filesystem dax yet.

May be silly question, but could you tell me why the "unbinding"
operation of the namespace which is mounted by filesystem dax must be
allowed?
If "unbinding" is rejected when the filesystem is mounted with dax
enabled, what is inconvenience?

I can imagine if a device like usb memory stick is removed surprisingly,
kernel/filesystem need to reject writeback at the time, and discard page
cache. Then, I can understand that unbinding operation is essential for
such case.
But I don't know why PMEM device/namespace allows unbinding operation
like surprising removal event.

Thanks,

--
Yasunori Goto

2021-03-02 17:02:45

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <[email protected]> wrote:
>
> On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
> > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > it points to, check if it points to the PMEM that is being removed,
> > > grab the page it points to, map that to the relevant struct page,
> > > run collect_procs() on that page, then kill the user processes that
> > > map that page.
> > >
> > > So why can't we walk the ptescheck the physical pages that they
> > > map to and if they map to a pmem page we go poison that
> > > page and that kills any user process that maps it.
> > >
> > > i.e. I can't see how unexpected pmem device unplug is any different
> > > to an MCE delivering a hwpoison event to a DAX mapped page.
> >
> > I guess the tradeoff is walking a long list of inodes vs walking a
> > large array of pages.
>
> Not really. You're assuming all a filesystem has to do is invalidate
> everything if a device goes away, and that's not true. Finding if an
> inode has a mapping that spans a specific device in a multi-device
> filesystem can be a lot more complex than that. Just walking inodes
> is easy - determining whihc inodes need invalidation is the hard
> part.

That inode-to-device level of specificity is not needed for the same
reason that drop_caches does not need to be specific. If the wrong
page is unmapped a re-fault will bring it back, and re-fault will fail
for the pages that are successfully removed.

> That's where ->corrupt_range() comes in - the filesystem is already
> set up to do reverse mapping from physical range to inode(s)
> offsets...

Sure, but what is the need to get to that level of specificity with
the filesystem for something that should rarely happen in the course
of normal operation outside of a mistake?

>
> > There's likely always more pages than inodes, but perhaps it's more
> > efficient to walk the 'struct page' array than sb->s_inodes?
>
> I really don't see you seem to be telling us that invalidation is an
> either/or choice. There's more ways to convert physical block
> address -> inode file offset and mapping index than brute force
> inode cache walks....

Yes, but I was trying to map it to an existing mechanism and the
internals of drop_pagecache_sb() are, in coarse terms, close to what
needs to happen here.

>
> .....
>
> > > IOWs, what needs to happen at this point is very filesystem
> > > specific. Assuming that "device unplug == filesystem dead" is not
> > > correct, nor is specifying a generic action that assumes the
> > > filesystem is dead because a device it is using went away.
> >
> > Ok, I think I set this discussion in the wrong direction implying any
> > mapping of this action to a "filesystem dead" event. It's just a "zap
> > all ptes" event and upper layers recover from there.
>
> Yes, that's exactly what ->corrupt_range() is intended for. It
> allows the filesystem to lock out access to the bad range
> and then recover the data. Or metadata, if that's where the bad
> range lands. If that recovery fails, it can then report a data
> loss/filesystem shutdown event to userspace and kill user procs that
> span the bad range...
>
> FWIW, is this notification going to occur before or after the device
> has been physically unplugged?

Before. This will be operations that happen in the pmem driver
->remove() callback.

> i.e. what do we do about the
> time-of-unplug-to-time-of-invalidation window where userspace can
> still attempt to access the missing pmem though the
> not-yet-invalidated ptes? It may not be likely that people just yank
> pmem nvdimms out of machines, but with NVMe persistent memory
> spaces, there's every chance that someone pulls the wrong device...

The physical removal aspect is only theoretical today. While the pmem
driver has a ->remove() path that's purely a software unbind
operation. That said the vulnerability window today is if a process
acquires a dax mapping, the pmem device hosting that filesystem goes
through an unbind / bind cycle, and then a new filesystem is created /
mounted. That old pte may be able to access data that is outside its
intended protection domain.

Going forward, for buses like CXL, there will be a managed physical
remove operation via PCIE native hotplug. The flow there is that the
PCIE hotplug driver will notify the OS of a pending removal, trigger
->remove() on the pmem driver, and then notify the technician (slot
status LED) that the card is safe to pull.

2021-03-02 18:14:57

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner <[email protected]> wrote:
>
> On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
> > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > > it points to, check if it points to the PMEM that is being removed,
> > > > > grab the page it points to, map that to the relevant struct page,
> > > > > run collect_procs() on that page, then kill the user processes that
> > > > > map that page.
> > > > >
> > > > > So why can't we walk the ptescheck the physical pages that they
> > > > > map to and if they map to a pmem page we go poison that
> > > > > page and that kills any user process that maps it.
> > > > >
> > > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > > >
> > > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > > large array of pages.
> > >
> > > Not really. You're assuming all a filesystem has to do is invalidate
> > > everything if a device goes away, and that's not true. Finding if an
> > > inode has a mapping that spans a specific device in a multi-device
> > > filesystem can be a lot more complex than that. Just walking inodes
> > > is easy - determining whihc inodes need invalidation is the hard
> > > part.
> >
> > That inode-to-device level of specificity is not needed for the same
> > reason that drop_caches does not need to be specific. If the wrong
> > page is unmapped a re-fault will bring it back, and re-fault will fail
> > for the pages that are successfully removed.
> >
> > > That's where ->corrupt_range() comes in - the filesystem is already
> > > set up to do reverse mapping from physical range to inode(s)
> > > offsets...
> >
> > Sure, but what is the need to get to that level of specificity with
> > the filesystem for something that should rarely happen in the course
> > of normal operation outside of a mistake?
>
> Dan, you made this mistake with the hwpoisoning code that we're
> trying to fix that here. Hard coding a 1:1 physical address to
> inode/offset into the DAX mapping was a bad mistake. It's also one
> that should never have occurred because it's *obviously wrong* to
> filesystem developers and has been for a long time.

I admit that mistake. The traditional memory error handling model
assumptions around page->mapping were broken by DAX, I'm not trying to
repeat that mistake. I feel we're talking past each other on the
discussion of the proposals.

> Now we have the filesytem people providing a mechanism for the pmem
> devices to tell the filesystems about physical device failures so
> they can handle such failures correctly themselves. Having the
> device go away unexpectedly from underneath a mounted and active
> filesystem is a *device failure*, not an "unplug event".

It's the same difference to the physical page, all mappings to that
page need to be torn down. I'm happy to call an fs callback and let
each filesystem do what it wants with a "every pfn in this dax device
needs to be unmapped".

I'm looking at the ->corrupted_range() patches trying to map it to
this use case and I don't see how, for example a realtime-xfs over DM
over multiple PMEM gets the notification to the right place.
bd_corrupted_range() uses get_super() which get the wrong answer for
both realtime-xfs and DM.

I'd flip that arrangement around and have the FS tell the block device
"if something happens to you, here is the super_block to notify". So
to me this looks like a fs_dax_register_super() helper that plumbs the
superblock through an arbitrary stack of block devices to the leaf
block-device that might want to send a notification up when a global
unmap operation needs to be performed.

I naively think that "for_each_inode()
unmap_mapping_range(&inode->i_mapping)" is sufficient as a generic
implementation, that does not preclude XFS to override that generic
implementation and handle it directly if it so chooses.

> The mistake you made was not understanding how filesystems work,
> nor actually asking filesystem developers what they actually needed.

You're going too far here, but that's off topic.

> You're doing the same thing here - you're telling us what you think
> the solution filesystems need is.

No, I'm not, I'm trying to understand tradeoffs. I apologize if this
is coming across as not listening.

> Please listen when we say "that is
> not sufficient" because we don't want to be backed into a corner
> that we have to fix ourselves again before we can enable some basic
> filesystem functionality that we should have been able to support on
> DAX from the start...

That's some revisionist interpretation of how the discovery of the
reflink+dax+memory-error-handling collision went down.

The whole point of this discussion is to determine if
->corrupted_range() is suitable for this notification, and looking at
the code as is, it isn't. Perhaps you have a different implementation
of ->corrupted_range() in mind that allows this to be plumbed
correctly?

>
> > > > There's likely always more pages than inodes, but perhaps it's more
> > > > efficient to walk the 'struct page' array than sb->s_inodes?
> > >
> > > I really don't see you seem to be telling us that invalidation is an
> > > either/or choice. There's more ways to convert physical block
> > > address -> inode file offset and mapping index than brute force
> > > inode cache walks....
> >
> > Yes, but I was trying to map it to an existing mechanism and the
> > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > needs to happen here.
>
> No.
>
> drop_pagecache_sb() is not a relevant model for telling a filesystem
> that the block device underneath has gone away,

Like I said I'm not trying to communicate "device has gone away", only
"unmap all dax pages". If you want those to be one in the same
mechanism I'm listening, but like I said it was my mistake for tying
global unmap to device-gone, they need not be the same given
fileystems have not historically been notified proactively of device
removal.

> nor for a device to
> ensure that access protections that *are managed by the filesystem*
> are enforced/revoked sanely.

Yes, if the fs needs / wants to do more than the generic need of unmap
all dax it's free to override the generic implementation.

> drop_pagecache_sb() is a brute-force model for invalidating user
> data mappings that the filesystem performs in response to such a
> notification. It only needs this brute-force approach if it has no
> other way to find active DAX mappings across the range of the device
> that has gone away.

Ok.

> But this model doesn't work for direct mapped metadata, journals or
> any other internal direct filesystem mappings that aren't referenced
> by inodes that the filesytem might be using. The filesystem still
> needs to invalidate all those mappings and prevent further access to
> them, even from within the kernel itself.

Agree. If the filesystem was using DAX techniques for metadata it
would want to know before the direct-map is torn down. No argument
there.

> Filesystems are way more complex than pure DAX devices, and hence
> handle errors and failure events differently. Unlike DAX devices, we
> have both internal and external references to the DAX device, and we
> can have both external and internal direct maps. Invalidating user
> data mappings is all dax devices need to do on unplug, but for
> filesystems it is only a small part of what we have to do when a
> range of a device goes bad.
>
> IOWs, there is no "one size fits all" approach that works for all
> filesystems, nor is there one strategy that is is optimal for all
> filesystems. Failure handling in a filesystem is almost always
> filesystem specific...

Point taken, if a filesystem is not using the block-layer for metadata
I/O and using DAX techniques directly it needs this event too
otherwise it will crash vs report failed operations...
->corrupted_range() does not offer the correct plumbing for that
today.

There's an additional problem this brings to mind. Device-mapper
targets like dm-writecache need this notification as well because it
is using direct physical page access via the linear map and may crash
like the filesystem if the mm-direct-map is torn down from underneath
it.

> > > > Ok, I think I set this discussion in the wrong direction implying any
> > > > mapping of this action to a "filesystem dead" event. It's just a "zap
> > > > all ptes" event and upper layers recover from there.
> > >
> > > Yes, that's exactly what ->corrupt_range() is intended for. It
> > > allows the filesystem to lock out access to the bad range
> > > and then recover the data. Or metadata, if that's where the bad
> > > range lands. If that recovery fails, it can then report a data
> > > loss/filesystem shutdown event to userspace and kill user procs that
> > > span the bad range...
> > >
> > > FWIW, is this notification going to occur before or after the device
> > > has been physically unplugged?
> >
> > Before. This will be operations that happen in the pmem driver
> > ->remove() callback.
> >
> > > i.e. what do we do about the
> > > time-of-unplug-to-time-of-invalidation window where userspace can
> > > still attempt to access the missing pmem though the
> > > not-yet-invalidated ptes? It may not be likely that people just yank
> > > pmem nvdimms out of machines, but with NVMe persistent memory
> > > spaces, there's every chance that someone pulls the wrong device...
> >
> > The physical removal aspect is only theoretical today.
>
> For actual pmem, maybe. But hotplug RAM is a thing; big numa
> machines that can hotplug nodes into their fabric and so have CPUs
> and memory able to come and go from a live machine. It's not a small
> stretch to extend that to having PMEM in those nodes. So it's a
> practical design concern right now, even ignoring that NVMe is
> hotplug....

Memory hotplug today requires the memory-device to be offlined before
the memory is unplugged and the core-mm has a chance to say "no" if it
sees even one page with an elevated reference. Block-devices in
contrast have no option to say "no" to being unplugged / ->remove()
triggered.

> > While the pmem
> > driver has a ->remove() path that's purely a software unbind
> > operation. That said the vulnerability window today is if a process
> > acquires a dax mapping, the pmem device hosting that filesystem goes
> > through an unbind / bind cycle, and then a new filesystem is created /
> > mounted. That old pte may be able to access data that is outside its
> > intended protection domain.
>
> So what is being done to prevent stale DAX mappings from being
> leaked this way right now, seeing as the leak you mention here
> doesn't appear in any way to be filesystem related?

For device-dax where there is only one inode->i_mapping to deal with,
one unmap_mapping_range() call is performed in the device shutdown
path. For filesystem-dax only the direct-map is torn down.

The user mapping teardown gap is why I'm coming at this elephant from
the user mapping perspective and not necessarily the "what does the
filesystem want to do about device removal" perspective.

> > Going forward, for buses like CXL, there will be a managed physical
> > remove operation via PCIE native hotplug. The flow there is that the
> > PCIE hotplug driver will notify the OS of a pending removal, trigger
> > ->remove() on the pmem driver, and then notify the technician (slot
> > status LED) that the card is safe to pull.
>
> That doesn't protect against pulling the wrong device, or having
> someone pull the device without first running an admin command that
> makes systems using DAX safe to pull the device....

Of course not, at some point surprise removal can't be compensated.
There are hardware mechanisms to try to contain mistakes, but those
can only go so far...

> And once you take into account that "pulling the wrong device" can
> happen, how does the filesystem tell tell the difference between a
> device being pulled and a drive cage just dying and so the drive
> just disappear from the system? How are these accidental vs real
> failures any different from the perspective of a filesystem mounted
> on that device?

Not even the device driver can tell you that. The Linux driver model
has no way to communicate why ->remove() is being called, it only
knows that it needs to revoke everything that was handed out since
->probe().

> And then there is the elephant in the room: if there's a "human in
> the loop" step needed to hot unplug a pmem device safely, then
> why the hell is the filesystem on that device still mounted and the
> DAX applications still running?

This goes back to Yasunori's question, can't ->remove() just be
blocked when the filesystem is mounted? The answer is similar to
asking the filesystem to allow DAX RDMA pages to be pinned
indefinitely and lock-out the filesystem from making any extent-map
changes. If the admin wants the device disabled while the filesystem
is mounted the kernel should do everything it can to honor that
request safely.

> This just makes no sense at all from an operations perspective - if
> you know that you are about to do an unplug that will result in all
> your DAX apps and filesystems being killed (i.e. fatal production
> environment failure) then why haven't they all been stopped by the
> admin before the device unplug is done? Why does this "human in the
> loop" admin task require the applications and filesystems to handle
> this without warning and have to treat it as a "device failure"
> event when this can all be avoided for normal, scheduled, controlled
> unplug operations? The "unexpected unplug" is a catastrophic failure
> event which may have severe side effects on system operation and
> stability. Why would you design an unplug process that does not
> start with a clean, a controlled shutdown process from the top down?
> If we make the assumption that planned unplugs are well planned,
> organised and scheduled, then the only thing that an unplug event
> needs to mean to a filesystem is "catastrophic device failure has
> occurred".

There is a difference between the kernel saying "don't do that, bad
things will happen" and "you can't do that the entire system will
crash / security promises will be violated".

git grep -n suppress_bind_attr drivers/ata/ drivers/scsi/ drivers/nvme/

There are no block-device providers that I can find on a quick search
that forbid triggering ->remove() on the driver if a filesystem is
mounted. pmem is not the first block device driver to present this
problem.

> So from a system level, the way you are describing the way hot
> unplug events are supposed to occur and work looks completely
> screwed up to me. Exactly what use case do you have for pmem device
> hot-unplug from under a live filesystem that isn't considered a
> *catastrophic runtime device failure* by the filesystem?

I'm coming at this from the perspective of it historically always
being possible for a block-device to be ripped out from underneath a
filesystem. I seem to be just the messenger conveying that bad news.
What's different now is that DAX has expanded what was previously
constrained to something the block layer could handle with a BLK_STS_*
return value for new I/O to a live pte that needs to be torn down, not
a page cache page that can live on indefinitely.

2021-03-02 18:46:19

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <[email protected]> wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
>
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
>
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
>
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

I can't tell if we're conflating the "a bunch of your pmem went bad"
case with the "all your dimms fell out of the machine" case.

If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
of pmem, I certainly want that level of specificity. Just notify the
users of the dead piece, don't flush the whole machine down the drain.

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks....
>
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

Yes. XFS (with rmap enabled) can do all the iteration and walking in
that function except for the invalidate_mapping_* call itself. The goal
of this series is first to wire up a callback within both the block and
pmem subsystems so that they can take notifications and reverse-map them
through the storage stack until they reach an fs superblock.

Once the information has reached XFS, it can use its own reverse
mappings to figure out which pages of which inodes are now targetted.
The future of DAX hw error handling can be that you throw the spitwad at
us, and it's our problem to distill that into mm invalidation calls.
XFS' reverse mapping data is indexed by storage location and isn't
sharded by address_space, so (except for the DIMMs falling out), we
don't need to walk the entire inode list or scan the entire mapping.

Between XFS and DAX and mm, the mm already has the invalidation calls,
xfs already has the distiller, and so all we need is that first bit.
The current mm code doesn't fully solve the problem, nor does it need
to, since it handles DRAM errors acceptably* already.

* Actually, the hwpoison code should _also_ be calling ->corrupted_range
when DRAM goes bad so that we can detect metadata failures and either
reload the buffer or (if it was dirty) shut down.

> >
> > .....
> >
> > > > IOWs, what needs to happen at this point is very filesystem
> > > > specific. Assuming that "device unplug == filesystem dead" is not
> > > > correct, nor is specifying a generic action that assumes the
> > > > filesystem is dead because a device it is using went away.
> > >
> > > Ok, I think I set this discussion in the wrong direction implying any
> > > mapping of this action to a "filesystem dead" event. It's just a "zap
> > > all ptes" event and upper layers recover from there.
> >
> > Yes, that's exactly what ->corrupt_range() is intended for. It
> > allows the filesystem to lock out access to the bad range
> > and then recover the data. Or metadata, if that's where the bad
> > range lands. If that recovery fails, it can then report a data
> > loss/filesystem shutdown event to userspace and kill user procs that
> > span the bad range...
> >
> > FWIW, is this notification going to occur before or after the device
> > has been physically unplugged?
>
> Before. This will be operations that happen in the pmem driver
> ->remove() callback.
>
> > i.e. what do we do about the
> > time-of-unplug-to-time-of-invalidation window where userspace can
> > still attempt to access the missing pmem though the
> > not-yet-invalidated ptes? It may not be likely that people just yank
> > pmem nvdimms out of machines, but with NVMe persistent memory
> > spaces, there's every chance that someone pulls the wrong device...
>
> The physical removal aspect is only theoretical today. While the pmem
> driver has a ->remove() path that's purely a software unbind
> operation. That said the vulnerability window today is if a process
> acquires a dax mapping, the pmem device hosting that filesystem goes
> through an unbind / bind cycle, and then a new filesystem is created /
> mounted. That old pte may be able to access data that is outside its
> intended protection domain.
>
> Going forward, for buses like CXL, there will be a managed physical
> remove operation via PCIE native hotplug. The flow there is that the
> PCIE hotplug driver will notify the OS of a pending removal, trigger
> ->remove() on the pmem driver, and then notify the technician (slot
> status LED) that the card is safe to pull.

Well, that's a relief. Can we cancel longterm RDMA leases now too?
<duck>

--D

2021-03-02 18:48:41

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 01, 2021 at 04:32:36PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner <[email protected]> wrote:
> > Now we have the filesytem people providing a mechanism for the pmem
> > devices to tell the filesystems about physical device failures so
> > they can handle such failures correctly themselves. Having the
> > device go away unexpectedly from underneath a mounted and active
> > filesystem is a *device failure*, not an "unplug event".
>
> It's the same difference to the physical page, all mappings to that
> page need to be torn down. I'm happy to call an fs callback and let
> each filesystem do what it wants with a "every pfn in this dax device
> needs to be unmapped".

You keep talking like this is something specific to a DAX device.
It isn't - the filesystem needs to take specific actions if any type
of block device reports that it has a corrupted range, not just DAX.
A DAX device simply adds "and invalidate direct mappings" to the
list of stuff that needs to be done.

And as far as a filesystem is concerned, there is no difference
between "this 4kB range is bad" and "the range of this entire device
is bad". We have to do the same things in both situations.

> I'm looking at the ->corrupted_range() patches trying to map it to
> this use case and I don't see how, for example a realtime-xfs over DM
> over multiple PMEM gets the notification to the right place.
> bd_corrupted_range() uses get_super() which get the wrong answer for
> both realtime-xfs and DM.

I'm not sure I follow your logic. What is generating the wrong
answer?

We already have infrastructure for the block device to look up the
superblock mounted on top of it, an DM already uses that for things
like "dmsetup suspend" to freeze the filesystem before it does
something. This "superblock lookup" only occurs for the top level
DM device, not for the component pmem devices that make up the DM
device.

IOWs, if there's a DM device that maps multiple pmem devices, then
it should be stacking the bd_corrupted_range() callbacks to map the
physical device range to the range in the higher level DM device
that belongs to. This mapping of ranges is what DM exists to do -
the filesystem has no clue about what devices make up a DM device,
so the DM device *must* translate ranges for component devices into
the ranges that it maps that device into the LBA range it exposes to
the filesystem.

> I'd flip that arrangement around and have the FS tell the block device
> "if something happens to you, here is the super_block to notify".

We already have a mechanism for this that the block device calls:
get_active_super(bdev). There can be only one superblock per block
device - the superblock has exclusive ownership of the block device
while the filesystem is mounted.

get_active_super() returns the superblock that sits on top of the
bdev with an active reference, allowing the caller to safely access
and operate on the sueprblock without having to worry about the
superblock going away in the middle of whatever operation the block
device needs to perform.

If this isn't working, then existing storage stack functionality
doesn't work as it should and this needs fixing independently of
the PMEM/DAX stuff we are talking about here.

> So
> to me this looks like a fs_dax_register_super() helper that plumbs the
> superblock through an arbitrary stack of block devices to the leaf
> block-device that might want to send a notification up when a global
> unmap operation needs to be performed.

No, this is just wrong. The filesystem has no clue what block device
is at the leaf level of a block device stack, nor what LBA block
range represents that device within the address space the stacked
block devices present to the filesystem.

> > Please listen when we say "that is
> > not sufficient" because we don't want to be backed into a corner
> > that we have to fix ourselves again before we can enable some basic
> > filesystem functionality that we should have been able to support on
> > DAX from the start...
>
> That's some revisionist interpretation of how the discovery of the
> reflink+dax+memory-error-handling collision went down.
>
> The whole point of this discussion is to determine if
> ->corrupted_range() is suitable for this notification, and looking at
> the code as is, it isn't. Perhaps you have a different implementation
> of ->corrupted_range() in mind that allows this to be plumbed
> correctly?

So rather than try to make the generic ->corrupted_range
infrastructure be able to report "DAX range is invalid" (which is
the very definition of a corrupted block device range!), you want
to introduce a DAX specific notification to tell us that a range in
the block device contains invalid/corrupt data?

We're talking about a patchset that is in development. The proposed
notification path is supposed to be generic and *not PMEM specific*,
and is intended to handle exactly the use case that you raised.
The implementation may not be perfect yet, so rather than trying to
say "we need something different but does the same thing", work to
ensure that the proposed -generic infrastructure- can pass the
information you want to pass to the filesystem.

> > > Yes, but I was trying to map it to an existing mechanism and the
> > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > needs to happen here.
> >
> > No.
> >
> > drop_pagecache_sb() is not a relevant model for telling a filesystem
> > that the block device underneath has gone away,
>
> Like I said I'm not trying to communicate "device has gone away", only
> "unmap all dax pages".

That is the wrong thing to be communicating. If the device has gone
away, the filesystem needs to know that the device has gone away,
not that it should just unmap DAX pages.

> If you want those to be one in the same
> mechanism I'm listening, but like I said it was my mistake for tying
> global unmap to device-gone, they need not be the same given
> fileystems have not historically been notified proactively of device
> removal.

What other circumstance is there for the device driver punching
through block device layers to tell the filesystem it should "unmap
all dax pages"? ANd if we get such an event, what does that mean for
any of the other filesystem data/metadata in that range?

You are still trying to tell the filesystem what action it must take
based on what went wrong at the device driver level, not
communicating what error just occurred to the device. The filesystem
needs to know about the error that occurred, not what some device
thinks the filesystem should do when the device detects an error.

> > Filesystems are way more complex than pure DAX devices, and hence
> > handle errors and failure events differently. Unlike DAX devices, we
> > have both internal and external references to the DAX device, and we
> > can have both external and internal direct maps. Invalidating user
> > data mappings is all dax devices need to do on unplug, but for
> > filesystems it is only a small part of what we have to do when a
> > range of a device goes bad.
> >
> > IOWs, there is no "one size fits all" approach that works for all
> > filesystems, nor is there one strategy that is is optimal for all
> > filesystems. Failure handling in a filesystem is almost always
> > filesystem specific...
>
> Point taken, if a filesystem is not using the block-layer for metadata
> I/O and using DAX techniques directly it needs this event too
> otherwise it will crash vs report failed operations...
> ->corrupted_range() does not offer the correct plumbing for that
> today.
>
> There's an additional problem this brings to mind. Device-mapper
> targets like dm-writecache need this notification as well because it
> is using direct physical page access via the linear map and may crash
> like the filesystem if the mm-direct-map is torn down from underneath
> it.

Yes, dm gets the notification by the ->corrupted_range() callback
from it's underlying device(s). It can then do what it needs to map
the range and pass that error on to the filesystem. Fundamentally,
though, if the range is mapped into userspace and it goes away, the
user has lost data and there's nothing DM can do to recover it so
all it can do is pass the corruption up the stack to the next layer
(either another block device or the filesystem).

> > For actual pmem, maybe. But hotplug RAM is a thing; big numa
> > machines that can hotplug nodes into their fabric and so have CPUs
> > and memory able to come and go from a live machine. It's not a small
> > stretch to extend that to having PMEM in those nodes. So it's a
> > practical design concern right now, even ignoring that NVMe is
> > hotplug....
>
> Memory hotplug today requires the memory-device to be offlined before
> the memory is unplugged and the core-mm has a chance to say "no" if it
> sees even one page with an elevated reference. Block-devices in
> contrast have no option to say "no" to being unplugged / ->remove()
> triggered.

Yes, I know that. That's my whole point - NVMe persistent regions
mean that DAX filesystems will have to handle the latter case, and
that it looks no different to normal block device failure to the
filesystem. ->corrupted_range is exactly how these events are
intended to be sent up the storage stack to the filesystem, so why
should PMEM be handled any different?

> > > While the pmem
> > > driver has a ->remove() path that's purely a software unbind
> > > operation. That said the vulnerability window today is if a process
> > > acquires a dax mapping, the pmem device hosting that filesystem goes
> > > through an unbind / bind cycle, and then a new filesystem is created /
> > > mounted. That old pte may be able to access data that is outside its
> > > intended protection domain.
> >
> > So what is being done to prevent stale DAX mappings from being
> > leaked this way right now, seeing as the leak you mention here
> > doesn't appear in any way to be filesystem related?
>
> For device-dax where there is only one inode->i_mapping to deal with,
> one unmap_mapping_range() call is performed in the device shutdown
> path. For filesystem-dax only the direct-map is torn down.
>
> The user mapping teardown gap is why I'm coming at this elephant from
> the user mapping perspective and not necessarily the "what does the
> filesystem want to do about device removal" perspective.

But that doesn't help avoid the "user mapping teardown gap" at all -
that gap only gets bigger when you add a filesystem into the picture
because not we have tens to hundreds of millions of cache inodes to
walk and invalidate mappings on.

Closing this gap requires brute force purging the CPU ptes the
moment an unexpected DAX device unplug occurs. There is no other way
to do it quickly, and just waiting until the filesystem can unmap it
only increases the gap between the ptes becoming invalid and them
getting invalidated.

> > And once you take into account that "pulling the wrong device" can
> > happen, how does the filesystem tell tell the difference between a
> > device being pulled and a drive cage just dying and so the drive
> > just disappear from the system? How are these accidental vs real
> > failures any different from the perspective of a filesystem mounted
> > on that device?
>
> Not even the device driver can tell you that.

Exactly my point. As there is no difference between unplug and
device failure from a filesystem perspective, the comunication
should come through a single "device failure" interface, not some
special DAX-specific notification path that you are advocating for.

> This goes back to Yasunori's question, can't ->remove() just be
> blocked when the filesystem is mounted? The answer is similar to
> asking the filesystem to allow DAX RDMA pages to be pinned
> indefinitely and lock-out the filesystem from making any extent-map
> changes. If the admin wants the device disabled while the filesystem
> is mounted the kernel should do everything it can to honor that
> request safely.

Sure, but the end effect of this is that the filesystem seems that
the -device has failed- and there is no need for DAX devices to
require some special "invalidate all mappings" notification when a
"device jsut failed" notification tells the filesystem the same
thing and a whole lot more....

> > This just makes no sense at all from an operations perspective - if
> > you know that you are about to do an unplug that will result in all
> > your DAX apps and filesystems being killed (i.e. fatal production
> > environment failure) then why haven't they all been stopped by the
> > admin before the device unplug is done? Why does this "human in the
> > loop" admin task require the applications and filesystems to handle
> > this without warning and have to treat it as a "device failure"
> > event when this can all be avoided for normal, scheduled, controlled
> > unplug operations? The "unexpected unplug" is a catastrophic failure
> > event which may have severe side effects on system operation and
> > stability. Why would you design an unplug process that does not
> > start with a clean, a controlled shutdown process from the top down?
> > If we make the assumption that planned unplugs are well planned,
> > organised and scheduled, then the only thing that an unplug event
> > needs to mean to a filesystem is "catastrophic device failure has
> > occurred".
>
> There is a difference between the kernel saying "don't do that, bad
> things will happen" and "you can't do that the entire system will
> crash / security promises will be violated".
>
> git grep -n suppress_bind_attr drivers/ata/ drivers/scsi/ drivers/nvme/
>
> There are no block-device providers that I can find on a quick search
> that forbid triggering ->remove() on the driver if a filesystem is
> mounted. pmem is not the first block device driver to present this
> problem.

Yes, that's because, as you point out, pmem has unique
characteristics - DAX - that absolutely require us to handle storage
failures in this way. No other type of device requires the fileystem
to directly arbitrate userspace access to the device, and so we've
been able to get away with having the block device return EIO or
ENODEV when we try to do IO and handling the problem that way.

But we still have been wanting ENODEV notification from block
devices when they are unexpectedly unplugged, and have been wanting
that functionality for at least the last decade, if not longer.
Filesystem shutdown on device removal should be instantenous because
device removal for most filesystems is an unrecoverable error and
delaying the shutdown until a fatal IO error occurrs in the
filesystem benefits no-one.

And now, we can't even get reliable IO error reporting, because DAX.

That's the problems that this set of ->corrupted_range callbacks is
supposed to provide - it's generic enough that we can plumb
ata/scsi/nvme layers into it as well as PMEM, and the filesystem
will now get device failure notifications from all types of device
drivers and block devices.

We do not need a DAX specific mechanism to tell us "DAX device
gone", we need a generic block device interface that tells us "range
of block device is gone".

The reason that the block device is gone is irrelevant to the
filesystem. The type of block device is also irrelevant. If the
filesystem isn't using DAX (e.g. dax=never mount option) even when
it is on a DAX capable device, then it just doesn't care about
invalidating DAX mappings because it has none. But it still may care
about shutting down the filesystem because the block device is gone.

This is why we need to communicate what error occurred, not what
action a device driver thinks needs to be taken. The error is
important to the filesystem, the action might be completely
irrelevant. And, as we know now, shutdown on DAX enable filesystems
needs to imply DAX mapping invalidation in all cases, not just when
the device disappears from under the filesystem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-03-02 18:54:07

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner <[email protected]> wrote:
[..]
> We do not need a DAX specific mechanism to tell us "DAX device
> gone", we need a generic block device interface that tells us "range
> of block device is gone".

This is the crux of the disagreement. The block_device is going away
*and* the dax_device is going away. The dax_device removal implies one
set of actions (direct accessed pfns invalid) the block device removal
implies another (block layer sector access offline). corrupted_range
is blurring the notification for 2 different failure domains. Look at
the nascent idea to mount a filesystem on dax sans a block device.
Look at the existing plumbing for DM to map dax_operations through a
device stack. Look at the pushback Ruan got for adding a new
block_device operation for corrupted_range().

> The reason that the block device is gone is irrelevant to the
> filesystem. The type of block device is also irrelevant. If the
> filesystem isn't using DAX (e.g. dax=never mount option) even when
> it is on a DAX capable device, then it just doesn't care about
> invalidating DAX mappings because it has none. But it still may care
> about shutting down the filesystem because the block device is gone.

Sure, let's have a discussion about a block_device gone notification,
and a dax_device gone notification.

> This is why we need to communicate what error occurred, not what
> action a device driver thinks needs to be taken.

The driver is only an event producer in this model, whatever the
consumer does at the other end is not its concern. There may be a
generic consumer and a filesystem specific consumer.

> The error is
> important to the filesystem, the action might be completely
> irrelevant. And, as we know now, shutdown on DAX enable filesystems
> needs to imply DAX mapping invalidation in all cases, not just when
> the device disappears from under the filesystem.

Sure.

2021-03-02 19:40:03

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 1, 2021 at 9:38 PM Dave Chinner <[email protected]> wrote:
>
> On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> > On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner <[email protected]> wrote:
> > [..]
> > > We do not need a DAX specific mechanism to tell us "DAX device
> > > gone", we need a generic block device interface that tells us "range
> > > of block device is gone".
> >
> > This is the crux of the disagreement. The block_device is going away
> > *and* the dax_device is going away.
>
> No, that is not the disagreement I have with what you are saying.
> You still haven't understand that it's even more basic and generic
> than devices going away. At the simplest form, all the filesystem
> wants is to be notified of is when *unrecoverable media errors*
> occur in the persistent storage that underlies the filesystem.
>
> The filesystem does not care what that media is build from - PMEM,
> flash, corroded spinning disks, MRAM, or any other persistent media
> you can think off. It just doesn't matter.
>
> What we care about is that the contents of a *specific LBA range* no
> longer contain *valid data*. IOWs, the data in that range of the
> block device has been lost, cannot be retreived and/or cannot be
> written to any more.
>
> PMEM taking a MCE because ECC tripped is a media error because data
> is lost and inaccessible until recovery actions are taken.
>
> MD RAID failing a scrub is a media error and data is lost and
> unrecoverable at that layer.
>
> A device disappearing is a media error because the storage media is
> now permanently inaccessible to the higher layers.
>
> This "media error" categorisation is a fundamental property of
> persistent storage and, as such, is a property of the block devices
> used to access said persistent storage.
>
> That's the disagreement here - that you and Christoph are saying
> ->corrupted_range is not a block device property because only a
> pmem/DAX device currently generates it.
>
> You both seem to be NACKing a generic interface because it's only
> implemented for the first subsystem that needs it. AFAICT, you
> either don't understand or are completely ignoring the architectural
> need for it to be provided across the rest of the storage stack that
> *block device based filesystems depend on*.

No I'm NAKing it because it's the wrong interface. See my 'struct
badblocks' argument in the reply to Darrick. That 'struct badblocks'
infrastructure arose from MD and is shared with PMEM.

>
> Sure, there might be dax device based fielsystems around the corner.
> They just require a different pmem device ->corrupted_range callout
> to implement the notification - one that directs to the dax device
> rather than the block device. That's simple and trivial to
> implement, but such functionaity for DAX devices does not replace
> the need for the same generic functionality to be provided across a
> *range of different block devices* as required by *block device
> based filesystems*.
>
> And that's fundamentally the problem. XFS is block device based, not
> DAX device based. We require errors to be reported through block
> device mechanisms. fs-dax does not change this - it is based on pmem
> being presented as a primarily as a block device to the block device
> based filesystems and only secondarily as a dax device. Hence if it
> can be trivially implemented as a block device interface, that's
> where it should go, because then all the other block devices that
> the filesytem runs on can provide the same functionality for similar
> media error events....

Sure, use 'struct badblocks' not struct block_device and
block_device_operations.
>
> > The dax_device removal implies one
> > set of actions (direct accessed pfns invalid) the block device removal
> > implies another (block layer sector access offline).
>
> There you go again, saying DAX requires an action, while the block
> device notification is a -state change- (i.e. goes offline).

There you go reacting to the least generous interpretation of what I said.

s/pfns invalid/pfns offline/

>
> This is exactly what I said was wrong in my last email.
>
> > corrupted_range
> > is blurring the notification for 2 different failure domains. Look at
> > the nascent idea to mount a filesystem on dax sans a block device.
> > Look at the existing plumbing for DM to map dax_operations through a
> > device stack.
>
> Ummm, it just maps the direct_access call to the underlying device
> and calls it's ->direct_access method. All it's doing is LBA
> mapping. That's all it needs to do for ->corrupted_range, too.
> I have no clue why you think this is a problem for error
> notification...
>
> > Look at the pushback Ruan got for adding a new
> > block_device operation for corrupted_range().
>
> one person said "no". That's hardly pushback. Especially as I think
> Christoph's objection about this being dax specific functionality
> is simply wrong, as per above.

It's not wrong when we have a perfectly suitable object for sector
based error notification and when we're trying to disentangle 'struct
block_device' from 'struct dax_device'.

2021-03-04 05:25:53

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Sun, Feb 28, 2021 at 11:27 PM Yasunori Goto <[email protected]> wrote:
>
> Hello, Dan-san,
>
> On 2021/02/27 4:24, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <[email protected]> wrote:
> >>
> >> On Fri, Feb 26, 2021 at 09:45:45AM +0000, [email protected] wrote:
> >>> Hi, guys
> >>>
> >>> Beside this patchset, I'd like to confirm something about the
> >>> "EXPERIMENTAL" tag for dax in XFS.
> >>>
> >>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> >>> when we mount a pmem device with dax option, has been existed for a
> >>> while. It's a bit annoying when using fsdax feature. So, my initial
> >>> intention was to remove this tag. And I started to find out and solve
> >>> the problems which prevent it from being removed.
> >>>
> >>> As is talked before, there are 3 main problems. The first one is "dax
> >>> semantics", which has been resolved. The rest two are "RMAP for
> >>> fsdax" and "support dax reflink for filesystem", which I have been
> >>> working on.
> >>
> >> <nod>
> >>
> >>> So, what I want to confirm is: does it means that we can remove the
> >>> "EXPERIMENTAL" tag when the rest two problem are solved?
> >>
> >> Yes. I'd keep the experimental tag for a cycle or two to make sure that
> >> nothing new pops up, but otherwise the two patchsets you've sent close
> >> those two big remaining gaps. Thank you for working on this!
> >>
> >>> Or maybe there are other important problems need to be fixed before
> >>> removing it? If there are, could you please show me that?
> >>
> >> That remains to be seen through QA/validation, but I think that's it.
> >>
> >> Granted, I still have to read through the two patchsets...
> >
> > I've been meaning to circle back here as well.
> >
> > My immediate concern is the issue Jason recently highlighted [1] with
> > respect to invalidating all dax mappings when / if the device is
> > ripped out from underneath the fs. I don't think that will collide
> > with Ruan's implementation, but it does need new communication from
> > driver to fs about removal events.
> >
> > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
> >
>
> I'm not sure why there is a race condition between unbinding operation
> and accessing mmaped file on filesystem dax yet.
>
> May be silly question, but could you tell me why the "unbinding"
> operation of the namespace which is mounted by filesystem dax must be
> allowed?

The unbind operation is used to switch the mode of a namespace between
fsdax and devdax. There is no way to fail unbind. At most it can be
delayed for a short while to perform cleanup, but it can't be blocked
indefinitely. There is the option to specify 'suppress_bind_attrs' to
the driver to preclude software triggered device removal, but that
would disable mode changes for the device.

> If "unbinding" is rejected when the filesystem is mounted with dax
> enabled, what is inconvenience?

It would be interesting (read difficult) to introduce the concept of
dynamic 'suppress_bind_attrs'. Today the decision is static at driver
registration time, not in response to how the device is being used.

I think global invalidation of all inodes that might be affected by a
dax-capable device being ripped away from the filesystem is sufficient
and avoids per-fs enabling, but I'm willing to be convinced that
->corrupted_range() is the proper vehicle for this.

>
> I can imagine if a device like usb memory stick is removed surprisingly,
> kernel/filesystem need to reject writeback at the time, and discard page
> cache. Then, I can understand that unbinding operation is essential for
> such case.

For usb the system is protected by the fact that all future block-i/o
submissions to the old block-device will fail, and a new usb-device
being plugged in will get a new block-device. I.e. the old security
model is invalidated / all holes are closed by blk_cleanup_queue().

> But I don't know why PMEM device/namespace allows unbinding operation
> like surprising removal event.

DAX hands direct mappings to physical pages, if the security model
fronting those physical pages changes the mappings attained via the
old security model need to be invalidated. blk_cleanup_queue() does
not invalidate DAX mappings.

The practical value of fixing that hole is small given that physical
unplug is not implemented for NVDIMMs today, but the get_user_pages()
path can be optimized if this invalidation is implemented, and future
hotplug-capable NVDIMM buses like CXL will need this.

2021-03-04 05:28:34

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <[email protected]> wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
>
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
>
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
>
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

Dan, you made this mistake with the hwpoisoning code that we're
trying to fix that here. Hard coding a 1:1 physical address to
inode/offset into the DAX mapping was a bad mistake. It's also one
that should never have occurred because it's *obviously wrong* to
filesystem developers and has been for a long time.

Now we have the filesytem people providing a mechanism for the pmem
devices to tell the filesystems about physical device failures so
they can handle such failures correctly themselves. Having the
device go away unexpectedly from underneath a mounted and active
filesystem is a *device failure*, not an "unplug event".

The mistake you made was not understanding how filesystems work,
nor actually asking filesystem developers what they actually needed.
You're doing the same thing here - you're telling us what you think
the solution filesystems need is. Please listen when we say "that is
not sufficient" because we don't want to be backed into a corner
that we have to fix ourselves again before we can enable some basic
filesystem functionality that we should have been able to support on
DAX from the start...

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks....
>
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

No.

drop_pagecache_sb() is not a relevant model for telling a filesystem
that the block device underneath has gone away, nor for a device to
ensure that access protections that *are managed by the filesystem*
are enforced/revoked sanely.

drop_pagecache_sb() is a brute-force model for invalidating user
data mappings that the filesystem performs in response to such a
notification. It only needs this brute-force approach if it has no
other way to find active DAX mappings across the range of the device
that has gone away.

But this model doesn't work for direct mapped metadata, journals or
any other internal direct filesystem mappings that aren't referenced
by inodes that the filesytem might be using. The filesystem still
needs to invalidate all those mappings and prevent further access to
them, even from within the kernel itself.

Filesystems are way more complex than pure DAX devices, and hence
handle errors and failure events differently. Unlike DAX devices, we
have both internal and external references to the DAX device, and we
can have both external and internal direct maps. Invalidating user
data mappings is all dax devices need to do on unplug, but for
filesystems it is only a small part of what we have to do when a
range of a device goes bad.

IOWs, there is no "one size fits all" approach that works for all
filesystems, nor is there one strategy that is is optimal for all
filesystems. Failure handling in a filesystem is almost always
filesystem specific...

> > > Ok, I think I set this discussion in the wrong direction implying any
> > > mapping of this action to a "filesystem dead" event. It's just a "zap
> > > all ptes" event and upper layers recover from there.
> >
> > Yes, that's exactly what ->corrupt_range() is intended for. It
> > allows the filesystem to lock out access to the bad range
> > and then recover the data. Or metadata, if that's where the bad
> > range lands. If that recovery fails, it can then report a data
> > loss/filesystem shutdown event to userspace and kill user procs that
> > span the bad range...
> >
> > FWIW, is this notification going to occur before or after the device
> > has been physically unplugged?
>
> Before. This will be operations that happen in the pmem driver
> ->remove() callback.
>
> > i.e. what do we do about the
> > time-of-unplug-to-time-of-invalidation window where userspace can
> > still attempt to access the missing pmem though the
> > not-yet-invalidated ptes? It may not be likely that people just yank
> > pmem nvdimms out of machines, but with NVMe persistent memory
> > spaces, there's every chance that someone pulls the wrong device...
>
> The physical removal aspect is only theoretical today.

For actual pmem, maybe. But hotplug RAM is a thing; big numa
machines that can hotplug nodes into their fabric and so have CPUs
and memory able to come and go from a live machine. It's not a small
stretch to extend that to having PMEM in those nodes. So it's a
practical design concern right now, even ignoring that NVMe is
hotplug....

> While the pmem
> driver has a ->remove() path that's purely a software unbind
> operation. That said the vulnerability window today is if a process
> acquires a dax mapping, the pmem device hosting that filesystem goes
> through an unbind / bind cycle, and then a new filesystem is created /
> mounted. That old pte may be able to access data that is outside its
> intended protection domain.

So what is being done to prevent stale DAX mappings from being
leaked this way right now, seeing as the leak you mention here
doesn't appear in any way to be filesystem related?

> Going forward, for buses like CXL, there will be a managed physical
> remove operation via PCIE native hotplug. The flow there is that the
> PCIE hotplug driver will notify the OS of a pending removal, trigger
> ->remove() on the pmem driver, and then notify the technician (slot
> status LED) that the card is safe to pull.

That doesn't protect against pulling the wrong device, or having
someone pull the device without first running an admin command that
makes systems using DAX safe to pull the device....

And once you take into account that "pulling the wrong device" can
happen, how does the filesystem tell tell the difference between a
device being pulled and a drive cage just dying and so the drive
just disappear from the system? How are these accidental vs real
failures any different from the perspective of a filesystem mounted
on that device?

And then there is the elephant in the room: if there's a "human in
the loop" step needed to hot unplug a pmem device safely, then
why the hell is the filesystem on that device still mounted and the
DAX applications still running?

This just makes no sense at all from an operations perspective - if
you know that you are about to do an unplug that will result in all
your DAX apps and filesystems being killed (i.e. fatal production
environment failure) then why haven't they all been stopped by the
admin before the device unplug is done? Why does this "human in the
loop" admin task require the applications and filesystems to handle
this without warning and have to treat it as a "device failure"
event when this can all be avoided for normal, scheduled, controlled
unplug operations? The "unexpected unplug" is a catastrophic failure
event which may have severe side effects on system operation and
stability. Why would you design an unplug process that does not
start with a clean, a controlled shutdown process from the top down?
If we make the assumption that planned unplugs are well planned,
organised and scheduled, then the only thing that an unplug event
needs to mean to a filesystem is "catastrophic device failure has
occurred".

So from a system level, the way you are describing the way hot
unplug events are supposed to occur and work looks completely
screwed up to me. Exactly what use case do you have for pmem device
hot-unplug from under a live filesystem that isn't considered a
*catastrophic runtime device failure* by the filesystem?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-03-04 05:48:05

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner <[email protected]> wrote:
> [..]
> > We do not need a DAX specific mechanism to tell us "DAX device
> > gone", we need a generic block device interface that tells us "range
> > of block device is gone".
>
> This is the crux of the disagreement. The block_device is going away
> *and* the dax_device is going away.

No, that is not the disagreement I have with what you are saying.
You still haven't understand that it's even more basic and generic
than devices going away. At the simplest form, all the filesystem
wants is to be notified of is when *unrecoverable media errors*
occur in the persistent storage that underlies the filesystem.

The filesystem does not care what that media is build from - PMEM,
flash, corroded spinning disks, MRAM, or any other persistent media
you can think off. It just doesn't matter.

What we care about is that the contents of a *specific LBA range* no
longer contain *valid data*. IOWs, the data in that range of the
block device has been lost, cannot be retreived and/or cannot be
written to any more.

PMEM taking a MCE because ECC tripped is a media error because data
is lost and inaccessible until recovery actions are taken.

MD RAID failing a scrub is a media error and data is lost and
unrecoverable at that layer.

A device disappearing is a media error because the storage media is
now permanently inaccessible to the higher layers.

This "media error" categorisation is a fundamental property of
persistent storage and, as such, is a property of the block devices
used to access said persistent storage.

That's the disagreement here - that you and Christoph are saying
->corrupted_range is not a block device property because only a
pmem/DAX device currently generates it.

You both seem to be NACKing a generic interface because it's only
implemented for the first subsystem that needs it. AFAICT, you
either don't understand or are completely ignoring the architectural
need for it to be provided across the rest of the storage stack that
*block device based filesystems depend on*.

Sure, there might be dax device based fielsystems around the corner.
They just require a different pmem device ->corrupted_range callout
to implement the notification - one that directs to the dax device
rather than the block device. That's simple and trivial to
implement, but such functionaity for DAX devices does not replace
the need for the same generic functionality to be provided across a
*range of different block devices* as required by *block device
based filesystems*.

And that's fundamentally the problem. XFS is block device based, not
DAX device based. We require errors to be reported through block
device mechanisms. fs-dax does not change this - it is based on pmem
being presented as a primarily as a block device to the block device
based filesystems and only secondarily as a dax device. Hence if it
can be trivially implemented as a block device interface, that's
where it should go, because then all the other block devices that
the filesytem runs on can provide the same functionality for similar
media error events....

> The dax_device removal implies one
> set of actions (direct accessed pfns invalid) the block device removal
> implies another (block layer sector access offline).

There you go again, saying DAX requires an action, while the block
device notification is a -state change- (i.e. goes offline).

This is exactly what I said was wrong in my last email.

> corrupted_range
> is blurring the notification for 2 different failure domains. Look at
> the nascent idea to mount a filesystem on dax sans a block device.
> Look at the existing plumbing for DM to map dax_operations through a
> device stack.

Ummm, it just maps the direct_access call to the underlying device
and calls it's ->direct_access method. All it's doing is LBA
mapping. That's all it needs to do for ->corrupted_range, too.
I have no clue why you think this is a problem for error
notification...

> Look at the pushback Ruan got for adding a new
> block_device operation for corrupted_range().

one person said "no". That's hardly pushback. Especially as I think
Christoph's objection about this being dax specific functionality
is simply wrong, as per above.

> > This is why we need to communicate what error occurred, not what
> > action a device driver thinks needs to be taken.
>
> The driver is only an event producer in this model, whatever the
> consumer does at the other end is not its concern. There may be a
> generic consumer and a filesystem specific consumer.

<sigh>

That's why these are all ops functions that can provide multiple
implementations to different device types. So that when we get a new
use case, the ops function structure can be replaced with one that
directs the notification to the new user instead of to the existing
one. It's a design pattern we use all over the kernel code.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-03-04 05:49:14

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong <[email protected]> wrote:
>
> On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <[email protected]> wrote:
> > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <[email protected]> wrote:
> > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > > it points to, check if it points to the PMEM that is being removed,
> > > > > grab the page it points to, map that to the relevant struct page,
> > > > > run collect_procs() on that page, then kill the user processes that
> > > > > map that page.
> > > > >
> > > > > So why can't we walk the ptescheck the physical pages that they
> > > > > map to and if they map to a pmem page we go poison that
> > > > > page and that kills any user process that maps it.
> > > > >
> > > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > > >
> > > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > > large array of pages.
> > >
> > > Not really. You're assuming all a filesystem has to do is invalidate
> > > everything if a device goes away, and that's not true. Finding if an
> > > inode has a mapping that spans a specific device in a multi-device
> > > filesystem can be a lot more complex than that. Just walking inodes
> > > is easy - determining whihc inodes need invalidation is the hard
> > > part.
> >
> > That inode-to-device level of specificity is not needed for the same
> > reason that drop_caches does not need to be specific. If the wrong
> > page is unmapped a re-fault will bring it back, and re-fault will fail
> > for the pages that are successfully removed.
> >
> > > That's where ->corrupt_range() comes in - the filesystem is already
> > > set up to do reverse mapping from physical range to inode(s)
> > > offsets...
> >
> > Sure, but what is the need to get to that level of specificity with
> > the filesystem for something that should rarely happen in the course
> > of normal operation outside of a mistake?
>
> I can't tell if we're conflating the "a bunch of your pmem went bad"
> case with the "all your dimms fell out of the machine" case.

From the pmem driver perspective it has the media scanning to find
some small handful of cachelines that have gone bad, and it has the
driver ->remove() callback to tell it a bunch of pmem is now offline.
The NVDIMM device "range has gone bad" mechanism has no way to
communicate multiple terabytes have gone bad at once.

In fact I think the distinction is important that ->remove() is not
treated as ->corrupted_range() because I expect the level of freakout
is much worse for a "your storage is offline" notification vs "your
storage is corrupted" notification.

> If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
> of pmem, I certainly want that level of specificity. Just notify the
> users of the dead piece, don't flush the whole machine down the drain.

Right, something like corrupted_range() is there to say, "keep going
upper layers, but note that this handful of sectors now has
indeterminant data and will return -EIO on access until repaired". The
repair for device-offline is device-online.

>
> > > > There's likely always more pages than inodes, but perhaps it's more
> > > > efficient to walk the 'struct page' array than sb->s_inodes?
> > >
> > > I really don't see you seem to be telling us that invalidation is an
> > > either/or choice. There's more ways to convert physical block
> > > address -> inode file offset and mapping index than brute force
> > > inode cache walks....
> >
> > Yes, but I was trying to map it to an existing mechanism and the
> > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > needs to happen here.
>
> Yes. XFS (with rmap enabled) can do all the iteration and walking in
> that function except for the invalidate_mapping_* call itself. The goal
> of this series is first to wire up a callback within both the block and
> pmem subsystems so that they can take notifications and reverse-map them
> through the storage stack until they reach an fs superblock.

I'm chuckling because this "reverse map all the way up the block
layer" is the opposite of what Dave said at the first reaction to my
proposal, "can't the mm map pfns to fs inode address_spaces?".

I think dax unmap is distinct from corrupted_range() precisely because
they are events happening in two different domains, block device
sectors vs dax device pfns.

Let's step back. I think a chain of ->corrupted_range() callbacks up
the block stack terminating in the filesystem with dax implications
tacked on is the wrong abstraction. Why not use the existing generic
object for communicating bad sector ranges, 'struct badblocks'?

Today whenever the pmem driver receives new corrupted range
notification from the lower level nvdimm
infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
associated with the pmem gendisk and then notifies userspace that
there are new badblocks. This seems a perfect place to signal an upper
level stacked block device that may also be watching disk->bb. Then
each gendisk in a stacked topology is responsible for watching the
badblock notifications of the next level and storing a remapped
instance of those blocks until ultimately the filesystem mounted on
the top-level block device is responsible for registering for those
top-level disk->bb events.

The device gone notification does not map cleanly onto 'struct badblocks'.

If an upper level agent really cared about knowing about ->remove()
events before they happened it could maybe do something like:

dev = disk_to_dev(bdev->bd_disk)->parent;
bus_register_notifier(dev->bus. &disk_host_device_notifier_block)

...where it's trying to watch for events that will trigger the driver
->remove() callback on the device hosting a disk.

I still don't think that solves the need for a separate mechanism for
global dax_device pte invalidation.

I think that global dax_device invalidation needs new kernel
infrastructure to allow internal users, like dm-writecache and future
filesystems using dax for metadata, to take a fault when pmem is
offlined. They can't use the direct-map because the direct-map can't
fault, and they can't indefinitely pin metadata pages because that
blocks ->remove() from being guaranteed of forward progress.

Then an invalidation event is indeed a walk of address_space like
objects where some are fs-inodes and some are kernel-mode dax-users,
and that remains independent from remove events and badblocks
notifications because they are independent objects and events.

In contrast I think calling something like soft_offline_page() a pfn
at a time over terabytes will take forever especially when that event
need not fire if the dax_device is not mounted.

> Once the information has reached XFS, it can use its own reverse
> mappings to figure out which pages of which inodes are now targetted.

It has its own sector based reverse mappings, it does not have pfn reverse map.

> The future of DAX hw error handling can be that you throw the spitwad at
> us, and it's our problem to distill that into mm invalidation calls.
> XFS' reverse mapping data is indexed by storage location and isn't
> sharded by address_space, so (except for the DIMMs falling out), we
> don't need to walk the entire inode list or scan the entire mapping.

->remove() is effectively all the DIMMs falling out for all XFS knows.

> Between XFS and DAX and mm, the mm already has the invalidation calls,
> xfs already has the distiller, and so all we need is that first bit.
> The current mm code doesn't fully solve the problem, nor does it need
> to, since it handles DRAM errors acceptably* already.
>
> * Actually, the hwpoison code should _also_ be calling ->corrupted_range
> when DRAM goes bad so that we can detect metadata failures and either
> reload the buffer or (if it was dirty) shut down.
[..]
> > Going forward, for buses like CXL, there will be a managed physical
> > remove operation via PCIE native hotplug. The flow there is that the
> > PCIE hotplug driver will notify the OS of a pending removal, trigger
> > ->remove() on the pmem driver, and then notify the technician (slot
> > status LED) that the card is safe to pull.
>
> Well, that's a relief. Can we cancel longterm RDMA leases now too?
> <duck>

Yes, all problems can be solved with more blinky lights.

2021-03-04 06:00:17

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong <[email protected]> wrote:
> > > > I really don't see you seem to be telling us that invalidation is an
> > > > either/or choice. There's more ways to convert physical block
> > > > address -> inode file offset and mapping index than brute force
> > > > inode cache walks....
> > >
> > > Yes, but I was trying to map it to an existing mechanism and the
> > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > needs to happen here.
> >
> > Yes. XFS (with rmap enabled) can do all the iteration and walking in
> > that function except for the invalidate_mapping_* call itself. The goal
> > of this series is first to wire up a callback within both the block and
> > pmem subsystems so that they can take notifications and reverse-map them
> > through the storage stack until they reach an fs superblock.
>
> I'm chuckling because this "reverse map all the way up the block
> layer" is the opposite of what Dave said at the first reaction to my
> proposal, "can't the mm map pfns to fs inode address_spaces?".

Ah, no, I never said that the filesystem can't do reverse maps. I
was asking if the mm could directly (brute-force) invalidate PTEs
pointing at physical pmem ranges without needing walk the inode
mappings. That would be far more efficient if it could be done....

> Today whenever the pmem driver receives new corrupted range
> notification from the lower level nvdimm
> infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> associated with the pmem gendisk and then notifies userspace that
> there are new badblocks. This seems a perfect place to signal an upper
> level stacked block device that may also be watching disk->bb. Then
> each gendisk in a stacked topology is responsible for watching the
> badblock notifications of the next level and storing a remapped
> instance of those blocks until ultimately the filesystem mounted on
> the top-level block device is responsible for registering for those
> top-level disk->bb events.
>
> The device gone notification does not map cleanly onto 'struct badblocks'.

Filesystems are not allowed to interact with the gendisk
infrastructure - that's for supporting the device side of a block
device. It's a layering violation, and many a filesytem developer
has been shouted at for trying to do this. At most we can peek
through it to query functionality support from the request queue,
but otherwise filesystems do not interact with anything under
bdev->bd_disk.

As it is, badblocks are used by devices to manage internal state.
e.g. md for recording stripes that need recovery if the system
crashes while they are being written out.

> If an upper level agent really cared about knowing about ->remove()
> events before they happened it could maybe do something like:
>
> dev = disk_to_dev(bdev->bd_disk)->parent;
> bus_register_notifier(dev->bus. &disk_host_device_notifier_block)

Yeah, that's exactly the sort of thing that filesystems have been
aggressively discouraged from doing for years.

Part of the reason for this is that gendisk based mechanisms are not
very good for stacked device error reporting. Part of the problem
here is that every layer of the stacked device has to hook the
notifier of the block devices underneath it, then translate the
event to match the upper block device map, then regenerate the
notification for the next layer up. This isn't an efficient way to
pass a notification through a series of stacked devices and it is
messy and cumbersome to maintain.

It can be effective for getting notifications to userspace about
something that happens to a specific block device. But The userspace
still ends up having to solve the "what does this error resolve to"
problem. i.e. Userspace still needs to map that notification to a
filesystem, and for data loss events map it to objects within the
filesystem, which can be extremely expensive to do from userspace.

This is exactly the sort of userspace error reporting mess that
various projects have asked us to try to fix. Plumbing errors
internally through the kernel up to the filesystem where the
filesytem can point directly to the user data that is affected is a
simple, effective solution to the problem. Especially if we then
have a generic error notification mechanism for filesystems to emit
errors to registered userspace watchers...

> I still don't think that solves the need for a separate mechanism for
> global dax_device pte invalidation.

It's just another type of media error because.....

> I think that global dax_device invalidation needs new kernel
> infrastructure to allow internal users, like dm-writecache and future
> filesystems using dax for metadata, to take a fault when pmem is
> offlined.

.... if userspace has directly mapped into the cache, and the cache
storage goes away, the userspace app has to be killed because we
have no idea if the device going away has caused data loss or not.
IOWs, if userspace writes direct to the cache device and it hasn't
been written back to other storage when it gets yanked, we have just
caused data corruption to occur.

At minimum, we now have to tell the filesystem that the dirty data
in the cache is now bad, and direct map applications that map those
dirty ranges need to be killed because their backing store is no
longer valid nor does the backup copy contain the data they last
wrote. Nor is it acessible by direct access, which is going to be
interesting because dynamically changing dax to non-dax access can't
be done without forcibly kicking the inode out of the cache. That
requires all references to the inode to go away. And that means the
event really has to go up to the filesystem.

But I think the biggest piece of the puzzle that you haven't grokked
here is that the dm cache device isn't a linear map - it's made up of
random ranges from the underlying devices. Hence the "remove" of a dm
cache device turns into a huge number of small, sparse corrupt
ranges, not a single linear device remove event.

IOWs, device unplug/remove events are not just simple "pass it on"
events in a stacked storage setup. There can be non-trivial mappings
through the layers, and device disappearance may in fact manifest to
the user as data corruption rather than causing data to be
inaccessible.

Hence "remove" notifications just don't work in the storage stack.
They need to be translated to block ranges going bad (i.e. media
errors), and reported to higher layers as bad ranges, not as device
removal.

The same goes for DAX devices. The moment they can be placed in
storage stacks in non-trivial configurations and/or used as cache
devices that can be directly accessed over tranditional block
devices, we end up with error conditions that can only be mapped as
ranges of blocks that have gone bad.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-03-04 09:05:47

[permalink] [raw]

Subject: Re: [PATCH v2 08/10] fsdax: Dedup file range to use a compare function

On Fri, 2021-02-26 at 08:20 +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with
> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
[]
> diff --git a/fs/dax.c b/fs/dax.c
[]
> @@ -1856,3 +1856,54 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,l
> ? return dax_insert_pfn_mkwrite(vmf, pfn, order);
> ?}
> ?EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap)
> +{
> + void *saddr, *daddr;
> + bool *same = data;
> + int ret;
> +
> + while (len) {
> + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE)
> + goto next;
> +
> + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> + *same = false;
> + break;
> + }
> +
> + ret = dax_iomap_direct_access(smap, pos1,
> + ALIGN(pos1 + len, PAGE_SIZE), &saddr, NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + ret = dax_iomap_direct_access(dmap, pos2,
> + ALIGN(pos2 + len, PAGE_SIZE), &daddr, NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + *same = !memcmp(saddr, daddr, len);
> + if (!*same)
> + break;
> +next:
> + len -= len;
> + }
> +
> + return 0;
> +}

This code looks needlessly complex.

len is never decremented inside the while loop so the while loop
itself looks unnecessary. Is there some missing decrement of len
or some other reason to use a while loop?

Is dax_iomap_direct_access some ugly macro that modifies a hidden len?

Why not remove the while loop and use straightforward code without
unnecessary indentatation?

{
void *saddr;
void *daddr;
bool *same = data;
int ret;

if (!len ||
(smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE))
return 0;

if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
*same = false;
return 0;
}

ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
&saddr, NULL);
if (ret < 0)
return -EIO;

ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
&daddr, NULL);
if (ret < 0)
return -EIO;

*same = !memcmp(saddr, daddr, len);

return 0;
}

I didn't look at the rest.

2021-03-04 09:07:23

by Shiyang Ruan

[permalink] [raw]

Subject: Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

> >
> > if (dirty)
> > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> I still think the __mark_inode_dirty should just be moved into the one
> caller that needs it.

I found that the dirty flag will be used in the next few lines, so I keep
this function inside. If I move it outside, the drity flag should be passed
in as well.

@@ -774,6 +780,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);

+ if (cow)
+ xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
}

So, may I ask what's your purpose for doing in that way?

--
Thanks,
Ruan Shiyang.

2021-03-04 09:49:12

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 04/10] fsdax: Introduce dax_iomap_cow_copy()

On Fri, Feb 26, 2021 at 08:20:24AM +0800, Shiyang Ruan wrote:
> + if (!copy_edge) {
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + return ret;

No need for the ret variable here, this can be:

if (!copy_edge)
return copy_mc_to_kernel(daddr, saddr, length);

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2021-03-04 09:53:05

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor()

On Fri, Feb 26, 2021 at 08:20:22AM +0800, Shiyang Ruan wrote:
> The core logic in the two dax page fault functions is similar. So, move
> the logic into a common helper function. Also, to facilitate the
> addition of new features, such as CoW, switch-case is no longer used to
> handle different iomap types.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 211 ++++++++++++++++++++++++++++++-------------------------
> 1 file changed, 117 insertions(+), 94 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 7031e4302b13..9dea1572868e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1289,6 +1289,93 @@ static int dax_fault_cow_page(struct vm_fault *vmf, struct iomap *iomap,
> return 0;
> }
>
> +static vm_fault_t dax_fault_insert_pfn(struct vm_fault *vmf, pfn_t pfn,
> + bool pmd, bool write)
> +{
> + vm_fault_t ret;
> +
> + if (!pmd) {
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long address = vmf->address;
> +
> + if (write)
> + ret = vmf_insert_mixed_mkwrite(vma, address, pfn);
> + else
> + ret = vmf_insert_mixed(vma, address, pfn);
> + } else
> + ret = vmf_insert_pfn_pmd(vmf, pfn, write);

What about simplifying this a little bit more, something like:

if (pmd)
return vmf_insert_pfn_pmd(vmf, pfn, write);

if (write)
return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
return vmf_insert_mixed(vmf->vma, vmf->address, pfn);

also given that this only has a single user, why not keep open coding
it in the caller?

> +#ifdef CONFIG_FS_DAX_PMD
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> + struct iomap *iomap, void **entry);
> +#else
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> + struct iomap *iomap, void **entry)
> +{
> + return VM_FAULT_FALLBACK;
> +}
> +#endif

Can we try to avoid the forward declaration? Also is there a reason
dax_pmd_load_hole does not compile for the !CONFIG_FS_DAX_PMD case?
If it compiles fine we can just rely on IS_ENABLED() based dead code
elimination entirely.

> + /* if we are reading UNWRITTEN and HOLE, return a hole. */
> + if (!write &&
> + (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
> + if (!pmd)
> + return dax_load_hole(xas, mapping, &entry, vmf);
> + else
> + return dax_pmd_load_hole(xas, vmf, iomap, &entry);
> + }
> +
> + if (iomap->type != IOMAP_MAPPED) {
> + WARN_ON_ONCE(1);
> + return VM_FAULT_SIGBUS;
> + }

Nit: I'd use a switch statement here for a clarity:

switch (iomap->type) {
case IOMAP_MAPPED:
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!write) {
if (!pmd)
return dax_load_hole(xas, mapping, &entry, vmf);
return dax_pmd_load_hole(xas, vmf, iomap, &entry);
}
break;
default:
WARN_ON_ONCE(1);
return VM_FAULT_SIGBUS;
}

> + err = dax_iomap_pfn(iomap, pos, size, &pfn);
> + if (err)
> + goto error_fault;
> +
> + entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
> + write && !sync);
> +
> + if (sync)
> + return dax_fault_synchronous_pfnp(pfnp, pfn);
> +
> + ret = dax_fault_insert_pfn(vmf, pfn, pmd, write);
> +
> +error_fault:
> + if (err)
> + ret = dax_fault_return(err);
> +
> + return ret;

It seems like the only place that sets err is the dax_iomap_pfn case
above. So I'd move the dax_fault_return there, which then allows a direct
return for everyone else, including the open coded version of
dax_fault_insert_pfn.

I really like where this is going!

2021-03-04 14:26:28

by Shiyang Ruan

[permalink] [raw]

Subject: Re: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

> On Wed, Mar 03, 2021 at 09:57:48AM +0000, [email protected] wrote:
> > > What is the advantage of the ioemap_end handler here? It adds another
> > > indirect funtion call to the fast path, so if we can avoid it, I'd
> > > rather do that.
> >
> > These code were in xfs_file_dax_write(). I moved them into the iomap_end
> > because the mmaped CoW need this.
> >
> > I know this is not so good, but I could not find another better way. Do you
> > have any ideas?
> mmaped copy is the copy_edge case? Maybe just use different iomap_ops for
> that case vs plain write?

No, I mean mmaped CoW need a xfs_reflink_end_cow() to make sure the new extent
will be correctly remaped to the file. Otherwise, the file will still refer to
the extent that srcmap point to.

We are able to call this in xfs_file_dax_write(), but cannot call it anywhere
except iomap_end in mmap path.

--
Thanks,
Ruan Shiyang.

2021-03-04 15:03:30

by Shiyang Ruan

[permalink] [raw]

Subject: [RESEND PATCH v2.1 07/10] iomap: Introduce iomap_apply2() for operations on two files

Some operations, such as comparing a range of data in two files under
fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus,
we introduce iomap_apply2() to accept arguments from two files and
iomap_actor2_t for actions on two files.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/iomap/apply.c | 56 +++++++++++++++++++++++++++++++++++++++++++
include/linux/iomap.h | 7 +++++-
2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 26ab6563181f..fbc38ce3d5b6 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -97,3 +97,59 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,

return written ? written : ret;
}
+
+loff_t
+iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2,
+ loff_t length, unsigned int flags, const struct iomap_ops *ops,
+ void *data, iomap_actor2_t actor)
+{
+ struct iomap smap = { .type = IOMAP_HOLE };
+ struct iomap dmap = { .type = IOMAP_HOLE };
+ loff_t written = 0, ret, ret2 = 0;
+ loff_t len1 = length, len2, min_len;
+
+ ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL);
+ if (ret)
+ goto out_src;
+ if (WARN_ON(smap.offset > pos1)) {
+ written = -EIO;
+ goto out_src;
+ }
+ if (WARN_ON(smap.length == 0)) {
+ written = -EIO;
+ goto out_src;
+ }
+ len2 = min_t(loff_t, len1, smap.length);
+
+ ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL);
+ if (ret)
+ goto out_dest;
+ if (WARN_ON(dmap.offset > pos2)) {
+ written = -EIO;
+ goto out_dest;
+ }
+ if (WARN_ON(dmap.length == 0)) {
+ written = -EIO;
+ goto out_dest;
+ }
+ min_len = min_t(loff_t, len2, dmap.length);
+
+ written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap);
+
+out_dest:
+ if (ops->iomap_end)
+ ret2 = ops->iomap_end(ino2, pos2, len2,
+ written > 0 ? written : 0, flags, &dmap);
+out_src:
+ if (ops->iomap_end)
+ ret = ops->iomap_end(ino1, pos1, len1,
+ written > 0 ? written : 0, flags, &smap);
+
+ if (ret)
+ return written ? written : ret;
+
+ if (ret2)
+ return written ? written : ret2;
+
+ return written;
+}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5bd3cac4df9c..913f98897a77 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -148,10 +148,15 @@ struct iomap_ops {
*/
typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
void *data, struct iomap *iomap, struct iomap *srcmap);
-
+typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
+ struct inode *ino2, loff_t pos2, loff_t len, void *data,
+ struct iomap *smap, struct iomap *dmap);
loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
iomap_actor_t actor);
+loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
+ loff_t pos2, loff_t length, unsigned int flags,
+ const struct iomap_ops *ops, void *data, iomap_actor2_t actor);

ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops);
--
2.30.1

2021-03-04 20:27:27

by Dan Williams

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner <[email protected]> wrote:
>
> On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong <[email protected]> wrote:
> > > > > I really don't see you seem to be telling us that invalidation is an
> > > > > either/or choice. There's more ways to convert physical block
> > > > > address -> inode file offset and mapping index than brute force
> > > > > inode cache walks....
> > > >
> > > > Yes, but I was trying to map it to an existing mechanism and the
> > > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > > needs to happen here.
> > >
> > > Yes. XFS (with rmap enabled) can do all the iteration and walking in
> > > that function except for the invalidate_mapping_* call itself. The goal
> > > of this series is first to wire up a callback within both the block and
> > > pmem subsystems so that they can take notifications and reverse-map them
> > > through the storage stack until they reach an fs superblock.
> >
> > I'm chuckling because this "reverse map all the way up the block
> > layer" is the opposite of what Dave said at the first reaction to my
> > proposal, "can't the mm map pfns to fs inode address_spaces?".
>
> Ah, no, I never said that the filesystem can't do reverse maps. I
> was asking if the mm could directly (brute-force) invalidate PTEs
> pointing at physical pmem ranges without needing walk the inode
> mappings. That would be far more efficient if it could be done....
>
> > Today whenever the pmem driver receives new corrupted range
> > notification from the lower level nvdimm
> > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> > associated with the pmem gendisk and then notifies userspace that
> > there are new badblocks. This seems a perfect place to signal an upper
> > level stacked block device that may also be watching disk->bb. Then
> > each gendisk in a stacked topology is responsible for watching the
> > badblock notifications of the next level and storing a remapped
> > instance of those blocks until ultimately the filesystem mounted on
> > the top-level block device is responsible for registering for those
> > top-level disk->bb events.
> >
> > The device gone notification does not map cleanly onto 'struct badblocks'.
>
> Filesystems are not allowed to interact with the gendisk
> infrastructure - that's for supporting the device side of a block
> device. It's a layering violation, and many a filesytem developer
> has been shouted at for trying to do this. At most we can peek
> through it to query functionality support from the request queue,
> but otherwise filesystems do not interact with anything under
> bdev->bd_disk.

So lets add an api that allows the querying of badblocks by bdev and
let the block core handle the bd_disk interaction. I see other block
functionality like blk-integrity reaching through gendisk. The fs need
not interact with the gendisk directly.

>
> As it is, badblocks are used by devices to manage internal state.
> e.g. md for recording stripes that need recovery if the system
> crashes while they are being written out.

I know, I was there when it was invented which is why it was
top-of-mind when pmem had a need to communicate badblocks. Other block
drivers have threatened to use it for badblocks tracking, but none of
those have carried through on that initial interest.

>
> > If an upper level agent really cared about knowing about ->remove()
> > events before they happened it could maybe do something like:
> >
> > dev = disk_to_dev(bdev->bd_disk)->parent;
> > bus_register_notifier(dev->bus. &disk_host_device_notifier_block)
>
> Yeah, that's exactly the sort of thing that filesystems have been
> aggressively discouraged from doing for years.

Yup, it's a layering violation.

> Part of the reason for this is that gendisk based mechanisms are not
> very good for stacked device error reporting. Part of the problem
> here is that every layer of the stacked device has to hook the
> notifier of the block devices underneath it, then translate the
> event to match the upper block device map, then regenerate the
> notification for the next layer up. This isn't an efficient way to
> pass a notification through a series of stacked devices and it is
> messy and cumbersome to maintain.

It's been messy and cumbersome to route new infrastructure through DM
every time a new dax_operation arrives. The corrupted_range() routing
has the same burden. The advantage of badblocks over corrupted_range()
is that it solves the "what If I miss a notification" problem. Each
layer of the stack maintains its sector translation of the next level
errors.
.
> It can be effective for getting notifications to userspace about
> something that happens to a specific block device.

No, it's not block device specific, it's stuck at the disk level. The
user notification aspect was added for pmem at the disk layer because
IIRC it was NAKd to add it to the block_device itself.

>
> But The userspace
> still ends up having to solve the "what does this error resolve to"
> problem. i.e. Userspace still needs to map that notification to a
> filesystem, and for data loss events map it to objects within the
> filesystem, which can be extremely expensive to do from userspace.

Expensive and vulnerable to TOCTOU, this has been the motivation for
filesystem native awareness of these errors from the beginning.

> This is exactly the sort of userspace error reporting mess that
> various projects have asked us to try to fix. Plumbing errors
> internally through the kernel up to the filesystem where the
> filesytem can point directly to the user data that is affected is a
> simple, effective solution to the problem. Especially if we then
> have a generic error notification mechanism for filesystems to emit
> errors to registered userspace watchers...

Agree, that's the dream worth pursuing.

>
> > I still don't think that solves the need for a separate mechanism for
> > global dax_device pte invalidation.
>
> It's just another type of media error because.....
>
> > I think that global dax_device invalidation needs new kernel
> > infrastructure to allow internal users, like dm-writecache and future
> > filesystems using dax for metadata, to take a fault when pmem is
> > offlined.
>
> .... if userspace has directly mapped into the cache, and the cache
> storage goes away, the userspace app has to be killed because we
> have no idea if the device going away has caused data loss or not.
> IOWs, if userspace writes direct to the cache device and it hasn't
> been written back to other storage when it gets yanked, we have just
> caused data corruption to occur.

If userspace has it direct mapped dirty in the cache when the remove
fires, there is no opportunity to flush the cache. Just as there is no
opportunity today with non-DAX and the page cache. The block-queue
will be invalidated and any dirty in page cache is stranded.

> At minimum, we now have to tell the filesystem that the dirty data
> in the cache is now bad, and direct map applications that map those
> dirty ranges need to be killed because their backing store is no
> longer valid nor does the backup copy contain the data they last
> wrote. Nor is it acessible by direct access, which is going to be
> interesting because dynamically changing dax to non-dax access can't
> be done without forcibly kicking the inode out of the cache. That
> requires all references to the inode to go away. And that means the
> event really has to go up to the filesystem.
>
> But I think the biggest piece of the puzzle that you haven't grokked
> here is that the dm cache device isn't a linear map - it's made up of
> random ranges from the underlying devices. Hence the "remove" of a dm
> cache device turns into a huge number of small, sparse corrupt
> ranges, not a single linear device remove event.

I am aware that DM is non-linear. The other non-linearity is sector-to-pfn.

> IOWs, device unplug/remove events are not just simple "pass it on"
> events in a stacked storage setup. There can be non-trivial mappings
> through the layers, and device disappearance may in fact manifest to
> the user as data corruption rather than causing data to be
> inaccessible.

Even MD does not rely on component device notifications for failure
notifications, it waits for write-errors, and yes losing a component
of a raid0 is more than a data offline event.

> Hence "remove" notifications just don't work in the storage stack.
> They need to be translated to block ranges going bad (i.e. media
> errors), and reported to higher layers as bad ranges, not as device
> removal.

Yes, the generic top-level remove event is pretty much useless for
both the dax pte invalidation and lba range offline notification. I'm
distinguishing that from knock on events that fire in response to
->remove() triggering on the disk driver which seems to be where you
are at as well with the idea to trigger ->corrupted_range(0, EOD) from
->remove().

There's 2 ways to view the "filesystems have wanted proactive
notification of remove events from storage for a long time". There's
either enough pent up demand to convince all parties to come to the
table and get something done, or there's too much momentum with the
status quo to overcome.

I do not think it is fair to ask Ruan to solve a problem with brand
new plumbing that the Linux storage community has not seen fit to
address for a decade. Not when disk->bb is already plumbed without
anyone complaining about it.

> The same goes for DAX devices. The moment they can be placed in
> storage stacks in non-trivial configurations and/or used as cache
> devices that can be directly accessed over tranditional block
> devices, we end up with error conditions that can only be mapped as
> ranges of blocks that have gone bad.

I see plumbing corrupted_range() and using it to communicate removal
in addition to badblocks in addition to bad pfns as a revolutionary
change. A reuse of disk->bb for communicating poison sector discovery
events up the stack and a separate facility to invalidate dax devices
as evolutionary. The evolutionary change does not preclude the
eventual revolutionary change, but it has a better chance of making
forward progress in the near term.

2021-03-04 21:45:24

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

On Wed, Mar 03, 2021 at 09:41:54AM +0000, [email protected] wrote:
>
> > >
> > > if (dirty)
> > > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >
> > I still think the __mark_inode_dirty should just be moved into the one
> > caller that needs it.
>
> I found that the dirty flag will be used in the next few lines, so I keep
> this function inside. If I move it outside, the drity flag should be passed
> in as well.
>
> @@ -774,6 +780,9 @@ static void *dax_insert_entry(struct xa_state *xas,
> if (dirty)
> xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
> xas_unlock_irq(xas);
> return entry;
> }
>
>
> So, may I ask what's your purpose for doing in that way?

Oh, true. We can't just move that out as the xas needs to stay
locked.

2021-03-04 21:45:24

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

On Fri, Feb 26, 2021 at 08:20:29AM +0800, Shiyang Ruan wrote:
> error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - &xfs_buffered_write_iomap_ops);
> + IS_DAX(VFS_I(ip)) ?
> + &xfs_dax_write_iomap_ops : &xfs_buffered_write_iomap_ops);

Please add a xfs_zero_range helper that picks the right iomap_ops
instead of open coding this in a few places.

> +static int
> +xfs_dax_write_iomap_end(
> + struct inode *inode,
> + loff_t pos,
> + loff_t length,
> + ssize_t written,
> + unsigned int flags,
> + struct iomap *iomap)
> +{
> + int error = 0;
> + xfs_inode_t *ip = XFS_I(inode);
> +
> + if (pos + written > i_size_read(inode)) {
> + i_size_write(inode, pos + written);
> + error = xfs_setfilesize(ip, pos, written);
> + }
> + if (xfs_is_cow_inode(ip))
> + error = xfs_reflink_end_cow(ip, pos, written);
> +
> + return error;

What is the advantage of the ioemap_end handler here? It adds another
indirect funtion call to the fast path, so if we can avoid it, I'd
rather do that.

Also, shouldn't we cancel the COW rather than finishing it when setting
the file size fails?

2021-03-04 21:50:23

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2021-03-04 22:05:35

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

On Wed, Mar 03, 2021 at 09:57:48AM +0000, [email protected] wrote:
> > What is the advantage of the ioemap_end handler here? It adds another
> > indirect funtion call to the fast path, so if we can avoid it, I'd
> > rather do that.
>
> These code were in xfs_file_dax_write(). I moved them into the iomap_end
> because the mmaped CoW need this.
>
> I know this is not so good, but I could not find another better way. Do you
> have any ideas?

mmaped copy is the copy_edge case? Maybe just use different iomap_ops for
that case vs plain write?

2021-03-04 23:12:28

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

On Fri, Feb 26, 2021 at 08:20:25AM +0800, Shiyang Ruan wrote:
> We replace the existing entry to the newly allocated one in case of CoW.
> Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
> entry as writeprotected. This helps us snapshots so new write
> pagefaults after snapshots trigger a CoW.
>
> Signed-off-by: Goldwyn Rodrigues <[email protected]>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 37 ++++++++++++++++++++++++++-----------
> 1 file changed, 26 insertions(+), 11 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 748dfb89fb41..ec4b733e0b59 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
> return 0;
> }
>
> +#define DAX_IF_DIRTY (1 << 0)
> +#define DAX_IF_COW (1 << 1)
> +
> /*
> * By this point grab_mapping_entry() has ensured that we have a locked entry
> * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -729,16 +732,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
> * already in the tree, we will skip the insertion and just dirty the PMD as
> * appropriate.
> */
> -static void *dax_insert_entry(struct xa_state *xas,
> - struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned int insert_flags)
> {
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>
> if (dirty)
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

I still think the __mark_inode_dirty should just be moved into the one
caller that needs it.

2021-03-04 23:12:28

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

On Fri, Feb 26, 2021 at 08:20:26AM +0800, Shiyang Ruan wrote:
> Punch hole on a reflinked file needs dax_copy_edge() too. Otherwise,
> data in not aligned area will be not correct. So, add the srcmap to
> dax_iomap_zero() and replace memset() as dax_copy_edge().
>
> Signed-off-by: Shiyang Ruan <[email protected]>

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2021-03-04 23:13:48

by Shiyang Ruan

[permalink] [raw]

Subject: Re: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

>
> On Fri, Feb 26, 2021 at 08:20:29AM +0800, Shiyang Ruan wrote:
> > error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > - &xfs_buffered_write_iomap_ops);
> > + IS_DAX(VFS_I(ip)) ?
> > + &xfs_dax_write_iomap_ops : &xfs_buffered_write_iomap_ops);
>
> Please add a xfs_zero_range helper that picks the right iomap_ops
> instead of open coding this in a few places.

OK. I'll add it.
>
> > +static int
> > +xfs_dax_write_iomap_end(
> > + struct inode *inode,
> > + loff_t pos,
> > + loff_t length,
> > + ssize_t written,
> > + unsigned int flags,
> > + struct iomap *iomap)
> > +{
> > + int error = 0;
> > + xfs_inode_t *ip = XFS_I(inode);
> > +
> > + if (pos + written > i_size_read(inode)) {
> > + i_size_write(inode, pos + written);
> > + error = xfs_setfilesize(ip, pos, written);
> > + }
> > + if (xfs_is_cow_inode(ip))
> > + error = xfs_reflink_end_cow(ip, pos, written);
> > +
> > + return error;
>
> What is the advantage of the ioemap_end handler here? It adds another
> indirect funtion call to the fast path, so if we can avoid it, I'd
> rather do that.

These code were in xfs_file_dax_write(). I moved them into the iomap_end
because the mmaped CoW need this.

I know this is not so good, but I could not find another better way. Do you
have any ideas?

>
> Also, shouldn't we cancel the COW rather than finishing it when setting
> the file size fails?
>

I did forget about this part. Thanks for pointing out.

--
Thanks,
Ruan Shiyang.

2021-03-05 00:02:28

by Shiyang Ruan

[permalink] [raw]

Subject: [RESEND PATCH v2.1 08/10] fsdax: Dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a dax
comparison funciton which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 56 ++++++++++++++++++++++++++++++++++++++++++++
fs/remap_range.c | 45 ++++++++++++++++++++++++++++-------
fs/xfs/xfs_reflink.c | 9 +++++--
include/linux/dax.h | 4 ++++
include/linux/fs.h | 15 ++++++++----
5 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4f6c6ba68e6f..dbb95f00b38b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1856,3 +1856,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
+ struct inode *ino2, loff_t pos2, loff_t len, void *data,
+ struct iomap *smap, struct iomap *dmap)
+{
+ void *saddr, *daddr;
+ bool *same = data;
+ int ret;
+
+ if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+ *same = true;
+ return len;
+ }
+
+ if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+ *same = false;
+ return 0;
+ }
+
+ ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+ &saddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+ &daddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ *same = !memcmp(saddr, daddr, len);
+ return len;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
+ const struct iomap_ops *ops)
+{
+ int id, ret = 0;
+
+ id = dax_read_lock();
+ while (len) {
+ ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
+ is_same, dax_range_compare_actor);
+ if (ret < 0 || !*is_same)
+ goto out;
+
+ len -= ret;
+ srcoff += ret;
+ destoff += ret;
+ }
+ ret = 0;
+out:
+ dax_read_unlock(id);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 77dba3a49e65..9079390edaf3 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/fs.h>
+#include <linux/dax.h>
#include "internal.h"

#include <linux/uaccess.h>
@@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
- loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
{
loff_t src_poff;
loff_t dest_poff;
@@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);

/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
* If there's an error, then the usual negative error code is returned.
* Otherwise returns 0 with *len set to the request length.
*/
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+static int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;

- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ if (!IS_DAX(inode_in) && !IS_DAX(inode_out))
+ ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same);
+ else if (IS_DAX(inode_in) && IS_DAX(inode_out) && ops)
+ ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same,
+ ops);
+ else
+ return -EINVAL;
if (ret)
return ret;
if (!is_same)
@@ -370,6 +381,24 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,

return ret;
}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL(dax_remap_file_range_prep);
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, NULL);
+}
EXPORT_SYMBOL(generic_remap_file_range_prep);

loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fa05fb78189..f5b3a3da36b7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1308,8 +1308,13 @@ xfs_reflink_remap_prep(
if (IS_DAX(inode_in) || IS_DAX(inode_out))
goto out_unlock;

- ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ if (IS_DAX(inode_in))
+ ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags);
+ else
+ ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags,
+ &xfs_read_iomap_ops);
if (ret || *len == 0)
goto out_unlock;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3275e01ed33d..32e1c34349f2 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,6 +239,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
struct iomap *srcmap);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..2e6ec5bdf82a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -68,6 +68,7 @@ struct fsverity_info;
struct fsverity_operations;
struct fs_context;
struct fs_parameter_spec;
+struct iomap_ops;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1910,13 +1911,19 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+ struct inode *dest, loff_t destpos,
+ loff_t len, bool *is_same);
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *count,
- unsigned int remap_flags);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *count, unsigned int remap_flags);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.30.1

2021-03-05 01:03:00

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS

On Tue, Mar 02, 2021 at 09:49:30AM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner <[email protected]> wrote:
> >
> > On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> > > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong <[email protected]> wrote:
> > > > > > I really don't see you seem to be telling us that invalidation is an
> > > > > > either/or choice. There's more ways to convert physical block
> > > > > > address -> inode file offset and mapping index than brute force
> > > > > > inode cache walks....
> > > > >
> > > > > Yes, but I was trying to map it to an existing mechanism and the
> > > > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > > > needs to happen here.
> > > >
> > > > Yes. XFS (with rmap enabled) can do all the iteration and walking in
> > > > that function except for the invalidate_mapping_* call itself. The goal
> > > > of this series is first to wire up a callback within both the block and
> > > > pmem subsystems so that they can take notifications and reverse-map them
> > > > through the storage stack until they reach an fs superblock.
> > >
> > > I'm chuckling because this "reverse map all the way up the block
> > > layer" is the opposite of what Dave said at the first reaction to my
> > > proposal, "can't the mm map pfns to fs inode address_spaces?".
> >
> > Ah, no, I never said that the filesystem can't do reverse maps. I
> > was asking if the mm could directly (brute-force) invalidate PTEs
> > pointing at physical pmem ranges without needing walk the inode
> > mappings. That would be far more efficient if it could be done....

So, uh, /can/ the kernel brute-force invalidate PTEs when the pmem
driver says that something died? Part of what's keeping me from putting
together a coherent vision for how this would work is my relative
unfamiliarity with all things mm/.

> > > Today whenever the pmem driver receives new corrupted range
> > > notification from the lower level nvdimm
> > > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> > > associated with the pmem gendisk and then notifies userspace that
> > > there are new badblocks. This seems a perfect place to signal an upper
> > > level stacked block device that may also be watching disk->bb. Then
> > > each gendisk in a stacked topology is responsible for watching the
> > > badblock notifications of the next level and storing a remapped
> > > instance of those blocks until ultimately the filesystem mounted on
> > > the top-level block device is responsible for registering for those
> > > top-level disk->bb events.
> > >
> > > The device gone notification does not map cleanly onto 'struct badblocks'.
> >
> > Filesystems are not allowed to interact with the gendisk
> > infrastructure - that's for supporting the device side of a block
> > device. It's a layering violation, and many a filesytem developer
> > has been shouted at for trying to do this. At most we can peek
> > through it to query functionality support from the request queue,
> > but otherwise filesystems do not interact with anything under
> > bdev->bd_disk.
>
> So lets add an api that allows the querying of badblocks by bdev and
> let the block core handle the bd_disk interaction. I see other block
> functionality like blk-integrity reaching through gendisk. The fs need
> not interact with the gendisk directly.

(I thought it was ok for block code to fiddle with other block
internals, and it's filesystems messing with block internals that was
prohibited?)

> > As it is, badblocks are used by devices to manage internal state.
> > e.g. md for recording stripes that need recovery if the system
> > crashes while they are being written out.
>
> I know, I was there when it was invented which is why it was
> top-of-mind when pmem had a need to communicate badblocks. Other block
> drivers have threatened to use it for badblocks tracking, but none of
> those have carried through on that initial interest.

I hadn't realized that badblocks was bolted onto gendisk nowadays, I
mistakenly thought it was still something internal to md.

Looking over badblocks, I see a major drawback in that it can only
remember a single page's worth of badblocks records.

> > > If an upper level agent really cared about knowing about ->remove()
> > > events before they happened it could maybe do something like:
> > >
> > > dev = disk_to_dev(bdev->bd_disk)->parent;
> > > bus_register_notifier(dev->bus. &disk_host_device_notifier_block)
> >
> > Yeah, that's exactly the sort of thing that filesystems have been
> > aggressively discouraged from doing for years.
>
> Yup, it's a layering violation.
>
> > Part of the reason for this is that gendisk based mechanisms are not
> > very good for stacked device error reporting. Part of the problem
> > here is that every layer of the stacked device has to hook the
> > notifier of the block devices underneath it, then translate the
> > event to match the upper block device map, then regenerate the
> > notification for the next layer up. This isn't an efficient way to
> > pass a notification through a series of stacked devices and it is
> > messy and cumbersome to maintain.
>
> It's been messy and cumbersome to route new infrastructure through DM
> every time a new dax_operation arrives. The corrupted_range() routing
> has the same burden. The advantage of badblocks over corrupted_range()
> is that it solves the "what If I miss a notification" problem. Each
> layer of the stack maintains its sector translation of the next level
> errors.

Oh. Hum. This changes my interpretation of what you're advocating.

If I'm understanding you correctly, I think you want to handle pmem
persistence errors (aka "I lost this cache line") by ... what? The pmem
driver marks the appropriate range in the block_device/dax_device's
badblocks list, invalidates the page tables to force fs page faults, and
the next time the fs tries to access that pmem (either via bios or by
creating a direct map) the lower level storage driver will see the
badblocks entry and fail the IO / decline the mapping?

<shrug> I dunno, does that even make sense? I thought it was pretty
easy for the kernel to invalidate a mapping to force a page fault, since
we (xfs) do that to the regular page cache all the time.

Assuming I understood that part correctly, why is it objectionable to
ask for the one extra step where pmem steps through the dax_device to
call the filesystem ->memory_failure handler? There's no pmem-mapper
layer (yet) so making this piece happen should be relatively simple
since it doesn't require translating through multiple layers of dm,
right?

Also, does your mental model of storage device error reporting center
around lower layers setting badblocks ranges and then poking filesystems
to call down into badblocks to find out what's bad? Versus lower layers
calling filesystems with the bad ranges directly? Or are you trying to
omit as much fs involvement as possible?

(I'll address invalidating dax devices a little further down)

> > It can be effective for getting notifications to userspace about
> > something that happens to a specific block device.
>
> No, it's not block device specific, it's stuck at the disk level. The
> user notification aspect was added for pmem at the disk layer because
> IIRC it was NAKd to add it to the block_device itself.
>
> >
> > But The userspace
> > still ends up having to solve the "what does this error resolve to"
> > problem. i.e. Userspace still needs to map that notification to a
> > filesystem, and for data loss events map it to objects within the
> > filesystem, which can be extremely expensive to do from userspace.
>
> Expensive and vulnerable to TOCTOU, this has been the motivation for
> filesystem native awareness of these errors from the beginning.
>
> > This is exactly the sort of userspace error reporting mess that
> > various projects have asked us to try to fix. Plumbing errors
> > internally through the kernel up to the filesystem where the
> > filesytem can point directly to the user data that is affected is a
> > simple, effective solution to the problem. Especially if we then
> > have a generic error notification mechanism for filesystems to emit
> > errors to registered userspace watchers...
>
> Agree, that's the dream worth pursuing.

(Agree, the error reporting story is still a mess.)

> >
> > > I still don't think that solves the need for a separate mechanism for
> > > global dax_device pte invalidation.
> >
> > It's just another type of media error because.....
> >
> > > I think that global dax_device invalidation needs new kernel
> > > infrastructure to allow internal users, like dm-writecache and future
> > > filesystems using dax for metadata, to take a fault when pmem is
> > > offlined.
> >
> > .... if userspace has directly mapped into the cache, and the cache
> > storage goes away, the userspace app has to be killed because we
> > have no idea if the device going away has caused data loss or not.
> > IOWs, if userspace writes direct to the cache device and it hasn't
> > been written back to other storage when it gets yanked, we have just
> > caused data corruption to occur.
>
> If userspace has it direct mapped dirty in the cache when the remove
> fires, there is no opportunity to flush the cache. Just as there is no
> opportunity today with non-DAX and the page cache. The block-queue
> will be invalidated and any dirty in page cache is stranded.

So this is the "dax device invalidation" case that you also mention
below. How differently would you handle this case from the persistence
error case I outlined above? It sounds like in this case all the mm can
really do is invalidate the active page table mappings and set some
"totally offline" state in the dax/block_device badblocks so that all
future io requests are declined?

Do I understand that correctly?

If so, then I guess my next question is about the coordinated
pre-removal step that I think you mentioned in connection with something
named "CXL"? If someone /requests/ the removal of a chunk of pmem,
would you propagate that request far enough up the storage chain so that
a mounted filesystem could reject the removal attempt?

> > At minimum, we now have to tell the filesystem that the dirty data
> > in the cache is now bad, and direct map applications that map those
> > dirty ranges need to be killed because their backing store is no
> > longer valid nor does the backup copy contain the data they last
> > wrote. Nor is it acessible by direct access, which is going to be
> > interesting because dynamically changing dax to non-dax access can't
> > be done without forcibly kicking the inode out of the cache. That
> > requires all references to the inode to go away. And that means the
> > event really has to go up to the filesystem.
> >
> > But I think the biggest piece of the puzzle that you haven't grokked
> > here is that the dm cache device isn't a linear map - it's made up of
> > random ranges from the underlying devices. Hence the "remove" of a dm
> > cache device turns into a huge number of small, sparse corrupt
> > ranges, not a single linear device remove event.
>
> I am aware that DM is non-linear. The other non-linearity is sector-to-pfn.
>
> > IOWs, device unplug/remove events are not just simple "pass it on"
> > events in a stacked storage setup. There can be non-trivial mappings
> > through the layers, and device disappearance may in fact manifest to
> > the user as data corruption rather than causing data to be
> > inaccessible.
>
> Even MD does not rely on component device notifications for failure
> notifications, it waits for write-errors, and yes losing a component
> of a raid0 is more than a data offline event.
>
> > Hence "remove" notifications just don't work in the storage stack.
> > They need to be translated to block ranges going bad (i.e. media
> > errors), and reported to higher layers as bad ranges, not as device
> > removal.
>
> Yes, the generic top-level remove event is pretty much useless for
> both the dax pte invalidation and lba range offline notification. I'm
> distinguishing that from knock on events that fire in response to
> ->remove() triggering on the disk driver which seems to be where you
> are at as well with the idea to trigger ->corrupted_range(0, EOD) from
> ->remove().
>
> There's 2 ways to view the "filesystems have wanted proactive
> notification of remove events from storage for a long time". There's
> either enough pent up demand to convince all parties to come to the
> table and get something done, or there's too much momentum with the
> status quo to overcome.

Don't forget my cynical product manager view: "Here's a good opportunity
to get the basics of this revolutionary change plumbed in while upper
management is still hot enough about pmem to spend engineer time". :P

> I do not think it is fair to ask Ruan to solve a problem with brand
> new plumbing that the Linux storage community has not seen fit to
> address for a decade.

Nevertheless, he's more or less built it now. Honestly I'm pleased to
see him pushing this forward exactly /because/ nobody has seen fit to
address this for so long.

The part where we plumb notifications upwards through the storage stack
is indeed revolutionary. However, I /do/ think it's fair to ask Ruan to
make a revolutionary change as part of adapting to recent revolutionary
changes in storage hardware.

(At the very least I think it soul-crushing to toss out Ruan's work
now that he's at least gotten the proof of concept running... but Ruan
is in the best place to say that)

> Not when disk->bb is already plumbed without anyone complaining about
> it.

...or noticing it was there, as was the case here. :/

> > The same goes for DAX devices. The moment they can be placed in
> > storage stacks in non-trivial configurations and/or used as cache
> > devices that can be directly accessed over tranditional block
> > devices, we end up with error conditions that can only be mapped as
> > ranges of blocks that have gone bad.
>
> I see plumbing corrupted_range() and using it to communicate removal
> in addition to badblocks in addition to bad pfns as a revolutionary
> change. A reuse of disk->bb for communicating poison sector discovery
> events up the stack and a separate facility to invalidate dax devices
> as evolutionary. The evolutionary change does not preclude the
> eventual revolutionary change, but it has a better chance of making
> forward progress in the near term.

And I want both. :)

But I'll end this email here to make sure I've understood what you're
going for, Dan, before working on a reply.

Hopefully it doesn't take 2 days to roundtrip a reply email like the
last week of utter vger frustration. :(

--D

2021-03-09 06:41:01

by Xiaoguang Wang

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

hi,

First thanks for your patchset.
I'd like to know whether your patchset pass fstests? Thanks.

Regards,
Xiaoguang Wang

> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.
>
> Changes from V1:
> - Factor some helper functions to simplify dax fault code
> - Introduce iomap_apply2() for dax_dedupe_file_range_compare()
> - Fix mistakes and other problems
> - Rebased on v5.11
>
> One of the key mechanism need to be implemented in fsdax is CoW. Copy
> the data from srcmap before we actually write data to the destance
> iomap. And we just copy range in which data won't be changed.
>
> Another mechanism is range comparison. In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data. In fsdax case, readpage() does not work. So, we need
> another compare data with direct access support.
>
> With the two mechanism implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
>
>
> Some of the patches are picked up from Goldwyn's patchset. I made some
> changes to adapt to this patchset.
>
> (Rebased on v5.11)
> ==
>
> Shiyang Ruan (10):
> fsdax: Factor helpers to simplify dax fault code
> fsdax: Factor helper: dax_fault_actor()
> fsdax: Output address in dax_iomap_pfn() and rename it
> fsdax: Introduce dax_iomap_cow_copy()
> fsdax: Replace mmap entry in case of CoW
> fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
> iomap: Introduce iomap_apply2() for operations on two files
> fsdax: Dedup file range to use a compare function
> fs/xfs: Handle CoW for fsdax write() path
> fs/xfs: Add dedupe support for fsdax
>
> fs/dax.c | 532 +++++++++++++++++++++++++++--------------
> fs/iomap/apply.c | 51 ++++
> fs/iomap/buffered-io.c | 2 +-
> fs/remap_range.c | 45 +++-
> fs/xfs/xfs_bmap_util.c | 3 +-
> fs/xfs/xfs_file.c | 29 ++-
> fs/xfs/xfs_inode.c | 8 +-
> fs/xfs/xfs_inode.h | 1 +
> fs/xfs/xfs_iomap.c | 30 ++-
> fs/xfs/xfs_iomap.h | 1 +
> fs/xfs/xfs_iops.c | 11 +-
> fs/xfs/xfs_reflink.c | 16 +-
> include/linux/dax.h | 7 +-
> include/linux/fs.h | 15 +-
> include/linux/iomap.h | 7 +-
> 15 files changed, 550 insertions(+), 208 deletions(-)
>

2021-03-09 16:20:27

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

Hi Shiang,

Thanks for picking up this work.

On 8:20 26/02, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.

How does this work for read sequence for two different files
mapped to the same extent, both residing in DAX?

If two different files read the same shared extent, which file
would resultant page->mapping->host point to?

This problem is listed as a TODO over dax_associate_entry() and is
still not fixed.

<snip>

--
Goldwyn

2021-03-10 01:28:30

by Shiyang Ruan

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

>
> Hi Shiang,
>
> Thanks for picking up this work.
>
> On 8:20 26/02, Shiyang Ruan wrote:
> > This patchset is attempt to add CoW support for fsdax, and take XFS,
> > which has both reflink and fsdax feature, as an example.
>
> How does this work for read sequence for two different files
> mapped to the same extent, both residing in DAX?
>
> If two different files read the same shared extent, which file
> would resultant page->mapping->host point to?
>
> This problem is listed as a TODO over dax_associate_entry() and is
> still not fixed.

I have posted another patchset which I called "fix dax-rmap"[1]. It is a
try to solve this problem, but still in disscussion for now.

[1] https://lkml.org/lkml/2021/2/8/347

--
Thanks,
Ruan Shiyang.

>
> <snip>
>
> --
> Goldwyn

2021-03-10 12:32:56

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Thu, Feb 25, 2021 at 7:23 PM Shiyang Ruan <[email protected]> wrote:
>
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.
>
> Changes from V1:
> - Factor some helper functions to simplify dax fault code
> - Introduce iomap_apply2() for dax_dedupe_file_range_compare()
> - Fix mistakes and other problems
> - Rebased on v5.11
>
> One of the key mechanism need to be implemented in fsdax is CoW. Copy
> the data from srcmap before we actually write data to the destance
> iomap. And we just copy range in which data won't be changed.
>
> Another mechanism is range comparison. In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data. In fsdax case, readpage() does not work. So, we need
> another compare data with direct access support.
>
> With the two mechanism implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
>
>
> Some of the patches are picked up from Goldwyn's patchset. I made some
> changes to adapt to this patchset.
>
> (Rebased on v5.11)

Forgive my ignorance, but is there a reason why this isn't wired up to
Btrfs at the same time? It seems weird to me that adding a feature
like DAX to work with CoW filesystems is not being wired into *the*
CoW filesystem in the Linux kernel that fully takes advantage of
copy-on-write. I'm aware that XFS supports reflinks and does some
datacow stuff, but I don't know if I would consider XFS integration
sufficient for integrating this feature now, especially if it's
possible that the design might not work with Btrfs (I hadn't seen any
feedback from Btrfs developers, though given how much email there is
here, it's entirely possible that I missed it).

--
真実はいつも一つ！/ Always, there's only one truth!

2021-03-10 13:05:17

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> Forgive my ignorance, but is there a reason why this isn't wired up to
> Btrfs at the same time? It seems weird to me that adding a feature

btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.

If you think about it, btrfs and DAX are diametrically opposite things.
DAX is about giving raw access to the hardware. btrfs is about offering
extra value (RAID, checksums, ...), none of which can be done if the
filesystem isn't in the read/write path.

That's why there's no DAX support in btrfs. If you want DAX, you have
to give up all the features you like in btrfs. So you may as well use
a different filesystem.

2021-03-10 13:38:24

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 8:02 AM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > Forgive my ignorance, but is there a reason why this isn't wired up to
> > Btrfs at the same time? It seems weird to me that adding a feature
>
> btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
>
> If you think about it, btrfs and DAX are diametrically opposite things.
> DAX is about giving raw access to the hardware. btrfs is about offering
> extra value (RAID, checksums, ...), none of which can be done if the
> filesystem isn't in the read/write path.
>
> That's why there's no DAX support in btrfs. If you want DAX, you have
> to give up all the features you like in btrfs. So you may as well use
> a different filesystem.

So does that mean that DAX is incompatible with those filesystems when
layered on DM (e.g. through LVM)?

Also, based on what you're saying, that means that DAX'd resources
would not be able to use reflinks on XFS, right? That'd put it in
similar territory as swap files on Btrfs, I would think.

--
真実はいつも一つ！/ Always, there's only one truth!

2021-03-10 13:57:30

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 08:36:06AM -0500, Neal Gompa wrote:
> On Wed, Mar 10, 2021 at 8:02 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > Btrfs at the same time? It seems weird to me that adding a feature
> >
> > btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
> >
> > If you think about it, btrfs and DAX are diametrically opposite things.
> > DAX is about giving raw access to the hardware. btrfs is about offering
> > extra value (RAID, checksums, ...), none of which can be done if the
> > filesystem isn't in the read/write path.
> >
> > That's why there's no DAX support in btrfs. If you want DAX, you have
> > to give up all the features you like in btrfs. So you may as well use
> > a different filesystem.
>
> So does that mean that DAX is incompatible with those filesystems when
> layered on DM (e.g. through LVM)?

Yes. It might be possible to work through RAID-0 or read-only through
RAID-1, but I'm not sure anybody's bothered to do that work.

> Also, based on what you're saying, that means that DAX'd resources
> would not be able to use reflinks on XFS, right? That'd put it in
> similar territory as swap files on Btrfs, I would think.

You can use DAX with reflinks because the CPU can do read-only mmaps.
On a write fault, we break the reflink, copy the data and put in a
writable PTE.

2021-03-10 14:23:47

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On 13:02 10/03, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > Forgive my ignorance, but is there a reason why this isn't wired up to
> > Btrfs at the same time? It seems weird to me that adding a feature
>
> btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
>
> If you think about it, btrfs and DAX are diametrically opposite things.
> DAX is about giving raw access to the hardware. btrfs is about offering
> extra value (RAID, checksums, ...), none of which can be done if the
> filesystem isn't in the read/write path.
>
> That's why there's no DAX support in btrfs. If you want DAX, you have
> to give up all the features you like in btrfs. So you may as well use
> a different filesystem.

DAX on btrfs has been attempted[1]. Of course, we could not
have checksums or multi-device with it. However, got stuck on
associating a shared extent on the same page mapping: basically the
TODO above dax_associate_entry().

Shiyang has proposed a way to disassociate existing mapping, but I
don't think that is the best solution. DAX for CoW will not work until
we have a way of mapping a page to multiple inodes (page->mapping),
which will convert a 1-N inode-page mapping to M-N inode-page mapping.

[1] https://lore.kernel.org/linux-btrfs/[email protected]/

--
Goldwyn

2021-03-10 14:28:59

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> On 13:02 10/03, Matthew Wilcox wrote:
> > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > Btrfs at the same time? It seems weird to me that adding a feature
> >
> > btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
> >
> > If you think about it, btrfs and DAX are diametrically opposite things.
> > DAX is about giving raw access to the hardware. btrfs is about offering
> > extra value (RAID, checksums, ...), none of which can be done if the
> > filesystem isn't in the read/write path.
> >
> > That's why there's no DAX support in btrfs. If you want DAX, you have
> > to give up all the features you like in btrfs. So you may as well use
> > a different filesystem.
>
> DAX on btrfs has been attempted[1]. Of course, we could not

But why? A completeness fetish? I don't understand why you decided
to do this work.

> have checksums or multi-device with it. However, got stuck on
> associating a shared extent on the same page mapping: basically the
> TODO above dax_associate_entry().
>
> Shiyang has proposed a way to disassociate existing mapping, but I
> don't think that is the best solution. DAX for CoW will not work until
> we have a way of mapping a page to multiple inodes (page->mapping),
> which will convert a 1-N inode-page mapping to M-N inode-page mapping.

If you're still thinking in terms of pages, you're doing DAX wrong.
DAX should work without a struct page.

2021-03-10 17:07:34

by Goldwyn Rodrigues

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On 14:26 10/03, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > On 13:02 10/03, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > Btrfs at the same time? It seems weird to me that adding a feature
> > >
> > > btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
> > >
> > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > DAX is about giving raw access to the hardware. btrfs is about offering
> > > extra value (RAID, checksums, ...), none of which can be done if the
> > > filesystem isn't in the read/write path.
> > >
> > > That's why there's no DAX support in btrfs. If you want DAX, you have
> > > to give up all the features you like in btrfs. So you may as well use
> > > a different filesystem.
> >
> > DAX on btrfs has been attempted[1]. Of course, we could not
>
> But why? A completeness fetish? I don't understand why you decided
> to do this work.

If only I had a penny every time I heard "why would you want to do that?"

>
> > have checksums or multi-device with it. However, got stuck on
> > associating a shared extent on the same page mapping: basically the
> > TODO above dax_associate_entry().
> >
> > Shiyang has proposed a way to disassociate existing mapping, but I
> > don't think that is the best solution. DAX for CoW will not work until
> > we have a way of mapping a page to multiple inodes (page->mapping),
> > which will convert a 1-N inode-page mapping to M-N inode-page mapping.
>
> If you're still thinking in terms of pages, you're doing DAX wrong.
> DAX should work without a struct page.

Not pages specifically, but mappings.
fsdax needs the mappings during the page fault and it breaks in case both
files fault on the same shared extent.

For Reference: WARN_ON_ONCE(page->mapping && page->mapping != mapping)
in dax_disassociate_entry().

--
Goldwyn

2021-03-11 00:57:27

by Dan Williams

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 6:27 AM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > On 13:02 10/03, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > Btrfs at the same time? It seems weird to me that adding a feature
> > >
> > > btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
> > >
> > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > DAX is about giving raw access to the hardware. btrfs is about offering
> > > extra value (RAID, checksums, ...), none of which can be done if the
> > > filesystem isn't in the read/write path.
> > >
> > > That's why there's no DAX support in btrfs. If you want DAX, you have
> > > to give up all the features you like in btrfs. So you may as well use
> > > a different filesystem.
> >
> > DAX on btrfs has been attempted[1]. Of course, we could not
>
> But why? A completeness fetish? I don't understand why you decided
> to do this work.

Isn't DAX useful for pagecache minimization on read even if it is
awkward for a copy-on-write fs?

Seems it would be a useful case to have COW'd VM images on BTRFS that
don't need superfluous page cache allocations.

2021-03-11 08:29:14

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 7:53 PM Dan Williams <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 6:27 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > > On 13:02 10/03, Matthew Wilcox wrote:
> > > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > > Btrfs at the same time? It seems weird to me that adding a feature
> > > >
> > > > btrfs doesn't support DAX. only ext2, ext4, XFS and FUSE have DAX support.
> > > >
> > > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > > DAX is about giving raw access to the hardware. btrfs is about offering
> > > > extra value (RAID, checksums, ...), none of which can be done if the
> > > > filesystem isn't in the read/write path.
> > > >
> > > > That's why there's no DAX support in btrfs. If you want DAX, you have
> > > > to give up all the features you like in btrfs. So you may as well use
> > > > a different filesystem.
> > >
> > > DAX on btrfs has been attempted[1]. Of course, we could not
> >
> > But why? A completeness fetish? I don't understand why you decided
> > to do this work.
>
> Isn't DAX useful for pagecache minimization on read even if it is
> awkward for a copy-on-write fs?
>
> Seems it would be a useful case to have COW'd VM images on BTRFS that
> don't need superfluous page cache allocations.

I could also see this being useful for databases (and maybe even swap
files!) on Btrfs, if I'm understanding this feature correctly.

--
真実はいつも一つ！/ Always, there's only one truth!

2021-03-11 12:31:54

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RESEND PATCH v2.1 07/10] iomap: Introduce iomap_apply2() for operations on two files

On Thu, Mar 04, 2021 at 01:41:42PM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.

I still wonder if adding the iter based iomap API that willy proposed
would be a better fit here. In that case we might not even need
a special API for the double iteration.

2021-03-12 09:03:04

by Shiyang Ruan

[permalink] [raw]

Subject: RE: [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor()

> > + /* if we are reading UNWRITTEN and HOLE, return a hole. */
> > + if (!write &&
> > + (iomap->type == IOMAP_UNWRITTEN || iomap->type ==
> IOMAP_HOLE)) {
> > + if (!pmd)
> > + return dax_load_hole(xas, mapping, &entry, vmf);
> > + else
> > + return dax_pmd_load_hole(xas, vmf, iomap, &entry);
> > + }
> > +
> > + if (iomap->type != IOMAP_MAPPED) {
> > + WARN_ON_ONCE(1);
> > + return VM_FAULT_SIGBUS;
> > + }
>
> Nit: I'd use a switch statement here for a clarity:
>
> switch (iomap->type) {
> case IOMAP_MAPPED:
> break;
> case IOMAP_UNWRITTEN:
> case IOMAP_HOLE:
> if (!write) {
> if (!pmd)
> return dax_load_hole(xas, mapping, &entry, vmf);
> return dax_pmd_load_hole(xas, vmf, iomap, &entry);
> }
> break;
> default:
> WARN_ON_ONCE(1);
> return VM_FAULT_SIGBUS;
> }
>
Hi, Christoph

I did not use a switch-case here is because that I still have to introduce a 'goto' for CoW(Writing on IOMAP_UNWRITTEN and the two different iomap indicate that it is a CoW operation. Then goto IOMAP_MAPPED branch to do the data copy and pfn insertion.) You said the 'goto' makes the code convoluted. So, I avoided to use it and refactored this part into so much if-else, which looks similar in dax_iomap_actor(). So, what's your opinion now?

--
Thanks,
Ruan Shiyang.

>
> > + err = dax_iomap_pfn(iomap, pos, size, &pfn);
> > + if (err)
> > + goto error_fault;
> > +
> > + entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
> > + write && !sync);
> > +
> > + if (sync)
> > + return dax_fault_synchronous_pfnp(pfnp, pfn);
> > +
> > + ret = dax_fault_insert_pfn(vmf, pfn, pmd, write);
> > +
> > +error_fault:
> > + if (err)
> > + ret = dax_fault_return(err);
> > +
> > + return ret;
>
> It seems like the only place that sets err is the dax_iomap_pfn case above. So
> I'd move the dax_fault_return there, which then allows a direct return for
> everyone else, including the open coded version of dax_fault_insert_pfn.
>
> I really like where this is going!

2021-03-13 13:21:16

by Adam Borowski

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Wed, Mar 10, 2021 at 02:26:43PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > DAX on btrfs has been attempted[1]. Of course, we could not
>
> But why? A completeness fetish? I don't understand why you decided
> to do this work.

* xfs can shapshot only single files, btrfs entire subvolumes
* btrfs-send|receive
* enumeration of changed parts of a file

Meow!
--
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢠⠒⠀⣿⡁ productivity. You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism
⠈⠳⣄⠀⠀⠀⠀ (funeral doom metal).

2021-03-13 16:26:27

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Sat, Mar 13, 2021 at 8:09 AM Adam Borowski <[email protected]> wrote:
>
> On Wed, Mar 10, 2021 at 02:26:43PM +0000, Matthew Wilcox wrote:
> > On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > > DAX on btrfs has been attempted[1]. Of course, we could not
> >
> > But why? A completeness fetish? I don't understand why you decided
> > to do this work.
>
> * xfs can shapshot only single files, btrfs entire subvolumes
> * btrfs-send|receive
> * enumeration of changed parts of a file
>

XFS cannot do snapshots since it lacks metadata COW. XFS reflinking is
primarily for space efficiency.

--
真実はいつも一つ！/ Always, there's only one truth!

2021-03-13 22:09:20

by Adam Borowski

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

On Sat, Mar 13, 2021 at 11:24:00AM -0500, Neal Gompa wrote:
> On Sat, Mar 13, 2021 at 8:09 AM Adam Borowski <[email protected]> wrote:
> >
> > On Wed, Mar 10, 2021 at 02:26:43PM +0000, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > > > DAX on btrfs has been attempted[1]. Of course, we could not
> > >
> > > But why? A completeness fetish? I don't understand why you decided
> > > to do this work.
> >
> > * xfs can shapshot only single files, btrfs entire subvolumes
> > * btrfs-send|receive
> > * enumeration of changed parts of a file
>
> XFS cannot do snapshots since it lacks metadata COW. XFS reflinking is
> primarily for space efficiency.

A reflink is a single-file snapshot.

My work team really wants this very patchset -- reflinks on DAX allow
backups and/or checkpointing, without stopping the world (there's a single
file, "pool", here).

Besides, you can still get poor-man's whole-subvolume(/directory)
snapshots by manually walking the tree and reflinking everything.
That's not atomic -- but rsync isn't atomic either. That's enough for
eg. dnf/dpkg purposes.

Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ NADIE anticipa la inquisición de españa!
⠈⠳⣄⠀⠀⠀⠀