2021-08-16 06:05:14

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

Changes from V6:
- Rebased on switch iomap to an iterator model v2[1]
- Replace iomap_apply2() with iomap_iter2() because of the new model
- Introduce dax_iomap_ops and dax_iomap_ops->actor_end() to handle the end
of CoW before iomap_ops->iomap_end()

One of the key mechanism need to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destance
iomap. And we just copy range in which data won't be changed.

Another mechanism is range comparison. In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data. In fsdax case, readpage() does not work. So, we need
another compare data with direct access support.

With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.

(Rebased on v5.14-rc4 and Christoph's "switch iomap to an iterator model v2"[1])
[1]: https://lore.kernel.org/linux-xfs/[email protected]/
==

Shiyang Ruan (8):
fsdax: Output address in dax_iomap_pfn() and rename it
fsdax: Introduce dax_iomap_cow_copy()
fsdax: Replace mmap entry in case of CoW
fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
iomap: Introduce iomap_iter2 for two files
fsdax: Dedup file range to use a compare function
fsdax: Introduce dax_iomap_ops for end of reflink
fs/xfs: Add dax dedupe support

fs/dax.c | 307 ++++++++++++++++++++++++++++++++++++-----
fs/ext2/ext2.h | 3 +
fs/ext2/file.c | 6 +-
fs/ext2/inode.c | 11 +-
fs/ext4/ext4.h | 3 +
fs/ext4/file.c | 6 +-
fs/ext4/inode.c | 13 +-
fs/iomap/buffered-io.c | 7 +-
fs/iomap/core.c | 51 +++++++
fs/remap_range.c | 39 ++++--
fs/xfs/xfs_bmap_util.c | 3 +-
fs/xfs/xfs_file.c | 10 +-
fs/xfs/xfs_inode.c | 57 ++++++++
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 36 ++++-
fs/xfs/xfs_iomap.h | 33 +++++
fs/xfs/xfs_iops.c | 7 +-
fs/xfs/xfs_reflink.c | 15 +-
include/linux/dax.h | 32 ++++-
include/linux/fs.h | 12 +-
include/linux/iomap.h | 3 +
21 files changed, 570 insertions(+), 85 deletions(-)

--
2.32.0




2021-08-16 06:05:19

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 1/8] fsdax: Output address in dax_iomap_pfn() and rename it

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case. Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/dax.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 118c9e2923f5..9fb6218f42be 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1010,8 +1010,8 @@ static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
}

-static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
- pfn_t *pfnp)
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+ size_t size, void **kaddr, pfn_t *pfnp)
{
const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
@@ -1023,11 +1023,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
return rc;
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
- NULL, pfnp);
+ kaddr, pfnp);
if (length < 0) {
rc = length;
goto out;
}
+ if (!pfnp)
+ goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1037,6 +1039,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+ if (!kaddr)
+ goto out;
+ if (!*kaddr)
+ rc = -EFAULT;
out:
dax_read_unlock(id);
return rc;
@@ -1401,7 +1409,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
}

- err = dax_iomap_pfn(&iter->iomap, pos, size, &pfn);
+ err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
if (err)
return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);

--
2.32.0



2021-08-16 06:05:39

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 4/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
Otherwise, data in not aligned area will be not correct. So, add the
srcmap to dax_iomap_zero() and replace memset() as dax_iomap_cow_copy().

Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/dax.c | 25 +++++++++++++++----------
fs/iomap/buffered-io.c | 4 ++--
include/linux/dax.h | 3 ++-
3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index e49ba68cc7e4..91ceb518f66a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1198,7 +1198,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
}
#endif /* CONFIG_FS_DAX_PMD */

-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
+s64 dax_iomap_zero(loff_t pos, u64 length, const struct iomap *iomap,
+ const struct iomap *srcmap)
{
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
pgoff_t pgoff;
@@ -1220,19 +1221,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)

if (page_aligned)
rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
- else
+ else {
rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
- if (rc < 0) {
- dax_read_unlock(id);
- return rc;
- }
-
- if (!page_aligned) {
- memset(kaddr + offset, 0, size);
+ if (rc < 0)
+ goto out;
+ if (iomap->addr != srcmap->addr) {
+ rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
+ kaddr);
+ if (rc < 0)
+ goto out;
+ } else
+ memset(kaddr + offset, 0, size);
dax_flush(iomap->dax_dev, kaddr + offset, size);
}
+
+out:
dax_read_unlock(id);
- return size;
+ return rc < 0 ? rc : size;
}

static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 71b4806266d7..6e8d40877d01 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -889,7 +889,7 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)

static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
{
- struct iomap *iomap = &iter->iomap;
+ const struct iomap *iomap = &iter->iomap;
const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t pos = iter->pos;
loff_t length = iomap_length(iter);
@@ -903,7 +903,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
s64 bytes;

if (IS_DAX(iter->inode))
- bytes = dax_iomap_zero(pos, length, iomap);
+ bytes = dax_iomap_zero(pos, length, iomap, srcmap);
else
bytes = __iomap_zero_iter(iter, pos, length);
if (bytes < 0)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..c63559605369 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
+s64 dax_iomap_zero(loff_t pos, u64 length, const struct iomap *iomap,
+ const struct iomap *srcmap);
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
--
2.32.0



2021-08-16 06:05:43

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 6/8] fsdax: Dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/dax.c | 72 ++++++++++++++++++++++++++++++++++++++++++++
fs/remap_range.c | 39 +++++++++++++++++++-----
fs/xfs/xfs_reflink.c | 8 +++--
include/linux/dax.h | 8 +++++
include/linux/fs.h | 12 +++++---
5 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 91ceb518f66a..74dd918cff1f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1823,3 +1823,75 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
+ struct iomap_iter *it_dest, bool *same)
+{
+ const struct iomap *smap = &it_src->iomap;
+ const struct iomap *dmap = &it_dest->iomap;
+ loff_t pos1 = it_src->pos, pos2 = it_dest->pos;
+ loff_t len = min(smap->length, dmap->length);
+ void *saddr, *daddr;
+ int ret;
+
+ if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+ *same = true;
+ return len;
+ }
+
+ if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+ *same = false;
+ return 0;
+ }
+
+ ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+ &saddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+ &daddr, NULL);
+ if (ret < 0)
+ return -EIO;
+
+ *same = !memcmp(saddr, daddr, len);
+ if (!*same)
+ return 0;
+ return len;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len, bool *same,
+ const struct iomap_ops *ops)
+{
+ struct iomap_iter it_src = {
+ .inode = src,
+ .pos = srcoff,
+ .len = len,
+ };
+ struct iomap_iter it_dest = {
+ .inode = dest,
+ .pos = destoff,
+ .len = len,
+ };
+ int id, ret;
+
+ id = dax_read_lock();
+ while ((ret = iomap_iter2(&it_src, &it_dest, ops)) > 0) {
+ it_src.processed = it_dest.processed =
+ dax_range_compare_iter(&it_src, &it_dest, same);
+ }
+ dax_read_unlock(id);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL(dax_remap_file_range_prep);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e4a5fdd7ad7b..b68a29562902 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/fs.h>
+#include <linux/dax.h>
#include "internal.h"

#include <linux/uaccess.h>
@@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
- loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
{
loff_t src_poff;
loff_t dest_poff;
@@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);

/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
* If there's an error, then the usual negative error code is returned.
* Otherwise returns 0 with *len set to the request length.
*/
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *dax_read_ops)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -351,8 +355,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;

- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ if (*len == 0)
+ return 0;
+
+ if (!IS_DAX(inode_in))
+ ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same);
+ else if (dax_read_ops)
+ ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same,
+ dax_read_ops);
+ else
+ return -EINVAL;
if (ret)
return ret;
if (!is_same)
@@ -370,6 +384,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,

return ret;
}
+EXPORT_SYMBOL(__generic_remap_file_range_prep);
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags)
+{
+ return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, NULL);
+}
EXPORT_SYMBOL(generic_remap_file_range_prep);

loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index c256104772cb..28effe537d07 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1332,8 +1332,12 @@ xfs_reflink_remap_prep(
if (IS_DAX(inode_in) || IS_DAX(inode_out))
goto out_unlock;

- ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ if (!IS_DAX(inode_in))
+ ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags);
+ else
+ ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, &xfs_read_iomap_ops);
if (ret || *len == 0)
goto out_unlock;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index c63559605369..8ecd125434ef 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,6 +239,14 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
s64 dax_iomap_zero(loff_t pos, u64 length, const struct iomap *iomap,
const struct iomap *srcmap);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *ops);
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..3e9922aac200 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -71,6 +71,7 @@ struct fsverity_operations;
struct fs_context;
struct fs_parameter_spec;
struct fileattr;
+struct iomap_ops;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2126,10 +2127,13 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *count,
- unsigned int remap_flags);
+int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *len, unsigned int remap_flags,
+ const struct iomap_ops *dax_read_ops);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t *count, unsigned int remap_flags);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.32.0



2021-08-16 06:06:03

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 2/8] fsdax: Introduce dax_iomap_cow_copy()

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

The destance extent which iomap indicated is new allocated extent.
So, it is needed to copy the data from srcmap to new allocated extent.
In theory, it is better to copy the head and tail ranges which is
outside of the non-aligned area instead of copying the whole aligned
range. But in dax page fault, it will always be an aligned range. So,
we have to copy the whole range in this case.

Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/dax.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 84 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9fb6218f42be..697a7b7bb96f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1050,6 +1050,61 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
return rc;
}

+/**
+ * dax_iomap_cow_copy(): Copy the data from source to destination before write.
+ * @pos: address to do copy from.
+ * @length: size of copy operation.
+ * @align_size: aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
+ * @srcmap: iomap srcmap
+ * @daddr: destination address to copy to.
+ *
+ * This can be called from two places. Either during DAX write fault, to copy
+ * the length size data to daddr. Or, while doing normal DAX write operation,
+ * dax_iomap_actor() might call this to do the copy of either start or end
+ * unaligned address. In this case the rest of the copy of aligned ranges is
+ * taken care by dax_iomap_actor() itself.
+ * Also, note DAX fault will always result in aligned pos and pos + length.
+ */
+static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+ const struct iomap *srcmap, void *daddr)
+{
+ loff_t head_off = pos & (align_size - 1);
+ size_t size = ALIGN(head_off + length, align_size);
+ loff_t end = pos + length;
+ loff_t pg_end = round_up(end, align_size);
+ bool copy_all = head_off == 0 && end == pg_end;
+ void *saddr = 0;
+ int ret = 0;
+
+ ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+ if (ret)
+ return ret;
+
+ if (copy_all) {
+ ret = copy_mc_to_kernel(daddr, saddr, length);
+ return ret ? -EIO : 0;
+ }
+
+ /* Copy the head part of the range. Note: we pass offset as length. */
+ if (head_off) {
+ ret = copy_mc_to_kernel(daddr, saddr, head_off);
+ if (ret)
+ return -EIO;
+ }
+
+ /* Copy the tail part of the range */
+ if (end < pg_end) {
+ loff_t tail_off = head_off + length;
+ loff_t tail_len = pg_end - end;
+
+ ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
+ tail_len);
+ if (ret)
+ return -EIO;
+ }
+ return 0;
+}
+
/*
* The user has performed a load from a hole in the file. Allocating a new
* page in the file would cause excessive storage usage for workloads with
@@ -1175,16 +1230,18 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
struct iov_iter *iter)
{
const struct iomap *iomap = &iomi->iomap;
+ const struct iomap *srcmap = &iomi->srcmap;
loff_t length = iomap_length(iomi);
loff_t pos = iomi->pos;
struct block_device *bdev = iomap->bdev;
struct dax_device *dax_dev = iomap->dax_dev;
loff_t end = pos + length, done = 0;
+ bool write = iov_iter_rw(iter) == WRITE;
ssize_t ret = 0;
size_t xfer;
int id;

- if (iov_iter_rw(iter) == READ) {
+ if (!write) {
end = min(end, i_size_read(iomi->inode));
if (pos >= end)
return 0;
@@ -1193,7 +1250,12 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
return iov_iter_zero(min(length, end - pos), iter);
}

- if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+ /*
+ * In DAX mode, we allow either pure overwrites of written extents, or
+ * writes to unwritten extents as part of a copy-on-write operation.
+ */
+ if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+ !(iomap->flags & IOMAP_F_SHARED)))
return -EIO;

/*
@@ -1232,6 +1294,14 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
break;
}

+ if (write &&
+ srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+ ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+ kaddr);
+ if (ret)
+ break;
+ }
+
map_len = PFN_PHYS(map_len);
kaddr += offset;
map_len -= offset;
@@ -1243,7 +1313,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
* validated via access_ok() in either vfs_read() or
* vfs_write(), depending on which operation we are doing.
*/
- if (iov_iter_rw(iter) == WRITE)
+ if (write)
xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
map_len, iter);
else
@@ -1385,6 +1455,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
const struct iomap *iomap = &iter->iomap;
+ const struct iomap *srcmap = &iter->srcmap;
size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -1392,6 +1463,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
unsigned long entry_flags = pmd ? DAX_PMD : 0;
int err = 0;
pfn_t pfn;
+ void *kaddr;

if (!pmd && vmf->cow_page)
return dax_fault_cow_page(vmf, iter);
@@ -1404,18 +1476,25 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
return dax_pmd_load_hole(xas, vmf, iomap, entry);
}

- if (iomap->type != IOMAP_MAPPED) {
+ if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
WARN_ON_ONCE(1);
return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
}

- err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
+ err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
if (err)
return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);

*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
write && !sync);

+ if (write &&
+ srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+ err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr);
+ if (err)
+ return dax_fault_return(err);
+ }
+
if (sync)
return dax_fault_synchronous_pfnp(pfnp, pfn);

--
2.32.0



2021-08-16 06:06:04

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 5/8] iomap: Introduce iomap_iter2 for two files

Some operations, such as comparing a range of data in two files under
fsdax mode, requires nested iomap_begin()/iomap_end() on two files.
Thus, we introduce iomap_iter2() to accept two iteraters to operate
action on two files.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/iomap/core.c | 51 +++++++++++++++++++++++++++++++++++++++++++
include/linux/iomap.h | 2 ++
2 files changed, 53 insertions(+)

diff --git a/fs/iomap/core.c b/fs/iomap/core.c
index 89a87a1654e8..214538a25110 100644
--- a/fs/iomap/core.c
+++ b/fs/iomap/core.c
@@ -77,3 +77,54 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
iomap_iter_done(iter);
return 1;
}
+
+/**
+ * iomap_iter2 - iterate over a ranges in two files
+ * @iter1: the first iteration structure
+ * @iter2: the second iteration structure
+ * @ops: iomap ops provided by the file system
+ *
+ * Iterate two files once.
+ */
+int iomap_iter2(struct iomap_iter *iter1, struct iomap_iter *iter2,
+ const struct iomap_ops *ops)
+{
+ int ret;
+
+ if (iter2->iomap.length && ops->iomap_end) {
+ ret = ops->iomap_end(iter2->inode, iter2->pos, iomap_length(iter2),
+ iter2->processed > 0 ? iter2->processed : 0,
+ iter2->flags, &iter2->iomap);
+ if (ret < 0 && !iter2->processed)
+ return ret;
+ }
+
+ if (iter1->iomap.length && ops->iomap_end) {
+ ret = ops->iomap_end(iter1->inode, iter1->pos, iomap_length(iter1),
+ iter1->processed > 0 ? iter1->processed : 0,
+ iter1->flags, &iter1->iomap);
+ if (ret < 0 && !iter1->processed)
+ return ret;
+ }
+
+ trace_iomap_iter(iter1, ops, _RET_IP_);
+ ret = iomap_iter_advance(iter1);
+ if (ret <= 0)
+ return ret;
+ ret = iomap_iter_advance(iter2);
+ if (ret <= 0)
+ return ret;
+
+ ret = ops->iomap_begin(iter1->inode, iter1->pos, iter1->len, iter1->flags,
+ &iter1->iomap, &iter1->srcmap);
+ if (ret < 0)
+ return ret;
+ iomap_iter_done(iter1);
+
+ ret = ops->iomap_begin(iter2->inode, iter2->pos, iter2->len, iter2->flags,
+ &iter2->iomap, &iter2->srcmap);
+ if (ret < 0)
+ return ret;
+ iomap_iter_done(iter2);
+ return 1;
+}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 24f8489583ca..ebef060c65cd 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -186,6 +186,8 @@ struct iomap_iter {
};

int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops);
+int iomap_iter2(struct iomap_iter *iter, struct iomap_iter *iter2,
+ const struct iomap_ops *ops);

/**
* iomap_length - length of the current iomap iteration
--
2.32.0



2021-08-16 06:06:23

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink

After writing data, reflink requires end operations to remap those new
allocated extents. The current ->iomap_end() ignores the error code
returned from ->actor(), so we introduce this dax_iomap_ops and change
the dax_iomap_*() interfaces to do this job.

- the dax_iomap_ops contains the original struct iomap_ops and fsdax
specific ->actor_end(), which is for the end operations of reflink
- also introduce dax specific zero_range, truncate_page
- create new dax_iomap_ops for ext2 and ext4

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 68 +++++++++++++++++++++++++++++++++++++-----
fs/ext2/ext2.h | 3 ++
fs/ext2/file.c | 6 ++--
fs/ext2/inode.c | 11 +++++--
fs/ext4/ext4.h | 3 ++
fs/ext4/file.c | 6 ++--
fs/ext4/inode.c | 13 ++++++--
fs/iomap/buffered-io.c | 3 +-
fs/xfs/xfs_bmap_util.c | 3 +-
fs/xfs/xfs_file.c | 8 ++---
fs/xfs/xfs_iomap.c | 36 +++++++++++++++++++++-
fs/xfs/xfs_iomap.h | 33 ++++++++++++++++++++
fs/xfs/xfs_iops.c | 7 ++---
fs/xfs/xfs_reflink.c | 3 +-
include/linux/dax.h | 21 ++++++++++---
include/linux/iomap.h | 1 +
16 files changed, 189 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 74dd918cff1f..0e0536765a7e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1348,11 +1348,30 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
return done ? done : ret;
}

+static inline int
+__dax_iomap_iter(struct iomap_iter *iter, const struct dax_iomap_ops *ops)
+{
+ int ret;
+
+ /*
+ * Call dax_iomap_ops->actor_end() before iomap_ops->iomap_end() in
+ * each iteration.
+ */
+ if (iter->iomap.length && ops->actor_end) {
+ ret = ops->actor_end(iter->inode, iter->pos, iter->len,
+ iter->processed);
+ if (ret < 0)
+ return ret;
+ }
+
+ return iomap_iter(iter, &ops->iomap_ops);
+}
+
/**
* dax_iomap_rw - Perform I/O to a DAX file
* @iocb: The control block for this I/O
* @iter: The addresses to do I/O from or to
- * @ops: iomap ops passed from the file system
+ * @ops: dax iomap ops passed from the file system
*
* This function performs read and write operations to directly mapped
* persistent memory. The callers needs to take care of read/write exclusion
@@ -1360,7 +1379,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
*/
ssize_t
dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
- const struct iomap_ops *ops)
+ const struct dax_iomap_ops *ops)
{
struct iomap_iter iomi = {
.inode = iocb->ki_filp->f_mapping->host,
@@ -1380,7 +1399,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iocb->ki_flags & IOCB_NOWAIT)
iomi.flags |= IOMAP_NOWAIT;

- while ((ret = iomap_iter(&iomi, ops)) > 0)
+ while ((ret = __dax_iomap_iter(&iomi, ops)) > 0)
iomi.processed = dax_iomap_iter(&iomi, iter);

done = iomi.pos - iocb->ki_pos;
@@ -1389,6 +1408,39 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(dax_iomap_rw);

+int
+dax_iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
+ bool *did_zero, const struct dax_iomap_ops *ops)
+{
+ struct iomap_iter iomi = {
+ .inode = inode,
+ .pos = pos,
+ .len = len,
+ .flags = IOMAP_ZERO,
+ };
+ int ret;
+
+ while ((ret = __dax_iomap_iter(&iomi, ops)) > 0)
+ iomi.processed = iomap_zero_iter(&iomi, did_zero);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_zero_range);
+
+int
+dax_iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+ const struct dax_iomap_ops *ops)
+{
+ unsigned int blocksize = i_blocksize(inode);
+ unsigned int off = pos & (blocksize - 1);
+
+ /* Block boundary? Nothing to do */
+ if (!off)
+ return 0;
+ return dax_iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+}
+EXPORT_SYMBOL_GPL(dax_iomap_truncate_page);
+
static vm_fault_t dax_fault_return(int error)
{
if (error == 0)
@@ -1531,7 +1583,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
}

static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
- int *iomap_errp, const struct iomap_ops *ops)
+ int *iomap_errp, const struct dax_iomap_ops *ops)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
@@ -1576,7 +1628,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
goto unlock_entry;
}

- while ((error = iomap_iter(&iter, ops)) > 0) {
+ while ((error = __dax_iomap_iter(&iter, ops)) > 0) {
if (WARN_ON_ONCE(iomap_length(&iter) < PAGE_SIZE)) {
iter.processed = -EIO; /* fs corruption? */
continue;
@@ -1641,7 +1693,7 @@ static bool dax_fault_check_fallback(struct vm_fault *vmf, struct xa_state *xas,
}

static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
- const struct iomap_ops *ops)
+ const struct dax_iomap_ops *ops)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
@@ -1700,7 +1752,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
}

iter.pos = (loff_t)xas.xa_index << PAGE_SHIFT;
- while ((error = iomap_iter(&iter, ops)) > 0) {
+ while ((error = __dax_iomap_iter(&iter, ops)) > 0) {
if (iomap_length(&iter) < PMD_SIZE)
continue; /* actually breaks out of the loop */

@@ -1742,7 +1794,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* successfully.
*/
vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
- pfn_t *pfnp, int *iomap_errp, const struct iomap_ops *ops)
+ pfn_t *pfnp, int *iomap_errp, const struct dax_iomap_ops *ops)
{
switch (pe_size) {
case PE_SIZE_PTE:
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index e512630cb63e..b04f3e2fe11f 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -807,6 +807,9 @@ extern void ext2_set_file_ops(struct inode *inode);
extern const struct address_space_operations ext2_aops;
extern const struct address_space_operations ext2_nobh_aops;
extern const struct iomap_ops ext2_iomap_ops;
+#ifdef CONFIG_FS_DAX
+extern const struct dax_iomap_ops ext2_dax_iomap_ops;
+#endif

/* namei.c */
extern const struct inode_operations ext2_dir_inode_operations;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index f98466acc672..d5dd82111128 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,7 +39,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
return 0; /* skip atime */

inode_lock_shared(inode);
- ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
+ ret = dax_iomap_rw(iocb, to, &ext2_dax_iomap_ops);
inode_unlock_shared(inode);

file_accessed(iocb->ki_filp);
@@ -63,7 +63,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out_unlock;

- ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
+ ret = dax_iomap_rw(iocb, from, &ext2_dax_iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
mark_inode_dirty(inode);
@@ -102,7 +102,7 @@ static vm_fault_t ext2_dax_fault(struct vm_fault *vmf)
}
down_read(&ei->dax_sem);

- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, &ext2_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, &ext2_dax_iomap_ops);

up_read(&ei->dax_sem);
if (write)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index dadb121beb22..deb1a117aa74 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -852,6 +852,13 @@ const struct iomap_ops ext2_iomap_ops = {
.iomap_begin = ext2_iomap_begin,
.iomap_end = ext2_iomap_end,
};
+
+const struct dax_iomap_ops ext2_dax_iomap_ops = {
+ .iomap_ops = {
+ .iomap_begin = ext2_iomap_begin,
+ .iomap_end = ext2_iomap_end,
+ },
+};
#else
/* Define empty ops for !CONFIG_FS_DAX case to avoid ugly ifdefs */
const struct iomap_ops ext2_iomap_ops;
@@ -1296,9 +1303,9 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);

if (IS_DAX(inode)) {
- error = iomap_zero_range(inode, newsize,
+ error = dax_iomap_zero_range(inode, newsize,
PAGE_ALIGN(newsize) - newsize, NULL,
- &ext2_iomap_ops);
+ &ext2_dax_iomap_ops);
} else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c51e243450d..59829d63cb89 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3787,6 +3787,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}

extern const struct iomap_ops ext4_iomap_ops;
+#ifdef CONFIG_FS_DAX
+extern const struct dax_iomap_ops ext4_dax_iomap_ops;
+#endif
extern const struct iomap_ops ext4_iomap_overwrite_ops;
extern const struct iomap_ops ext4_iomap_report_ops;

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 816dedcbd541..a7a3497429ca 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -102,7 +102,7 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
/* Fallback to buffered IO in case we cannot support DAX */
return generic_file_read_iter(iocb, to);
}
- ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
+ ret = dax_iomap_rw(iocb, to, &ext4_dax_iomap_ops);
inode_unlock_shared(inode);

file_accessed(iocb->ki_filp);
@@ -650,7 +650,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ext4_journal_stop(handle);
}

- ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ ret = dax_iomap_rw(iocb, from, &ext4_dax_iomap_ops);

if (extend)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
@@ -721,7 +721,7 @@ static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf,
} else {
down_read(&EXT4_I(inode)->i_mmap_sem);
}
- result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
+ result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_dax_iomap_ops);
if (write) {
ext4_journal_stop(handle);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d8de607849df..c0ac5e1384e6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,6 +3523,15 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};

+#ifdef CONFIG_FS_DAX
+const struct dax_iomap_ops ext4_dax_iomap_ops = {
+ .iomap_ops = {
+ .iomap_begin = ext4_iomap_begin,
+ .iomap_end = ext4_iomap_end,
+ },
+};
+#endif
+
const struct iomap_ops ext4_iomap_overwrite_ops = {
.iomap_begin = ext4_iomap_overwrite_begin,
.iomap_end = ext4_iomap_end,
@@ -3840,8 +3849,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
length = max;

if (IS_DAX(inode)) {
- return iomap_zero_range(inode, from, length, NULL,
- &ext4_iomap_ops);
+ return dax_iomap_zero_range(inode, from, length, NULL,
+ &ext4_dax_iomap_ops);
}
return __ext4_block_zero_page_range(handle, mapping, from, length);
}
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6e8d40877d01..6341a1328def 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -887,7 +887,7 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
return iomap_write_end(iter, pos, bytes, bytes, page);
}

-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
+loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
{
const struct iomap *iomap = &iter->iomap;
const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -918,6 +918,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)

return written;
}
+EXPORT_SYMBOL_GPL(iomap_zero_iter);

int
iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 213a97a921bb..f1b7a2637a1d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1009,8 +1009,7 @@ xfs_free_file_space(
return 0;
if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;
- error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
- &xfs_buffered_write_iomap_ops);
+ error = xfs_iomap_zero_range(ip, offset, len, NULL);
if (error)
return error;

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cc3cfb12df53..226e3fbaf405 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -281,7 +281,7 @@ xfs_file_dax_read(
ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
if (ret)
return ret;
- ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
+ ret = dax_iomap_rw(iocb, to, &xfs_dax_read_iomap_ops);
xfs_iunlock(ip, XFS_IOLOCK_SHARED);

file_accessed(iocb->ki_filp);
@@ -704,7 +704,7 @@ xfs_file_dax_write(
pos = iocb->ki_pos;

trace_xfs_file_dax_write(iocb, from);
- ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
+ ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
error = xfs_setfilesize(ip, pos, ret);
@@ -1329,8 +1329,8 @@ __xfs_filemap_fault(

ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
(write_fault && !vmf->cow_page) ?
- &xfs_direct_write_iomap_ops :
- &xfs_read_iomap_ops);
+ &xfs_dax_write_iomap_ops :
+ &xfs_dax_read_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
} else {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d8cd2583dedb..4b0f405304ce 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(

/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
- &lockmode, flags & IOMAP_DIRECT);
+ &lockmode,
+ (flags & IOMAP_DIRECT) || IS_DAX(inode));
if (error)
goto out_unlock;
if (shared)
@@ -854,6 +855,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
.iomap_begin = xfs_direct_write_iomap_begin,
};

+static int
+xfs_dax_write_iomap_actor_end(
+ struct inode *inode,
+ loff_t pos,
+ loff_t length,
+ ssize_t written)
+{
+ int error = 0;
+ struct xfs_inode *ip = XFS_I(inode);
+ bool cow = xfs_is_cow_inode(ip);
+
+ if (cow) {
+ if (written <= 0)
+ xfs_reflink_cancel_cow_range(ip, pos, length, true);
+ else
+ error = xfs_reflink_end_cow(ip, pos, written);
+ }
+ return error ?: written;
+}
+
+const struct dax_iomap_ops xfs_dax_write_iomap_ops = {
+ .iomap_ops = {
+ .iomap_begin = xfs_direct_write_iomap_begin,
+ },
+ .actor_end = xfs_dax_write_iomap_actor_end,
+};
+
static int
xfs_buffered_write_iomap_begin(
struct inode *inode,
@@ -1184,6 +1212,12 @@ const struct iomap_ops xfs_read_iomap_ops = {
.iomap_begin = xfs_read_iomap_begin,
};

+const struct dax_iomap_ops xfs_dax_read_iomap_ops = {
+ .iomap_ops = {
+ .iomap_begin = xfs_read_iomap_begin,
+ },
+};
+
static int
xfs_seek_iomap_begin(
struct inode *inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 7d3703556d0e..5eacb5d8ca88 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -45,5 +45,38 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
extern const struct iomap_ops xfs_read_iomap_ops;
extern const struct iomap_ops xfs_seek_iomap_ops;
extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct dax_iomap_ops xfs_dax_write_iomap_ops;
+extern const struct dax_iomap_ops xfs_dax_read_iomap_ops;
+
+static inline int
+xfs_iomap_zero_range(
+ struct xfs_inode *ip,
+ loff_t pos,
+ loff_t len,
+ bool *did_zero)
+{
+ struct inode *inode = VFS_I(ip);
+
+ return IS_DAX(inode)
+ ? dax_iomap_zero_range(inode, pos, len, did_zero,
+ &xfs_dax_write_iomap_ops)
+ : iomap_zero_range(inode, pos, len, did_zero,
+ &xfs_buffered_write_iomap_ops);
+}
+
+static inline int
+xfs_iomap_truncate_page(
+ struct xfs_inode *ip,
+ loff_t pos,
+ bool *did_zero)
+{
+ struct inode *inode = VFS_I(ip);
+
+ return IS_DAX(inode)
+ ? dax_iomap_truncate_page(inode, pos, did_zero,
+ &xfs_dax_write_iomap_ops)
+ : iomap_truncate_page(inode, pos, did_zero,
+ &xfs_buffered_write_iomap_ops);
+}

#endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 93c082db04b7..0380f6942bc0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,8 +911,8 @@ xfs_setattr_size(
*/
if (newsize > oldsize) {
trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
- error = iomap_zero_range(inode, oldsize, newsize - oldsize,
- &did_zeroing, &xfs_buffered_write_iomap_ops);
+ error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
+ &did_zeroing);
} else {
/*
* iomap won't detect a dirty page over an unwritten block (or a
@@ -924,8 +924,7 @@ xfs_setattr_size(
newsize);
if (error)
return error;
- error = iomap_truncate_page(inode, newsize, &did_zeroing,
- &xfs_buffered_write_iomap_ops);
+ error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
}

if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 28effe537d07..13e461cf2055 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1269,8 +1269,7 @@ xfs_reflink_zero_posteof(
return 0;

trace_xfs_zero_eof(ip, isize, pos - isize);
- return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
- &xfs_buffered_write_iomap_ops);
+ return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
}

/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8ecd125434ef..aff5617578ba 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,6 +3,7 @@
#define _LINUX_DAX_H

#include <linux/fs.h>
+#include <linux/iomap.h>
#include <linux/mm.h>
#include <linux/radix-tree.h>

@@ -11,8 +12,6 @@

typedef unsigned long dax_entry_t;

-struct iomap_ops;
-struct iomap;
struct dax_device;
struct dax_operations {
/*
@@ -38,6 +37,16 @@ struct dax_operations {
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
};

+struct dax_iomap_ops {
+ /* the original iomap ops */
+ struct iomap_ops iomap_ops;
+ /*
+ * actor_end: accept error code returned from ->actor(), deal with it
+ * before ->iomap_end()
+ */
+ int (*actor_end)(struct inode *, loff_t, loff_t, ssize_t);
+};
+
extern struct attribute_group dax_attribute_group;

#if IS_ENABLED(CONFIG_DAX)
@@ -229,14 +238,18 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);

ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
- const struct iomap_ops *ops);
+ const struct dax_iomap_ops *ops);
vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
- pfn_t *pfnp, int *errp, const struct iomap_ops *ops);
+ pfn_t *pfnp, int *errp, const struct dax_iomap_ops *ops);
vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
enum page_entry_size pe_size, pfn_t pfn);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
+int dax_iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
+ bool *did_zero, const struct dax_iomap_ops *ops);
+int dax_iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+ const struct dax_iomap_ops *ops);
s64 dax_iomap_zero(loff_t pos, u64 length, const struct iomap *iomap,
const struct iomap *srcmap);
int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index ebef060c65cd..8cdb94e749d9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -237,6 +237,7 @@ int iomap_migrate_page(struct address_space *mapping, struct page *newpage,
#endif
int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
const struct iomap_ops *ops);
+loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero);
int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
bool *did_zero, const struct iomap_ops *ops);
int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
--
2.32.0



2021-08-16 06:06:25

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 8/8] fs/xfs: Add dax dedupe support

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
who are going to be deduped. After that, call compare range function
only when files are both DAX or not.

Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/xfs/xfs_file.c | 2 +-
fs/xfs/xfs_inode.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_reflink.c | 4 ++--
4 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 226e3fbaf405..4765abfe777e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -846,7 +846,7 @@ xfs_wait_dax_page(
xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
}

-static int
+int
xfs_break_dax_layouts(
struct inode *inode,
bool *retry)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 990b72ae3635..4b44d9d1e42a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3727,6 +3727,59 @@ xfs_iolock_two_inodes_and_break_layout(
return 0;
}

+static int
+xfs_mmaplock_two_inodes_and_break_dax_layout(
+ struct xfs_inode *ip1,
+ struct xfs_inode *ip2)
+{
+ int error, attempts = 0;
+ bool retry;
+ struct page *page;
+ struct xfs_log_item *lp;
+
+ if (ip1->i_ino > ip2->i_ino)
+ swap(ip1, ip2);
+
+again:
+ retry = false;
+ /* Lock the first inode */
+ xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+ error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
+ if (error || retry) {
+ xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+ goto again;
+ }
+
+ if (ip1 == ip2)
+ return 0;
+
+ /* Nested lock the second inode */
+ lp = &ip1->i_itemp->ili_item;
+ if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
+ if (!xfs_ilock_nowait(ip2,
+ xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
+ xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+ if ((++attempts % 5) == 0)
+ delay(1); /* Don't just spin the CPU */
+ goto again;
+ }
+ } else
+ xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
+ /*
+ * We cannot use xfs_break_dax_layouts() directly here because it may
+ * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
+ * for this nested lock case.
+ */
+ page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
+ if (page && page_ref_count(page) != 1) {
+ xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+ xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+ goto again;
+ }
+
+ return 0;
+}
+
/*
* Lock two inodes so that userspace cannot initiate I/O via file syscalls or
* mmap activity.
@@ -3741,6 +3794,10 @@ xfs_ilock2_io_mmap(
ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
if (ret)
return ret;
+
+ if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2)))
+ return xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
+
if (ip1 == ip2)
xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
else
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 4b6703dbffb8..f1547330b087 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -456,6 +456,7 @@ enum xfs_prealloc_flags {

int xfs_update_prealloc_flags(struct xfs_inode *ip,
enum xfs_prealloc_flags flags);
+int xfs_break_dax_layouts(struct inode *inode, bool *retry);
int xfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 13e461cf2055..86c737c2baeb 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;

- /* Don't share DAX file data for now. */
- if (IS_DAX(inode_in) || IS_DAX(inode_out))
+ /* Don't share DAX file data with non-DAX file. */
+ if (IS_DAX(inode_in) != IS_DAX(inode_out))
goto out_unlock;

if (!IS_DAX(inode_in))
--
2.32.0



2021-08-16 06:07:08

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 3/8] fsdax: Replace mmap entry in case of CoW

We replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected. This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/dax.c | 39 ++++++++++++++++++++++++++++-----------
1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 697a7b7bb96f..e49ba68cc7e4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -734,6 +734,10 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
return 0;
}

+/* DAX Insert Flag: The state of the entry we insert */
+#define DAX_IF_DIRTY (1 << 0)
+#define DAX_IF_COW (1 << 1)
+
/*
* By this point grab_mapping_entry() has ensured that we have a locked entry
* of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -741,16 +745,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
* already in the tree, we will skip the insertion and just dirty the PMD as
* appropriate.
*/
-static void *dax_insert_entry(struct xa_state *xas,
- struct address_space *mapping, struct vm_fault *vmf,
- void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+ void *entry, pfn_t pfn, unsigned long flags,
+ unsigned int insert_flags)
{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
+ bool dirty = insert_flags & DAX_IF_DIRTY;
+ bool cow = insert_flags & DAX_IF_COW;

if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

- if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+ if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -762,7 +769,7 @@ static void *dax_insert_entry(struct xa_state *xas,

xas_reset(xas);
xas_lock_irq(xas);
- if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+ if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;

dax_disassociate_entry(entry, mapping, false);
@@ -786,6 +793,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);

+ if (cow)
+ xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
}
@@ -1121,8 +1131,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
vm_fault_t ret;

- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_ZERO_PAGE, false);
+ *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);

ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1149,8 +1158,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
goto fallback;

pfn = page_to_pfn_t(zero_page);
- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_PMD | DAX_ZERO_PAGE, false);
+ *entry = dax_insert_entry(xas, vmf, *entry, pfn,
+ DAX_PMD | DAX_ZERO_PAGE, 0);

if (arch_needs_pgtable_deposit()) {
pgtable = pte_alloc_one(vma->vm_mm);
@@ -1461,6 +1470,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool sync = dax_fault_is_synchronous(iter->flags, vmf->vma, iomap);
unsigned long entry_flags = pmd ? DAX_PMD : 0;
+ unsigned int insert_flags = 0;
int err = 0;
pfn_t pfn;
void *kaddr;
@@ -1485,8 +1495,15 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
if (err)
return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);

- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
- write && !sync);
+ if (write) {
+ if (!sync)
+ insert_flags |= DAX_IF_DIRTY;
+ if (iomap->flags & IOMAP_F_SHARED)
+ insert_flags |= DAX_IF_COW;
+ }
+
+ *entry = dax_insert_entry(xas, vmf, *entry, pfn, entry_flags,
+ insert_flags);

if (write &&
srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
--
2.32.0



2021-08-18 21:03:07

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 1/8] fsdax: Output address in dax_iomap_pfn() and rename it

On Sun, Aug 15, 2021 at 11:04 PM Shiyang Ruan <[email protected]> wrote:
>
> Add address output in dax_iomap_pfn() in order to perform a memcpy() in
> CoW case. Since this function both output address and pfn, rename it to
> dax_iomap_direct_access().
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>

Looks good to me:

Reviewed-by: Dan Williams <[email protected]>

2021-08-19 22:37:44

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] fsdax: Introduce dax_iomap_cow_copy()

On Sun, Aug 15, 2021 at 11:04 PM Shiyang Ruan <[email protected]> wrote:
>
> In the case where the iomap is a write operation and iomap is not equal
> to srcmap after iomap_begin, we consider it is a CoW operation.
>
> The destance extent which iomap indicated is new allocated extent.

s/destance/destination/

That sentence is still hard to grok though, did it mean to say:

"In this case, the destination (iomap->addr) points to a newly
allocated extent."

> So, it is needed to copy the data from srcmap to new allocated extent.
> In theory, it is better to copy the head and tail ranges which is
> outside of the non-aligned area instead of copying the whole aligned
> range. But in dax page fault, it will always be an aligned range. So,
> we have to copy the whole range in this case.

s/we have to copy/copy/

>
> Signed-off-by: Shiyang Ruan <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> ---
> fs/dax.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 84 insertions(+), 5 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 9fb6218f42be..697a7b7bb96f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1050,6 +1050,61 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
> return rc;
> }
>
> +/**
> + * dax_iomap_cow_copy(): Copy the data from source to destination before write.

s/():/() -/

...to be kernel-doc compliant

> + * @pos: address to do copy from.
> + * @length: size of copy operation.
> + * @align_size: aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
> + * @srcmap: iomap srcmap
> + * @daddr: destination address to copy to.
> + *
> + * This can be called from two places. Either during DAX write fault, to copy
> + * the length size data to daddr. Or, while doing normal DAX write operation,
> + * dax_iomap_actor() might call this to do the copy of either start or end
> + * unaligned address. In this case the rest of the copy of aligned ranges is

Which "this", the latter, or the former? Looks like the latter.

"In the latter case the rest of the copy..."

> + * taken care by dax_iomap_actor() itself.
> + * Also, note DAX fault will always result in aligned pos and pos + length.

Perhaps drop this sentence and just say:

"Either during DAX write fault (page aligned), to copy..."

...in that earlier sentence so this comment flows better.

> + */
> +static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
> + const struct iomap *srcmap, void *daddr)
> +{
> + loff_t head_off = pos & (align_size - 1);
> + size_t size = ALIGN(head_off + length, align_size);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, align_size);
> + bool copy_all = head_off == 0 && end == pg_end;
> + void *saddr = 0;
> + int ret = 0;
> +
> + ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
> + if (ret)
> + return ret;
> +
> + if (copy_all) {
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + return ret ? -EIO : 0;
> + }
> +
> + /* Copy the head part of the range. Note: we pass offset as length. */

I've re-read this a few times and this comment is not helping, why is
the offset used as the copy length?

> + if (head_off) {
> + ret = copy_mc_to_kernel(daddr, saddr, head_off);
> + if (ret)
> + return -EIO;
> + }
> +
> + /* Copy the tail part of the range */
> + if (end < pg_end) {
> + loff_t tail_off = head_off + length;
> + loff_t tail_len = pg_end - end;
> +
> + ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
> + tail_len);
> + if (ret)
> + return -EIO;
> + }
> + return 0;
> +}
> +
> /*
> * The user has performed a load from a hole in the file. Allocating a new
> * page in the file would cause excessive storage usage for workloads with
> @@ -1175,16 +1230,18 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> struct iov_iter *iter)
> {
> const struct iomap *iomap = &iomi->iomap;
> + const struct iomap *srcmap = &iomi->srcmap;
> loff_t length = iomap_length(iomi);
> loff_t pos = iomi->pos;
> struct block_device *bdev = iomap->bdev;
> struct dax_device *dax_dev = iomap->dax_dev;
> loff_t end = pos + length, done = 0;
> + bool write = iov_iter_rw(iter) == WRITE;
> ssize_t ret = 0;
> size_t xfer;
> int id;
>
> - if (iov_iter_rw(iter) == READ) {
> + if (!write) {
> end = min(end, i_size_read(iomi->inode));
> if (pos >= end)
> return 0;
> @@ -1193,7 +1250,12 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> return iov_iter_zero(min(length, end - pos), iter);
> }
>
> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> + /*
> + * In DAX mode, we allow either pure overwrites of written extents, or

s/we allow/enforce/

> + * writes to unwritten extents as part of a copy-on-write operation.
> + */
> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
> + !(iomap->flags & IOMAP_F_SHARED)))
> return -EIO;
>
> /*
> @@ -1232,6 +1294,14 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> break;
> }
>
> + if (write &&
> + srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
> + ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
> + kaddr);
> + if (ret)
> + break;
> + }
> +
> map_len = PFN_PHYS(map_len);
> kaddr += offset;
> map_len -= offset;
> @@ -1243,7 +1313,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> * validated via access_ok() in either vfs_read() or
> * vfs_write(), depending on which operation we are doing.
> */
> - if (iov_iter_rw(iter) == WRITE)
> + if (write)
> xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> map_len, iter);
> else
> @@ -1385,6 +1455,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> {
> struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> const struct iomap *iomap = &iter->iomap;
> + const struct iomap *srcmap = &iter->srcmap;
> size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
> loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -1392,6 +1463,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> unsigned long entry_flags = pmd ? DAX_PMD : 0;
> int err = 0;
> pfn_t pfn;
> + void *kaddr;
>
> if (!pmd && vmf->cow_page)
> return dax_fault_cow_page(vmf, iter);
> @@ -1404,18 +1476,25 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> return dax_pmd_load_hole(xas, vmf, iomap, entry);
> }
>
> - if (iomap->type != IOMAP_MAPPED) {
> + if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
> WARN_ON_ONCE(1);
> return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
> }
>
> - err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
> + err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
> if (err)
> return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
>
> *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
> write && !sync);
>
> + if (write &&
> + srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
> + err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr);
> + if (err)
> + return dax_fault_return(err);
> + }
> +
> if (sync)
> return dax_fault_synchronous_pfnp(pfnp, pfn);
>
> --
> 2.32.0
>
>
>

2021-08-19 22:56:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] fsdax: Replace mmap entry in case of CoW

On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
>
> We replace the existing entry to the newly allocated one in case of CoW.
> Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
> entry as writeprotected. This helps us snapshots so new write
> pagefaults after snapshots trigger a CoW.
>
> Signed-off-by: Goldwyn Rodrigues <[email protected]>
> Signed-off-by: Shiyang Ruan <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> ---
> fs/dax.c | 39 ++++++++++++++++++++++++++++-----------
> 1 file changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 697a7b7bb96f..e49ba68cc7e4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -734,6 +734,10 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
> return 0;
> }
>
> +/* DAX Insert Flag: The state of the entry we insert */
> +#define DAX_IF_DIRTY (1 << 0)
> +#define DAX_IF_COW (1 << 1)
> +
> /*
> * By this point grab_mapping_entry() has ensured that we have a locked entry
> * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -741,16 +745,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
> * already in the tree, we will skip the insertion and just dirty the PMD as
> * appropriate.
> */
> -static void *dax_insert_entry(struct xa_state *xas,
> - struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned int insert_flags)

I'm late, so feel free to ignore this style feedback, but what about
changing the signature to:

static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
const struct iomap_iter *iter, void
*entry, pfn_t pfn,
unsigned long flags)


> {
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;

...and then calculate these flags from the source data. I'm just
reacting to "yet more flags".

So, take it or leave it,

Reviewed-by: Dan Williams <[email protected]>

2021-08-20 02:43:03

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

On Sun, Aug 15, 2021 at 11:04 PM Shiyang Ruan <[email protected]> wrote:
>
> Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
> Otherwise, data in not aligned area will be not correct. So, add the
> srcmap to dax_iomap_zero() and replace memset() as dax_iomap_cow_copy().
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> ---
> fs/dax.c | 25 +++++++++++++++----------
> fs/iomap/buffered-io.c | 4 ++--
> include/linux/dax.h | 3 ++-
> 3 files changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index e49ba68cc7e4..91ceb518f66a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1198,7 +1198,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> }
> #endif /* CONFIG_FS_DAX_PMD */
>
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(loff_t pos, u64 length, const struct iomap *iomap,
> + const struct iomap *srcmap)
> {
> sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
> pgoff_t pgoff;
> @@ -1220,19 +1221,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>
> if (page_aligned)
> rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> - else
> + else {
> rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
> - if (rc < 0) {
> - dax_read_unlock(id);
> - return rc;
> - }
> -
> - if (!page_aligned) {
> - memset(kaddr + offset, 0, size);
> + if (rc < 0)
> + goto out;
> + if (iomap->addr != srcmap->addr) {
> + rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
> + kaddr);

Apologies, I'm confused, why is it ok to skip zeroing here?

2021-08-20 03:03:35

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink

On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
>
> After writing data, reflink requires end operations to remap those new
> allocated extents. The current ->iomap_end() ignores the error code
> returned from ->actor(), so we introduce this dax_iomap_ops and change
> the dax_iomap_*() interfaces to do this job.
>
> - the dax_iomap_ops contains the original struct iomap_ops and fsdax
> specific ->actor_end(), which is for the end operations of reflink
> - also introduce dax specific zero_range, truncate_page
> - create new dax_iomap_ops for ext2 and ext4
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 68 +++++++++++++++++++++++++++++++++++++-----
> fs/ext2/ext2.h | 3 ++
> fs/ext2/file.c | 6 ++--
> fs/ext2/inode.c | 11 +++++--
> fs/ext4/ext4.h | 3 ++
> fs/ext4/file.c | 6 ++--
> fs/ext4/inode.c | 13 ++++++--
> fs/iomap/buffered-io.c | 3 +-
> fs/xfs/xfs_bmap_util.c | 3 +-
> fs/xfs/xfs_file.c | 8 ++---
> fs/xfs/xfs_iomap.c | 36 +++++++++++++++++++++-
> fs/xfs/xfs_iomap.h | 33 ++++++++++++++++++++
> fs/xfs/xfs_iops.c | 7 ++---
> fs/xfs/xfs_reflink.c | 3 +-
> include/linux/dax.h | 21 ++++++++++---
> include/linux/iomap.h | 1 +
> 16 files changed, 189 insertions(+), 36 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 74dd918cff1f..0e0536765a7e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1348,11 +1348,30 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> return done ? done : ret;
> }
>
> +static inline int
> +__dax_iomap_iter(struct iomap_iter *iter, const struct dax_iomap_ops *ops)
> +{
> + int ret;
> +
> + /*
> + * Call dax_iomap_ops->actor_end() before iomap_ops->iomap_end() in
> + * each iteration.
> + */
> + if (iter->iomap.length && ops->actor_end) {
> + ret = ops->actor_end(iter->inode, iter->pos, iter->len,
> + iter->processed);
> + if (ret < 0)
> + return ret;
> + }
> +
> + return iomap_iter(iter, &ops->iomap_ops);

This reorganization looks needlessly noisy. Why not require the
iomap_end operation to perform the actor_end work. I.e. why can't
xfs_dax_write_iomap_actor_end() just be the passed in iomap_end? I am
not seeing where the ->iomap_end() result is ignored?

2021-08-20 03:10:42

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] fs/xfs: Add dax dedupe support

On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
>
> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> who are going to be deduped. After that, call compare range function
> only when files are both DAX or not.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> ---
> fs/xfs/xfs_file.c | 2 +-
> fs/xfs/xfs_inode.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_inode.h | 1 +
> fs/xfs/xfs_reflink.c | 4 ++--
> 4 files changed, 61 insertions(+), 3 deletions(-)
[..]
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 13e461cf2055..86c737c2baeb 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
> if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> goto out_unlock;
>
> - /* Don't share DAX file data for now. */
> - if (IS_DAX(inode_in) || IS_DAX(inode_out))
> + /* Don't share DAX file data with non-DAX file. */
> + if (IS_DAX(inode_in) != IS_DAX(inode_out))
> goto out_unlock;

What if you have 2 DAX inodes sharing data and one is flipped to
non-DAX? Does that operation need to first go undo all sharing?

2021-08-20 15:20:37

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink

On Thu, Aug 19, 2021 at 11:13 PM ruansy.fnst <[email protected]> wrote:
>
>
>
> On 2021/8/20 上午11:01, Dan Williams wrote:
> > On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
> >>
> >> After writing data, reflink requires end operations to remap those new
> >> allocated extents. The current ->iomap_end() ignores the error code
> >> returned from ->actor(), so we introduce this dax_iomap_ops and change
> >> the dax_iomap_*() interfaces to do this job.
> >>
> >> - the dax_iomap_ops contains the original struct iomap_ops and fsdax
> >> specific ->actor_end(), which is for the end operations of reflink
> >> - also introduce dax specific zero_range, truncate_page
> >> - create new dax_iomap_ops for ext2 and ext4
> >>
> >> Signed-off-by: Shiyang Ruan <[email protected]>
> >> ---
> >> fs/dax.c | 68 +++++++++++++++++++++++++++++++++++++-----
> >> fs/ext2/ext2.h | 3 ++
> >> fs/ext2/file.c | 6 ++--
> >> fs/ext2/inode.c | 11 +++++--
> >> fs/ext4/ext4.h | 3 ++
> >> fs/ext4/file.c | 6 ++--
> >> fs/ext4/inode.c | 13 ++++++--
> >> fs/iomap/buffered-io.c | 3 +-
> >> fs/xfs/xfs_bmap_util.c | 3 +-
> >> fs/xfs/xfs_file.c | 8 ++---
> >> fs/xfs/xfs_iomap.c | 36 +++++++++++++++++++++-
> >> fs/xfs/xfs_iomap.h | 33 ++++++++++++++++++++
> >> fs/xfs/xfs_iops.c | 7 ++---
> >> fs/xfs/xfs_reflink.c | 3 +-
> >> include/linux/dax.h | 21 ++++++++++---
> >> include/linux/iomap.h | 1 +
> >> 16 files changed, 189 insertions(+), 36 deletions(-)
> >>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index 74dd918cff1f..0e0536765a7e 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -1348,11 +1348,30 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> >> return done ? done : ret;
> >> }
> >>
> >> +static inline int
> >> +__dax_iomap_iter(struct iomap_iter *iter, const struct dax_iomap_ops *ops)
> >> +{
> >> + int ret;
> >> +
> >> + /*
> >> + * Call dax_iomap_ops->actor_end() before iomap_ops->iomap_end() in
> >> + * each iteration.
> >> + */
> >> + if (iter->iomap.length && ops->actor_end) {
> >> + ret = ops->actor_end(iter->inode, iter->pos, iter->len,
> >> + iter->processed);
> >> + if (ret < 0)
> >> + return ret;
> >> + }
> >> +
> >> + return iomap_iter(iter, &ops->iomap_ops);
> >
> > This reorganization looks needlessly noisy. Why not require the
> > iomap_end operation to perform the actor_end work. I.e. why can't
> > xfs_dax_write_iomap_actor_end() just be the passed in iomap_end? I am
> > not seeing where the ->iomap_end() result is ignored?
> >
>
> The V6 patch[1] was did in this way.
> [1]https://lore.kernel.org/linux-xfs/20210526005159.GF202144@locust/T/#m79a66a928da2d089e2458c1a97c0516dbfde2f7f
>
> But Darrick reminded me that ->iomap_end() will always take zero or
> positive 'written' because iomap_apply() handles this argument.
>
> ```
> if (ops->iomap_end) {
> ret = ops->iomap_end(inode, pos, length,
> written > 0 ? written : 0,
> flags, &iomap);
> }
> ```
>
> So, we cannot get actual return code from CoW in ->actor(), and as a
> result, we cannot handle the xfs end_cow correctly in ->iomap_end().
> That's where the result of CoW was ignored.

Ah, thank you for the explanation.

However, this still seems like too much code thrash just to get back
to the original value of iter->processed. I notice you are talking
about iomap_apply(), but that routine is now gone in Darrick's latest
iomap-for-next branch. Instead iomap_iter() does this:

if (iter->iomap.length && ops->iomap_end) {
ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
iter->processed > 0 ? iter->processed : 0,
iter->flags, &iter->iomap);
if (ret < 0 && !iter->processed)
return ret;
}


I notice that the @iomap argument to ->iomap_end() is reliably coming
from @iter. So you could do the following in your iomap_end()
callback:

struct iomap_iter *iter = container_of(iomap, typeof(*iter), iomap);
struct xfs_inode *ip = XFS_I(inode);
ssize_t written = iter->processed;
bool cow = xfs_is_cow_inode(ip);

if (cow) {
if (written <= 0)
xfs_reflink_cancel_cow_range(ip, pos, length, true)
}

2021-08-23 12:59:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] fsdax: Replace mmap entry in case of CoW

On Thu, Aug 19, 2021 at 03:54:01PM -0700, Dan Williams wrote:
>
> static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> const struct iomap_iter *iter, void
> *entry, pfn_t pfn,
> unsigned long flags)
>
>
> > {
> > + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > void *new_entry = dax_make_entry(pfn, flags);
> > + bool dirty = insert_flags & DAX_IF_DIRTY;
> > + bool cow = insert_flags & DAX_IF_COW;
>
> ...and then calculate these flags from the source data. I'm just
> reacting to "yet more flags".

Except for the overly long line above that seems like a good idea.
The iomap_iter didn't exist for most of the time this patch has been
around.

2021-08-23 13:00:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] iomap: Introduce iomap_iter2 for two files

On Mon, Aug 16, 2021 at 02:03:56PM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_begin()/iomap_end() on two files.
> Thus, we introduce iomap_iter2() to accept two iteraters to operate
> action on two files.

We really should not need both. We should be able to be in two
nested loops, just operating on the part that both cover (for which
a helper might be useful).

2021-08-23 13:05:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink

On Fri, Aug 20, 2021 at 08:18:44AM -0700, Dan Williams wrote:
> I notice that the @iomap argument to ->iomap_end() is reliably coming
> from @iter. So you could do the following in your iomap_end()
> callback:
>
> struct iomap_iter *iter = container_of(iomap, typeof(*iter), iomap);
> struct xfs_inode *ip = XFS_I(inode);
> ssize_t written = iter->processed;
> bool cow = xfs_is_cow_inode(ip);
>
> if (cow) {
> if (written <= 0)
> xfs_reflink_cancel_cow_range(ip, pos, length, true)
> }

I think this might be ok for now (with a big comment). Willy's original
iomap iter series replaced the iomap_begin and iomap_end with a single
next callback that takes the iomap_iter, which would solve this issue.
My plan is to look into a series that implements this and sees if this
is indeed a net win.

2021-08-23 13:18:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 6/8] fsdax: Dedup file range to use a compare function

On Mon, Aug 16, 2021 at 02:03:57PM +0800, Shiyang Ruan wrote:
> + id = dax_read_lock();
> + while ((ret = iomap_iter2(&it_src, &it_dest, ops)) > 0) {
> + it_src.processed = it_dest.processed =
> + dax_range_compare_iter(&it_src, &it_dest, same);
> + }
> + dax_read_unlock(id);

I think it would be better to move the DAX locking into
dax_range_compare_iter to avoid very long hold times.

> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);

No need for the export.

> +EXPORT_SYMBOL(dax_remap_file_range_prep);

EXPORT_SYMBOL_GPL, please.

Attached is a totally untested patch that just has two levels of
iterations instead of the new iter2 helper:


diff --git a/fs/dax.c b/fs/dax.c
index 0e0536765a7efc..2b65471785290d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1885,6 +1885,7 @@ static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
loff_t len = min(smap->length, dmap->length);
void *saddr, *daddr;
int ret;
+ int id;

if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
*same = true;
@@ -1896,47 +1897,56 @@ static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
return 0;
}

+ id = dax_read_lock();
ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
&saddr, NULL);
if (ret < 0)
- return -EIO;
+ goto out_unlock;

ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
&daddr, NULL);
if (ret < 0)
- return -EIO;
+ goto out_unlock;

*same = !memcmp(saddr, daddr, len);
if (!*same)
- return 0;
+ len = 0;
+ dax_read_unlock(id);
return len;
+
+out_unlock:
+ dax_read_unlock(id);
+ return -EIO;
}

int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff, loff_t len, bool *same,
+ struct inode *dst, loff_t dstoff, loff_t len, bool *same,
const struct iomap_ops *ops)
{
- struct iomap_iter it_src = {
+ struct iomap_iter src_iter = {
.inode = src,
.pos = srcoff,
.len = len,
};
- struct iomap_iter it_dest = {
- .inode = dest,
- .pos = destoff,
+ struct iomap_iter dst_iter = {
+ .inode = dst,
+ .pos = dstoff,
.len = len,
};
- int id, ret;
+ int ret;

- id = dax_read_lock();
- while ((ret = iomap_iter2(&it_src, &it_dest, ops)) > 0) {
- it_src.processed = it_dest.processed =
- dax_range_compare_iter(&it_src, &it_dest, same);
+ while ((ret = iomap_iter(&src_iter, ops)) > 0) {
+ while ((ret = iomap_iter(&dst_iter, ops)) > 0) {
+ dst_iter.processed = dax_range_compare_iter(&src_iter,
+ &dst_iter, same);
+ if (dst_iter.processed > 0)
+ src_iter.processed += dst_iter.processed;
+ else if (!src_iter.processed)
+ src_iter.processed = dst_iter.processed;
+ }
}
- dax_read_unlock(id);
return ret;
}
-EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);

int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
@@ -1946,4 +1956,4 @@ int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
return __generic_remap_file_range_prep(file_in, pos_in, file_out,
pos_out, len, remap_flags, ops);
}
-EXPORT_SYMBOL(dax_remap_file_range_prep);
+EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);

2021-08-27 03:23:39

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] fsdax: Replace mmap entry in case of CoW



On 2021/8/23 20:57, Christoph Hellwig wrote:
> On Thu, Aug 19, 2021 at 03:54:01PM -0700, Dan Williams wrote:
>>
>> static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
>> const struct iomap_iter *iter, void
>> *entry, pfn_t pfn,
>> unsigned long flags)
>>
>>
>>> {
>>> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>>> void *new_entry = dax_make_entry(pfn, flags);
>>> + bool dirty = insert_flags & DAX_IF_DIRTY;
>>> + bool cow = insert_flags & DAX_IF_COW;
>>
>> ...and then calculate these flags from the source data. I'm just
>> reacting to "yet more flags".
>
> Except for the overly long line above that seems like a good idea.
> The iomap_iter didn't exist for most of the time this patch has been
> around.
>

So should I reuse the iter->flags to pass the insert_flags? (left shift
it to higher bits)

--
Thanks,
Ruan.


2021-08-27 03:26:04

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero



On 2021/8/20 10:39, Dan Williams wrote:
> On Sun, Aug 15, 2021 at 11:04 PM Shiyang Ruan <[email protected]> wrote:
>>
>> Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
>> Otherwise, data in not aligned area will be not correct. So, add the
>> srcmap to dax_iomap_zero() and replace memset() as dax_iomap_cow_copy().
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> Reviewed-by: Ritesh Harjani <[email protected]>
>> Reviewed-by: Darrick J. Wong <[email protected]>
>> ---
>> fs/dax.c | 25 +++++++++++++++----------
>> fs/iomap/buffered-io.c | 4 ++--
>> include/linux/dax.h | 3 ++-
>> 3 files changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index e49ba68cc7e4..91ceb518f66a 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1198,7 +1198,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>> }
>> #endif /* CONFIG_FS_DAX_PMD */
>>
>> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>> +s64 dax_iomap_zero(loff_t pos, u64 length, const struct iomap *iomap,
>> + const struct iomap *srcmap)
>> {
>> sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>> pgoff_t pgoff;
>> @@ -1220,19 +1221,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>>
>> if (page_aligned)
>> rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
>> - else
>> + else {
>> rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
>> - if (rc < 0) {
>> - dax_read_unlock(id);
>> - return rc;
>> - }
>> -
>> - if (!page_aligned) {
>> - memset(kaddr + offset, 0, size);
>> + if (rc < 0)
>> + goto out;
>> + if (iomap->addr != srcmap->addr) {
>> + rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
>> + kaddr);
>
> Apologies, I'm confused, why is it ok to skip zeroing here?
>

That was a mistake. Will be fixed in next version.


--
Thanks,
Ruan.


2021-08-27 03:31:03

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink



On 2021/8/20 23:18, Dan Williams wrote:
> On Thu, Aug 19, 2021 at 11:13 PM ruansy.fnst <[email protected]> wrote:
>>
>>
>>
>> On 2021/8/20 上午11:01, Dan Williams wrote:
>>> On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
>>>>
>>>> After writing data, reflink requires end operations to remap those new
>>>> allocated extents. The current ->iomap_end() ignores the error code
>>>> returned from ->actor(), so we introduce this dax_iomap_ops and change
>>>> the dax_iomap_*() interfaces to do this job.
>>>>
>>>> - the dax_iomap_ops contains the original struct iomap_ops and fsdax
>>>> specific ->actor_end(), which is for the end operations of reflink
>>>> - also introduce dax specific zero_range, truncate_page
>>>> - create new dax_iomap_ops for ext2 and ext4
>>>>
>>>> Signed-off-by: Shiyang Ruan <[email protected]>
>>>> ---
>>>> fs/dax.c | 68 +++++++++++++++++++++++++++++++++++++-----
>>>> fs/ext2/ext2.h | 3 ++
>>>> fs/ext2/file.c | 6 ++--
>>>> fs/ext2/inode.c | 11 +++++--
>>>> fs/ext4/ext4.h | 3 ++
>>>> fs/ext4/file.c | 6 ++--
>>>> fs/ext4/inode.c | 13 ++++++--
>>>> fs/iomap/buffered-io.c | 3 +-
>>>> fs/xfs/xfs_bmap_util.c | 3 +-
>>>> fs/xfs/xfs_file.c | 8 ++---
>>>> fs/xfs/xfs_iomap.c | 36 +++++++++++++++++++++-
>>>> fs/xfs/xfs_iomap.h | 33 ++++++++++++++++++++
>>>> fs/xfs/xfs_iops.c | 7 ++---
>>>> fs/xfs/xfs_reflink.c | 3 +-
>>>> include/linux/dax.h | 21 ++++++++++---
>>>> include/linux/iomap.h | 1 +
>>>> 16 files changed, 189 insertions(+), 36 deletions(-)
>>>>
>>>> diff --git a/fs/dax.c b/fs/dax.c
>>>> index 74dd918cff1f..0e0536765a7e 100644
>>>> --- a/fs/dax.c
>>>> +++ b/fs/dax.c
>>>> @@ -1348,11 +1348,30 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>>>> return done ? done : ret;
>>>> }
>>>>
>>>> +static inline int
>>>> +__dax_iomap_iter(struct iomap_iter *iter, const struct dax_iomap_ops *ops)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + /*
>>>> + * Call dax_iomap_ops->actor_end() before iomap_ops->iomap_end() in
>>>> + * each iteration.
>>>> + */
>>>> + if (iter->iomap.length && ops->actor_end) {
>>>> + ret = ops->actor_end(iter->inode, iter->pos, iter->len,
>>>> + iter->processed);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> + }
>>>> +
>>>> + return iomap_iter(iter, &ops->iomap_ops);
>>>
>>> This reorganization looks needlessly noisy. Why not require the
>>> iomap_end operation to perform the actor_end work. I.e. why can't
>>> xfs_dax_write_iomap_actor_end() just be the passed in iomap_end? I am
>>> not seeing where the ->iomap_end() result is ignored?
>>>
>>
>> The V6 patch[1] was did in this way.
>> [1]https://lore.kernel.org/linux-xfs/20210526005159.GF202144@locust/T/#m79a66a928da2d089e2458c1a97c0516dbfde2f7f
>>
>> But Darrick reminded me that ->iomap_end() will always take zero or
>> positive 'written' because iomap_apply() handles this argument.
>>
>> ```
>> if (ops->iomap_end) {
>> ret = ops->iomap_end(inode, pos, length,
>> written > 0 ? written : 0,
>> flags, &iomap);
>> }
>> ```
>>
>> So, we cannot get actual return code from CoW in ->actor(), and as a
>> result, we cannot handle the xfs end_cow correctly in ->iomap_end().
>> That's where the result of CoW was ignored.
>
> Ah, thank you for the explanation.
>
> However, this still seems like too much code thrash just to get back
> to the original value of iter->processed. I notice you are talking
> about iomap_apply(), but that routine is now gone in Darrick's latest
> iomap-for-next branch. Instead iomap_iter() does this:
>
> if (iter->iomap.length && ops->iomap_end) {
> ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
> iter->processed > 0 ? iter->processed : 0,

As you can see, here is the same logic as the old iomap_apply(): the
negative iter->processed won't be passed into ->iomap_end().

> iter->flags, &iter->iomap);
> if (ret < 0 && !iter->processed)
> return ret;
> }
>
>
> I notice that the @iomap argument to ->iomap_end() is reliably coming
> from @iter. So you could do the following in your iomap_end()
> callback:
>
> struct iomap_iter *iter = container_of(iomap, typeof(*iter), iomap);
> struct xfs_inode *ip = XFS_I(inode);
> ssize_t written = iter->processed;

The written will be 0 or positive. The original error code is ingnored.

> bool cow = xfs_is_cow_inode(ip);
>
> if (cow) {
> if (written <= 0)
> xfs_reflink_cancel_cow_range(ip, pos, length, true)
> }
>

--
Thanks,
Ruan.


2021-08-27 03:37:21

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] fs/xfs: Add dax dedupe support



On 2021/8/20 11:08, Dan Williams wrote:
> On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
>>
>> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
>> who are going to be deduped. After that, call compare range function
>> only when files are both DAX or not.
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> Reviewed-by: Darrick J. Wong <[email protected]>
>> ---
>> fs/xfs/xfs_file.c | 2 +-
>> fs/xfs/xfs_inode.c | 57 ++++++++++++++++++++++++++++++++++++++++++++
>> fs/xfs/xfs_inode.h | 1 +
>> fs/xfs/xfs_reflink.c | 4 ++--
>> 4 files changed, 61 insertions(+), 3 deletions(-)
> [..]
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 13e461cf2055..86c737c2baeb 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
>> if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>> goto out_unlock;
>>
>> - /* Don't share DAX file data for now. */
>> - if (IS_DAX(inode_in) || IS_DAX(inode_out))
>> + /* Don't share DAX file data with non-DAX file. */
>> + if (IS_DAX(inode_in) != IS_DAX(inode_out))
>> goto out_unlock;
>
> What if you have 2 DAX inodes sharing data and one is flipped to
> non-DAX? Does that operation need to first go undo all sharing?
>

Yes, I think it is needed to unshare the extents when the DAX flags of
the file is changed. I'll look into it.


--
Thanks,
Ruan.


2021-08-27 05:02:26

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] fsdax: Replace mmap entry in case of CoW

On Thu, Aug 26, 2021 at 8:22 PM Shiyang Ruan <[email protected]> wrote:
>
>
>
> On 2021/8/23 20:57, Christoph Hellwig wrote:
> > On Thu, Aug 19, 2021 at 03:54:01PM -0700, Dan Williams wrote:
> >>
> >> static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> >> const struct iomap_iter *iter, void
> >> *entry, pfn_t pfn,
> >> unsigned long flags)
> >>
> >>
> >>> {
> >>> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> >>> void *new_entry = dax_make_entry(pfn, flags);
> >>> + bool dirty = insert_flags & DAX_IF_DIRTY;
> >>> + bool cow = insert_flags & DAX_IF_COW;
> >>
> >> ...and then calculate these flags from the source data. I'm just
> >> reacting to "yet more flags".
> >
> > Except for the overly long line above that seems like a good idea.
> > The iomap_iter didn't exist for most of the time this patch has been
> > around.
> >
>
> So should I reuse the iter->flags to pass the insert_flags? (left shift
> it to higher bits)

No, the advice is to just pass the @iter to dax_insert_entry directly
and calculate @dirty and @cow internally.

2021-08-27 05:08:46

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink

On Thu, Aug 26, 2021 at 8:30 PM Shiyang Ruan <[email protected]> wrote:
>
>
>
> On 2021/8/20 23:18, Dan Williams wrote:
> > On Thu, Aug 19, 2021 at 11:13 PM ruansy.fnst <[email protected]> wrote:
> >>
> >>
> >>
> >> On 2021/8/20 上午11:01, Dan Williams wrote:
> >>> On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
> >>>>
> >>>> After writing data, reflink requires end operations to remap those new
> >>>> allocated extents. The current ->iomap_end() ignores the error code
> >>>> returned from ->actor(), so we introduce this dax_iomap_ops and change
> >>>> the dax_iomap_*() interfaces to do this job.
> >>>>
> >>>> - the dax_iomap_ops contains the original struct iomap_ops and fsdax
> >>>> specific ->actor_end(), which is for the end operations of reflink
> >>>> - also introduce dax specific zero_range, truncate_page
> >>>> - create new dax_iomap_ops for ext2 and ext4
> >>>>
> >>>> Signed-off-by: Shiyang Ruan <[email protected]>
> >>>> ---
> >>>> fs/dax.c | 68 +++++++++++++++++++++++++++++++++++++-----
> >>>> fs/ext2/ext2.h | 3 ++
> >>>> fs/ext2/file.c | 6 ++--
> >>>> fs/ext2/inode.c | 11 +++++--
> >>>> fs/ext4/ext4.h | 3 ++
> >>>> fs/ext4/file.c | 6 ++--
> >>>> fs/ext4/inode.c | 13 ++++++--
> >>>> fs/iomap/buffered-io.c | 3 +-
> >>>> fs/xfs/xfs_bmap_util.c | 3 +-
> >>>> fs/xfs/xfs_file.c | 8 ++---
> >>>> fs/xfs/xfs_iomap.c | 36 +++++++++++++++++++++-
> >>>> fs/xfs/xfs_iomap.h | 33 ++++++++++++++++++++
> >>>> fs/xfs/xfs_iops.c | 7 ++---
> >>>> fs/xfs/xfs_reflink.c | 3 +-
> >>>> include/linux/dax.h | 21 ++++++++++---
> >>>> include/linux/iomap.h | 1 +
> >>>> 16 files changed, 189 insertions(+), 36 deletions(-)
> >>>>
> >>>> diff --git a/fs/dax.c b/fs/dax.c
> >>>> index 74dd918cff1f..0e0536765a7e 100644
> >>>> --- a/fs/dax.c
> >>>> +++ b/fs/dax.c
> >>>> @@ -1348,11 +1348,30 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> >>>> return done ? done : ret;
> >>>> }
> >>>>
> >>>> +static inline int
> >>>> +__dax_iomap_iter(struct iomap_iter *iter, const struct dax_iomap_ops *ops)
> >>>> +{
> >>>> + int ret;
> >>>> +
> >>>> + /*
> >>>> + * Call dax_iomap_ops->actor_end() before iomap_ops->iomap_end() in
> >>>> + * each iteration.
> >>>> + */
> >>>> + if (iter->iomap.length && ops->actor_end) {
> >>>> + ret = ops->actor_end(iter->inode, iter->pos, iter->len,
> >>>> + iter->processed);
> >>>> + if (ret < 0)
> >>>> + return ret;
> >>>> + }
> >>>> +
> >>>> + return iomap_iter(iter, &ops->iomap_ops);
> >>>
> >>> This reorganization looks needlessly noisy. Why not require the
> >>> iomap_end operation to perform the actor_end work. I.e. why can't
> >>> xfs_dax_write_iomap_actor_end() just be the passed in iomap_end? I am
> >>> not seeing where the ->iomap_end() result is ignored?
> >>>
> >>
> >> The V6 patch[1] was did in this way.
> >> [1]https://lore.kernel.org/linux-xfs/20210526005159.GF202144@locust/T/#m79a66a928da2d089e2458c1a97c0516dbfde2f7f
> >>
> >> But Darrick reminded me that ->iomap_end() will always take zero or
> >> positive 'written' because iomap_apply() handles this argument.
> >>
> >> ```
> >> if (ops->iomap_end) {
> >> ret = ops->iomap_end(inode, pos, length,
> >> written > 0 ? written : 0,
> >> flags, &iomap);
> >> }
> >> ```
> >>
> >> So, we cannot get actual return code from CoW in ->actor(), and as a
> >> result, we cannot handle the xfs end_cow correctly in ->iomap_end().
> >> That's where the result of CoW was ignored.
> >
> > Ah, thank you for the explanation.
> >
> > However, this still seems like too much code thrash just to get back
> > to the original value of iter->processed. I notice you are talking
> > about iomap_apply(), but that routine is now gone in Darrick's latest
> > iomap-for-next branch. Instead iomap_iter() does this:
> >
> > if (iter->iomap.length && ops->iomap_end) {
> > ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
> > iter->processed > 0 ? iter->processed : 0,
>
> As you can see, here is the same logic as the old iomap_apply(): the
> negative iter->processed won't be passed into ->iomap_end().
>
> > iter->flags, &iter->iomap);
> > if (ret < 0 && !iter->processed)
> > return ret;
> > }
> >
> >
> > I notice that the @iomap argument to ->iomap_end() is reliably coming
> > from @iter. So you could do the following in your iomap_end()
> > callback:
> >
> > struct iomap_iter *iter = container_of(iomap, typeof(*iter), iomap);
> > struct xfs_inode *ip = XFS_I(inode);
> > ssize_t written = iter->processed;
>
> The written will be 0 or positive. The original error code is ingnored.

Correct, but you can use container_of() to get back to the iter and
consider the raw untranslated value of iter->processed. As Christoph
mentioned this needs a comment explaining the layering violation, but
that's a cleaner change than the dax_iomap_ops approach.

2021-08-27 05:28:51

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] fsdax: Replace mmap entry in case of CoW



On 2021/8/27 13:00, Dan Williams wrote:
> On Thu, Aug 26, 2021 at 8:22 PM Shiyang Ruan <[email protected]> wrote:
>>
>>
>>
>> On 2021/8/23 20:57, Christoph Hellwig wrote:
>>> On Thu, Aug 19, 2021 at 03:54:01PM -0700, Dan Williams wrote:
>>>>
>>>> static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
>>>> const struct iomap_iter *iter, void
>>>> *entry, pfn_t pfn,
>>>> unsigned long flags)
>>>>
>>>>
>>>>> {
>>>>> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>>>>> void *new_entry = dax_make_entry(pfn, flags);
>>>>> + bool dirty = insert_flags & DAX_IF_DIRTY;
>>>>> + bool cow = insert_flags & DAX_IF_COW;
>>>>
>>>> ...and then calculate these flags from the source data. I'm just
>>>> reacting to "yet more flags".
>>>
>>> Except for the overly long line above that seems like a good idea.
>>> The iomap_iter didn't exist for most of the time this patch has been
>>> around.
>>>
>>
>> So should I reuse the iter->flags to pass the insert_flags? (left shift
>> it to higher bits)
>
> No, the advice is to just pass the @iter to dax_insert_entry directly
> and calculate @dirty and @cow internally.
>

I see. Yes, they can be calculated inside the dax_insert_entry()
because it already has enough arguments.


--
Thanks,
Ruan.


2021-08-27 05:30:25

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] fsdax: Introduce dax_iomap_ops for end of reflink



On 2021/8/27 13:04, Dan Williams wrote:
> On Thu, Aug 26, 2021 at 8:30 PM Shiyang Ruan <[email protected]> wrote:
>>
>>
>>
>> On 2021/8/20 23:18, Dan Williams wrote:
>>> On Thu, Aug 19, 2021 at 11:13 PM ruansy.fnst <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 2021/8/20 上午11:01, Dan Williams wrote:
>>>>> On Sun, Aug 15, 2021 at 11:05 PM Shiyang Ruan <[email protected]> wrote:
>>>>>>
>>>>>> After writing data, reflink requires end operations to remap those new
>>>>>> allocated extents. The current ->iomap_end() ignores the error code
>>>>>> returned from ->actor(), so we introduce this dax_iomap_ops and change
>>>>>> the dax_iomap_*() interfaces to do this job.
>>>>>>
>>>>>> - the dax_iomap_ops contains the original struct iomap_ops and fsdax
>>>>>> specific ->actor_end(), which is for the end operations of reflink
>>>>>> - also introduce dax specific zero_range, truncate_page
>>>>>> - create new dax_iomap_ops for ext2 and ext4
>>>>>>
>>>>>> Signed-off-by: Shiyang Ruan <[email protected]>
>>>>>> ---
>>>>>> fs/dax.c | 68 +++++++++++++++++++++++++++++++++++++-----
>>>>>> fs/ext2/ext2.h | 3 ++
>>>>>> fs/ext2/file.c | 6 ++--
>>>>>> fs/ext2/inode.c | 11 +++++--
>>>>>> fs/ext4/ext4.h | 3 ++
>>>>>> fs/ext4/file.c | 6 ++--
>>>>>> fs/ext4/inode.c | 13 ++++++--
>>>>>> fs/iomap/buffered-io.c | 3 +-
>>>>>> fs/xfs/xfs_bmap_util.c | 3 +-
>>>>>> fs/xfs/xfs_file.c | 8 ++---
>>>>>> fs/xfs/xfs_iomap.c | 36 +++++++++++++++++++++-
>>>>>> fs/xfs/xfs_iomap.h | 33 ++++++++++++++++++++
>>>>>> fs/xfs/xfs_iops.c | 7 ++---
>>>>>> fs/xfs/xfs_reflink.c | 3 +-
>>>>>> include/linux/dax.h | 21 ++++++++++---
>>>>>> include/linux/iomap.h | 1 +
>>>>>> 16 files changed, 189 insertions(+), 36 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/dax.c b/fs/dax.c
>>>>>> index 74dd918cff1f..0e0536765a7e 100644
>>>>>> --- a/fs/dax.c
>>>>>> +++ b/fs/dax.c
>>>>>> @@ -1348,11 +1348,30 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>>>>>> return done ? done : ret;
>>>>>> }
>>>>>>
>>>>>> +static inline int
>>>>>> +__dax_iomap_iter(struct iomap_iter *iter, const struct dax_iomap_ops *ops)
>>>>>> +{
>>>>>> + int ret;
>>>>>> +
>>>>>> + /*
>>>>>> + * Call dax_iomap_ops->actor_end() before iomap_ops->iomap_end() in
>>>>>> + * each iteration.
>>>>>> + */
>>>>>> + if (iter->iomap.length && ops->actor_end) {
>>>>>> + ret = ops->actor_end(iter->inode, iter->pos, iter->len,
>>>>>> + iter->processed);
>>>>>> + if (ret < 0)
>>>>>> + return ret;
>>>>>> + }
>>>>>> +
>>>>>> + return iomap_iter(iter, &ops->iomap_ops);
>>>>>
>>>>> This reorganization looks needlessly noisy. Why not require the
>>>>> iomap_end operation to perform the actor_end work. I.e. why can't
>>>>> xfs_dax_write_iomap_actor_end() just be the passed in iomap_end? I am
>>>>> not seeing where the ->iomap_end() result is ignored?
>>>>>
>>>>
>>>> The V6 patch[1] was did in this way.
>>>> [1]https://lore.kernel.org/linux-xfs/20210526005159.GF202144@locust/T/#m79a66a928da2d089e2458c1a97c0516dbfde2f7f
>>>>
>>>> But Darrick reminded me that ->iomap_end() will always take zero or
>>>> positive 'written' because iomap_apply() handles this argument.
>>>>
>>>> ```
>>>> if (ops->iomap_end) {
>>>> ret = ops->iomap_end(inode, pos, length,
>>>> written > 0 ? written : 0,
>>>> flags, &iomap);
>>>> }
>>>> ```
>>>>
>>>> So, we cannot get actual return code from CoW in ->actor(), and as a
>>>> result, we cannot handle the xfs end_cow correctly in ->iomap_end().
>>>> That's where the result of CoW was ignored.
>>>
>>> Ah, thank you for the explanation.
>>>
>>> However, this still seems like too much code thrash just to get back
>>> to the original value of iter->processed. I notice you are talking
>>> about iomap_apply(), but that routine is now gone in Darrick's latest
>>> iomap-for-next branch. Instead iomap_iter() does this:
>>>
>>> if (iter->iomap.length && ops->iomap_end) {
>>> ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
>>> iter->processed > 0 ? iter->processed : 0,
>>
>> As you can see, here is the same logic as the old iomap_apply(): the
>> negative iter->processed won't be passed into ->iomap_end().
>>
>>> iter->flags, &iter->iomap);
>>> if (ret < 0 && !iter->processed)
>>> return ret;
>>> }
>>>
>>>
>>> I notice that the @iomap argument to ->iomap_end() is reliably coming
>>> from @iter. So you could do the following in your iomap_end()
>>> callback:
>>>
>>> struct iomap_iter *iter = container_of(iomap, typeof(*iter), iomap);
>>> struct xfs_inode *ip = XFS_I(inode);
>>> ssize_t written = iter->processed;
>>
>> The written will be 0 or positive. The original error code is ingnored.
>
> Correct, but you can use container_of() to get back to the iter and
> consider the raw untranslated value of iter->processed. As Christoph
> mentioned this needs a comment explaining the layering violation, but
> that's a cleaner change than the dax_iomap_ops approach.
>

Understood. Thanks.

--
Ruan.