2021-02-07 17:12:45

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

One of the key mechanism need to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destance
iomap. And we just copy range in which data won't be changed.

Another mechanism is range comparison . In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data. In fsdax case, readpage() does not work. So, we need
another compare data with direct access support.

With the two mechanism implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.


Some of the patches are picked up from Goldwyn's patchset. I made some
changes to adapt to this patchset.

(Rebased on v5.10)
==

Shiyang Ruan (7):
fsdax: Output address in dax_iomap_pfn() and rename it
fsdax: Introduce dax_copy_edges() for CoW
fsdax: Copy data before write
fsdax: Replace mmap entry in case of CoW
fsdax: Dedup file range to use a compare function
fs/xfs: Handle CoW for fsdax write() path
fs/xfs: Add dedupe support for fsdax

fs/btrfs/reflink.c | 3 +-
fs/dax.c | 188 ++++++++++++++++++++++++++++++++++++++---
fs/ocfs2/file.c | 2 +-
fs/remap_range.c | 14 +--
fs/xfs/xfs_bmap_util.c | 6 +-
fs/xfs/xfs_file.c | 30 ++++++-
fs/xfs/xfs_inode.c | 8 +-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 3 +-
fs/xfs/xfs_iops.c | 11 ++-
fs/xfs/xfs_reflink.c | 23 ++++-
include/linux/dax.h | 5 ++
include/linux/fs.h | 9 +-
13 files changed, 270 insertions(+), 33 deletions(-)

--
2.30.0




2021-02-07 17:12:47

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH 4/7] fsdax: Replace mmap entry in case of CoW

We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b2195cbdf2dc..29698a3d2e37 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
return 0;
}

+#define DAX_IF_DIRTY (1ULL << 0)
+#define DAX_IF_COW (1ULL << 1)
+
/*
* By this point grab_mapping_entry() has ensured that we have a locked entry
* of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -731,14 +734,16 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
*/
static void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
- void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+ void *entry, pfn_t pfn, unsigned long flags, bool insert_flags)
{
void *new_entry = dax_make_entry(pfn, flags);
+ bool dirty = insert_flags & DAX_IF_DIRTY;
+ bool cow = insert_flags & DAX_IF_COW;

if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

- if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+ if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -750,7 +755,7 @@ static void *dax_insert_entry(struct xa_state *xas,

xas_reset(xas);
xas_lock_irq(xas);
- if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+ if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;

dax_disassociate_entry(entry, mapping, false);
@@ -774,6 +779,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);

+ if (cow)
+ xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
}
@@ -1319,6 +1327,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
void *entry;
pfn_t pfn;
void *kaddr;
+ unsigned long insert_flags = 0;

trace_dax_pte_fault(inode, vmf, ret);
/*
@@ -1444,8 +1453,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,

goto finish_iomap;
case IOMAP_UNWRITTEN:
- if (write && iomap.flags & IOMAP_F_SHARED)
+ if (write && (iomap.flags & IOMAP_F_SHARED)) {
+ insert_flags |= DAX_IF_COW;
goto cow;
+ }
fallthrough;
case IOMAP_HOLE:
if (!write) {
@@ -1555,6 +1566,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
int error;
pfn_t pfn;
void *kaddr;
+ unsigned long insert_flags = 0;

/*
* Check whether offset isn't beyond end of file now. Caller is
@@ -1670,14 +1682,17 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
result = vmf_insert_pfn_pmd(vmf, pfn, write);
break;
case IOMAP_UNWRITTEN:
- if (write && iomap.flags & IOMAP_F_SHARED)
+ if (write && (iomap.flags & IOMAP_F_SHARED)) {
+ insert_flags |= DAX_IF_COW;
goto cow;
+ }
fallthrough;
case IOMAP_HOLE:
- if (WARN_ON_ONCE(write))
+ if (!write) {
+ result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
break;
- result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
- break;
+ }
+ fallthrough;
default:
WARN_ON_ONCE(1);
break;
--
2.30.0



2021-02-07 17:12:57

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH 2/7] fsdax: Introduce dax_copy_edges() for CoW

dax_copy_edges() is a helper functions performs a copy from one part of
the device to another for data not page aligned.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index b012b2db7ba2..ea4e8a434900 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1038,6 +1038,47 @@ static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
return rc;
}

+/*
+ * Copies the part of the pages not included in the write, but required for CoW
+ * because offset/offset+length are not page aligned.
+ */
+static int dax_copy_edges(loff_t pos, loff_t length, struct iomap *srcmap,
+ void *daddr, bool pmd)
+{
+ size_t page_size = pmd ? PMD_SIZE : PAGE_SIZE;
+ loff_t offset = pos & (page_size - 1);
+ size_t size = ALIGN(offset + length, page_size);
+ loff_t end = pos + length;
+ loff_t pg_end = round_up(end, page_size);
+ void *saddr = 0;
+ int ret = 0;
+
+ ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+ if (ret)
+ return ret;
+ /*
+ * Copy the first part of the page
+ * Note: we pass offset as length
+ */
+ if (offset) {
+ if (saddr)
+ ret = copy_mc_to_kernel(daddr, saddr, offset);
+ else
+ memset(daddr, 0, offset);
+ }
+
+ /* Copy the last part of the range */
+ if (end < pg_end) {
+ if (saddr)
+ ret = copy_mc_to_kernel(daddr + offset + length,
+ saddr + offset + length, pg_end - end);
+ else
+ memset(daddr + offset + length, 0, pg_end - end);
+ }
+
+ return ret;
+}
+
/*
* The user has performed a load from a hole in the file. Allocating a new
* page in the file would cause excessive storage usage for workloads with
--
2.30.0



2021-02-07 17:15:36

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH 6/7] fs/xfs: Handle CoW for fsdax write() path

In fsdax mode, WRITE and ZERO on a shared extent need CoW mechanism
performed. After CoW, new extents needs to be remapped to the file.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 6 +++++-
fs/xfs/xfs_file.c | 10 +++++++---
fs/xfs/xfs_iomap.c | 3 ++-
fs/xfs/xfs_iops.c | 11 ++++++++---
fs/xfs/xfs_reflink.c | 2 ++
5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7371a7f7c652..eb0c7ff103f8 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -977,10 +977,14 @@ xfs_free_file_space(
if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;
error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
- &xfs_buffered_write_iomap_ops);
+ IS_DAX(VFS_I(ip)) ?
+ &xfs_direct_write_iomap_ops : &xfs_buffered_write_iomap_ops);
if (error)
return error;

+ if (xfs_is_reflink_inode(ip))
+ xfs_reflink_end_cow(ip, offset, len);
+
/*
* If we zeroed right up to EOF and EOF straddles a page boundary we
* must make sure that the post-EOF area is also zeroed because the
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..ab738641a3f4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -624,9 +624,13 @@ xfs_file_dax_write(

trace_xfs_file_dax_write(ip, count, pos);
ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
- if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
- i_size_write(inode, iocb->ki_pos);
- error = xfs_setfilesize(ip, pos, ret);
+ if (ret > 0) {
+ if (iocb->ki_pos > i_size_read(inode)) {
+ i_size_write(inode, iocb->ki_pos);
+ error = xfs_setfilesize(ip, pos, ret);
+ }
+ if (xfs_is_cow_inode(ip))
+ xfs_reflink_end_cow(ip, pos, ret);
}
out:
xfs_iunlock(ip, iolock);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7b9ff824e82d..d6d4cc0f084e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -765,13 +765,14 @@ xfs_direct_write_iomap_begin(
goto out_unlock;

if (imap_needs_cow(ip, flags, &imap, nimaps)) {
+ bool need_convert = flags & IOMAP_DIRECT || IS_DAX(inode);
error = -EAGAIN;
if (flags & IOMAP_NOWAIT)
goto out_unlock;

/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
- &lockmode, flags & IOMAP_DIRECT);
+ &lockmode, need_convert);
if (error)
goto out_unlock;
if (shared)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1414ab79eacf..480bda0f16d0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -860,6 +860,7 @@ xfs_setattr_size(
int error;
uint lock_flags = 0;
bool did_zeroing = false;
+ const struct iomap_ops *ops;

ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
@@ -906,10 +907,15 @@ xfs_setattr_size(
* extension, or zeroing out the rest of the block on a downward
* truncate.
*/
+ if (IS_DAX(inode))
+ ops = &xfs_direct_write_iomap_ops;
+ else
+ ops = &xfs_buffered_write_iomap_ops;
+
if (newsize > oldsize) {
trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
error = iomap_zero_range(inode, oldsize, newsize - oldsize,
- &did_zeroing, &xfs_buffered_write_iomap_ops);
+ &did_zeroing, ops);
} else {
/*
* iomap won't detect a dirty page over an unwritten block (or a
@@ -921,8 +927,7 @@ xfs_setattr_size(
newsize);
if (error)
return error;
- error = iomap_truncate_page(inode, newsize, &did_zeroing,
- &xfs_buffered_write_iomap_ops);
+ error = iomap_truncate_page(inode, newsize, &did_zeroing, ops);
}

if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 26708c01fd78..38bde16cdb39 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1246,6 +1246,8 @@ xfs_reflink_zero_posteof(

trace_xfs_zero_eof(ip, isize, pos - isize);
return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+ IS_DAX(VFS_I(ip)) ?
+ &xfs_direct_write_iomap_ops :
&xfs_buffered_write_iomap_ops);
}

--
2.30.0



2021-02-07 17:15:58

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH 7/7] fs/xfs: Add dedupe support for fsdax

Add xfs_break_two_dax_layouts() to break layout for tow dax files. Then
call compare range function only when files are both DAX or not.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_file.c | 20 ++++++++++++++++++++
fs/xfs/xfs_inode.c | 8 +++++++-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_reflink.c | 21 ++++++++++++++++++---
4 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ab738641a3f4..64ded96b43c8 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -791,6 +791,26 @@ xfs_break_dax_layouts(
0, 0, xfs_wait_dax_page(inode));
}

+int
+xfs_break_two_dax_layouts(
+ struct inode *src,
+ struct inode *dest)
+{
+ int error;
+ bool retry = false;
+
+retry:
+ error = xfs_break_dax_layouts(src, &retry);
+ if (error || retry)
+ goto retry;
+
+ error = xfs_break_dax_layouts(dest, &retry);
+ if (error || retry)
+ goto retry;
+
+ return error;
+}
+
int
xfs_break_layouts(
struct inode *inode,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2bfbcf28b1bd..6483dbaa4d57 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3786,8 +3786,10 @@ xfs_ilock2_io_mmap(
struct xfs_inode *ip2)
{
int ret;
+ struct inode *inode1 = VFS_I(ip1);
+ struct inode *inode2 = VFS_I(ip2);

- ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+ ret = xfs_iolock_two_inodes_and_break_layout(inode1, inode2);
if (ret)
return ret;
if (ip1 == ip2)
@@ -3795,6 +3797,10 @@ xfs_ilock2_io_mmap(
else
xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
ip2, XFS_MMAPLOCK_EXCL);
+
+ if (IS_DAX(inode1) && IS_DAX(inode2))
+ ret = xfs_break_two_dax_layouts(inode1, inode2);
+
return 0;
}

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 751a3d1d7d84..462b61bea691 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -431,6 +431,7 @@ enum xfs_prealloc_flags {

int xfs_update_prealloc_flags(struct xfs_inode *ip,
enum xfs_prealloc_flags flags);
+int xfs_break_two_dax_layouts(struct inode *inode1, struct inode *inode2);
int xfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 38bde16cdb39..bdb9a07e703f 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
#include "xfs_iomap.h"
#include "xfs_sb.h"
#include "xfs_ag_resv.h"
+#include <linux/dax.h>

/*
* Copy on Write of Shared Blocks
@@ -1251,6 +1252,14 @@ xfs_reflink_zero_posteof(
&xfs_buffered_write_iomap_ops);
}

+int xfs_reflink_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
+{
+ return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same,
+ &xfs_read_iomap_ops);
+}
+
/*
* Prepare two files for range cloning. Upon a successful return both inodes
* will have the iolock and mmaplock held, the page cache of the out file will
@@ -1293,6 +1302,7 @@ xfs_reflink_remap_prep(
struct xfs_inode *src = XFS_I(inode_in);
struct inode *inode_out = file_inode(file_out);
struct xfs_inode *dest = XFS_I(inode_out);
+ compare_range_t compare_range_fn;
int ret;

/* Lock both files against IO */
@@ -1306,12 +1316,17 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;

- /* Don't share DAX file data for now. */
- if (IS_DAX(inode_in) || IS_DAX(inode_out))
+ /* Don't share DAX file data with non-DAX file. */
+ if (IS_DAX(inode_in) != IS_DAX(inode_out))
goto out_unlock;

+ if (IS_DAX(inode_in))
+ compare_range_fn = xfs_reflink_dedupe_file_range_compare;
+ else
+ compare_range_fn = vfs_dedupe_file_range_compare;
+
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags, vfs_dedupe_file_range_compare);
+ len, remap_flags, compare_range_fn);
if (ret || *len == 0)
goto out_unlock;

--
2.30.0



2021-02-07 17:16:03

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH 5/7] fsdax: Dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/btrfs/reflink.c | 3 +-
fs/dax.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
fs/ocfs2/file.c | 2 +-
fs/remap_range.c | 14 +++++----
fs/xfs/xfs_reflink.c | 2 +-
include/linux/dax.h | 5 ++++
include/linux/fs.h | 9 +++++-
7 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 99aa87c08912..efe094f7f748 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -783,7 +783,8 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
return ret;

return generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ len, remap_flags,
+ vfs_dedupe_file_range_compare);
}

loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
diff --git a/fs/dax.c b/fs/dax.c
index 29698a3d2e37..6e9391845057 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/fs_dax.h>

+#define MIN(a, b) (((a) < (b)) ? (a) : (b))
+
static inline unsigned int pe_order(enum page_entry_size pe_size)
{
if (pe_size == PE_SIZE_PTE)
@@ -1827,3 +1829,68 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+ loff_t destoff, loff_t len, bool *is_same,
+ const struct iomap_ops *ops)
+{
+ void *saddr, *daddr;
+ struct iomap smap = { 0 };
+ struct iomap dmap = { 0 };
+ bool same = false;
+ loff_t cmp_len;
+ int id, ret = 0;
+
+ id = dax_read_lock();
+ while (len) {
+ ret = ops->iomap_begin(src, srcoff, len, 0, &smap, NULL);
+ if (ret < 0)
+ goto out_src;
+ cmp_len = MIN(len, smap.offset + smap.length - srcoff);
+
+ ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &dmap, NULL);
+ if (ret < 0)
+ goto out_dest;
+ cmp_len = MIN(cmp_len, dmap.offset + dmap.length - destoff);
+
+ if (smap.type == IOMAP_HOLE && dmap.type == IOMAP_HOLE)
+ goto next;
+
+ if (smap.type == IOMAP_HOLE || dmap.type == IOMAP_HOLE) {
+ same = false;
+ goto next;
+ }
+
+ ret = dax_iomap_direct_access(&smap, srcoff,
+ ALIGN(srcoff + cmp_len, PAGE_SIZE), &saddr, NULL);
+ if (ret < 0)
+ goto out_dest;
+
+ ret = dax_iomap_direct_access(&dmap, destoff,
+ ALIGN(destoff + cmp_len, PAGE_SIZE), &daddr, NULL);
+ if (ret < 0)
+ goto out_dest;
+
+ same = !memcmp(saddr, daddr, cmp_len);
+ if (!same)
+ break;
+next:
+ len -= cmp_len;
+ srcoff += cmp_len;
+ destoff += cmp_len;
+out_dest:
+ if (ops->iomap_end)
+ ops->iomap_end(dest, destoff, cmp_len, 0, 0, &dmap);
+out_src:
+ if (ops->iomap_end)
+ ops->iomap_end(src, srcoff, len, 0, 0, &smap);
+
+ if (ret < 0)
+ goto out;
+ }
+ *is_same = same;
+out:
+ dax_read_unlock(id);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 85979e2214b3..9d101f129d16 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2591,7 +2591,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
goto out_unlock;

ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- &len, remap_flags);
+ &len, remap_flags, vfs_dedupe_file_range_compare);
if (ret < 0 || len == 0)
goto out_unlock;

diff --git a/fs/remap_range.c b/fs/remap_range.c
index e6099beefa97..fc64f7e2af5a 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -199,9 +199,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
- loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
{
loff_t src_poff;
loff_t dest_poff;
@@ -280,6 +280,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);

/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -291,7 +292,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
*/
int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+ loff_t *len, unsigned int remap_flags,
+ compare_range_t compare_range_fn)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -351,8 +353,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;

- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ ret = compare_range_fn(inode_in, pos_in, inode_out, pos_out,
+ *len, &is_same);
if (ret)
return ret;
if (!is_same)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fa05fb78189..26708c01fd78 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1309,7 +1309,7 @@ xfs_reflink_remap_prep(
goto out_unlock;

ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ len, remap_flags, vfs_dedupe_file_range_compare);
if (ret || *len == 0)
goto out_unlock;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..183e084c2ac7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -238,6 +238,11 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
+int dax_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
+
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0cdc71e..159ec755e47b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1912,13 +1912,20 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+ struct inode *dest, loff_t destpos,
+ loff_t len, bool *is_same);
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count,
- unsigned int remap_flags);
+ unsigned int remap_flags,
+ compare_range_t compare_range_fn);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.30.0



2021-02-08 16:24:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/7] fsdax: Introduce dax_copy_edges() for CoW

On Mon, Feb 08, 2021 at 01:09:19AM +0800, Shiyang Ruan wrote:
> dax_copy_edges() is a helper functions performs a copy from one part of
> the device to another for data not page aligned.

The function looks good to me, but adding a static function without a
user is not very bisection friendly.

2021-02-08 16:53:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/7] fsdax: Replace mmap entry in case of CoW

> static void *dax_insert_entry(struct xa_state *xas,
> struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags, bool insert_flags)
> {
> void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>
> if (dirty)
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

Can we just move the __mark_inode_dirty to the one caller that needs it
in a prep patch?

2021-02-08 17:00:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

On Mon, Feb 08, 2021 at 01:09:22AM +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass

s/funciton/function/g

> +#define MIN(a, b) (((a) < (b)) ? (a) : (b))

This should use the existing min or min_t helpers.


> int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> - loff_t *len, unsigned int remap_flags)
> + loff_t *len, unsigned int remap_flags,
> + compare_range_t compare_range_fn)

Can we keep generic_remap_file_range_prep as-is, and add a new
dax_remap_file_range_prep, both sharing a low-level
__generic_remap_file_range_prep implementation? And for that
implementation I'd also go for classic if/else instead of the function
pointer.

> +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff,
> + loff_t len, bool *is_same);

no need for the extern.

2021-02-08 17:14:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/7] fs/xfs: Handle CoW for fsdax write() path

> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -977,10 +977,14 @@ xfs_free_file_space(
> if (offset + len > XFS_ISIZE(ip))
> len = XFS_ISIZE(ip) - offset;
> error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - &xfs_buffered_write_iomap_ops);
> + IS_DAX(VFS_I(ip)) ?
> + &xfs_direct_write_iomap_ops : &xfs_buffered_write_iomap_ops);
> if (error)
> return error;
> + if (xfs_is_reflink_inode(ip))
> + xfs_reflink_end_cow(ip, offset, len);

Maybe we need to add (back) and xfs_zero_range helper that encapsulates
the details?

> trace_xfs_file_dax_write(ip, count, pos);
> ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> - if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> - i_size_write(inode, iocb->ki_pos);
> - error = xfs_setfilesize(ip, pos, ret);
> + if (ret > 0) {
> + if (iocb->ki_pos > i_size_read(inode)) {
> + i_size_write(inode, iocb->ki_pos);
> + error = xfs_setfilesize(ip, pos, ret);
> + }
> + if (xfs_is_cow_inode(ip))
> + xfs_reflink_end_cow(ip, pos, ret);

Nitpick, but I'd just goto out for ret <= 0 to reduce the indentation a
bit.

> }
> out:
> xfs_iunlock(ip, iolock);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7b9ff824e82d..d6d4cc0f084e 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -765,13 +765,14 @@ xfs_direct_write_iomap_begin(
> goto out_unlock;
>
> if (imap_needs_cow(ip, flags, &imap, nimaps)) {
> + bool need_convert = flags & IOMAP_DIRECT || IS_DAX(inode);
> error = -EAGAIN;
> if (flags & IOMAP_NOWAIT)
> goto out_unlock;
>
> /* may drop and re-acquire the ilock */
> error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> - &lockmode, flags & IOMAP_DIRECT);
> + &lockmode, need_convert);

Why not:

error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
&lockmode,
(flags & IOMAP_DIRECT) || IS_DAX(inode));

?

2021-02-08 18:12:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax

On Mon 08-02-21 01:09:17, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.
>
> One of the key mechanism need to be implemented in fsdax is CoW. Copy
> the data from srcmap before we actually write data to the destance
> iomap. And we just copy range in which data won't be changed.
>
> Another mechanism is range comparison . In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data. In fsdax case, readpage() does not work. So, we need
> another compare data with direct access support.
>
> With the two mechanism implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
>
> Some of the patches are picked up from Goldwyn's patchset. I made some
> changes to adapt to this patchset.

How do you deal with HWPoison code trying to reverse-map struct page back
to inode-offset pair? This also needs to be fixed for reflink to work with
DAX.

Honza

>
> (Rebased on v5.10)
> ==
>
> Shiyang Ruan (7):
> fsdax: Output address in dax_iomap_pfn() and rename it
> fsdax: Introduce dax_copy_edges() for CoW
> fsdax: Copy data before write
> fsdax: Replace mmap entry in case of CoW
> fsdax: Dedup file range to use a compare function
> fs/xfs: Handle CoW for fsdax write() path
> fs/xfs: Add dedupe support for fsdax
>
> fs/btrfs/reflink.c | 3 +-
> fs/dax.c | 188 ++++++++++++++++++++++++++++++++++++++---
> fs/ocfs2/file.c | 2 +-
> fs/remap_range.c | 14 +--
> fs/xfs/xfs_bmap_util.c | 6 +-
> fs/xfs/xfs_file.c | 30 ++++++-
> fs/xfs/xfs_inode.c | 8 +-
> fs/xfs/xfs_inode.h | 1 +
> fs/xfs/xfs_iomap.c | 3 +-
> fs/xfs/xfs_iops.c | 11 ++-
> fs/xfs/xfs_reflink.c | 23 ++++-
> include/linux/dax.h | 5 ++
> include/linux/fs.h | 9 +-
> 13 files changed, 270 insertions(+), 33 deletions(-)
>
> --
> 2.30.0
>
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-02-09 01:53:54

by Ruan Shiyang

[permalink] [raw]
Subject: Re: [PATCH 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax



On 2021/2/8 下午11:39, Jan Kara wrote:
> On Mon 08-02-21 01:09:17, Shiyang Ruan wrote:
>> This patchset is attempt to add CoW support for fsdax, and take XFS,
>> which has both reflink and fsdax feature, as an example.
>>
>> One of the key mechanism need to be implemented in fsdax is CoW. Copy
>> the data from srcmap before we actually write data to the destance
>> iomap. And we just copy range in which data won't be changed.
>>
>> Another mechanism is range comparison . In page cache case, readpage()
>> is used to load data on disk to page cache in order to be able to
>> compare data. In fsdax case, readpage() does not work. So, we need
>> another compare data with direct access support.
>>
>> With the two mechanism implemented in fsdax, we are able to make reflink
>> and fsdax work together in XFS.
>>
>> Some of the patches are picked up from Goldwyn's patchset. I made some
>> changes to adapt to this patchset.
>
> How do you deal with HWPoison code trying to reverse-map struct page back
> to inode-offset pair? This also needs to be fixed for reflink to work with
> DAX.
>

I have posted v3 patchset to add reverse-map support for dax page
yesterday. Please take a look at this:

https://lkml.org/lkml/2021/2/8/347


--
Thanks,
Ruan Shiyang.

> Honza
>
>>
>> (Rebased on v5.10)
>> ==
>>
>> Shiyang Ruan (7):
>> fsdax: Output address in dax_iomap_pfn() and rename it
>> fsdax: Introduce dax_copy_edges() for CoW
>> fsdax: Copy data before write
>> fsdax: Replace mmap entry in case of CoW
>> fsdax: Dedup file range to use a compare function
>> fs/xfs: Handle CoW for fsdax write() path
>> fs/xfs: Add dedupe support for fsdax
>>
>> fs/btrfs/reflink.c | 3 +-
>> fs/dax.c | 188 ++++++++++++++++++++++++++++++++++++++---
>> fs/ocfs2/file.c | 2 +-
>> fs/remap_range.c | 14 +--
>> fs/xfs/xfs_bmap_util.c | 6 +-
>> fs/xfs/xfs_file.c | 30 ++++++-
>> fs/xfs/xfs_inode.c | 8 +-
>> fs/xfs/xfs_inode.h | 1 +
>> fs/xfs/xfs_iomap.c | 3 +-
>> fs/xfs/xfs_iops.c | 11 ++-
>> fs/xfs/xfs_reflink.c | 23 ++++-
>> include/linux/dax.h | 5 ++
>> include/linux/fs.h | 9 +-
>> 13 files changed, 270 insertions(+), 33 deletions(-)
>>
>> --
>> 2.30.0
>>
>>
>>


2021-02-09 09:26:30

by Ruan Shiyang

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function



On 2021/2/8 下午11:19, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 01:09:22AM +0800, Shiyang Ruan wrote:
>> With dax we cannot deal with readpage() etc. So, we create a
>> funciton callback to perform the file data comparison and pass
>
> s/funciton/function/g
>
>> +#define MIN(a, b) (((a) < (b)) ? (a) : (b))
>
> This should use the existing min or min_t helpers.
>
>
>> int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>> struct file *file_out, loff_t pos_out,
>> - loff_t *len, unsigned int remap_flags)
>> + loff_t *len, unsigned int remap_flags,
>> + compare_range_t compare_range_fn)
>
> Can we keep generic_remap_file_range_prep as-is, and add a new
> dax_remap_file_range_prep, both sharing a low-level
> __generic_remap_file_range_prep implementation? And for that
> implementation I'd also go for classic if/else instead of the function
> pointer.

The dax dedupe comparison need the iomap_ops pointer as argument, so my
understanding is that we don't modify the argument list of
generic_remap_file_range_prep(), but move its code into
__generic_remap_file_range_prep() whose argument list can be modified to
accepts the iomap_ops pointer. Then it looks like this:

```
int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags,
const struct iomap_ops *ops)
{
return __generic_remap_file_range_prep(file_in, pos_in, file_out,
pos_out, len, remap_flags, ops);
}
EXPORT_SYMBOL(dax_remap_file_range_prep);

int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags)
{
return __generic_remap_file_range_prep(file_in, pos_in, file_out,
pos_out, len, remap_flags, NULL);
}
EXPORT_SYMBOL(generic_remap_file_range_prep);
```

Am i right?


--
Thanks,
Ruan Shiyang.

>
>> +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>> + struct inode *dest, loff_t destoff,
>> + loff_t len, bool *is_same);
>
> no need for the extern.
>
>


2021-02-09 09:42:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
> The dax dedupe comparison need the iomap_ops pointer as argument, so my
> understanding is that we don't modify the argument list of
> generic_remap_file_range_prep(), but move its code into
> __generic_remap_file_range_prep() whose argument list can be modified to
> accepts the iomap_ops pointer. Then it looks like this:

I'd say just add the iomap_ops pointer to
generic_remap_file_range_prep and do away with the extra wrappers. We
only have three callers anyway.

2021-02-09 09:59:51

by Ruan Shiyang

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function



On 2021/2/9 下午5:34, Christoph Hellwig wrote:
> On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
>> The dax dedupe comparison need the iomap_ops pointer as argument, so my
>> understanding is that we don't modify the argument list of
>> generic_remap_file_range_prep(), but move its code into
>> __generic_remap_file_range_prep() whose argument list can be modified to
>> accepts the iomap_ops pointer. Then it looks like this:
>
> I'd say just add the iomap_ops pointer to
> generic_remap_file_range_prep and do away with the extra wrappers. We
> only have three callers anyway.

OK.


--
Thanks,
Ruan Shiyang.
>
>


2021-02-10 13:21:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:
>
>
> On 2021/2/9 下午5:34, Christoph Hellwig wrote:
>> On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
>>> The dax dedupe comparison need the iomap_ops pointer as argument, so my
>>> understanding is that we don't modify the argument list of
>>> generic_remap_file_range_prep(), but move its code into
>>> __generic_remap_file_range_prep() whose argument list can be modified to
>>> accepts the iomap_ops pointer. Then it looks like this:
>>
>> I'd say just add the iomap_ops pointer to
>> generic_remap_file_range_prep and do away with the extra wrappers. We
>> only have three callers anyway.
>
> OK.

So looking at this again I think your proposal actaully is better,
given that the iomap variant is still DAX specific. Sorry for
the noise.

Also I think dax_file_range_compare should use iomap_apply instead
of open coding it.

2021-02-16 13:18:13

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 4/7] fsdax: Replace mmap entry in case of CoW

On Mon, Feb 08, 2021 at 01:09:21AM +0800, Shiyang Ruan wrote:
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
>
> Signed-off-by: Goldwyn Rodrigues <[email protected]>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 31 +++++++++++++++++++++++--------
> 1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index b2195cbdf2dc..29698a3d2e37 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
> return 0;
> }
>
> +#define DAX_IF_DIRTY (1ULL << 0)
> +#define DAX_IF_COW (1ULL << 1)

The constants are ULL, but I see other flags only 'unsigned long'

> +
> /*
> * By this point grab_mapping_entry() has ensured that we have a locked entry
> * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -731,14 +734,16 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
> */
> static void *dax_insert_entry(struct xa_state *xas,
> struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags, bool insert_flags)

insert_flags is bool

> {
> void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;

"insert_flags & DAX_IF_DIRTY" is "bool & ULL", this can't be right

> + bool cow = insert_flags & DAX_IF_COW;

Same

>
> if (dirty)
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
> unsigned long index = xas->xa_index;
> /* we are replacing a zero page with block mapping */
> if (dax_is_pmd_entry(entry))
> @@ -750,7 +755,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>
> xas_reset(xas);
> xas_lock_irq(xas);
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> void *old;
>
> dax_disassociate_entry(entry, mapping, false);
> @@ -774,6 +779,9 @@ static void *dax_insert_entry(struct xa_state *xas,
> if (dirty)
> xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
> xas_unlock_irq(xas);
> return entry;
> }
> @@ -1319,6 +1327,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> void *entry;
> pfn_t pfn;
> void *kaddr;
> + unsigned long insert_flags = 0;
>
> trace_dax_pte_fault(inode, vmf, ret);
> /*
> @@ -1444,8 +1453,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>
> goto finish_iomap;
> case IOMAP_UNWRITTEN:
> - if (write && iomap.flags & IOMAP_F_SHARED)
> + if (write && (iomap.flags & IOMAP_F_SHARED)) {
> + insert_flags |= DAX_IF_COW;

Here's an example of 'unsigned long = unsigned long long', though it'll
work, it would be better to unify all the types.

> goto cow;
> + }
> fallthrough;
> case IOMAP_HOLE:
> if (!write) {
> @@ -1555,6 +1566,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> int error;
> pfn_t pfn;
> void *kaddr;
> + unsigned long insert_flags = 0;
>
> /*
> * Check whether offset isn't beyond end of file now. Caller is
> @@ -1670,14 +1682,17 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> result = vmf_insert_pfn_pmd(vmf, pfn, write);
> break;
> case IOMAP_UNWRITTEN:
> - if (write && iomap.flags & IOMAP_F_SHARED)
> + if (write && (iomap.flags & IOMAP_F_SHARED)) {
> + insert_flags |= DAX_IF_COW;
> goto cow;
> + }
> fallthrough;
> case IOMAP_HOLE:
> - if (WARN_ON_ONCE(write))
> + if (!write) {
> + result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
> break;
> - result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
> - break;
> + }
> + fallthrough;
> default:
> WARN_ON_ONCE(1);
> break;
> --
> 2.30.0
>
>

2021-02-17 03:39:39

by Ruan Shiyang

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function



On 2021/2/10 下午9:19, Christoph Hellwig wrote:
> On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:
>>
>>
>> On 2021/2/9 下午5:34, Christoph Hellwig wrote:
>>> On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
>>>> The dax dedupe comparison need the iomap_ops pointer as argument, so my
>>>> understanding is that we don't modify the argument list of
>>>> generic_remap_file_range_prep(), but move its code into
>>>> __generic_remap_file_range_prep() whose argument list can be modified to
>>>> accepts the iomap_ops pointer. Then it looks like this:
>>>
>>> I'd say just add the iomap_ops pointer to
>>> generic_remap_file_range_prep and do away with the extra wrappers. We
>>> only have three callers anyway.
>>
>> OK.
>
> So looking at this again I think your proposal actaully is better,
> given that the iomap variant is still DAX specific. Sorry for
> the noise.
>
> Also I think dax_file_range_compare should use iomap_apply instead
> of open coding it.
>

There are two files, which are not reflinked, need to be direct_access()
here. The iomap_apply() can handle one file each time. So, it seems
that iomap_apply() is not suitable for this case...


The pseudo code of this process is as follows:

srclen = ops->begin(&srcmap)
destlen = ops->begin(&destmap)

direct_access(&srcmap, &saddr)
direct_access(&destmap, &daddr)

same = memcpy(saddr, daddr, min(srclen,destlen))

ops->end(&destmap)
ops->end(&srcmap)

I think a nested call like this is necessary. That's why I use the open
code way.


--
Thanks,
Ruan Shiyang.
>


2021-02-17 03:54:17

by Ruan Shiyang

[permalink] [raw]
Subject: Re: [PATCH 4/7] fsdax: Replace mmap entry in case of CoW



On 2021/2/16 下午9:11, David Sterba wrote:
> On Mon, Feb 08, 2021 at 01:09:21AM +0800, Shiyang Ruan wrote:
>> We replace the existing entry to the newly allocated one
>> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
>> so writeback marks this entry as writeprotected. This
>> helps us snapshots so new write pagefaults after snapshots
>> trigger a CoW.
>>
>> Signed-off-by: Goldwyn Rodrigues <[email protected]>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> ---
>> fs/dax.c | 31 +++++++++++++++++++++++--------
>> 1 file changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b2195cbdf2dc..29698a3d2e37 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
>> return 0;
>> }
>>
>> +#define DAX_IF_DIRTY (1ULL << 0)
>> +#define DAX_IF_COW (1ULL << 1)
>
> The constants are ULL, but I see other flags only 'unsigned long'
>
>> +
>> /*
>> * By this point grab_mapping_entry() has ensured that we have a locked entry
>> * of the appropriate size so we don't have to worry about downgrading PMDs to
>> @@ -731,14 +734,16 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
>> */
>> static void *dax_insert_entry(struct xa_state *xas,
>> struct address_space *mapping, struct vm_fault *vmf,
>> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
>> + void *entry, pfn_t pfn, unsigned long flags, bool insert_flags)
>
> insert_flags is bool
>
>> {
>> void *new_entry = dax_make_entry(pfn, flags);
>> + bool dirty = insert_flags & DAX_IF_DIRTY;
>
> "insert_flags & DAX_IF_DIRTY" is "bool & ULL", this can't be right

This is a mistake caused by rebasing my old version patchset. Thanks
for pointing out. I'll fix this in next version.
>
>> + bool cow = insert_flags & DAX_IF_COW;
>
> Same
>
>>
>> if (dirty)
>> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
>> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>> unsigned long index = xas->xa_index;
>> /* we are replacing a zero page with block mapping */
>> if (dax_is_pmd_entry(entry))
>> @@ -750,7 +755,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>>
>> xas_reset(xas);
>> xas_lock_irq(xas);
>> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>> void *old;
>>
>> dax_disassociate_entry(entry, mapping, false);
>> @@ -774,6 +779,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>> if (dirty)
>> xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>>
>> + if (cow)
>> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
>> +
>> xas_unlock_irq(xas);
>> return entry;
>> }
>> @@ -1319,6 +1327,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>> void *entry;
>> pfn_t pfn;
>> void *kaddr;
>> + unsigned long insert_flags = 0;
>>
>> trace_dax_pte_fault(inode, vmf, ret);
>> /*
>> @@ -1444,8 +1453,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>>
>> goto finish_iomap;
>> case IOMAP_UNWRITTEN:
>> - if (write && iomap.flags & IOMAP_F_SHARED)
>> + if (write && (iomap.flags & IOMAP_F_SHARED)) {
>> + insert_flags |= DAX_IF_COW;
>
> Here's an example of 'unsigned long = unsigned long long', though it'll
> work, it would be better to unify all the types.

Yes, I'll fix it.


--
Thanks,
Ruan Shiyang.
>
>> goto cow;
>> + }
>> fallthrough;
>> case IOMAP_HOLE:
>> if (!write) {
>> @@ -1555,6 +1566,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>> int error;
>> pfn_t pfn;
>> void *kaddr;
>> + unsigned long insert_flags = 0;
>>
>> /*
>> * Check whether offset isn't beyond end of file now. Caller is
>> @@ -1670,14 +1682,17 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>> result = vmf_insert_pfn_pmd(vmf, pfn, write);
>> break;
>> case IOMAP_UNWRITTEN:
>> - if (write && iomap.flags & IOMAP_F_SHARED)
>> + if (write && (iomap.flags & IOMAP_F_SHARED)) {
>> + insert_flags |= DAX_IF_COW;
>> goto cow;
>> + }
>> fallthrough;
>> case IOMAP_HOLE:
>> - if (WARN_ON_ONCE(write))
>> + if (!write) {
>> + result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
>> break;
>> - result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
>> - break;
>> + }
>> + fallthrough;
>> default:
>> WARN_ON_ONCE(1);
>> break;
>> --
>> 2.30.0
>>
>>
>
>


2021-02-18 18:44:12

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

On Wed, Feb 17, 2021 at 11:24:18AM +0800, Ruan Shiyang wrote:
>
>
> On 2021/2/10 下午9:19, Christoph Hellwig wrote:
> > On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:
> > >
> > >
> > > On 2021/2/9 下午5:34, Christoph Hellwig wrote:
> > > > On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
> > > > > The dax dedupe comparison need the iomap_ops pointer as argument, so my
> > > > > understanding is that we don't modify the argument list of
> > > > > generic_remap_file_range_prep(), but move its code into
> > > > > __generic_remap_file_range_prep() whose argument list can be modified to
> > > > > accepts the iomap_ops pointer. Then it looks like this:
> > > >
> > > > I'd say just add the iomap_ops pointer to
> > > > generic_remap_file_range_prep and do away with the extra wrappers. We
> > > > only have three callers anyway.
> > >
> > > OK.
> >
> > So looking at this again I think your proposal actaully is better,
> > given that the iomap variant is still DAX specific. Sorry for
> > the noise.
> >
> > Also I think dax_file_range_compare should use iomap_apply instead
> > of open coding it.
> >
>
> There are two files, which are not reflinked, need to be direct_access()
> here. The iomap_apply() can handle one file each time. So, it seems that
> iomap_apply() is not suitable for this case...
>
>
> The pseudo code of this process is as follows:
>
> srclen = ops->begin(&srcmap)
> destlen = ops->begin(&destmap)
>
> direct_access(&srcmap, &saddr)
> direct_access(&destmap, &daddr)
>
> same = memcpy(saddr, daddr, min(srclen,destlen))
>
> ops->end(&destmap)
> ops->end(&srcmap)
>
> I think a nested call like this is necessary. That's why I use the open
> code way.

This might be a good place to implement an iomap_apply2() loop that
actually /does/ walk all the extents of file1 and file2. There's now
two users of this idiom.

(Possibly structured as a "get next mappings from both" generator
function like Matthew Wilcox keeps asking for. :))

--D

>
> --
> Thanks,
> Ruan Shiyang.
> >
>
>

2021-02-25 11:39:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

On Thu, Feb 18, 2021 at 08:20:18AM -0800, Darrick J. Wong wrote:
> > I think a nested call like this is necessary. That's why I use the open
> > code way.
>
> This might be a good place to implement an iomap_apply2() loop that
> actually /does/ walk all the extents of file1 and file2. There's now
> two users of this idiom.

Why do we need a special helper for that?

> (Possibly structured as a "get next mappings from both" generator
> function like Matthew Wilcox keeps asking for. :))

OTOH this might be a good first use for that.