2020-08-07 13:15:16

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Instead of per-page tracking method, this patchset introduces a query
interface: get_shared_files(), which is implemented by each FS, to
obtain the owners of a shared page. It returns an owner list of this
shared page. Then, the memory-failure() iterates the list to be able
to notify each process using files that sharing this page.

The design of the tracking method is as follow:
1. dax_assocaite_entry() associates the owner's info to this page
- For non-reflink case:
page->mapping,->index stores the file's mapping, offset in file.
A dax page is not shared by other files. dax_associate_entry() is
called only once. So, use page->mapping,->index to store the
owner's info.
- For reflink case:
page->mapping,->index stores the block device, offset in device.
A dax page is shared more than once. So, dax_assocaite_entry()
will be called more than once. We introduce page->zone_device_data
as reflink counter, to indicate that this page is shared and how
many owners now is using this page. The page->mapping,->index is
used to store the block_device of the fs and page offset of this
device.

2. dax_lock_page() calls query interface to lock each dax entry
- For non-reflink case:
owner's info is stored in page->mapping,->index.
So, It is easy to lock its dax entry.
- For reflink case:
owner's info is obtained by calling get_shared_files(), which is
implemented by FS.
The FS context could be found in block_device that stored by
page->mapping. Then lock the dax entries of the owners.

In memory-failure(), since the owner list has been obtained in
dax_lock_page(), just iterate the list and handle the error. This
patchset didn't handle the memory failure on metadata of FS because
I haven't found a way to distinguish whether this page contains
matadata yet. Still working on it.

==
I also borrowed and made some changes on Goldwyn's patchsets.
These patches makes up for the lack of CoW mechanism in fsdax.

The rests are dax & reflink support for xfs.

(Rebased on v5.8)
==
Shiyang Ruan (8):
fs: introduce get_shared_files() for dax&reflink
fsdax, mm: track files sharing dax page for memory-failure
fsdax: introduce dax_copy_edges() for COW
fsdax: copy data before write
fsdax: replace mmap entry in case of CoW
fsdax: dedup file range to use a compare function
fs/xfs: handle CoW for fsdax write() path
fs/xfs: support dedupe for fsdax

fs/btrfs/reflink.c | 3 +-
fs/dax.c | 302 +++++++++++++++++++++++++++++++++++------
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 +-
fs/xfs/xfs_bmap_util.c | 6 +-
fs/xfs/xfs_file.c | 10 +-
fs/xfs/xfs_iomap.c | 3 +-
fs/xfs/xfs_iops.c | 11 +-
fs/xfs/xfs_reflink.c | 80 ++++++-----
fs/xfs/xfs_super.c | 67 +++++++++
include/linux/dax.h | 18 ++-
include/linux/fs.h | 11 +-
include/linux/mm.h | 8 ++
mm/memory-failure.c | 138 ++++++++++++-------
14 files changed, 529 insertions(+), 141 deletions(-)

--
2.27.0




2020-08-07 13:15:19

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax

Use xfs_break_layouts() to break files' layouts when locking them. And
call dax_file_range_compare() function to compare range for files both
have DAX flag.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_reflink.c | 78 ++++++++++++++++++++++++++------------------
1 file changed, 46 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f87ab78dd421..b2901ad1a269 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
#include "xfs_iomap.h"
#include "xfs_sb.h"
#include "xfs_ag_resv.h"
+#include <linux/dax.h>

/*
* Copy on Write of Shared Blocks
@@ -1185,47 +1186,41 @@ xfs_reflink_remap_blocks(
* back out both locks.
*/
static int
-xfs_iolock_two_inodes_and_break_layout(
- struct inode *src,
- struct inode *dest)
+xfs_reflink_remap_lock_and_break_layouts(
+ struct file *file_in,
+ struct file *file_out)
{
int error;
+ struct inode *inode_in = file_inode(file_in);
+ struct xfs_inode *src = XFS_I(inode_in);
+ struct inode *inode_out = file_inode(file_out);
+ struct xfs_inode *dest = XFS_I(inode_out);
+ uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;

- if (src > dest)
+ if (inode_in > inode_out) {
+ swap(inode_in, inode_out);
swap(src, dest);
-
-retry:
- /* Wait to break both inodes' layouts before we start locking. */
- error = break_layout(src, true);
- if (error)
- return error;
- if (src != dest) {
- error = break_layout(dest, true);
- if (error)
- return error;
}

- /* Lock one inode and make sure nobody got in and leased it. */
- inode_lock(src);
- error = break_layout(src, false);
+ inode_lock(inode_in);
+ xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+ error = xfs_break_layouts(inode_in, &iolock, BREAK_UNMAP);
+ xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
if (error) {
- inode_unlock(src);
- if (error == -EWOULDBLOCK)
- goto retry;
+ inode_unlock(inode_in);
return error;
}

- if (src == dest)
+ if (inode_in == inode_out)
return 0;

- /* Lock the other inode and make sure nobody got in and leased it. */
- inode_lock_nested(dest, I_MUTEX_NONDIR2);
- error = break_layout(dest, false);
+ inode_lock_nested(inode_out, I_MUTEX_NONDIR2);
+ xfs_ilock(dest, XFS_MMAPLOCK_EXCL);
+ error = xfs_break_layouts(inode_out, &iolock, BREAK_UNMAP);
+ xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
if (error) {
- inode_unlock(src);
- inode_unlock(dest);
- if (error == -EWOULDBLOCK)
- goto retry;
+ inode_unlock(inode_in);
+ inode_unlock(inode_out);
return error;
}

@@ -1244,6 +1239,11 @@ xfs_reflink_remap_unlock(
struct xfs_inode *dest = XFS_I(inode_out);
bool same_inode = (inode_in == inode_out);

+ if (inode_in > inode_out) {
+ swap(inode_in, inode_out);
+ swap(src, dest);
+ }
+
xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
if (!same_inode)
xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
@@ -1274,6 +1274,14 @@ xfs_reflink_zero_posteof(
&xfs_buffered_write_iomap_ops);
}

+int xfs_reflink_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
+{
+ return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same,
+ &xfs_read_iomap_ops);
+}
+
/*
* Prepare two files for range cloning. Upon a successful return both inodes
* will have the iolock and mmaplock held, the page cache of the out file will
@@ -1318,9 +1326,10 @@ xfs_reflink_remap_prep(
struct xfs_inode *dest = XFS_I(inode_out);
bool same_inode = (inode_in == inode_out);
ssize_t ret;
+ compare_range_t cmp;

/* Lock both files against IO */
- ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+ ret = xfs_reflink_remap_lock_and_break_layouts(file_in, file_out);
if (ret)
return ret;
if (same_inode)
@@ -1335,12 +1344,17 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;

- /* Don't share DAX file data for now. */
- if (IS_DAX(inode_in) || IS_DAX(inode_out))
+ /* Don't share DAX file data with non-DAX file. */
+ if (IS_DAX(inode_in) != IS_DAX(inode_out))
goto out_unlock;

+ if (IS_DAX(inode_in))
+ cmp = xfs_reflink_dedupe_file_range_compare;
+ else
+ cmp = vfs_dedupe_file_range_compare;
+
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags, vfs_dedupe_file_range_compare);
+ len, remap_flags, cmp);
if (ret < 0 || *len == 0)
goto out_unlock;

--
2.27.0



2020-08-07 13:15:40

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 6/8] fsdax: dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.

This may not be the best way to solve this. Suggestions welcome.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/btrfs/reflink.c | 3 +-
fs/dax.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 ++++----
fs/xfs/xfs_reflink.c | 2 +-
include/linux/dax.h | 5 ++++
include/linux/fs.h | 9 +++++-
7 files changed, 90 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 040009d1cc31..d0412bc338da 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -765,7 +765,8 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
return ret;

return generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ len, remap_flags,
+ vfs_dedupe_file_range_compare);
}

loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
diff --git a/fs/dax.c b/fs/dax.c
index 28b8e23b11ac..80a9946fd25a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/fs_dax.h>

+#define MIN(a, b) (((a) < (b)) ? (a) : (b))
+
static inline unsigned int pe_order(enum page_entry_size pe_size)
{
if (pe_size == PE_SIZE_PTE)
@@ -1874,3 +1876,68 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+ loff_t destoff, loff_t len, bool *is_same,
+ const struct iomap_ops *ops)
+{
+ void *saddr, *daddr;
+ struct iomap smap = { 0 };
+ struct iomap dmap = { 0 };
+ bool same = false;
+ loff_t cmp_len;
+ int id, ret = 0;
+
+ id = dax_read_lock();
+ while (len) {
+ ret = ops->iomap_begin(src, srcoff, len, 0, &smap, NULL);
+ if (ret < 0)
+ goto out_src;
+ cmp_len = MIN(len, smap.offset + smap.length - srcoff);
+
+ ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &dmap, NULL);
+ if (ret < 0)
+ goto out_dest;
+ cmp_len = MIN(cmp_len, dmap.offset + dmap.length - destoff);
+
+ if (smap.type == IOMAP_HOLE && dmap.type == IOMAP_HOLE)
+ goto next;
+
+ if (smap.type == IOMAP_HOLE || dmap.type == IOMAP_HOLE) {
+ same = false;
+ goto next;
+ }
+
+ ret = dax_iomap_direct_access(&smap, srcoff,
+ ALIGN(srcoff + cmp_len, PAGE_SIZE), NULL, &saddr);
+ if (ret < 0)
+ goto out_dest;
+
+ ret = dax_iomap_direct_access(&dmap, destoff,
+ ALIGN(destoff + cmp_len, PAGE_SIZE), NULL, &daddr);
+ if (ret < 0)
+ goto out_dest;
+
+ same = !memcmp(saddr, daddr, cmp_len);
+ if (!same)
+ break;
+next:
+ len -= cmp_len;
+ srcoff += cmp_len;
+ destoff += cmp_len;
+out_dest:
+ if (ops->iomap_end)
+ ops->iomap_end(dest, destoff, cmp_len, 0, 0, &dmap);
+out_src:
+ if (ops->iomap_end)
+ ops->iomap_end(src, srcoff, len, 0, 0, &smap);
+
+ if (ret < 0)
+ goto out;
+ }
+ *is_same = same;
+out:
+ dax_read_unlock(id);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 85979e2214b3..9d101f129d16 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2591,7 +2591,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
goto out_unlock;

ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- &len, remap_flags);
+ &len, remap_flags, vfs_dedupe_file_range_compare);
if (ret < 0 || len == 0)
goto out_unlock;

diff --git a/fs/read_write.c b/fs/read_write.c
index 4fb797822567..2974a624f232 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1927,7 +1927,7 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
struct inode *dest, loff_t destoff,
loff_t len, bool *is_same)
{
@@ -2008,6 +2008,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL_GPL(vfs_dedupe_file_range_compare);

/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -2019,7 +2020,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
*/
int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+ loff_t *len, unsigned int remap_flags,
+ compare_range_t compare)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -2078,9 +2080,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
*/
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;
-
- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ ret = (*compare)(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same);
if (ret)
return ret;
if (!is_same)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 107bf2a2f344..792217cd1e64 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1338,7 +1338,7 @@ xfs_reflink_remap_prep(
goto out_unlock;

ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ len, remap_flags, vfs_dedupe_file_range_compare);
if (ret < 0 || *len == 0)
goto out_unlock;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index d6d9f4b721bc..607e21933928 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -223,6 +223,11 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
struct iomap *iomap);
+int dax_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
+
static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 81de3d2739b9..c2d985295db5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1924,13 +1924,20 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *, rwf_t);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+ struct inode *dest, loff_t destpos,
+ loff_t len, bool *is_same);
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count,
- unsigned int remap_flags);
+ unsigned int remap_flags,
+ compare_range_t cmp);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.27.0



2020-08-07 13:16:22

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 4/8] fsdax: copy data before write

Add dax_copy_edges() into each dax actor functions to perform CoW.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 308678c58d4d..65553e3f7602 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1208,7 +1208,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
return iov_iter_zero(min(length, end - pos), iter);
}

- if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+ if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+ !(iomap->flags & IOMAP_F_SHARED)))
return -EIO;

/*
@@ -1247,6 +1248,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
break;
}

+ if (iomap != srcmap) {
+ ret = dax_copy_edges(pos, length, srcmap, kaddr, false);
+ if (ret)
+ break;
+ }
+
map_len = PFN_PHYS(map_len);
kaddr += offset;
map_len -= offset;
@@ -1358,6 +1365,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
vm_fault_t ret = 0;
void *entry;
pfn_t pfn;
+ void *kaddr;

trace_dax_pte_fault(inode, vmf, ret);
/*
@@ -1439,19 +1447,27 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,

switch (iomap.type) {
case IOMAP_MAPPED:
+cow:
if (iomap.flags & IOMAP_F_NEW) {
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
error = dax_iomap_direct_access(&iomap, pos, PAGE_SIZE, &pfn,
- NULL);
+ &kaddr);
if (error < 0)
goto error_finish_iomap;

entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
pfn, 0, write && !sync);

+ if (srcmap.type != IOMAP_HOLE) {
+ error = dax_copy_edges(pos, PAGE_SIZE, &srcmap, kaddr,
+ false);
+ if (error)
+ goto error_finish_iomap;
+ }
+
/*
* If we are doing synchronous page fault and inode needs fsync,
* we can insert PTE into page tables only after that happens.
@@ -1475,12 +1491,15 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,

goto finish_iomap;
case IOMAP_UNWRITTEN:
+ if (write && iomap.flags & IOMAP_F_SHARED)
+ goto cow;
+ fallthrough;
case IOMAP_HOLE:
if (!write) {
ret = dax_load_hole(&xas, mapping, &entry, vmf);
goto finish_iomap;
}
- /*FALLTHRU*/
+ fallthrough;
default:
WARN_ON_ONCE(1);
error = -EIO;
@@ -1582,6 +1601,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
loff_t pos;
int error;
pfn_t pfn;
+ void *kaddr;

/*
* Check whether offset isn't beyond end of file now. Caller is
@@ -1663,14 +1683,22 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,

switch (iomap.type) {
case IOMAP_MAPPED:
+cow:
error = dax_iomap_direct_access(&iomap, pos, PMD_SIZE, &pfn,
- NULL);
+ &kaddr);
if (error < 0)
goto finish_iomap;

entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
pfn, DAX_PMD, write && !sync);

+ if (srcmap.type != IOMAP_HOLE) {
+ error = dax_copy_edges(pos, PMD_SIZE, &srcmap, kaddr,
+ true);
+ if (error)
+ goto unlock_entry;
+ }
+
/*
* If we are doing synchronous page fault and inode needs fsync,
* we can insert PMD into page tables only after that happens.
@@ -1689,6 +1717,9 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
result = vmf_insert_pfn_pmd(vmf, pfn, write);
break;
case IOMAP_UNWRITTEN:
+ if (write && iomap.flags & IOMAP_F_SHARED)
+ goto cow;
+ fallthrough;
case IOMAP_HOLE:
if (WARN_ON_ONCE(write))
break;
--
2.27.0



2020-08-07 13:16:43

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 3/8] fsdax: introduce dax_copy_edges() for COW

Add address output in dax_iomap_pfn() in order to perform a memcpy() in CoW
case. Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

dax_copy_edges() is a helper functions performs a copy from one part of
the device to another for data not page aligned.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 47380f75ef38..308678c58d4d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1043,8 +1043,8 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
}

-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
- pfn_t *pfnp)
+static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
+ pfn_t *pfnp, void **addr)
{
const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
@@ -1055,12 +1055,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
if (rc)
return rc;
id = dax_read_lock();
- length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
- NULL, pfnp);
+ length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), addr,
+ pfnp);
if (length < 0) {
rc = length;
goto out;
}
+ if (!pfnp)
+ goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1070,11 +1072,59 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+ if (!addr)
+ goto out;
+ if (!*addr)
+ rc = -EFAULT;
out:
dax_read_unlock(id);
return rc;
}

+/*
+ * dax_copy_edges - Copies the part of the pages not included in
+ * the write, but required for CoW because
+ * offset/offset+length are not page aligned.
+ */
+static int dax_copy_edges(loff_t pos, loff_t length, struct iomap *srcmap,
+ void *daddr, bool pmd)
+{
+ size_t page_size = pmd ? PMD_SIZE : PAGE_SIZE;
+ loff_t offset = pos & (page_size - 1);
+ size_t size = ALIGN(offset + length, page_size);
+ loff_t end = pos + length;
+ loff_t pg_end = round_up(end, page_size);
+ void *saddr = 0;
+ int ret = 0;
+
+ ret = dax_iomap_direct_access(srcmap, pos, size, NULL, &saddr);
+ if (ret)
+ return ret;
+ /*
+ * Copy the first part of the page
+ * Note: we pass offset as length
+ */
+ if (offset) {
+ if (saddr)
+ ret = memcpy_mcsafe(daddr, saddr, offset);
+ else
+ memset(daddr, 0, offset);
+ }
+
+ /* Copy the last part of the range */
+ if (end < pg_end) {
+ if (saddr)
+ ret = memcpy_mcsafe(daddr + offset + length,
+ saddr + offset + length, pg_end - end);
+ else
+ memset(daddr + offset + length, 0,
+ pg_end - end);
+ }
+ return ret;
+}
+
/*
* The user has performed a load from a hole in the file. Allocating a new
* page in the file would cause excessive storage usage for workloads with
@@ -1394,7 +1444,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
- error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+ error = dax_iomap_direct_access(&iomap, pos, PAGE_SIZE, &pfn,
+ NULL);
if (error < 0)
goto error_finish_iomap;

@@ -1612,7 +1663,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,

switch (iomap.type) {
case IOMAP_MAPPED:
- error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
+ error = dax_iomap_direct_access(&iomap, pos, PMD_SIZE, &pfn,
+ NULL);
if (error < 0)
goto finish_iomap;

--
2.27.0



2020-08-07 13:17:53

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 2/8] fsdax, mm: track files sharing dax page for memory-failure

When memory-failure occurs on a pmem device which contains a filesystem,
we need to find out which files are in using, and then notify processes
that are using these files to handle the error.

The design of the track method is as follow:
1. dax_assocaite_entry() associates the owner's info to this page
- For non-reflink case:
page->mapping,->index stores the file's mapping, offset in file.
A dax page is not shared by other files. dax_associate_entry() is
called only once. So, use page->mapping,->index to store the
owner's info.
- For reflink case:
page->mapping,->index stores the block device, offset in device.
A dax page is shared more than once. So, dax_assocaite_entry()
will be called more than once. We introduce page->zone_device_data
as reflink counter, to indicate that this page is shared and how
many owners now is using this page. The page->mapping,->index is
used to store the block_device of the fs and page offset of this
device.

2. dax_lock_page() calls query interface to lock each dax entry
- For non-reflink case:
owner's info is stored in page->mapping,->index.
So, It is easy to lock its dax entry.
- For reflink case:
owner's info is obtained by calling get_shared_files(), which is
implemented by FS.
The FS context could be found in block_device that stored by
page->mapping. Then lock the dax entries of the owners.

In memory-failure(), since owners list has been obtained in
dax_lock_page(), just iterate the list and handle the error.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 111 +++++++++++++++++++++++++++--------
include/linux/dax.h | 6 +-
include/linux/mm.h | 8 +++
mm/memory-failure.c | 138 +++++++++++++++++++++++++++-----------------
4 files changed, 183 insertions(+), 80 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..47380f75ef38 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -329,7 +329,7 @@ static unsigned long dax_end_pfn(void *entry)
* offsets.
*/
static void dax_associate_entry(void *entry, struct address_space *mapping,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_fault *vmf, pgoff_t offset)
{
unsigned long size = dax_entry_size(entry), pfn, index;
int i = 0;
@@ -337,13 +337,27 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return;

- index = linear_page_index(vma, address & ~(size - 1));
+ index = linear_page_index(vmf->vma, vmf->address & ~(size - 1));
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- WARN_ON_ONCE(page->mapping);
- page->mapping = mapping;
- page->index = index + i++;
+ BUG_ON(!page->mapping && IS_FSDAX_SHARED(page));
+
+ /* Use zone_device_data as reflink counter here.
+ * If one page is associated for the first time, then use the
+ * ->mapping,->index as normal. For the second time it is
+ * assocated, we store the block_device that this page belongs
+ * to in ->mapping and the offset within this block_device in
+ * ->index, and increase the reflink counter.
+ */
+ if (!page->mapping) {
+ page->mapping = mapping;
+ page->index = index + i++;
+ } else {
+ page->mapping = (struct address_space *)mapping->host->i_sb->s_bdev;
+ page->index = offset;
+ page->zone_device_data++;
+ }
}
}

@@ -359,9 +373,12 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
struct page *page = pfn_to_page(pfn);

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
- WARN_ON_ONCE(page->mapping && page->mapping != mapping);
- page->mapping = NULL;
- page->index = 0;
+ if (!IS_FSDAX_SHARED(page)) {
+ page->mapping = NULL;
+ page->index = 0;
+ } else {
+ page->zone_device_data--;
+ }
}
}

@@ -386,7 +403,7 @@ static struct page *dax_busy_page(void *entry)
* Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
* not be locked.
*/
-dax_entry_t dax_lock_page(struct page *page)
+void _dax_lock_page(struct page *page, struct shared_files *sfp)
{
XA_STATE(xas, NULL, 0);
void *entry;
@@ -394,7 +411,7 @@ dax_entry_t dax_lock_page(struct page *page)
/* Ensure page->mapping isn't freed while we look at it */
rcu_read_lock();
for (;;) {
- struct address_space *mapping = READ_ONCE(page->mapping);
+ struct address_space *mapping = READ_ONCE(sfp->mapping);

entry = NULL;
if (!mapping || !dax_mapping(mapping))
@@ -413,11 +430,11 @@ dax_entry_t dax_lock_page(struct page *page)

xas.xa = &mapping->i_pages;
xas_lock_irq(&xas);
- if (mapping != page->mapping) {
+ if (mapping != sfp->mapping) {
xas_unlock_irq(&xas);
continue;
}
- xas_set(&xas, page->index);
+ xas_set(&xas, sfp->index);
entry = xas_load(&xas);
if (dax_is_locked(entry)) {
rcu_read_unlock();
@@ -430,18 +447,61 @@ dax_entry_t dax_lock_page(struct page *page)
break;
}
rcu_read_unlock();
- return (dax_entry_t)entry;
+ sfp->cookie = (dax_entry_t)entry;
+}
+
+int dax_lock_page(struct page *page, struct list_head *shared_files)
+{
+ struct shared_files *sfp;
+
+ if (IS_FSDAX_SHARED(page)) {
+ struct block_device *bdev = (struct block_device *)page->mapping;
+
+ bdev->bd_super->s_op->get_shared_files(bdev->bd_super,
+ page->index, shared_files);
+ list_for_each_entry(sfp, shared_files, list) {
+ _dax_lock_page(page, sfp);
+ if (!sfp->cookie)
+ return 1;
+ }
+ } else {
+ sfp = kmalloc(sizeof(*sfp), GFP_KERNEL);
+ sfp->mapping = page->mapping;
+ sfp->index = page->index;
+ list_add(&sfp->list, shared_files);
+ _dax_lock_page(page, sfp);
+ if (!sfp->cookie)
+ return 1;
+ }
+ return 0;
}

-void dax_unlock_page(struct page *page, dax_entry_t cookie)
+void _dax_unlock_page(struct page *page, struct shared_files *sfp)
{
- struct address_space *mapping = page->mapping;
- XA_STATE(xas, &mapping->i_pages, page->index);
+ struct address_space *mapping = sfp->mapping;

+ XA_STATE(xas, &mapping->i_pages, sfp->index);
if (S_ISCHR(mapping->host->i_mode))
return;

- dax_unlock_entry(&xas, (void *)cookie);
+ dax_unlock_entry(&xas, (void *)sfp->cookie);
+}
+
+void dax_unlock_page(struct page *page, struct list_head *shared_files)
+{
+ struct shared_files *sfp, *next;
+
+ if (IS_FSDAX_SHARED(page)) {
+ list_for_each_entry_safe(sfp, next, shared_files, list) {
+ if (!sfp->cookie)
+ _dax_unlock_page(page, sfp);
+ kfree(sfp);
+ }
+ } else {
+ if (!sfp->cookie)
+ _dax_unlock_page(page, sfp);
+ kfree(sfp);
+ }
}

/*
@@ -715,7 +775,8 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
*/
static void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
- void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+ pgoff_t bdoff, void *entry, pfn_t pfn, unsigned long flags,
+ bool dirty)
{
void *new_entry = dax_make_entry(pfn, flags);

@@ -738,7 +799,7 @@ static void *dax_insert_entry(struct xa_state *xas,
void *old;

dax_disassociate_entry(entry, mapping, false);
- dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
+ dax_associate_entry(new_entry, mapping, vmf, bdoff);
/*
* Only swap our new entry into the page cache if the current
* entry is a zero page or an empty entry. If a normal PTE or
@@ -1030,7 +1091,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
vm_fault_t ret;

- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+ *entry = dax_insert_entry(xas, mapping, vmf, 0, *entry, pfn,
DAX_ZERO_PAGE, false);

ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
@@ -1337,8 +1398,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
if (error < 0)
goto error_finish_iomap;

- entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
- 0, write && !sync);
+ entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
+ pfn, 0, write && !sync);

/*
* If we are doing synchronous page fault and inode needs fsync,
@@ -1418,7 +1479,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
goto fallback;

pfn = page_to_pfn_t(zero_page);
- *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+ *entry = dax_insert_entry(xas, mapping, vmf, 0, *entry, pfn,
DAX_PMD | DAX_ZERO_PAGE, false);

if (arch_needs_pgtable_deposit()) {
@@ -1555,8 +1616,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
if (error < 0)
goto finish_iomap;

- entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
- DAX_PMD, write && !sync);
+ entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
+ pfn, DAX_PMD, write && !sync);

/*
* If we are doing synchronous page fault and inode needs fsync,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0a85e321d6b4..d6d9f4b721bc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,7 +8,7 @@

/* Flag for synchronous flush */
#define DAXDEV_F_SYNC (1UL << 0)
-
+#define IS_FSDAX_SHARED(p) ((unsigned long)p->zone_device_data != 0)
typedef unsigned long dax_entry_t;

struct iomap_ops;
@@ -148,8 +148,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);

struct page *dax_layout_busy_page(struct address_space *mapping);
-dax_entry_t dax_lock_page(struct page *page);
-void dax_unlock_page(struct page *page, dax_entry_t cookie);
+int dax_lock_page(struct page *page, struct list_head *shared_files);
+void dax_unlock_page(struct page *page, struct list_head *shared_files);
#else
static inline bool bdev_dax_supported(struct block_device *bdev,
int blocksize)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..bffccc34f1a1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1115,6 +1115,14 @@ static inline bool is_device_private_page(const struct page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}

+static inline bool is_device_fsdax_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+ is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_FS_DAX;
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b8ccb1fb9b..53cc65821c99 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -93,9 +93,13 @@ static int hwpoison_filter_dev(struct page *p)
if (PageSlab(p))
return -EINVAL;

- mapping = page_mapping(p);
- if (mapping == NULL || mapping->host == NULL)
- return -EINVAL;
+ if (is_device_fsdax_page(p) & IS_FSDAX_SHARED(p))
+ dev = ((struct block_device *)p->mapping)->s_dev;
+ else {
+ mapping = page_mapping(p);
+ if (mapping == NULL || mapping->host == NULL)
+ return -EINVAL;
+ }

dev = mapping->host->i_sb->s_dev;
if (hwpoison_filter_dev_major != ~0U &&
@@ -263,9 +267,8 @@ void shake_page(struct page *p, int access)
EXPORT_SYMBOL_GPL(shake_page);

static unsigned long dev_pagemap_mapping_shift(struct page *page,
- struct vm_area_struct *vma)
+ struct vm_area_struct *vma, unsigned long address)
{
- unsigned long address = vma_address(page, vma);
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -306,8 +309,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
* Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
*/
static void add_to_kill(struct task_struct *tsk, struct page *p,
- struct vm_area_struct *vma,
- struct list_head *to_kill)
+ struct address_space *mapping, pgoff_t offset,
+ struct vm_area_struct *vma, struct list_head *to_kill)
{
struct to_kill *tk;

@@ -317,12 +320,18 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
return;
}

- tk->addr = page_address_in_vma(p, vma);
- if (is_zone_device_page(p))
- tk->size_shift = dev_pagemap_mapping_shift(p, vma);
- else
- tk->size_shift = page_shift(compound_head(p));
-
+ if (is_device_fsdax_page(p)) {
+ tk->addr = vma->vm_start +
+ ((offset - vma->vm_pgoff) << PAGE_SHIFT);
+ tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
+ } else {
+ tk->addr = page_address_in_vma(p, vma);
+ if (is_zone_device_page(p)) {
+ tk->size_shift = dev_pagemap_mapping_shift(p, vma,
+ vma_address(p, vma));
+ } else
+ tk->size_shift = page_shift(compound_head(p));
+ }
/*
* Send SIGKILL if "tk->addr == -EFAULT". Also, as
* "tk->size_shift" is always non-zero for !is_zone_device_page(),
@@ -468,33 +477,26 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
- add_to_kill(t, page, vma, to_kill);
+ add_to_kill(t, page, NULL, 0, vma, to_kill);
}
}
read_unlock(&tasklist_lock);
page_unlock_anon_vma_read(av);
}

-/*
- * Collect processes when the error hit a file mapped page.
- */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
- int force_early)
+static void collect_procs_file_one(struct page *page, struct list_head *to_kill,
+ struct address_space *mapping, pgoff_t pgoff, bool force_early)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
- struct address_space *mapping = page->mapping;

i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- pgoff_t pgoff = page_to_pgoff(page);
struct task_struct *t = task_early_kill(tsk, force_early);
-
if (!t)
continue;
- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
- pgoff) {
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
/*
* Send early kill signal to tasks where a vma covers
* the page but the corrupted page is not necessarily
@@ -502,19 +504,39 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
* Assume applications who requested early kill want
* to be informed of all such data corruptions.
*/
- if (vma->vm_mm == t->mm)
- add_to_kill(t, page, vma, to_kill);
+ if (vma->vm_mm == t->mm) {
+ add_to_kill(t, page, mapping, pgoff, vma,
+ to_kill);
+ }
}
}
read_unlock(&tasklist_lock);
i_mmap_unlock_read(mapping);
}

+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+ struct list_head *shared_files, int force_early)
+{
+ struct shared_files *sfp;
+
+ if (shared_files)
+ list_for_each_entry(sfp, shared_files, list) {
+ collect_procs_file_one(page, to_kill, sfp->mapping,
+ sfp->index, force_early);
+ }
+ else
+ collect_procs_file_one(page, to_kill, page->mapping,
+ page_to_pgoff(page), force_early);
+}
+
/*
* Collect the processes who have the corrupted page mapped to kill.
*/
static void collect_procs(struct page *page, struct list_head *tokill,
- int force_early)
+ struct list_head *shared_files, int force_early)
{
if (!page->mapping)
return;
@@ -522,7 +544,33 @@ static void collect_procs(struct page *page, struct list_head *tokill,
if (PageAnon(page))
collect_procs_anon(page, tokill, force_early);
else
- collect_procs_file(page, tokill, force_early);
+ collect_procs_file(page, tokill, shared_files, force_early);
+}
+
+static void unmap_mappings_range(struct list_head *to_kill,
+ struct list_head *shared_files)
+{
+ unsigned long size = 0;
+ loff_t start = 0;
+ struct to_kill *tk;
+ struct shared_files *sfp;
+
+ list_for_each_entry(tk, to_kill, nd)
+ if (tk->size_shift)
+ size = max(size, 1UL << tk->size_shift);
+ if (size) {
+ /*
+ * Unmap the largest mapping to avoid breaking up
+ * device-dax mappings which are constant size. The
+ * actual size of the mapping being torn down is
+ * communicated in siginfo, see kill_proc()
+ */
+ list_for_each_entry(sfp, shared_files, list) {
+ start = (sfp->index << PAGE_SHIFT) & ~(size - 1);
+ unmap_mapping_range(sfp->mapping, start,
+ start + size, 0);
+ }
+ }
}

static const char *action_name[] = {
@@ -1026,7 +1074,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
* there's nothing that can be done.
*/
if (kill)
- collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
+ collect_procs(hpage, &tokill, NULL, flags & MF_ACTION_REQUIRED);

if (!PageHuge(hpage)) {
unmap_success = try_to_unmap(hpage, ttu);
@@ -1181,12 +1229,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
{
struct page *page = pfn_to_page(pfn);
const bool unmap_success = true;
- unsigned long size = 0;
- struct to_kill *tk;
LIST_HEAD(tokill);
+ LIST_HEAD(shared_files);
int rc = -EBUSY;
- loff_t start;
- dax_entry_t cookie;
+ int error = 0;

/*
* Prevent the inode from being freed while we are interrogating
@@ -1195,9 +1241,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* also prevents changes to the mapping of this pfn until
* poison signaling is complete.
*/
- cookie = dax_lock_page(page);
- if (!cookie)
- goto out;
+ error = dax_lock_page(page, &shared_files);
+ if (error)
+ goto unlock;

if (hwpoison_filter(page)) {
rc = 0;
@@ -1225,26 +1271,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* SIGBUS (i.e. MF_MUST_KILL)
*/
flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
- collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+ collect_procs(page, &tokill, &shared_files, flags & MF_ACTION_REQUIRED);
+
+ unmap_mappings_range(&tokill, &shared_files);

- list_for_each_entry(tk, &tokill, nd)
- if (tk->size_shift)
- size = max(size, 1UL << tk->size_shift);
- if (size) {
- /*
- * Unmap the largest mapping to avoid breaking up
- * device-dax mappings which are constant size. The
- * actual size of the mapping being torn down is
- * communicated in siginfo, see kill_proc()
- */
- start = (page->index << PAGE_SHIFT) & ~(size - 1);
- unmap_mapping_range(page->mapping, start, start + size, 0);
- }
kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
rc = 0;
unlock:
- dax_unlock_page(page, cookie);
-out:
+ dax_unlock_page(page, &shared_files);
/* drop pgmap ref acquired in caller */
put_dev_pagemap(pgmap);
action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
--
2.27.0



2020-08-07 13:18:23

by Ruan Shiyang

[permalink] [raw]
Subject: [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path

In fsdax mode, WRITE and ZERO on a shared extent need CoW mechanism
performed. After CoW, new extents needs to be remapped to the file.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 6 +++++-
fs/xfs/xfs_file.c | 10 +++++++---
fs/xfs/xfs_iomap.c | 3 ++-
fs/xfs/xfs_iops.c | 11 ++++++++---
fs/xfs/xfs_reflink.c | 2 ++
5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f37f5cc4b19f..5d09d6c454b6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -969,10 +969,14 @@ xfs_free_file_space(
if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;
error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
- &xfs_buffered_write_iomap_ops);
+ IS_DAX(VFS_I(ip)) ?
+ &xfs_direct_write_iomap_ops : &xfs_buffered_write_iomap_ops);
if (error)
return error;

+ if (xfs_is_reflink_inode(ip))
+ xfs_reflink_end_cow(ip, offset, len);
+
/*
* If we zeroed right up to EOF and EOF straddles a page boundary we
* must make sure that the post-EOF area is also zeroed because the
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..45041913129b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -588,9 +588,13 @@ xfs_file_dax_write(

trace_xfs_file_dax_write(ip, count, pos);
ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
- if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
- i_size_write(inode, iocb->ki_pos);
- error = xfs_setfilesize(ip, pos, ret);
+ if (ret > 0) {
+ if (iocb->ki_pos > i_size_read(inode)) {
+ i_size_write(inode, iocb->ki_pos);
+ error = xfs_setfilesize(ip, pos, ret);
+ }
+ if (xfs_is_cow_inode(ip))
+ xfs_reflink_end_cow(ip, pos, ret);
}
out:
xfs_iunlock(ip, iolock);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index b9a8c3798e08..a1fc75f11cf9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -748,13 +748,14 @@ xfs_direct_write_iomap_begin(
goto out_unlock;

if (imap_needs_cow(ip, flags, &imap, nimaps)) {
+ bool need_convert = flags & IOMAP_DIRECT || IS_DAX(inode);
error = -EAGAIN;
if (flags & IOMAP_NOWAIT)
goto out_unlock;

/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
- &lockmode, flags & IOMAP_DIRECT);
+ &lockmode, need_convert);
if (error)
goto out_unlock;
if (shared)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 80a13c8561d8..6dd6a973ea75 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -860,6 +860,7 @@ xfs_setattr_size(
int error;
uint lock_flags = 0;
bool did_zeroing = false;
+ const struct iomap_ops *ops;

ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
@@ -906,13 +907,17 @@ xfs_setattr_size(
* extension, or zeroing out the rest of the block on a downward
* truncate.
*/
+ if (IS_DAX(inode))
+ ops = &xfs_direct_write_iomap_ops;
+ else
+ ops = &xfs_buffered_write_iomap_ops;
+
if (newsize > oldsize) {
trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
error = iomap_zero_range(inode, oldsize, newsize - oldsize,
- &did_zeroing, &xfs_buffered_write_iomap_ops);
+ &did_zeroing, ops);
} else {
- error = iomap_truncate_page(inode, newsize, &did_zeroing,
- &xfs_buffered_write_iomap_ops);
+ error = iomap_truncate_page(inode, newsize, &did_zeroing, ops);
}

if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 792217cd1e64..f87ab78dd421 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1269,6 +1269,8 @@ xfs_reflink_zero_posteof(

trace_xfs_zero_eof(ip, isize, pos - isize);
return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+ IS_DAX(VFS_I(ip)) ?
+ &xfs_direct_write_iomap_ops :
&xfs_buffered_write_iomap_ops);
}

--
2.27.0



2020-08-07 13:40:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
>
> Instead of per-page tracking method, this patchset introduces a query
> interface: get_shared_files(), which is implemented by each FS, to
> obtain the owners of a shared page. It returns an owner list of this
> shared page. Then, the memory-failure() iterates the list to be able
> to notify each process using files that sharing this page.
>
> The design of the tracking method is as follow:
> 1. dax_assocaite_entry() associates the owner's info to this page

I think that's the first problem with this design. dax_associate_entry is
a horrendous idea which needs to be ripped out, not made more important.
It's all part of the general problem of trying to do something on a
per-page basis instead of per-extent basis.

2020-08-10 08:19:15

by Ruan Shiyang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink



On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
>> This patchset is a try to resolve the problem of tracking shared page
>> for fsdax.
>>
>> Instead of per-page tracking method, this patchset introduces a query
>> interface: get_shared_files(), which is implemented by each FS, to
>> obtain the owners of a shared page. It returns an owner list of this
>> shared page. Then, the memory-failure() iterates the list to be able
>> to notify each process using files that sharing this page.
>>
>> The design of the tracking method is as follow:
>> 1. dax_assocaite_entry() associates the owner's info to this page
>
> I think that's the first problem with this design. dax_associate_entry is
> a horrendous idea which needs to be ripped out, not made more important.
> It's all part of the general problem of trying to do something on a
> per-page basis instead of per-extent basis.
>

The memory-failure needs to track owners info from a dax page, so I
should associate the owner with this page. In this version, I associate
the block device to the dax page, so that the memory-failure is able to
iterate the owners by the query interface provided by filesystem.


--
Thanks,
Ruan Shiyang.
>
>


2020-08-10 11:18:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

On Mon, Aug 10, 2020 at 04:15:50PM +0800, Ruan Shiyang wrote:
>
>
> On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> > On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> > > This patchset is a try to resolve the problem of tracking shared page
> > > for fsdax.
> > >
> > > Instead of per-page tracking method, this patchset introduces a query
> > > interface: get_shared_files(), which is implemented by each FS, to
> > > obtain the owners of a shared page. It returns an owner list of this
> > > shared page. Then, the memory-failure() iterates the list to be able
> > > to notify each process using files that sharing this page.
> > >
> > > The design of the tracking method is as follow:
> > > 1. dax_assocaite_entry() associates the owner's info to this page
> >
> > I think that's the first problem with this design. dax_associate_entry is
> > a horrendous idea which needs to be ripped out, not made more important.
> > It's all part of the general problem of trying to do something on a
> > per-page basis instead of per-extent basis.
> >
>
> The memory-failure needs to track owners info from a dax page, so I should
> associate the owner with this page. In this version, I associate the block
> device to the dax page, so that the memory-failure is able to iterate the
> owners by the query interface provided by filesystem.

No, it doesn't need to track owner info from a DAX page. What it needs to
do is ask the filesystem.

2020-08-10 21:12:27

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

On Mon, Aug 10, 2020 at 12:16:57PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 10, 2020 at 04:15:50PM +0800, Ruan Shiyang wrote:
> >
> >
> > On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> > > On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > >
> > > > Instead of per-page tracking method, this patchset introduces a query
> > > > interface: get_shared_files(), which is implemented by each FS, to
> > > > obtain the owners of a shared page. It returns an owner list of this
> > > > shared page. Then, the memory-failure() iterates the list to be able
> > > > to notify each process using files that sharing this page.
> > > >
> > > > The design of the tracking method is as follow:
> > > > 1. dax_assocaite_entry() associates the owner's info to this page
> > >
> > > I think that's the first problem with this design. dax_associate_entry is
> > > a horrendous idea which needs to be ripped out, not made more important.
> > > It's all part of the general problem of trying to do something on a
> > > per-page basis instead of per-extent basis.
> > >
> >
> > The memory-failure needs to track owners info from a dax page, so I should
> > associate the owner with this page. In this version, I associate the block
> > device to the dax page, so that the memory-failure is able to iterate the
> > owners by the query interface provided by filesystem.
>
> No, it doesn't need to track owner info from a DAX page. What it needs to
> do is ask the filesystem.

Just to add to this: the owner tracking that is current done deep
inside the DAX code needs to be moved out to the owner of the dax
device. That may be the dax device itself, or it may be a filesystem
like ext4 or XFS. Initially, nothing will be able to share pages and
page owner tracking should be done on the page itself as the DAX
code currently does.

Hence when a page error occurs, the device owner is called with the
page and error information, and the dax device or filesystem can
then look up the page owner (via mapping/index pointers) and run the
Die Userspace Die functions instead of running this all internally
in the DAX code.

Once the implementation is abstracted and controlled by the device
owner, then we can start working to change the XFS implementation to
support shared pages....

Cheers,

Dave.
--
Dave Chinner
[email protected]