LinuxLists.cc - [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-27 08:52:08

Subject: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

This patchset is a try to resolve the shared 'page cache' problem for
fsdax.

In order to track multiple mappings and indexes on one page, I
introduced a dax-rmap rb-tree to manage the relationship. A dax entry
will be associated more than once if is shared. At the second time we
associate this entry, we create this rb-tree and store its root in
page->private(not used in fsdax). Insert (->mapping, ->index) when
dax_associate_entry() and delete it when dax_disassociate_entry().

We can iterate the dax-rmap rb-tree before any other operations on
mappings of files. Such as memory-failure and rmap.

Same as before, I borrowed and made some changes on Goldwyn's patchsets.
These patches makes up for the lack of CoW mechanism in fsdax.

The rests are dax & reflink support for xfs.

(Rebased to 5.7-rc2)

Shiyang Ruan (8):
fs/dax: Introduce dax-rmap btree for reflink
mm: add dax-rmap for memory-failure and rmap
fs/dax: Introduce dax_copy_edges() for COW
fs/dax: copy data before write
fs/dax: replace mmap entry in case of CoW
fs/dax: dedup file range to use a compare function
fs/xfs: handle CoW for fsdax write() path
fs/xfs: support dedupe for fsdax

fs/dax.c | 343 +++++++++++++++++++++++++++++++++++++----
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 +-
fs/xfs/xfs_bmap_util.c | 6 +-
fs/xfs/xfs_file.c | 10 +-
fs/xfs/xfs_iomap.c | 3 +-
fs/xfs/xfs_iops.c | 11 +-
fs/xfs/xfs_reflink.c | 79 ++++++----
include/linux/dax.h | 11 ++
include/linux/fs.h | 9 +-
mm/memory-failure.c | 63 ++++++--
mm/rmap.c | 54 +++++--
12 files changed, 498 insertions(+), 104 deletions(-)

--
2.26.2

2020-04-27 08:53:20

by Ruan Shiyang

[permalink] [raw]

Subject: [RFC PATCH 1/8] fs/dax: Introduce dax-rmap btree for reflink

Normally, when accessing a mmapped file, entering the page fault, the
file's (->mapping, ->index) will be associated with dax entry(represents
for one page or a couple of pages) to facilitate the reverse mapping
search. But in the case of reflink, a dax entry may be shared by multiple
files or offsets. In order to establish a reverse mapping relationship in
this case, I introduce a rb-tree to track multiple files and offsets.

The root of the rb-tree is stored in page->private, since I haven't found
it be used in fsdax. We create the rb-tree and insert the
(->mapping, ->index) tuple in the second time a dax entry is associated,
which means this dax entry is shared. And delete this tuple from the
rb-tree when disassociating.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 153 ++++++++++++++++++++++++++++++++++++++++----
include/linux/dax.h | 6 ++
2 files changed, 147 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..2f996c566103 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,6 +25,7 @@
#include <linux/sizes.h>
#include <linux/mmu_notifier.h>
#include <linux/iomap.h>
+#include <linux/rbtree.h>
#include <asm/pgalloc.h>

#define CREATE_TRACE_POINTS
@@ -310,6 +311,120 @@ static unsigned long dax_entry_size(void *entry)
return PAGE_SIZE;
}

+static struct kmem_cache *dax_rmap_node_cachep;
+static struct kmem_cache *dax_rmap_root_cachep;
+
+static int __init init_dax_rmap_cache(void)
+{
+ dax_rmap_root_cachep = KMEM_CACHE(rb_root_cached, SLAB_PANIC|SLAB_ACCOUNT);
+ dax_rmap_node_cachep = KMEM_CACHE(shared_file, SLAB_PANIC|SLAB_ACCOUNT);
+ return 0;
+}
+fs_initcall(init_dax_rmap_cache);
+
+struct rb_root_cached *dax_create_rbroot(void)
+{
+ struct rb_root_cached *root = kmem_cache_alloc(dax_rmap_root_cachep,
+ GFP_KERNEL);
+
+ memset(root, 0, sizeof(struct rb_root_cached));
+ return root;
+}
+
+static bool dax_rmap_insert(struct page *page, struct address_space *mapping,
+ pgoff_t index)
+{
+ struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+ struct rb_node **new, *parent = NULL;
+ struct shared_file *p;
+ bool leftmost = true;
+
+ if (!root) {
+ root = dax_create_rbroot();
+ set_page_private(page, (unsigned long)root);
+ dax_rmap_insert(page, page->mapping, page->index);
+ }
+ new = &root->rb_root.rb_node;
+ /* Figure out where to insert new node */
+ while (*new) {
+ struct shared_file *this = container_of(*new, struct shared_file, node);
+ long result = (long)mapping - (long)this->mapping;
+
+ if (result == 0)
+ result = (long)index - (long)this->index;
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0) {
+ new = &((*new)->rb_right);
+ leftmost = false;
+ } else
+ return false;
+ }
+ p = kmem_cache_alloc(dax_rmap_node_cachep, GFP_KERNEL);
+ p->mapping = mapping;
+ p->index = index;
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&p->node, parent, new);
+ rb_insert_color_cached(&p->node, root, leftmost);
+
+ return true;
+}
+
+static struct shared_file *dax_rmap_search(struct page *page,
+ struct address_space *mapping,
+ pgoff_t index)
+{
+ struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+ struct rb_node *node = root->rb_root.rb_node;
+
+ while (node) {
+ struct shared_file *this = container_of(node, struct shared_file, node);
+ long result = (long)mapping - (long)this->mapping;
+
+ if (result == 0)
+ result = (long)index - (long)this->index;
+ if (result < 0)
+ node = node->rb_left;
+ else if (result > 0)
+ node = node->rb_right;
+ else
+ return this;
+ }
+ return NULL;
+}
+
+static void dax_rmap_delete(struct page *page, struct address_space *mapping,
+ pgoff_t index)
+{
+ struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+ struct shared_file *this;
+
+ if (!root) {
+ page->mapping = NULL;
+ page->index = 0;
+ return;
+ }
+
+ this = dax_rmap_search(page, mapping, index);
+ rb_erase_cached(&this->node, root);
+ kmem_cache_free(dax_rmap_node_cachep, this);
+
+ if (!RB_EMPTY_ROOT(&root->rb_root)) {
+ if (page->mapping == mapping && page->index == index) {
+ this = container_of(rb_first_cached(root), struct shared_file, node);
+ page->mapping = this->mapping;
+ page->index = this->index;
+ }
+ } else {
+ kmem_cache_free(dax_rmap_root_cachep, root);
+ set_page_private(page, 0);
+ page->mapping = NULL;
+ page->index = 0;
+ }
+}
+
static unsigned long dax_end_pfn(void *entry)
{
return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
@@ -341,16 +456,20 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- WARN_ON_ONCE(page->mapping);
- page->mapping = mapping;
- page->index = index + i++;
+ if (!page->mapping) {
+ page->mapping = mapping;
+ page->index = index + i++;
+ } else {
+ dax_rmap_insert(page, mapping, index + i++);
+ }
}
}

static void dax_disassociate_entry(void *entry, struct address_space *mapping,
- bool trunc)
+ pgoff_t index, bool trunc)
{
unsigned long pfn;
+ int i = 0;

if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return;
@@ -359,9 +478,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
struct page *page = pfn_to_page(pfn);

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
- WARN_ON_ONCE(page->mapping && page->mapping != mapping);
- page->mapping = NULL;
- page->index = 0;
+ WARN_ON_ONCE(!page->mapping);
+ dax_rmap_delete(page, mapping, index + i++);
+ }
+}
+
+static void __dax_decrease_nrexceptional(void *entry,
+ struct address_space *mapping)
+{
+ if (dax_is_empty_entry(entry) || dax_is_zero_entry(entry) ||
+ dax_is_pmd_entry(entry)) {
+ mapping->nrexceptional--;
+ } else {
+ mapping->nrexceptional -= PHYS_PFN(dax_entry_size(entry));
}
}

@@ -522,10 +651,10 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_lock_irq(xas);
}

- dax_disassociate_entry(entry, mapping, false);
+ dax_disassociate_entry(entry, mapping, index, false);
xas_store(xas, NULL); /* undo the PMD join */
dax_wake_entry(xas, entry, true);
- mapping->nrexceptional--;
+ __dax_decrease_nrexceptional(entry, mapping);
entry = NULL;
xas_set(xas, index);
}
@@ -642,9 +771,9 @@ static int __dax_invalidate_entry(struct address_space *mapping,
(xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
goto out;
- dax_disassociate_entry(entry, mapping, trunc);
+ dax_disassociate_entry(entry, mapping, index, trunc);
xas_store(&xas, NULL);
- mapping->nrexceptional--;
+ __dax_decrease_nrexceptional(entry, mapping);
ret = 1;
out:
put_unlocked_entry(&xas, entry);
@@ -737,7 +866,7 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;

- dax_disassociate_entry(entry, mapping, false);
+ dax_disassociate_entry(entry, mapping, xas->xa_index, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
/*
* Only swap our new entry into the page cache if the current
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d7af5d243f24..1e2e81c701b6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -39,6 +39,12 @@ struct dax_operations {
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
};

+struct shared_file {
+ struct address_space *mapping;
+ pgoff_t index;
+ struct rb_node node;
+};
+
extern struct attribute_group dax_attribute_group;

#if IS_ENABLED(CONFIG_DAX)
--
2.26.2

2020-04-27 08:53:34

by Ruan Shiyang

[permalink] [raw]

Subject: [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax

Use xfs_break_layouts() to break files' layouts when locking them. And
call dax_file_range_compare() function to compare range for files both
DAX.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_reflink.c | 77 ++++++++++++++++++++++++++------------------
1 file changed, 45 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f87ab78dd421..efbe3464ae85 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1185,47 +1185,41 @@ xfs_reflink_remap_blocks(
* back out both locks.
*/
static int
-xfs_iolock_two_inodes_and_break_layout(
- struct inode *src,
- struct inode *dest)
+xfs_reflink_remap_lock_and_break_layouts(
+ struct file *file_in,
+ struct file *file_out)
{
int error;
+ struct inode *inode_in = file_inode(file_in);
+ struct xfs_inode *src = XFS_I(inode_in);
+ struct inode *inode_out = file_inode(file_out);
+ struct xfs_inode *dest = XFS_I(inode_out);
+ uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;

- if (src > dest)
+ if (inode_in > inode_out) {
+ swap(inode_in, inode_out);
swap(src, dest);
-
-retry:
- /* Wait to break both inodes' layouts before we start locking. */
- error = break_layout(src, true);
- if (error)
- return error;
- if (src != dest) {
- error = break_layout(dest, true);
- if (error)
- return error;
}

- /* Lock one inode and make sure nobody got in and leased it. */
- inode_lock(src);
- error = break_layout(src, false);
+ inode_lock(inode_in);
+ xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+ error = xfs_break_layouts(inode_in, &iolock, BREAK_UNMAP);
+ xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
if (error) {
- inode_unlock(src);
- if (error == -EWOULDBLOCK)
- goto retry;
+ inode_unlock(inode_in);
return error;
}

- if (src == dest)
+ if (inode_in == inode_out)
return 0;

- /* Lock the other inode and make sure nobody got in and leased it. */
- inode_lock_nested(dest, I_MUTEX_NONDIR2);
- error = break_layout(dest, false);
+ inode_lock_nested(inode_out, I_MUTEX_NONDIR2);
+ xfs_ilock(dest, XFS_MMAPLOCK_EXCL);
+ error = xfs_break_layouts(inode_out, &iolock, BREAK_UNMAP);
+ xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
if (error) {
- inode_unlock(src);
- inode_unlock(dest);
- if (error == -EWOULDBLOCK)
- goto retry;
+ inode_unlock(inode_in);
+ inode_unlock(inode_out);
return error;
}

@@ -1244,6 +1238,11 @@ xfs_reflink_remap_unlock(
struct xfs_inode *dest = XFS_I(inode_out);
bool same_inode = (inode_in == inode_out);

+ if (inode_in > inode_out) {
+ swap(inode_in, inode_out);
+ swap(src, dest);
+ }
+
xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
if (!same_inode)
xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
@@ -1274,6 +1273,14 @@ xfs_reflink_zero_posteof(
&xfs_buffered_write_iomap_ops);
}

+int xfs_reflink_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
+{
+ return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same,
+ &xfs_read_iomap_ops);
+}
+
/*
* Prepare two files for range cloning. Upon a successful return both inodes
* will have the iolock and mmaplock held, the page cache of the out file will
@@ -1318,9 +1325,10 @@ xfs_reflink_remap_prep(
struct xfs_inode *dest = XFS_I(inode_out);
bool same_inode = (inode_in == inode_out);
ssize_t ret;
+ compare_range_t cmp;

/* Lock both files against IO */
- ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+ ret = xfs_reflink_remap_lock_and_break_layouts(file_in, file_out);
if (ret)
return ret;
if (same_inode)
@@ -1335,12 +1343,17 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;

- /* Don't share DAX file data for now. */
- if (IS_DAX(inode_in) || IS_DAX(inode_out))
+ /* Don't share DAX file data with non-DAX file. */
+ if (IS_DAX(inode_in) != IS_DAX(inode_out))
goto out_unlock;

+ if (IS_DAX(inode_in))
+ cmp = xfs_reflink_dedupe_file_range_compare;
+ else
+ cmp = vfs_dedupe_file_range_compare;
+
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags, vfs_dedupe_file_range_compare);
+ len, remap_flags, cmp);
if (ret < 0 || *len == 0)
goto out_unlock;

--
2.26.2

2020-04-27 12:33:00

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the shared 'page cache' problem for
> fsdax.
>
> In order to track multiple mappings and indexes on one page, I
> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> will be associated more than once if is shared. At the second time we
> associate this entry, we create this rb-tree and store its root in
> page->private(not used in fsdax). Insert (->mapping, ->index) when
> dax_associate_entry() and delete it when dax_disassociate_entry().

Do we really want to track all of this on a per-page basis? I would
have thought a per-extent basis was more useful. Essentially, create
a new address_space for each shared extent. Per page just seems like
a huge overhead.

2020-04-28 06:11:46

by Ruan Shiyang

[permalink] [raw]

Subject: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rm ap tree to support reflink

在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:

>On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>> This patchset is a try to resolve the shared 'page cache' problem for
>> fsdax.
>>
>> In order to track multiple mappings and indexes on one page, I
>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
>> will be associated more than once if is shared. At the second time we
>> associate this entry, we create this rb-tree and store its root in
>> page->private(not used in fsdax). Insert (->mapping, ->index) when
>> dax_associate_entry() and delete it when dax_disassociate_entry().
>
>Do we really want to track all of this on a per-page basis? I would
>have thought a per-extent basis was more useful. Essentially, create
>a new address_space for each shared extent. Per page just seems like
>a huge overhead.
>
Per-extent tracking is a nice idea for me. I haven't thought of it
yet...

But the extent info is maintained by filesystem. I think we need a way
to obtain this info from FS when associating a page. May be a bit
complicated. Let me think about it...

--
Thanks,
Ruan Shiyang.

2020-04-28 06:45:20

by Dave Chinner

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>
> 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
>
> >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> >> This patchset is a try to resolve the shared 'page cache' problem for
> >> fsdax.
> >>
> >> In order to track multiple mappings and indexes on one page, I
> >> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> >> will be associated more than once if is shared. At the second time we
> >> associate this entry, we create this rb-tree and store its root in
> >> page->private(not used in fsdax). Insert (->mapping, ->index) when
> >> dax_associate_entry() and delete it when dax_disassociate_entry().
> >
> >Do we really want to track all of this on a per-page basis? I would
> >have thought a per-extent basis was more useful. Essentially, create
> >a new address_space for each shared extent. Per page just seems like
> >a huge overhead.
> >
> Per-extent tracking is a nice idea for me. I haven't thought of it
> yet...
>
> But the extent info is maintained by filesystem. I think we need a way
> to obtain this info from FS when associating a page. May be a bit
> complicated. Let me think about it...

That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.

At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.

That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.

- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
for each mapping/index tuple the filesystem finds for the given
LBA address that the media error occurred.

Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-28 09:34:49

by Ruan Shiyang

[permalink] [raw]

Subject: Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On 2020/4/28 下午2:43, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>
>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
>>
>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>> This patchset is a try to resolve the shared 'page cache' problem for
>>>> fsdax.
>>>>
>>>> In order to track multiple mappings and indexes on one page, I
>>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
>>>> will be associated more than once if is shared. At the second time we
>>>> associate this entry, we create this rb-tree and store its root in
>>>> page->private(not used in fsdax). Insert (->mapping, ->index) when
>>>> dax_associate_entry() and delete it when dax_disassociate_entry().
>>>
>>> Do we really want to track all of this on a per-page basis? I would
>>> have thought a per-extent basis was more useful. Essentially, create
>>> a new address_space for each shared extent. Per page just seems like
>>> a huge overhead.
>>>
>> Per-extent tracking is a nice idea for me. I haven't thought of it
>> yet...
>>
>> But the extent info is maintained by filesystem. I think we need a way
>> to obtain this info from FS when associating a page. May be a bit
>> complicated. Let me think about it...
>
> That's why I want the -user of this association- to do a filesystem
> callout instead of keeping it's own naive tracking infrastructure.
> The filesystem can do an efficient, on-demand reverse mapping lookup
> from it's own extent tracking infrastructure, and there's zero
> runtime overhead when there are no errors present.
>
> At the moment, this "dax association" is used to "report" a storage
> media error directly to userspace. I say "report" because what it
> does is kill userspace processes dead. The storage media error
> actually needs to be reported to the owner of the storage media,
> which in the case of FS-DAX is the filesytem.

Understood.

BTW, this is the usage in memory-failure, so what about rmap? I have
not found how to use this tracking in rmap. Do you have any ideas?

>
> That way the filesystem can then look up all the owners of that bad
> media range (i.e. the filesystem block it corresponds to) and take
> appropriate action. e.g.

I tried writing a function to look up all the owners' info of one block
in xfs for memory-failure use. It was dropped in this patchset because
I found out that this lookup function needs 'rmapbt' to be enabled when
mkfs. But by default, rmapbt is disabled. I am not sure if it matters...

>
> - if it falls in filesytem metadata, shutdown the filesystem
> - if it falls in user data, call the "kill userspace dead" routines
> for each mapping/index tuple the filesystem finds for the given
> LBA address that the media error occurred >
> Right now if the media error is in filesystem metadata, the
> filesystem isn't even told about it. The filesystem can't even shut
> down - the error is just dropped on the floor and it won't be until
> the filesystem next tries to reference that metadata that we notice
> there is an issue.

Understood. Thanks.

>
> Cheers,
>
> Dave.
>

--
Thanks,
Ruan Shiyang.

2020-04-28 11:20:13

by Matthew Wilcox

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > This patchset is a try to resolve the shared 'page cache' problem for
> > > > > fsdax.
> > > > >
> > > > > In order to track multiple mappings and indexes on one page, I
> > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> > > > > will be associated more than once if is shared. At the second time we
> > > > > associate this entry, we create this rb-tree and store its root in
> > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when
> > > > > dax_associate_entry() and delete it when dax_disassociate_entry().
> > > >
> > > > Do we really want to track all of this on a per-page basis? I would
> > > > have thought a per-extent basis was more useful. Essentially, create
> > > > a new address_space for each shared extent. Per page just seems like
> > > > a huge overhead.
> > > >
> > > Per-extent tracking is a nice idea for me. I haven't thought of it
> > > yet...
> > >
> > > But the extent info is maintained by filesystem. I think we need a way
> > > to obtain this info from FS when associating a page. May be a bit
> > > complicated. Let me think about it...
> >
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
> >
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
>
> Understood.
>
> BTW, this is the usage in memory-failure, so what about rmap? I have not
> found how to use this tracking in rmap. Do you have any ideas?
>
> >
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
>
> I tried writing a function to look up all the owners' info of one block in
> xfs for memory-failure use. It was dropped in this patchset because I found
> out that this lookup function needs 'rmapbt' to be enabled when mkfs. But
> by default, rmapbt is disabled. I am not sure if it matters...

I'm pretty sure you can't have shared extents on an XFS filesystem if you
_don't_ have the rmapbt feature enabled. I mean, that's why it exists.

2020-04-28 11:28:56

by Dave Chinner

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > fsdax.
> > > > > >
> > > > > > In order to track multiple mappings and indexes on one page, I
> > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> > > > > > will be associated more than once if is shared. At the second time we
> > > > > > associate this entry, we create this rb-tree and store its root in
> > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when
> > > > > > dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > >
> > > > > Do we really want to track all of this on a per-page basis? I would
> > > > > have thought a per-extent basis was more useful. Essentially, create
> > > > > a new address_space for each shared extent. Per page just seems like
> > > > > a huge overhead.
> > > > >
> > > > Per-extent tracking is a nice idea for me. I haven't thought of it
> > > > yet...
> > > >
> > > > But the extent info is maintained by filesystem. I think we need a way
> > > > to obtain this info from FS when associating a page. May be a bit
> > > > complicated. Let me think about it...
> > >
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > >
> > > At the moment, this "dax association" is used to "report" a storage
> > > media error directly to userspace. I say "report" because what it
> > > does is kill userspace processes dead. The storage media error
> > > actually needs to be reported to the owner of the storage media,
> > > which in the case of FS-DAX is the filesytem.
> >
> > Understood.
> >
> > BTW, this is the usage in memory-failure, so what about rmap? I have not
> > found how to use this tracking in rmap. Do you have any ideas?
> >
> > >
> > > That way the filesystem can then look up all the owners of that bad
> > > media range (i.e. the filesystem block it corresponds to) and take
> > > appropriate action. e.g.
> >
> > I tried writing a function to look up all the owners' info of one block in
> > xfs for memory-failure use. It was dropped in this patchset because I found
> > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But
> > by default, rmapbt is disabled. I am not sure if it matters...
>
> I'm pretty sure you can't have shared extents on an XFS filesystem if you
> _don't_ have the rmapbt feature enabled. I mean, that's why it exists.

You're confusing reflink with rmap. :)

rmapbt does all the reverse mapping tracking, reflink just does the
shared data extent tracking.

But given that anyone who wants to use DAX with reflink is going to
have to mkfs their filesystem anyway (to turn on reflink) requiring
that rmapbt is also turned on is not a big deal. Especially as we
can check it at mount time in the kernel...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-28 15:42:08

by Darrick J. Wong

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
> > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > > This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > > fsdax.
> > > > > > >
> > > > > > > In order to track multiple mappings and indexes on one page, I
> > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> > > > > > > will be associated more than once if is shared. At the second time we
> > > > > > > associate this entry, we create this rb-tree and store its root in
> > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when
> > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > >
> > > > > > Do we really want to track all of this on a per-page basis? I would
> > > > > > have thought a per-extent basis was more useful. Essentially, create
> > > > > > a new address_space for each shared extent. Per page just seems like
> > > > > > a huge overhead.
> > > > > >
> > > > > Per-extent tracking is a nice idea for me. I haven't thought of it
> > > > > yet...
> > > > >
> > > > > But the extent info is maintained by filesystem. I think we need a way
> > > > > to obtain this info from FS when associating a page. May be a bit
> > > > > complicated. Let me think about it...
> > > >
> > > > That's why I want the -user of this association- to do a filesystem
> > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > from it's own extent tracking infrastructure, and there's zero
> > > > runtime overhead when there are no errors present.
> > > >
> > > > At the moment, this "dax association" is used to "report" a storage
> > > > media error directly to userspace. I say "report" because what it
> > > > does is kill userspace processes dead. The storage media error
> > > > actually needs to be reported to the owner of the storage media,
> > > > which in the case of FS-DAX is the filesytem.
> > >
> > > Understood.
> > >
> > > BTW, this is the usage in memory-failure, so what about rmap? I have not
> > > found how to use this tracking in rmap. Do you have any ideas?
> > >
> > > >
> > > > That way the filesystem can then look up all the owners of that bad
> > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > appropriate action. e.g.
> > >
> > > I tried writing a function to look up all the owners' info of one block in
> > > xfs for memory-failure use. It was dropped in this patchset because I found
> > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But
> > > by default, rmapbt is disabled. I am not sure if it matters...
> >
> > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > _don't_ have the rmapbt feature enabled. I mean, that's why it exists.
>
> You're confusing reflink with rmap. :)
>
> rmapbt does all the reverse mapping tracking, reflink just does the
> shared data extent tracking.
>
> But given that anyone who wants to use DAX with reflink is going to
> have to mkfs their filesystem anyway (to turn on reflink) requiring
> that rmapbt is also turned on is not a big deal. Especially as we
> can check it at mount time in the kernel...

Are we going to turn on rmap by default? The last I checked, it did
have a 10-20% performance cost on extreme metadata-heavy workloads.
Or do we only enable it by default if mkfs detects a pmem device?

(Admittedly, most people do not run fsx as a productivity app; the
normal hit is usually 3-5% which might not be such a big deal since you
also get (half of) online fsck. :P)

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2020-04-28 22:05:09

by Dave Chinner

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
> > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > > > This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > > > fsdax.
> > > > > > > >
> > > > > > > > In order to track multiple mappings and indexes on one page, I
> > > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> > > > > > > > will be associated more than once if is shared. At the second time we
> > > > > > > > associate this entry, we create this rb-tree and store its root in
> > > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when
> > > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > > >
> > > > > > > Do we really want to track all of this on a per-page basis? I would
> > > > > > > have thought a per-extent basis was more useful. Essentially, create
> > > > > > > a new address_space for each shared extent. Per page just seems like
> > > > > > > a huge overhead.
> > > > > > >
> > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it
> > > > > > yet...
> > > > > >
> > > > > > But the extent info is maintained by filesystem. I think we need a way
> > > > > > to obtain this info from FS when associating a page. May be a bit
> > > > > > complicated. Let me think about it...
> > > > >
> > > > > That's why I want the -user of this association- to do a filesystem
> > > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > > from it's own extent tracking infrastructure, and there's zero
> > > > > runtime overhead when there are no errors present.
> > > > >
> > > > > At the moment, this "dax association" is used to "report" a storage
> > > > > media error directly to userspace. I say "report" because what it
> > > > > does is kill userspace processes dead. The storage media error
> > > > > actually needs to be reported to the owner of the storage media,
> > > > > which in the case of FS-DAX is the filesytem.
> > > >
> > > > Understood.
> > > >
> > > > BTW, this is the usage in memory-failure, so what about rmap? I have not
> > > > found how to use this tracking in rmap. Do you have any ideas?
> > > >
> > > > >
> > > > > That way the filesystem can then look up all the owners of that bad
> > > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > > appropriate action. e.g.
> > > >
> > > > I tried writing a function to look up all the owners' info of one block in
> > > > xfs for memory-failure use. It was dropped in this patchset because I found
> > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But
> > > > by default, rmapbt is disabled. I am not sure if it matters...
> > >
> > > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > > _don't_ have the rmapbt feature enabled. I mean, that's why it exists.
> >
> > You're confusing reflink with rmap. :)
> >
> > rmapbt does all the reverse mapping tracking, reflink just does the
> > shared data extent tracking.
> >
> > But given that anyone who wants to use DAX with reflink is going to
> > have to mkfs their filesystem anyway (to turn on reflink) requiring
> > that rmapbt is also turned on is not a big deal. Especially as we
> > can check it at mount time in the kernel...
>
> Are we going to turn on rmap by default? The last I checked, it did
> have a 10-20% performance cost on extreme metadata-heavy workloads.
> Or do we only enable it by default if mkfs detects a pmem device?

Just have the kernel refuse to mount a reflink enabled filesystem on
a DAX capable device unless -o dax=never or rmapbt is enabled.

That'll get the message across pretty quickly....

> (Admittedly, most people do not run fsx as a productivity app; the
> normal hit is usually 3-5% which might not be such a big deal since you
> also get (half of) online fsck. :P)

I have not noticed the overhead at all on any of my production
machines since I enabled it way on all of them way back when....

And, really, pmem is a _very poor choice_ for metadata intensive
applications on XFS as pmem is completely synchronous. XFS has an
async IO model for it's metadata that *must* be buffered (so no
DAX!) and the synchronous nature of pmem completely defeats the
architectural IO pipelining XFS uses to allow thousands of
concurrent metadata IOs in flight. OTOH, pmem IO depth is limited to
the number of CPUs that are concurrently issuing IO, so it really,
really sucks compared to a handful of high end nvme SSDs on PCIe
4.0....

So with that in mind, I see little reason to care about the small
additional overhead of rmapbt on FS-DAX installations that require
reflink...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-06-04 07:40:17

by Ruan Shiyang

[permalink] [raw]

Subject: Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On 2020/4/28 下午2:43, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>
>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
>>
>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>> This patchset is a try to resolve the shared 'page cache' problem for
>>>> fsdax.
>>>>
>>>> In order to track multiple mappings and indexes on one page, I
>>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
>>>> will be associated more than once if is shared. At the second time we
>>>> associate this entry, we create this rb-tree and store its root in
>>>> page->private(not used in fsdax). Insert (->mapping, ->index) when
>>>> dax_associate_entry() and delete it when dax_disassociate_entry().
>>>
>>> Do we really want to track all of this on a per-page basis? I would
>>> have thought a per-extent basis was more useful. Essentially, create
>>> a new address_space for each shared extent. Per page just seems like
>>> a huge overhead.
>>>
>> Per-extent tracking is a nice idea for me. I haven't thought of it
>> yet...
>>
>> But the extent info is maintained by filesystem. I think we need a way
>> to obtain this info from FS when associating a page. May be a bit
>> complicated. Let me think about it...
>
> That's why I want the -user of this association- to do a filesystem
> callout instead of keeping it's own naive tracking infrastructure.
> The filesystem can do an efficient, on-demand reverse mapping lookup
> from it's own extent tracking infrastructure, and there's zero
> runtime overhead when there are no errors present.

Hi Dave,

I ran into some difficulties when trying to implement the per-extent
rmap tracking. So, I re-read your comments and found that I was
misunderstanding what you described here.

I think what you mean is: we don't need the in-memory dax-rmap tracking
now. Just ask the FS for the owner's information that associate with
one page when memory-failure. So, the per-page (even per-extent)
dax-rmap is needless in this case. Is this right?

Based on this, we only need to store the extent information of a fsdax
page in its ->mapping (by searching from FS). Then obtain the owners of
this page (also by searching from FS) when memory-failure or other rmap
case occurs.

So, a fsdax page is no longer associated with a specific file, but with
a FS(or the pmem device). I think it's easier to understand and implement.

--
Thanks,
Ruan Shiyang.
>
> At the moment, this "dax association" is used to "report" a storage
> media error directly to userspace. I say "report" because what it
> does is kill userspace processes dead. The storage media error
> actually needs to be reported to the owner of the storage media,
> which in the case of FS-DAX is the filesytem.
>
> That way the filesystem can then look up all the owners of that bad
> media range (i.e. the filesystem block it corresponds to) and take
> appropriate action. e.g.
>
> - if it falls in filesytem metadata, shutdown the filesystem
> - if it falls in user data, call the "kill userspace dead" routines
> for each mapping/index tuple the filesystem finds for the given
> LBA address that the media error occurred.
>
> Right now if the media error is in filesystem metadata, the
> filesystem isn't even told about it. The filesystem can't even shut
> down - the error is just dropped on the floor and it won't be until
> the filesystem next tries to reference that metadata that we notice
> there is an issue.
>
> Cheers,
>
> Dave.
>

2020-06-04 14:54:20

by Darrick J. Wong

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>
>
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > >
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
> > >
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > This patchset is a try to resolve the shared 'page cache' problem for
> > > > > fsdax.
> > > > >
> > > > > In order to track multiple mappings and indexes on one page, I
> > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> > > > > will be associated more than once if is shared. At the second time we
> > > > > associate this entry, we create this rb-tree and store its root in
> > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when
> > > > > dax_associate_entry() and delete it when dax_disassociate_entry().
> > > >
> > > > Do we really want to track all of this on a per-page basis? I would
> > > > have thought a per-extent basis was more useful. Essentially, create
> > > > a new address_space for each shared extent. Per page just seems like
> > > > a huge overhead.
> > > >
> > > Per-extent tracking is a nice idea for me. I haven't thought of it
> > > yet...
> > >
> > > But the extent info is maintained by filesystem. I think we need a way
> > > to obtain this info from FS when associating a page. May be a bit
> > > complicated. Let me think about it...
> >
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
>
> Hi Dave,
>
> I ran into some difficulties when trying to implement the per-extent rmap
> tracking. So, I re-read your comments and found that I was misunderstanding
> what you described here.
>
> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> Just ask the FS for the owner's information that associate with one page
> when memory-failure. So, the per-page (even per-extent) dax-rmap is
> needless in this case. Is this right?

Right. XFS already has its own rmap tree.

> Based on this, we only need to store the extent information of a fsdax page
> in its ->mapping (by searching from FS). Then obtain the owners of this
> page (also by searching from FS) when memory-failure or other rmap case
> occurs.

I don't even think you need that much. All you need is the "physical"
offset of that page within the pmem device (e.g. 'this is the 307th 4k
page == offset 1257472 since the start of /dev/pmem0') and xfs can look
up the owner of that range of physical storage and deal with it as
needed.

> So, a fsdax page is no longer associated with a specific file, but with a
> FS(or the pmem device). I think it's easier to understand and implement.

Yes. I also suspect this will be necessary to support reflink...

--D

>
> --
> Thanks,
> Ruan Shiyang.
> >
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
> >
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
> >
> > - if it falls in filesytem metadata, shutdown the filesystem
> > - if it falls in user data, call the "kill userspace dead" routines
> > for each mapping/index tuple the filesystem finds for the given
> > LBA address that the media error occurred.
> >
> > Right now if the media error is in filesystem metadata, the
> > filesystem isn't even told about it. The filesystem can't even shut
> > down - the error is just dropped on the floor and it won't be until
> > the filesystem next tries to reference that metadata that we notice
> > there is an issue.
> >
> > Cheers,
> >
> > Dave.
> >
>
>

2020-06-05 01:33:02

by Dave Chinner

[permalink] [raw]

Subject: Re: 回复: Re : [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> >
> >
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > >
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
> > > >
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > fsdax.
> > > > > >
> > > > > > In order to track multiple mappings and indexes on one page, I
> > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry
> > > > > > will be associated more than once if is shared. At the second time we
> > > > > > associate this entry, we create this rb-tree and store its root in
> > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when
> > > > > > dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > >
> > > > > Do we really want to track all of this on a per-page basis? I would
> > > > > have thought a per-extent basis was more useful. Essentially, create
> > > > > a new address_space for each shared extent. Per page just seems like
> > > > > a huge overhead.
> > > > >
> > > > Per-extent tracking is a nice idea for me. I haven't thought of it
> > > > yet...
> > > >
> > > > But the extent info is maintained by filesystem. I think we need a way
> > > > to obtain this info from FS when associating a page. May be a bit
> > > > complicated. Let me think about it...
> > >
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> >
> > Hi Dave,
> >
> > I ran into some difficulties when trying to implement the per-extent rmap
> > tracking. So, I re-read your comments and found that I was misunderstanding
> > what you described here.
> >
> > I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> > Just ask the FS for the owner's information that associate with one page
> > when memory-failure. So, the per-page (even per-extent) dax-rmap is
> > needless in this case. Is this right?
>
> Right. XFS already has its own rmap tree.

*nod*

> > Based on this, we only need to store the extent information of a fsdax page
> > in its ->mapping (by searching from FS). Then obtain the owners of this
> > page (also by searching from FS) when memory-failure or other rmap case
> > occurs.
>
> I don't even think you need that much. All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Right. If we have the dax device associated with the page that had
the failure, then we can determine the offset of the page into the
block device address space and that's all we need to find the owner
of the page in the filesystem.

Note that there may actually be no owner - the page that had the
fault might land in free space, in which case we can simply zero
the page and clear the error.

> > So, a fsdax page is no longer associated with a specific file, but with a
> > FS(or the pmem device). I think it's easier to understand and implement.

Effectively, yes. But we shouldn't need to actually associate the
page with anything at the filesystem level because it is already
associated with a DAX device at a lower level via a dev_pagemap.
The hardware page fault already runs thought this code
memory_failure_dev_pagemap() before it gets to the DAX code, so
really all we need to is have that function pass us the page, offset
into the device and, say, the struct dax_device associated with that
page so we can get to the filesystem superblock we can then use for
rmap lookups on...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-06-05 02:15:48

by Ruan Shiyang

[permalink] [raw]

Subject: Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On 2020/6/4 下午10:51, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>>
>>
>> On 2020/4/28 下午2:43, Dave Chinner wrote:
>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>>>
>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
>>>>
>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>>> This patchset is a try to resolve the shared 'page cache' problem for
>>>>>> fsdax.
>>>>>>
>>>>>> In order to track multiple mappings and indexes on one page, I
>>>>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
>>>>>> will be associated more than once if is shared. At the second time we
>>>>>> associate this entry, we create this rb-tree and store its root in
>>>>>> page->private(not used in fsdax). Insert (->mapping, ->index) when
>>>>>> dax_associate_entry() and delete it when dax_disassociate_entry().
>>>>>
>>>>> Do we really want to track all of this on a per-page basis? I would
>>>>> have thought a per-extent basis was more useful. Essentially, create
>>>>> a new address_space for each shared extent. Per page just seems like
>>>>> a huge overhead.
>>>>>
>>>> Per-extent tracking is a nice idea for me. I haven't thought of it
>>>> yet...
>>>>
>>>> But the extent info is maintained by filesystem. I think we need a way
>>>> to obtain this info from FS when associating a page. May be a bit
>>>> complicated. Let me think about it...
>>>
>>> That's why I want the -user of this association- to do a filesystem
>>> callout instead of keeping it's own naive tracking infrastructure.
>>> The filesystem can do an efficient, on-demand reverse mapping lookup
>>> from it's own extent tracking infrastructure, and there's zero
>>> runtime overhead when there are no errors present.
>>
>> Hi Dave,
>>
>> I ran into some difficulties when trying to implement the per-extent rmap
>> tracking. So, I re-read your comments and found that I was misunderstanding
>> what you described here.
>>
>> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
>> Just ask the FS for the owner's information that associate with one page
>> when memory-failure. So, the per-page (even per-extent) dax-rmap is
>> needless in this case. Is this right?
>
> Right. XFS already has its own rmap tree.
>
>> Based on this, we only need to store the extent information of a fsdax page
>> in its ->mapping (by searching from FS). Then obtain the owners of this
>> page (also by searching from FS) when memory-failure or other rmap case
>> occurs.
>
> I don't even think you need that much. All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Yes, I think so.

>
>> So, a fsdax page is no longer associated with a specific file, but with a
>> FS(or the pmem device). I think it's easier to understand and implement.
>
> Yes. I also suspect this will be necessary to support reflink...
>
> --D

OK, Thank you very much.

--
Thanks,
Ruan Shiyang.

>
>>
>> --
>> Thanks,
>> Ruan Shiyang.
>>>
>>> At the moment, this "dax association" is used to "report" a storage
>>> media error directly to userspace. I say "report" because what it
>>> does is kill userspace processes dead. The storage media error
>>> actually needs to be reported to the owner of the storage media,
>>> which in the case of FS-DAX is the filesytem.
>>>
>>> That way the filesystem can then look up all the owners of that bad
>>> media range (i.e. the filesystem block it corresponds to) and take
>>> appropriate action. e.g.
>>>
>>> - if it falls in filesytem metadata, shutdown the filesystem
>>> - if it falls in user data, call the "kill userspace dead" routines
>>> for each mapping/index tuple the filesystem finds for the given
>>> LBA address that the media error occurred.
>>>
>>> Right now if the media error is in filesystem metadata, the
>>> filesystem isn't even told about it. The filesystem can't even shut
>>> down - the error is just dropped on the floor and it won't be until
>>> the filesystem next tries to reference that metadata that we notice
>>> there is an issue.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>
>>
>>
>
>

2020-06-05 02:33:32

by Ruan Shiyang

[permalink] [raw]

Subject: Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

On 2020/6/5 上午9:30, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
>> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>>>
>>>
>>> On 2020/4/28 下午2:43, Dave Chinner wrote:
>>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>>>>
>>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <[email protected]> 写道:
>>>>>
>>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>>>> This patchset is a try to resolve the shared 'page cache' problem for
>>>>>>> fsdax.
>>>>>>>
>>>>>>> In order to track multiple mappings and indexes on one page, I
>>>>>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry
>>>>>>> will be associated more than once if is shared. At the second time we
>>>>>>> associate this entry, we create this rb-tree and store its root in
>>>>>>> page->private(not used in fsdax). Insert (->mapping, ->index) when
>>>>>>> dax_associate_entry() and delete it when dax_disassociate_entry().
>>>>>>
>>>>>> Do we really want to track all of this on a per-page basis? I would
>>>>>> have thought a per-extent basis was more useful. Essentially, create
>>>>>> a new address_space for each shared extent. Per page just seems like
>>>>>> a huge overhead.
>>>>>>
>>>>> Per-extent tracking is a nice idea for me. I haven't thought of it
>>>>> yet...
>>>>>
>>>>> But the extent info is maintained by filesystem. I think we need a way
>>>>> to obtain this info from FS when associating a page. May be a bit
>>>>> complicated. Let me think about it...
>>>>
>>>> That's why I want the -user of this association- to do a filesystem
>>>> callout instead of keeping it's own naive tracking infrastructure.
>>>> The filesystem can do an efficient, on-demand reverse mapping lookup
>>>> from it's own extent tracking infrastructure, and there's zero
>>>> runtime overhead when there are no errors present.
>>>
>>> Hi Dave,
>>>
>>> I ran into some difficulties when trying to implement the per-extent rmap
>>> tracking. So, I re-read your comments and found that I was misunderstanding
>>> what you described here.
>>>
>>> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
>>> Just ask the FS for the owner's information that associate with one page
>>> when memory-failure. So, the per-page (even per-extent) dax-rmap is
>>> needless in this case. Is this right?
>>
>> Right. XFS already has its own rmap tree.
>
> *nod*
>
>>> Based on this, we only need to store the extent information of a fsdax page
>>> in its ->mapping (by searching from FS). Then obtain the owners of this
>>> page (also by searching from FS) when memory-failure or other rmap case
>>> occurs.
>>
>> I don't even think you need that much. All you need is the "physical"
>> offset of that page within the pmem device (e.g. 'this is the 307th 4k
>> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
>> up the owner of that range of physical storage and deal with it as
>> needed.
>
> Right. If we have the dax device associated with the page that had
> the failure, then we can determine the offset of the page into the
> block device address space and that's all we need to find the owner
> of the page in the filesystem.
>
> Note that there may actually be no owner - the page that had the
> fault might land in free space, in which case we can simply zero
> the page and clear the error.

OK. Thanks for pointing out.

>
>>> So, a fsdax page is no longer associated with a specific file, but with a
>>> FS(or the pmem device). I think it's easier to understand and implement.
>
> Effectively, yes. But we shouldn't need to actually associate the
> page with anything at the filesystem level because it is already
> associated with a DAX device at a lower level via a dev_pagemap.
> The hardware page fault already runs thought this code
> memory_failure_dev_pagemap() before it gets to the DAX code, so
> really all we need to is have that function pass us the page, offset
> into the device and, say, the struct dax_device associated with that
> page so we can get to the filesystem superblock we can then use for
> rmap lookups on...
>

OK. I was just thinking about how can I execute the FS rmap search from
the memory-failure. Thanks again for pointing out. :)

--
Thanks,
Ruan Shiyang.

> Cheers,
>
> Dave.
>