2022-04-20 10:21:48

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

This patchset is aimed to support shared pages tracking for fsdax.

Changes since V12:
- Rebased onto next-20220414
- Do not continue ->notify_failure() if filesystem is not ready yet
- Simplify the logic of setting CoW flag
- Fix build warning/error and typo

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device driver, by introducing an interface ->memory_failure() for struct
pagemap. This interface is called by memory_failure() in mm, and
implemented by pmem device.

Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.

Finally we are able to try to fix the corrupted data in filesystem and
do other necessary processing, such as killing processes who are using
the files affected.

The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()

==
Shiyang Ruan (7):
dax: Introduce holder for dax_device
mm: factor helpers for memory_failure_dev_pagemap
pagemap,pmem: Introduce ->memory_failure()
fsdax: Introduce dax_lock_mapping_entry()
mm: Introduce mf_dax_kill_procs() for fsdax case
xfs: Implement ->notify_failure() for XFS
fsdax: set a CoW flag when associate reflink mappings

drivers/dax/super.c | 67 +++++++++-
drivers/md/dm.c | 2 +-
drivers/nvdimm/pmem.c | 17 +++
fs/dax.c | 113 ++++++++++++++--
fs/erofs/super.c | 10 +-
fs/ext2/super.c | 7 +-
fs/ext4/super.c | 9 +-
fs/xfs/Makefile | 5 +
fs/xfs/xfs_buf.c | 10 +-
fs/xfs/xfs_fsops.c | 3 +
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_notify_failure.c | 220 +++++++++++++++++++++++++++++++
fs/xfs/xfs_super.h | 1 +
include/linux/dax.h | 48 +++++--
include/linux/memremap.h | 12 ++
include/linux/mm.h | 2 +
include/linux/page-flags.h | 6 +
mm/memory-failure.c | 255 +++++++++++++++++++++++++-----------
18 files changed, 682 insertions(+), 106 deletions(-)
create mode 100644 fs/xfs/xfs_notify_failure.c

--
2.35.1




2022-04-21 08:52:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > a single release. However, being unable to cleanly merge code we
> > need integrated into our local subsystem tree for integration
> > testing because a patch dependency with another subsystem won't gain
> > a stable commit ID until the next merge window is .... distinctly
> > suboptimal.
>
> Yes. Which is why we've taken a lot of mm patchs through other trees,
> sometimes specilly crafted for that. So I guess in this case we'll
> just need to take non-trivial dependencies into the XFS tree, and just
> deal with small merge conflicts for the trivial ones.

OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
listed looks trivial to resolve.

The second dependency, OTOH, is on a new function added in the patch
pointed to. That said, at first glance it looks to be independent of
the first two patches in that series so I might just be able to pull
that one patch in and have that leave us with a working
fsdax+reflink tree.

Regardless, I'll wait to see how much work the updated XFS/DAX
reflink enablement patchset still requires when Ruan posts it before
deciding what to do here. If it isn't going to be a merge
candidate, what to do with this patchset is moot because there's
little to test without reflink enabled...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-21 09:07:38

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v13 2/7] mm: factor helpers for memory_failure_dev_pagemap

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax. So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
---
mm/memory-failure.c | 157 ++++++++++++++++++++++++--------------------
1 file changed, 87 insertions(+), 70 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e3fbff5bd467..7c8c047bfdc8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,6 +1498,90 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}

+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+ struct address_space *mapping, pgoff_t index, int flags)
+{
+ struct to_kill *tk;
+ unsigned long size = 0;
+
+ list_for_each_entry(tk, to_kill, nd)
+ if (tk->size_shift)
+ size = max(size, 1UL << tk->size_shift);
+
+ if (size) {
+ /*
+ * Unmap the largest mapping to avoid breaking up device-dax
+ * mappings which are constant size. The actual size of the
+ * mapping being torn down is communicated in siginfo, see
+ * kill_proc()
+ */
+ loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+ unmap_mapping_range(mapping, start, size, 0);
+ }
+
+ kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+ struct dev_pagemap *pgmap)
+{
+ struct page *page = pfn_to_page(pfn);
+ LIST_HEAD(to_kill);
+ dax_entry_t cookie;
+ int rc = 0;
+
+ /*
+ * Pages instantiated by device-dax (not filesystem-dax)
+ * may be compound pages.
+ */
+ page = compound_head(page);
+
+ /*
+ * Prevent the inode from being freed while we are interrogating
+ * the address_space, typically this would be handled by
+ * lock_page(), but dax pages do not use the page lock. This
+ * also prevents changes to the mapping of this pfn until
+ * poison signaling is complete.
+ */
+ cookie = dax_lock_page(page);
+ if (!cookie)
+ return -EBUSY;
+
+ if (hwpoison_filter(page)) {
+ rc = -EOPNOTSUPP;
+ goto unlock;
+ }
+
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ /*
+ * TODO: Handle HMM pages which may need coordination
+ * with device-side memory.
+ */
+ return -EBUSY;
+ }
+
+ /*
+ * Use this flag as an indication that the dax page has been
+ * remapped UC to prevent speculative consumption of poison.
+ */
+ SetPageHWPoison(page);
+
+ /*
+ * Unlike System-RAM there is no possibility to swap in a
+ * different physical page at a given virtual address, so all
+ * userspace consumption of ZONE_DEVICE memory necessitates
+ * SIGBUS (i.e. MF_MUST_KILL)
+ */
+ flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+ collect_procs(page, &to_kill, true);
+
+ unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
+unlock:
+ dax_unlock_page(page, cookie);
+ return rc;
+}
+
/*
* Called from hugetlb code with hugetlb_lock held.
*
@@ -1644,12 +1728,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
{
struct page *page = pfn_to_page(pfn);
- unsigned long size = 0;
- struct to_kill *tk;
LIST_HEAD(tokill);
- int rc = -EBUSY;
- loff_t start;
- dax_entry_t cookie;
+ int rc = -ENXIO;

if (flags & MF_COUNT_INCREASED)
/*
@@ -1658,73 +1738,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
put_page(page);

/* device metadata space is not recoverable */
- if (!pgmap_pfn_valid(pgmap, pfn)) {
- rc = -ENXIO;
- goto out;
- }
-
- /*
- * Pages instantiated by device-dax (not filesystem-dax)
- * may be compound pages.
- */
- page = compound_head(page);
-
- /*
- * Prevent the inode from being freed while we are interrogating
- * the address_space, typically this would be handled by
- * lock_page(), but dax pages do not use the page lock. This
- * also prevents changes to the mapping of this pfn until
- * poison signaling is complete.
- */
- cookie = dax_lock_page(page);
- if (!cookie)
+ if (!pgmap_pfn_valid(pgmap, pfn))
goto out;

- if (hwpoison_filter(page)) {
- rc = -EOPNOTSUPP;
- goto unlock;
- }
-
- if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- /*
- * TODO: Handle HMM pages which may need coordination
- * with device-side memory.
- */
- goto unlock;
- }
-
- /*
- * Use this flag as an indication that the dax page has been
- * remapped UC to prevent speculative consumption of poison.
- */
- SetPageHWPoison(page);
-
- /*
- * Unlike System-RAM there is no possibility to swap in a
- * different physical page at a given virtual address, so all
- * userspace consumption of ZONE_DEVICE memory necessitates
- * SIGBUS (i.e. MF_MUST_KILL)
- */
- flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
- collect_procs(page, &tokill, true);
-
- list_for_each_entry(tk, &tokill, nd)
- if (tk->size_shift)
- size = max(size, 1UL << tk->size_shift);
- if (size) {
- /*
- * Unmap the largest mapping to avoid breaking up
- * device-dax mappings which are constant size. The
- * actual size of the mapping being torn down is
- * communicated in siginfo, see kill_proc()
- */
- start = (page->index << PAGE_SHIFT) & ~(size - 1);
- unmap_mapping_range(page->mapping, start, size, 0);
- }
- kill_procs(&tokill, true, false, pfn, flags);
- rc = 0;
-unlock:
- dax_unlock_page(page, cookie);
+ rc = mf_generic_kill_procs(pfn, flags, pgmap);
out:
/* drop pgmap ref acquired in caller */
put_dev_pagemap(pgmap);
--
2.35.1



2022-04-21 15:16:03

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v13 2/7] mm: factor helpers for memory_failure_dev_pagemap

On 2022/4/21 14:13, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Apr 19, 2022 at 12:50:40PM +0800, Shiyang Ruan wrote:
>> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
>> feature for fsdax. So it is needed to factor some helper functions to
>> simplify these code.
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> Reviewed-by: Darrick J. Wong <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Reviewed-by: Dan Williams <[email protected]>
>
> Thanks for the refactoring. As I commented to 0/7, the conflict with
> "mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()"
> can be trivially resolved.
>
> Another few comment below ...
>
>> ---
>> mm/memory-failure.c | 157 ++++++++++++++++++++++++--------------------
>> 1 file changed, 87 insertions(+), 70 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index e3fbff5bd467..7c8c047bfdc8 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1498,6 +1498,90 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
>> return 0;
>> }
>>
>> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
>> + struct address_space *mapping, pgoff_t index, int flags)
>> +{
>> + struct to_kill *tk;
>> + unsigned long size = 0;
>> +
>> + list_for_each_entry(tk, to_kill, nd)
>> + if (tk->size_shift)
>> + size = max(size, 1UL << tk->size_shift);
>> +
>> + if (size) {
>> + /*
>> + * Unmap the largest mapping to avoid breaking up device-dax
>> + * mappings which are constant size. The actual size of the
>> + * mapping being torn down is communicated in siginfo, see
>> + * kill_proc()
>> + */
>> + loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
>> +
>> + unmap_mapping_range(mapping, start, size, 0);
>> + }
>> +
>> + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
>> +}
>> +
>> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
>> + struct dev_pagemap *pgmap)
>> +{
>> + struct page *page = pfn_to_page(pfn);
>> + LIST_HEAD(to_kill);
>> + dax_entry_t cookie;
>> + int rc = 0;
>> +
>> + /*
>> + * Pages instantiated by device-dax (not filesystem-dax)
>> + * may be compound pages.
>> + */
>> + page = compound_head(page);
>> +
>> + /*
>> + * Prevent the inode from being freed while we are interrogating
>> + * the address_space, typically this would be handled by
>> + * lock_page(), but dax pages do not use the page lock. This
>> + * also prevents changes to the mapping of this pfn until
>> + * poison signaling is complete.
>> + */
>> + cookie = dax_lock_page(page);
>> + if (!cookie)
>> + return -EBUSY;
>> +
>> + if (hwpoison_filter(page)) {
>> + rc = -EOPNOTSUPP;
>> + goto unlock;
>> + }
>> +
>> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> + /*
>> + * TODO: Handle HMM pages which may need coordination
>> + * with device-side memory.
>> + */
>> + return -EBUSY;
>
> Don't we need to go to dax_unlock_page() as the origincal code do?

I think dax_unlock_page is needed too and please remember set rc to -EBUSY before out.

>
>> + }
>> +
>> + /*
>> + * Use this flag as an indication that the dax page has been
>> + * remapped UC to prevent speculative consumption of poison.
>> + */
>> + SetPageHWPoison(page);
>> +
>> + /*
>> + * Unlike System-RAM there is no possibility to swap in a
>> + * different physical page at a given virtual address, so all
>> + * userspace consumption of ZONE_DEVICE memory necessitates
>> + * SIGBUS (i.e. MF_MUST_KILL)
>> + */
>> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>> + collect_procs(page, &to_kill, true);
>> +
>> + unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
>> +unlock:
>> + dax_unlock_page(page, cookie);
>> + return rc;
>> +}
>> +
>> /*
>> * Called from hugetlb code with hugetlb_lock held.
>> *
>> @@ -1644,12 +1728,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> struct dev_pagemap *pgmap)
>> {
>> struct page *page = pfn_to_page(pfn);
>> - unsigned long size = 0;
>> - struct to_kill *tk;
>> LIST_HEAD(tokill);
>
> Is this variable unused in this function?

There has a to_kill in mf_generic_kill_procs. So this one is unneeded. We should remove it.

>
> Thanks,
> Naoya Horiguchi
>

Except for the above nit, the patch looks good to me. Thanks!

Reviewed-by: Miaohe Lin <[email protected]>

2022-04-22 02:41:21

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

Hi Dave,

在 2022/4/21 9:20, Dave Chinner 写道:
> Hi Ruan,
>
> On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
>> This patchset is aimed to support shared pages tracking for fsdax.
>
> Now that this is largely reviewed, it's time to work out the
> logistics of merging it.

Thanks!

>
>> Changes since V12:
>> - Rebased onto next-20220414
>
> What does this depend on that is in the linux-next kernel?
>
> i.e. can this be applied successfully to a v5.18-rc2 kernel without
> needing to drag in any other patchsets/commits/trees?

Firstly, I tried to apply to v5.18-rc2 but it failed.

There are some changes in memory-failure.c, which besides my Patch-02
"mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()"

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=423228ce93c6a283132be38d442120c8e4cdb061

Then, why it is on linux-next is: I was told[1] there is a better fix
about "pgoff_address()" in linux-next:
"mm: rmap: introduce pfn_mkclean_range() to cleans PTEs"

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=65c9605009f8317bb3983519874d755a0b2ca746
so I rebased my patches to it and dropped one of mine.

[1] https://lore.kernel.org/linux-xfs/[email protected]/

>
> What are your plans for the followup patches that enable
> reflink+fsdax in XFS? AFAICT that patchset hasn't been posted for
> while so I don't know what it's status is. Is that patchset anywhere
> near ready for merge in this cycle?
>
> If that patchset is not a candidate for this cycle, then it largely
> doesn't matter what tree this is merged through as there shouldn't
> be any major XFS or dax dependencies being built on top of it during
> this cycle. The filesystem side changes are isolated and won't
> conflict with other work in XFS, either, so this could easily go
> through Dan's tree.
>
> However, if the reflink enablement is ready to go, then this all
> needs to be in the XFS tree so that we can run it through filesystem
> level DAX+reflink testing. That will mean we need this in a stable
> shared topic branch and tighter co-ordination between the trees.
>
> So before we go any further we need to know if the dax+reflink
> enablement patchset is near being ready to merge....

The "reflink+fsdax" patchset is here:

https://lore.kernel.org/linux-xfs/[email protected]/

It was based on v5.15-rc3, I think I should do a rebase.


--
Thanks,
Ruan.

>
> Cheers,
>
> Dave.


2022-04-22 16:22:48

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v13 2/7] mm: factor helpers for memory_failure_dev_pagemap



在 2022/4/21 14:13, HORIGUCHI NAOYA(堀口 直也) 写道:
> On Tue, Apr 19, 2022 at 12:50:40PM +0800, Shiyang Ruan wrote:
>> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
>> feature for fsdax. So it is needed to factor some helper functions to
>> simplify these code.
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> Reviewed-by: Darrick J. Wong <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Reviewed-by: Dan Williams <[email protected]>
>
> Thanks for the refactoring. As I commented to 0/7, the conflict with
> "mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()"
> can be trivially resolved.
>
> Another few comment below ...
>
>> ---
>> mm/memory-failure.c | 157 ++++++++++++++++++++++++--------------------
>> 1 file changed, 87 insertions(+), 70 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index e3fbff5bd467..7c8c047bfdc8 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1498,6 +1498,90 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
>> return 0;
>> }
>>
>> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
>> + struct address_space *mapping, pgoff_t index, int flags)
>> +{
>> + struct to_kill *tk;
>> + unsigned long size = 0;
>> +
>> + list_for_each_entry(tk, to_kill, nd)
>> + if (tk->size_shift)
>> + size = max(size, 1UL << tk->size_shift);
>> +
>> + if (size) {
>> + /*
>> + * Unmap the largest mapping to avoid breaking up device-dax
>> + * mappings which are constant size. The actual size of the
>> + * mapping being torn down is communicated in siginfo, see
>> + * kill_proc()
>> + */
>> + loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
>> +
>> + unmap_mapping_range(mapping, start, size, 0);
>> + }
>> +
>> + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
>> +}
>> +
>> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
>> + struct dev_pagemap *pgmap)
>> +{
>> + struct page *page = pfn_to_page(pfn);
>> + LIST_HEAD(to_kill);
>> + dax_entry_t cookie;
>> + int rc = 0;
>> +
>> + /*
>> + * Pages instantiated by device-dax (not filesystem-dax)
>> + * may be compound pages.
>> + */
>> + page = compound_head(page);
>> +
>> + /*
>> + * Prevent the inode from being freed while we are interrogating
>> + * the address_space, typically this would be handled by
>> + * lock_page(), but dax pages do not use the page lock. This
>> + * also prevents changes to the mapping of this pfn until
>> + * poison signaling is complete.
>> + */
>> + cookie = dax_lock_page(page);
>> + if (!cookie)
>> + return -EBUSY;
>> +
>> + if (hwpoison_filter(page)) {
>> + rc = -EOPNOTSUPP;
>> + goto unlock;
>> + }
>> +
>> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> + /*
>> + * TODO: Handle HMM pages which may need coordination
>> + * with device-side memory.
>> + */
>> + return -EBUSY;
>
> Don't we need to go to dax_unlock_page() as the origincal code do?
>
>> + }
>> +
>> + /*
>> + * Use this flag as an indication that the dax page has been
>> + * remapped UC to prevent speculative consumption of poison.
>> + */
>> + SetPageHWPoison(page);
>> +
>> + /*
>> + * Unlike System-RAM there is no possibility to swap in a
>> + * different physical page at a given virtual address, so all
>> + * userspace consumption of ZONE_DEVICE memory necessitates
>> + * SIGBUS (i.e. MF_MUST_KILL)
>> + */
>> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>> + collect_procs(page, &to_kill, true);
>> +
>> + unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
>> +unlock:
>> + dax_unlock_page(page, cookie);
>> + return rc;
>> +}
>> +
>> /*
>> * Called from hugetlb code with hugetlb_lock held.
>> *
>> @@ -1644,12 +1728,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> struct dev_pagemap *pgmap)
>> {
>> struct page *page = pfn_to_page(pfn);
>> - unsigned long size = 0;
>> - struct to_kill *tk;
>> LIST_HEAD(tokill);
>
> Is this variable unused in this function?

Yes, this one and the one above are mistakes I didn't notice when I
resolving conflicts with the newer next- branch. I'll fix them in next
version.


--
Thanks,
Ruan.

>
> Thanks,
> Naoya Horiguchi


Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

Hi everyone,

On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> On Wed, Apr 20, 2022 at 07:20:07PM -0700, Dan Williams wrote:
> > [ add Andrew and Naoya ]
> >
> > On Wed, Apr 20, 2022 at 6:48 PM Shiyang Ruan <[email protected]> wrote:
> > >
> > > Hi Dave,
> > >
> > > 在 2022/4/21 9:20, Dave Chinner 写道:
> > > > Hi Ruan,
> > > >
> > > > On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
> > > >> This patchset is aimed to support shared pages tracking for fsdax.
> > > >
> > > > Now that this is largely reviewed, it's time to work out the
> > > > logistics of merging it.
> > >
> > > Thanks!
> > >
> > > >
> > > >> Changes since V12:
> > > >> - Rebased onto next-20220414
> > > >
> > > > What does this depend on that is in the linux-next kernel?
> > > >
> > > > i.e. can this be applied successfully to a v5.18-rc2 kernel without
> > > > needing to drag in any other patchsets/commits/trees?
> > >
> > > Firstly, I tried to apply to v5.18-rc2 but it failed.
> > >
> > > There are some changes in memory-failure.c, which besides my Patch-02
> > > "mm/hwpoison: fix race between hugetlb free/demotion and
> > > memory_failure_hugetlb()"
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=423228ce93c6a283132be38d442120c8e4cdb061

This commit should not logically conflict with patch 2/7 (just mismatch in context)
and the conflict can be trivially resolved, i.e. simply defining 2 new functions
(unmap_and_kill() and mf_generic_kill_procs()) just below try_to_split_thp_page()
(or somewhere else before memory_failure_dev_pagemap()) is a correct resolution.

> > >
> > > Then, why it is on linux-next is: I was told[1] there is a better fix
> > > about "pgoff_address()" in linux-next:
> > > "mm: rmap: introduce pfn_mkclean_range() to cleans PTEs"
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=65c9605009f8317bb3983519874d755a0b2ca746
> > > so I rebased my patches to it and dropped one of mine.
> > >
> > > [1] https://lore.kernel.org/linux-xfs/[email protected]/
> >
> > From my perspective, once something has -mm dependencies it needs to
> > go through Andrew's tree, and if it's going through Andrew's tree I
> > think that means the reflink side of this needs to wait a cycle as
> > there is no stable point that the XFS tree could merge to build on top
> > of.
>
> Ngggh. Still? Really?
>
> Sure, I'm not a maintainer and just the stand-in patch shepherd for
> a single release. However, being unable to cleanly merge code we
> need integrated into our local subsystem tree for integration
> testing because a patch dependency with another subsystem won't gain
> a stable commit ID until the next merge window is .... distinctly
> suboptimal.
>
> We know how to do this cleanly, quickly and efficiently - we've been
> doing cross-subsystem shared git branch co-ordination for
> VFS/fs/block stuff when needed for many, many years. It's pretty
> easy to do, just requires clear communication to decide where the
> source branch will be kept. It doesn't even matter what order Linus
> then merges the trees - they are self contained and git sorts out
> the duplicated commits without an issue.
>
> I mean, we've been using git for *17 years* now - this stuff should
> be second nature to maintainers by now. So how is it still
> considered acceptible for a core kernel subsystem not to have the
> ability to provide other subsystems with stable commits/branches
> so we can cleanly develop cross-subsystem functionality quickly and
> efficiently?
>
> > The last reviewed-by this wants before going through there is Naoya's
> > on the memory-failure.c changes.
>
> Naoya?

I'll reply to the individual patches soon.

Thanks,
Naoya Horiguchi

2022-04-22 17:57:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

On Wed, Apr 20, 2022 at 07:20:07PM -0700, Dan Williams wrote:
> [ add Andrew and Naoya ]
>
> On Wed, Apr 20, 2022 at 6:48 PM Shiyang Ruan <[email protected]> wrote:
> >
> > Hi Dave,
> >
> > 在 2022/4/21 9:20, Dave Chinner 写道:
> > > Hi Ruan,
> > >
> > > On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
> > >> This patchset is aimed to support shared pages tracking for fsdax.
> > >
> > > Now that this is largely reviewed, it's time to work out the
> > > logistics of merging it.
> >
> > Thanks!
> >
> > >
> > >> Changes since V12:
> > >> - Rebased onto next-20220414
> > >
> > > What does this depend on that is in the linux-next kernel?
> > >
> > > i.e. can this be applied successfully to a v5.18-rc2 kernel without
> > > needing to drag in any other patchsets/commits/trees?
> >
> > Firstly, I tried to apply to v5.18-rc2 but it failed.
> >
> > There are some changes in memory-failure.c, which besides my Patch-02
> > "mm/hwpoison: fix race between hugetlb free/demotion and
> > memory_failure_hugetlb()"
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=423228ce93c6a283132be38d442120c8e4cdb061
> >
> > Then, why it is on linux-next is: I was told[1] there is a better fix
> > about "pgoff_address()" in linux-next:
> > "mm: rmap: introduce pfn_mkclean_range() to cleans PTEs"
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=65c9605009f8317bb3983519874d755a0b2ca746
> > so I rebased my patches to it and dropped one of mine.
> >
> > [1] https://lore.kernel.org/linux-xfs/[email protected]/
>
> From my perspective, once something has -mm dependencies it needs to
> go through Andrew's tree, and if it's going through Andrew's tree I
> think that means the reflink side of this needs to wait a cycle as
> there is no stable point that the XFS tree could merge to build on top
> of.

Ngggh. Still? Really?

Sure, I'm not a maintainer and just the stand-in patch shepherd for
a single release. However, being unable to cleanly merge code we
need integrated into our local subsystem tree for integration
testing because a patch dependency with another subsystem won't gain
a stable commit ID until the next merge window is .... distinctly
suboptimal.

We know how to do this cleanly, quickly and efficiently - we've been
doing cross-subsystem shared git branch co-ordination for
VFS/fs/block stuff when needed for many, many years. It's pretty
easy to do, just requires clear communication to decide where the
source branch will be kept. It doesn't even matter what order Linus
then merges the trees - they are self contained and git sorts out
the duplicated commits without an issue.

I mean, we've been using git for *17 years* now - this stuff should
be second nature to maintainers by now. So how is it still
considered acceptible for a core kernel subsystem not to have the
ability to provide other subsystems with stable commits/branches
so we can cleanly develop cross-subsystem functionality quickly and
efficiently?

> The last reviewed-by this wants before going through there is Naoya's
> on the memory-failure.c changes.

Naoya?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-22 19:32:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

[ add Andrew and Naoya ]


On Wed, Apr 20, 2022 at 6:48 PM Shiyang Ruan <[email protected]> wrote:
>
> Hi Dave,
>
> 在 2022/4/21 9:20, Dave Chinner 写道:
> > Hi Ruan,
> >
> > On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
> >> This patchset is aimed to support shared pages tracking for fsdax.
> >
> > Now that this is largely reviewed, it's time to work out the
> > logistics of merging it.
>
> Thanks!
>
> >
> >> Changes since V12:
> >> - Rebased onto next-20220414
> >
> > What does this depend on that is in the linux-next kernel?
> >
> > i.e. can this be applied successfully to a v5.18-rc2 kernel without
> > needing to drag in any other patchsets/commits/trees?
>
> Firstly, I tried to apply to v5.18-rc2 but it failed.
>
> There are some changes in memory-failure.c, which besides my Patch-02
> "mm/hwpoison: fix race between hugetlb free/demotion and
> memory_failure_hugetlb()"
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=423228ce93c6a283132be38d442120c8e4cdb061
>
> Then, why it is on linux-next is: I was told[1] there is a better fix
> about "pgoff_address()" in linux-next:
> "mm: rmap: introduce pfn_mkclean_range() to cleans PTEs"
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=65c9605009f8317bb3983519874d755a0b2ca746
> so I rebased my patches to it and dropped one of mine.
>
> [1] https://lore.kernel.org/linux-xfs/[email protected]/

From my perspective, once something has -mm dependencies it needs to
go through Andrew's tree, and if it's going through Andrew's tree I
think that means the reflink side of this needs to wait a cycle as
there is no stable point that the XFS tree could merge to build on top
of.

The last reviewed-by this wants before going through there is Naoya's
on the memory-failure.c changes.

Subject: Re: [PATCH v13 2/7] mm: factor helpers for memory_failure_dev_pagemap

On Tue, Apr 19, 2022 at 12:50:40PM +0800, Shiyang Ruan wrote:
> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
> feature for fsdax. So it is needed to factor some helper functions to
> simplify these code.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>

Thanks for the refactoring. As I commented to 0/7, the conflict with
"mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()"
can be trivially resolved.

Another few comment below ...

> ---
> mm/memory-failure.c | 157 ++++++++++++++++++++++++--------------------
> 1 file changed, 87 insertions(+), 70 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e3fbff5bd467..7c8c047bfdc8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1498,6 +1498,90 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> return 0;
> }
>
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> + struct address_space *mapping, pgoff_t index, int flags)
> +{
> + struct to_kill *tk;
> + unsigned long size = 0;
> +
> + list_for_each_entry(tk, to_kill, nd)
> + if (tk->size_shift)
> + size = max(size, 1UL << tk->size_shift);
> +
> + if (size) {
> + /*
> + * Unmap the largest mapping to avoid breaking up device-dax
> + * mappings which are constant size. The actual size of the
> + * mapping being torn down is communicated in siginfo, see
> + * kill_proc()
> + */
> + loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
> +
> + unmap_mapping_range(mapping, start, size, 0);
> + }
> +
> + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> +}
> +
> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> + struct dev_pagemap *pgmap)
> +{
> + struct page *page = pfn_to_page(pfn);
> + LIST_HEAD(to_kill);
> + dax_entry_t cookie;
> + int rc = 0;
> +
> + /*
> + * Pages instantiated by device-dax (not filesystem-dax)
> + * may be compound pages.
> + */
> + page = compound_head(page);
> +
> + /*
> + * Prevent the inode from being freed while we are interrogating
> + * the address_space, typically this would be handled by
> + * lock_page(), but dax pages do not use the page lock. This
> + * also prevents changes to the mapping of this pfn until
> + * poison signaling is complete.
> + */
> + cookie = dax_lock_page(page);
> + if (!cookie)
> + return -EBUSY;
> +
> + if (hwpoison_filter(page)) {
> + rc = -EOPNOTSUPP;
> + goto unlock;
> + }
> +
> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> + /*
> + * TODO: Handle HMM pages which may need coordination
> + * with device-side memory.
> + */
> + return -EBUSY;

Don't we need to go to dax_unlock_page() as the origincal code do?

> + }
> +
> + /*
> + * Use this flag as an indication that the dax page has been
> + * remapped UC to prevent speculative consumption of poison.
> + */
> + SetPageHWPoison(page);
> +
> + /*
> + * Unlike System-RAM there is no possibility to swap in a
> + * different physical page at a given virtual address, so all
> + * userspace consumption of ZONE_DEVICE memory necessitates
> + * SIGBUS (i.e. MF_MUST_KILL)
> + */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, &to_kill, true);
> +
> + unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
> +unlock:
> + dax_unlock_page(page, cookie);
> + return rc;
> +}
> +
> /*
> * Called from hugetlb code with hugetlb_lock held.
> *
> @@ -1644,12 +1728,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> struct dev_pagemap *pgmap)
> {
> struct page *page = pfn_to_page(pfn);
> - unsigned long size = 0;
> - struct to_kill *tk;
> LIST_HEAD(tokill);

Is this variable unused in this function?

Thanks,
Naoya Horiguchi

2022-04-22 22:42:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

Hi Ruan,

On Tue, Apr 19, 2022 at 12:50:38PM +0800, Shiyang Ruan wrote:
> This patchset is aimed to support shared pages tracking for fsdax.

Now that this is largely reviewed, it's time to work out the
logistics of merging it.

> Changes since V12:
> - Rebased onto next-20220414

What does this depend on that is in the linux-next kernel?

i.e. can this be applied successfully to a v5.18-rc2 kernel without
needing to drag in any other patchsets/commits/trees?

What are your plans for the followup patches that enable
reflink+fsdax in XFS? AFAICT that patchset hasn't been posted for
while so I don't know what it's status is. Is that patchset anywhere
near ready for merge in this cycle?

If that patchset is not a candidate for this cycle, then it largely
doesn't matter what tree this is merged through as there shouldn't
be any major XFS or dax dependencies being built on top of it during
this cycle. The filesystem side changes are isolated and won't
conflict with other work in XFS, either, so this could easily go
through Dan's tree.

However, if the reflink enablement is ready to go, then this all
needs to be in the XFS tree so that we can run it through filesystem
level DAX+reflink testing. That will mean we need this in a stable
shared topic branch and tighter co-ordination between the trees.

So before we go any further we need to know if the dax+reflink
enablement patchset is near being ready to merge....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-22 22:52:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> Sure, I'm not a maintainer and just the stand-in patch shepherd for
> a single release. However, being unable to cleanly merge code we
> need integrated into our local subsystem tree for integration
> testing because a patch dependency with another subsystem won't gain
> a stable commit ID until the next merge window is .... distinctly
> suboptimal.

Yes. Which is why we've taken a lot of mm patchs through other trees,
sometimes specilly crafted for that. So I guess in this case we'll
just need to take non-trivial dependencies into the XFS tree, and just
deal with small merge conflicts for the trivial ones.

2022-04-22 23:19:13

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner <[email protected]> wrote:
>
> On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > a single release. However, being unable to cleanly merge code we
> > > need integrated into our local subsystem tree for integration
> > > testing because a patch dependency with another subsystem won't gain
> > > a stable commit ID until the next merge window is .... distinctly
> > > suboptimal.
> >
> > Yes. Which is why we've taken a lot of mm patchs through other trees,
> > sometimes specilly crafted for that. So I guess in this case we'll
> > just need to take non-trivial dependencies into the XFS tree, and just
> > deal with small merge conflicts for the trivial ones.
>
> OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> listed looks trivial to resolve.
>
> The second dependency, OTOH, is on a new function added in the patch
> pointed to. That said, at first glance it looks to be independent of
> the first two patches in that series so I might just be able to pull
> that one patch in and have that leave us with a working
> fsdax+reflink tree.
>
> Regardless, I'll wait to see how much work the updated XFS/DAX
> reflink enablement patchset still requires when Ruan posts it before
> deciding what to do here. If it isn't going to be a merge
> candidate, what to do with this patchset is moot because there's
> little to test without reflink enabled...

I do have a use case for this work absent the reflink work. Recall we
had a conversation about how to communicate "dax-device has been
ripped away from the fs" events and we ended up on the idea of reusing
->notify_failure(), but with the device's entire logical address range
as the notification span. That will let me unwind and delete the
PTE_DEVMAP infrastructure for taking extra device references to hold
off device-removal. Instead ->notify_failure() arranges for all active
DAX mappings to be invalidated and allow the removal to proceed
especially since physical removal does not care about software pins.

2022-04-23 00:04:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

On Fri, Apr 22, 2022 at 02:27:32PM -0700, Dan Williams wrote:
> On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner <[email protected]> wrote:
> >
> > On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > > a single release. However, being unable to cleanly merge code we
> > > > need integrated into our local subsystem tree for integration
> > > > testing because a patch dependency with another subsystem won't gain
> > > > a stable commit ID until the next merge window is .... distinctly
> > > > suboptimal.
> > >
> > > Yes. Which is why we've taken a lot of mm patchs through other trees,
> > > sometimes specilly crafted for that. So I guess in this case we'll
> > > just need to take non-trivial dependencies into the XFS tree, and just
> > > deal with small merge conflicts for the trivial ones.
> >
> > OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> > listed looks trivial to resolve.
> >
> > The second dependency, OTOH, is on a new function added in the patch
> > pointed to. That said, at first glance it looks to be independent of
> > the first two patches in that series so I might just be able to pull
> > that one patch in and have that leave us with a working
> > fsdax+reflink tree.
> >
> > Regardless, I'll wait to see how much work the updated XFS/DAX
> > reflink enablement patchset still requires when Ruan posts it before
> > deciding what to do here. If it isn't going to be a merge
> > candidate, what to do with this patchset is moot because there's
> > little to test without reflink enabled...
>
> I do have a use case for this work absent the reflink work. Recall we
> had a conversation about how to communicate "dax-device has been
> ripped away from the fs" events and we ended up on the idea of reusing
> ->notify_failure(), but with the device's entire logical address range
> as the notification span. That will let me unwind and delete the
> PTE_DEVMAP infrastructure for taking extra device references to hold
> off device-removal. Instead ->notify_failure() arranges for all active
> DAX mappings to be invalidated and allow the removal to proceed
> especially since physical removal does not care about software pins.

Sure. My point is that if the reflink enablement isn't ready to go,
then from an XFS POV none of this matters in this cycle and we can
just leave the dependencies to commit via Andrew's tree. Hence by
the time we get to the reflink enablement all the prior dependencies
will have been merged and have stable commit IDs, and we can just
stage this series and the reflink enablement as we normally would in
the next cycle.

However, if we don't get the XFS reflink dax enablement sorted out
in the next week or two, then we don't need this patchset in this
cycle. Hence if you still need this patchset for other code you need
to merge in this cycle, then you're the poor schmuck that has to run
the mm-tree conflict guantlet to get a stable commit ID for the
dependent patches in this cycle, not me....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-04-24 09:11:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

On Fri, Apr 22, 2022 at 5:02 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Apr 22, 2022 at 02:27:32PM -0700, Dan Williams wrote:
> > On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner <[email protected]> wrote:
> > >
> > > On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > > > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > > > a single release. However, being unable to cleanly merge code we
> > > > > need integrated into our local subsystem tree for integration
> > > > > testing because a patch dependency with another subsystem won't gain
> > > > > a stable commit ID until the next merge window is .... distinctly
> > > > > suboptimal.
> > > >
> > > > Yes. Which is why we've taken a lot of mm patchs through other trees,
> > > > sometimes specilly crafted for that. So I guess in this case we'll
> > > > just need to take non-trivial dependencies into the XFS tree, and just
> > > > deal with small merge conflicts for the trivial ones.
> > >
> > > OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> > > listed looks trivial to resolve.
> > >
> > > The second dependency, OTOH, is on a new function added in the patch
> > > pointed to. That said, at first glance it looks to be independent of
> > > the first two patches in that series so I might just be able to pull
> > > that one patch in and have that leave us with a working
> > > fsdax+reflink tree.
> > >
> > > Regardless, I'll wait to see how much work the updated XFS/DAX
> > > reflink enablement patchset still requires when Ruan posts it before
> > > deciding what to do here. If it isn't going to be a merge
> > > candidate, what to do with this patchset is moot because there's
> > > little to test without reflink enabled...
> >
> > I do have a use case for this work absent the reflink work. Recall we
> > had a conversation about how to communicate "dax-device has been
> > ripped away from the fs" events and we ended up on the idea of reusing
> > ->notify_failure(), but with the device's entire logical address range
> > as the notification span. That will let me unwind and delete the
> > PTE_DEVMAP infrastructure for taking extra device references to hold
> > off device-removal. Instead ->notify_failure() arranges for all active
> > DAX mappings to be invalidated and allow the removal to proceed
> > especially since physical removal does not care about software pins.
>
> Sure. My point is that if the reflink enablement isn't ready to go,
> then from an XFS POV none of this matters in this cycle and we can
> just leave the dependencies to commit via Andrew's tree. Hence by
> the time we get to the reflink enablement all the prior dependencies
> will have been merged and have stable commit IDs, and we can just
> stage this series and the reflink enablement as we normally would in
> the next cycle.
>
> However, if we don't get the XFS reflink dax enablement sorted out
> in the next week or two, then we don't need this patchset in this
> cycle. Hence if you still need this patchset for other code you need
> to merge in this cycle, then you're the poor schmuck that has to run
> the mm-tree conflict guantlet to get a stable commit ID for the
> dependent patches in this cycle, not me....

Yup. Let's give it another week or so to see if the reflink rebase
materializes and go from there.