[permalink] [raw]

Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Mon, May 29, 2023 at 7:25 PM HORIGUCHI NAOYA(堀口　直也)
<[email protected]> wrote:
>
> On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> > When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> > the raw HWPOISON page (previously split and mapped at PTE size).
> > If HGM failed to be enabled on eligible VMA or splitting failed,
> > try_to_unmap_one fails.
> >
> > For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> > the whole P*D.
> >
> > When only the raw HWPOISON subpage is unmapped but others keep mapped,
> > the old way in memory_failure to check if unmapping successful doesn't
> > work. So introduce is_unmapping_successful() to cover both existing and
> > new unmapping behavior.
> >
> > For the new unmapping behavior, store how many times a raw HWPOISON page
> > is expected to be unmapped, and how many times it is actually unmapped
> > in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> > from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> > unmap_success = (nr_expected_unamps == nr_actual_unmaps).
> >
> > Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> > raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> > recovery actions again because recovery used to be done on the entire
> > hugepage. With the new unmapping behavior, this doesn't hold. More
> > subpages in the hugepage can become corrupted, and needs to be recovered
> > (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> > 0 after adding a new raw subpage to raw_hwp_list.
> >
> > Unmapping raw HWPOISON page requires allocating raw_hwp_page
> > successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> > now may fail due to OOM.
> >
> > Signed-off-by: Jiaqi Yan <[email protected]>
> > ---
> ...
>
> > @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> >
> > #ifdef CONFIG_HUGETLB_PAGE
> >
> > +/*
> > + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> > + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> > + */
> > +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> > + struct page *subpage)
> > +{
> > + struct llist_node *t, *tnode;
> > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > + struct raw_hwp_page *hwp_page = NULL;
> > + struct raw_hwp_page *p;
> > +
> > + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
>
> I'm testing the series (on top of v6.2-rc4 + HGM v2 patchset) and found the
> following error triggered by this VM_BUG_ON_PAGE(). The testcase is just to
> inject hwpoison on an anonymous page (it's not hugetlb-related one).

Thanks for reporting this problem, Naoya!

My mistake, this assertion meant to be "if !PageHWPoison(subpage)", to
make sure the caller of find_in_raw_hwp_list is sure that subpage is
hw corrupted.

>
> [ 790.610985] ===> testcase 'mm/hwpoison/base/backend-anonymous_error-hard-offline_access-avoid.auto3' start
> [ 793.304927] page:000000006743177b refcount:1 mapcount:0 mapping:0000000000000000 index:0x700000000 pfn:0x14d739
> [ 793.309322] memcg:ffff8a30c50b6000
> [ 793.310934] anon flags: 0x57ffffe08a001d(locked|uptodate|dirty|lru|mappedtodisk|swapbacked|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> [ 793.316665] raw: 0057ffffe08a001d ffffe93cc5353c88 ffffe93cc5685fc8 ffff8a30c91878f1
> [ 793.320211] raw: 0000000700000000 0000000000000000 00000001ffffffff ffff8a30c50b6000
> [ 793.323665] page dumped because: VM_BUG_ON_PAGE(PageHWPoison(subpage))
> [ 793.326764] ------------[ cut here ]------------
> [ 793.329080] kernel BUG at mm/memory-failure.c:1894!
> [ 793.331895] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [ 793.334854] CPU: 4 PID: 2644 Comm: mceinj.sh Tainted: G E N 6.2.0-rc4-v6.2-rc2-230529-1404+ #63
> [ 793.340710] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
> [ 793.345875] RIP: 0010:hwpoison_user_mappings+0x654/0x780
> [ 793.349066] Code: ef 89 de e8 6e bc f8 ff 48 8b 7c 24 20 48 83 c7 58 e8 10 bb d9 ff e9 5f fb ff ff 48 c7 c6 80 ce 4c b1 4c 89 ef e8 1c 38 f6 ff <0f> 0b 48 c7 c6 7b c8 4c b1 4c 89 ef e8 0b 38 f6 ff 0f 0b 8b 45 58
> [ 793.359732] RSP: 0018:ffffa3ff85ed3d28 EFLAGS: 00010296
> [ 793.362367] RAX: 000000000000003a RBX: 0000000000000018 RCX: 0000000000000000
> [ 793.365763] RDX: 0000000000000001 RSI: ffffffffb14ac451 RDI: 00000000ffffffff
> [ 793.368698] RBP: ffffe93cc535ce40 R08: 0000000000000000 R09: ffffa3ff85ed3ba0
> [ 793.370837] R10: 0000000000000003 R11: ffffffffb1d3ed28 R12: 000000000014d739
> [ 793.372903] R13: ffffe93cc535ce40 R14: ffffe93cc535ce40 R15: ffffe93cc535ce40
> [ 793.374931] FS: 00007f6ccc42a740(0000) GS:ffff8a31bbc00000(0000) knlGS:0000000000000000
> [ 793.377136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 793.378656] CR2: 0000561aad6474b2 CR3: 00000001492d4005 CR4: 0000000000170ee0
> [ 793.380514] DR0: ffffffffb28ed7d0 DR1: ffffffffb28ed7d1 DR2: ffffffffb28ed7d2
> [ 793.382296] DR3: ffffffffb28ed7d3 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 793.384028] Call Trace:
> [ 793.384655] <TASK>
> [ 793.385210] ? __lru_add_drain_all+0x164/0x1f0
> [ 793.386316] memory_failure+0x352/0xaa0
> [ 793.387249] ? __pfx_bpf_lsm_capable+0x10/0x10
> [ 793.388323] ? __pfx_security_capable+0x10/0x10
> [ 793.389350] hard_offline_page_store+0x46/0x80
> [ 793.390397] kernfs_fop_write_iter+0x11e/0x200
> [ 793.391441] vfs_write+0x1e4/0x3a0
> [ 793.392221] ksys_write+0x53/0xd0
> [ 793.392976] do_syscall_64+0x3a/0x90
> [ 793.393790] entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> I'm wondering how this code path is called, one possible path is like this:
>
> hwpoison_user_mappings
> if PageHuge(hpage) && !PageAnon(hpage)
> try_to_split_huge_mapping()
> find_in_raw_hwp_list
> VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
>
> but this looks unlikely because the precheck "PageHuge(hpage) && !PageAnon(hpage)" is
> false for anonymous pages.
>
> Another possible code path is:
>
> hwpoison_user_mappings
> if PageHuge(hpage) && !PageAnon(hpage)
> ...
> else
> try_to_unmap
> rmap_walk
> rmap_walk_anon
> try_to_unmap_one
> if folio_test_hugetlb
> if hgm_eligible
> find_in_raw_hwp_list
> VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
>
> but this looks also unlikely because of checking folio_test_hugetlb and hgm_eligible
> (I think both are false in this testcase.)
> Maybe I miss something (and I'll dig this more), but let me share the issue.

I bet it is in "is_unmapping_successful". So another problem with this
patch is, "is_unmapping_successful" should only calls
find_in_raw_hwp_list after it handles non hugetlb and non shared
mapping, i.e.:

struct raw_hwp_page *hwp_page = NULL;

if (!folio_test_hugetlb(folio) ||
folio_test_anon(folio) ||
!IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
...
}

hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
VM_BUG_ON_PAGE(!hwp_page, poisoned_page);

I will make sure these two issues get fixed up in follow-up revisions.

>
> Thanks,
> Naoya Horiguchi

2023-05-30 22:22:54

by Jiaqi Yan

[permalink] [raw]

Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Tue, May 30, 2023 at 2:31 PM Jiaqi Yan <[email protected]> wrote:
>
> On Mon, May 29, 2023 at 7:25 PM HORIGUCHI NAOYA(堀口　直也)
> <[email protected]> wrote:
> >
> > On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> > > When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> > > the raw HWPOISON page (previously split and mapped at PTE size).
> > > If HGM failed to be enabled on eligible VMA or splitting failed,
> > > try_to_unmap_one fails.
> > >
> > > For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> > > the whole P*D.
> > >
> > > When only the raw HWPOISON subpage is unmapped but others keep mapped,
> > > the old way in memory_failure to check if unmapping successful doesn't
> > > work. So introduce is_unmapping_successful() to cover both existing and
> > > new unmapping behavior.
> > >
> > > For the new unmapping behavior, store how many times a raw HWPOISON page
> > > is expected to be unmapped, and how many times it is actually unmapped
> > > in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> > > from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> > > unmap_success = (nr_expected_unamps == nr_actual_unmaps).
> > >
> > > Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> > > raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> > > recovery actions again because recovery used to be done on the entire
> > > hugepage. With the new unmapping behavior, this doesn't hold. More
> > > subpages in the hugepage can become corrupted, and needs to be recovered
> > > (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> > > 0 after adding a new raw subpage to raw_hwp_list.
> > >
> > > Unmapping raw HWPOISON page requires allocating raw_hwp_page
> > > successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> > > now may fail due to OOM.
> > >
> > > Signed-off-by: Jiaqi Yan <[email protected]>
> > > ---
> > ...
> >
> > > @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > >
> > > #ifdef CONFIG_HUGETLB_PAGE
> > >
> > > +/*
> > > + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> > > + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> > > + */
> > > +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,

BTW, per our discussion here[1], this routine will probably reuse what
comes out of the refactored routine.

It should be safe for try_to_unmap_one to hold a raw_hwp_page returned
from find_in_raw_hwp_list as long as raw_hwp_list is protected by
mf_mutex.

[1] https://lore.kernel.org/linux-mm/CACw3F53+Hg4CgFoPj3LLSiURzWfa2egWLO-=12GzfhsNC3XTvQ@mail.gmail.com/T/#m9966de1007b80eb8bd2c2ce0a9db13624cd2652e

> > > + struct page *subpage)
> > > +{
> > > + struct llist_node *t, *tnode;
> > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > + struct raw_hwp_page *hwp_page = NULL;
> > > + struct raw_hwp_page *p;
> > > +
> > > + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
> >
> > I'm testing the series (on top of v6.2-rc4 + HGM v2 patchset) and found the
> > following error triggered by this VM_BUG_ON_PAGE(). The testcase is just to
> > inject hwpoison on an anonymous page (it's not hugetlb-related one).
>
> Thanks for reporting this problem, Naoya!
>
> My mistake, this assertion meant to be "if !PageHWPoison(subpage)", to
> make sure the caller of find_in_raw_hwp_list is sure that subpage is
> hw corrupted.
>
> >
> > [ 790.610985] ===> testcase 'mm/hwpoison/base/backend-anonymous_error-hard-offline_access-avoid.auto3' start
> > [ 793.304927] page:000000006743177b refcount:1 mapcount:0 mapping:0000000000000000 index:0x700000000 pfn:0x14d739
> > [ 793.309322] memcg:ffff8a30c50b6000
> > [ 793.310934] anon flags: 0x57ffffe08a001d(locked|uptodate|dirty|lru|mappedtodisk|swapbacked|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> > [ 793.316665] raw: 0057ffffe08a001d ffffe93cc5353c88 ffffe93cc5685fc8 ffff8a30c91878f1
> > [ 793.320211] raw: 0000000700000000 0000000000000000 00000001ffffffff ffff8a30c50b6000
> > [ 793.323665] page dumped because: VM_BUG_ON_PAGE(PageHWPoison(subpage))
> > [ 793.326764] ------------[ cut here ]------------
> > [ 793.329080] kernel BUG at mm/memory-failure.c:1894!
> > [ 793.331895] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [ 793.334854] CPU: 4 PID: 2644 Comm: mceinj.sh Tainted: G E N 6.2.0-rc4-v6.2-rc2-230529-1404+ #63
> > [ 793.340710] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
> > [ 793.345875] RIP: 0010:hwpoison_user_mappings+0x654/0x780
> > [ 793.349066] Code: ef 89 de e8 6e bc f8 ff 48 8b 7c 24 20 48 83 c7 58 e8 10 bb d9 ff e9 5f fb ff ff 48 c7 c6 80 ce 4c b1 4c 89 ef e8 1c 38 f6 ff <0f> 0b 48 c7 c6 7b c8 4c b1 4c 89 ef e8 0b 38 f6 ff 0f 0b 8b 45 58
> > [ 793.359732] RSP: 0018:ffffa3ff85ed3d28 EFLAGS: 00010296
> > [ 793.362367] RAX: 000000000000003a RBX: 0000000000000018 RCX: 0000000000000000
> > [ 793.365763] RDX: 0000000000000001 RSI: ffffffffb14ac451 RDI: 00000000ffffffff
> > [ 793.368698] RBP: ffffe93cc535ce40 R08: 0000000000000000 R09: ffffa3ff85ed3ba0
> > [ 793.370837] R10: 0000000000000003 R11: ffffffffb1d3ed28 R12: 000000000014d739
> > [ 793.372903] R13: ffffe93cc535ce40 R14: ffffe93cc535ce40 R15: ffffe93cc535ce40
> > [ 793.374931] FS: 00007f6ccc42a740(0000) GS:ffff8a31bbc00000(0000) knlGS:0000000000000000
> > [ 793.377136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 793.378656] CR2: 0000561aad6474b2 CR3: 00000001492d4005 CR4: 0000000000170ee0
> > [ 793.380514] DR0: ffffffffb28ed7d0 DR1: ffffffffb28ed7d1 DR2: ffffffffb28ed7d2
> > [ 793.382296] DR3: ffffffffb28ed7d3 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> > [ 793.384028] Call Trace:
> > [ 793.384655] <TASK>
> > [ 793.385210] ? __lru_add_drain_all+0x164/0x1f0
> > [ 793.386316] memory_failure+0x352/0xaa0
> > [ 793.387249] ? __pfx_bpf_lsm_capable+0x10/0x10
> > [ 793.388323] ? __pfx_security_capable+0x10/0x10
> > [ 793.389350] hard_offline_page_store+0x46/0x80
> > [ 793.390397] kernfs_fop_write_iter+0x11e/0x200
> > [ 793.391441] vfs_write+0x1e4/0x3a0
> > [ 793.392221] ksys_write+0x53/0xd0
> > [ 793.392976] do_syscall_64+0x3a/0x90
> > [ 793.393790] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >
> > I'm wondering how this code path is called, one possible path is like this:
> >
> > hwpoison_user_mappings
> > if PageHuge(hpage) && !PageAnon(hpage)
> > try_to_split_huge_mapping()
> > find_in_raw_hwp_list
> > VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
> >
> > but this looks unlikely because the precheck "PageHuge(hpage) && !PageAnon(hpage)" is
> > false for anonymous pages.
> >
> > Another possible code path is:
> >
> > hwpoison_user_mappings
> > if PageHuge(hpage) && !PageAnon(hpage)
> > ...
> > else
> > try_to_unmap
> > rmap_walk
> > rmap_walk_anon
> > try_to_unmap_one
> > if folio_test_hugetlb
> > if hgm_eligible
> > find_in_raw_hwp_list
> > VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage)
> >
> > but this looks also unlikely because of checking folio_test_hugetlb and hgm_eligible
> > (I think both are false in this testcase.)
> > Maybe I miss something (and I'll dig this more), but let me share the issue.
>
> I bet it is in "is_unmapping_successful". So another problem with this
> patch is, "is_unmapping_successful" should only calls
> find_in_raw_hwp_list after it handles non hugetlb and non shared
> mapping, i.e.:
>
> struct raw_hwp_page *hwp_page = NULL;
>
> if (!folio_test_hugetlb(folio) ||
> folio_test_anon(folio) ||
> !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
> ...
> }
>
> hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
> VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
>
> I will make sure these two issues get fixed up in follow-up revisions.
>
> >
> > Thanks,
> > Naoya Horiguchi

2023-06-01 03:13:44

by HORIGUCHI NAOYA(堀口　直也)

[permalink] [raw]

Subject: Re: [RFC PATCH v1 4/7] mm/memory_failure: unmap raw HWPoison PTEs when possible

On Fri, Apr 28, 2023 at 12:41:36AM +0000, Jiaqi Yan wrote:
> When a folio's VMA is HGM eligible, try_to_unmap_one now only unmaps
> the raw HWPOISON page (previously split and mapped at PTE size).
> If HGM failed to be enabled on eligible VMA or splitting failed,
> try_to_unmap_one fails.
>
> For VMS that is not HGM eligible, try_to_unmap_one still unmaps
> the whole P*D.
>
> When only the raw HWPOISON subpage is unmapped but others keep mapped,
> the old way in memory_failure to check if unmapping successful doesn't
> work. So introduce is_unmapping_successful() to cover both existing and
> new unmapping behavior.
>
> For the new unmapping behavior, store how many times a raw HWPOISON page
> is expected to be unmapped, and how many times it is actually unmapped
> in try_to_unmap_one(). A HWPOISON raw page is expected to be unmapped
> from a VMA if splitting succeeded in try_to_split_huge_mapping(), so
> unmap_success = (nr_expected_unamps == nr_actual_unmaps).

s/nr_expected_unamps/nr_expected_unmaps/

>
> Old folio_set_hugetlb_hwpoison returns -EHWPOISON if a folio has any
> raw HWPOISON subpage, and try_memory_failure_hugetlb won't attempt
> recovery actions again because recovery used to be done on the entire
> hugepage. With the new unmapping behavior, this doesn't hold. More
> subpages in the hugepage can become corrupted, and needs to be recovered
> (i.e. unmapped) individually. New folio_set_hugetlb_hwpoison returns
> 0 after adding a new raw subpage to raw_hwp_list.
>
> Unmapping raw HWPOISON page requires allocating raw_hwp_page
> successfully in folio_set_hugetlb_hwpoison, so try_memory_failure_hugetlb
> now may fail due to OOM.
>
> Signed-off-by: Jiaqi Yan <[email protected]>
> ---
> include/linux/mm.h | 20 ++++++-
> mm/memory-failure.c | 140 ++++++++++++++++++++++++++++++++++++++------
> mm/rmap.c | 38 +++++++++++-
> 3 files changed, 175 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4496d7bdd3ea..dc192f98cb1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3522,20 +3522,38 @@ enum mf_action_page_type {
> */
> extern const struct attribute_group memory_failure_attr_group;
>
> -#ifdef CONFIG_HUGETLB_PAGE
> /*
> * Struct raw_hwp_page represents information about "raw error page",
> * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> + * @node: the node in folio->_hugetlb_hwpoison list.
> + * @page: the raw HWPOISON page struct.
> + * @nr_vmas_mapped: the number of VMAs that map @page when detected.
> + * @nr_expected_unmaps: if a VMA that maps @page when detected is eligible
> + * for high granularity mapping, @page is expected to be unmapped.
> + * @nr_actual_unmaps: how many times the raw page is actually unmapped.
> */
> struct raw_hwp_page {
> struct llist_node node;
> struct page *page;
> + int nr_vmas_mapped;
> + int nr_expected_unmaps;
> + int nr_actual_unmaps;
> };
>
> +#ifdef CONFIG_HUGETLB_PAGE
> static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> {
> return (struct llist_head *)&folio->_hugetlb_hwpoison;
> }
> +
> +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage);
> +#else
> +static inline struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage)
> +{
> + return NULL;
> +}
> #endif
>
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 48e62d04af17..47b935918ceb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1120,10 +1120,10 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
> }
>
> /*
> - * Huge pages. Needs work.
> - * Issues:
> - * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
> - * To narrow down kill region to one page, we need to break up pmd.
> + * Huge pages.
> + * - Without HGM: Error on hugepage is contained in hugepage unit (not in
> + * raw page unit).
> + * - With HGM: Kill region is narrowed down to just one page.
> */
> static int me_huge_page(struct page_state *ps, struct page *p)
> {
> @@ -1131,6 +1131,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> struct page *hpage = compound_head(p);
> struct address_space *mapping;
> bool extra_pins = false;
> + struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(page_folio(p), p);
>
> if (!PageHuge(hpage))
> return MF_DELAYED;
> @@ -1157,7 +1158,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> }
> }
>
> - if (has_extra_refcount(ps, p, extra_pins))
> + if (hwp_page->nr_expected_unmaps == 0 &&

This additional check seems unclear to me, so could you add some comment about
the intention?

I think that "hwp_page->nr_expected_unmaps == 0" means no mapping is split.
This is OK because it just keeps the current behavior (where no HGM happens).
But thinking about the "partial" case where some vmas use HGM and some not,
has_extra_refcount() check seems to be meaningful, so maybe more resticted check
like "hwp_page->nr_expected_unmaps < hwp_page->nr_vmas_mapped" could be better?

> + has_extra_refcount(ps, p, extra_pins))
> res = MF_FAILED;
>
> return res;
> @@ -1497,24 +1499,30 @@ static void try_to_split_huge_mapping(struct folio *folio,
> unsigned long poisoned_addr;
> unsigned long head_addr;
> struct hugetlb_pte hpte;
> + struct raw_hwp_page *hwp_page = NULL;
>
> if (WARN_ON(!mapping))
> return;
>
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> + hwp_page = find_in_raw_hwp_list(folio, poisoned_page);
> + VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
> +
> pgoff_start = folio_pgoff(folio);
> pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
>
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) {
> + ++hwp_page->nr_vmas_mapped;
> +
> /* Enable HGM on HGM-eligible VMAs. */
> if (!hugetlb_hgm_eligible(vma))
> continue;
>
> i_mmap_assert_locked(vma->vm_file->f_mapping);
> if (hugetlb_enable_hgm_vma(vma)) {
> - pr_err("Failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
> - vma->vm_start, vma->vm_end);
> + pr_err("%#lx: failed to enable HGM on eligible VMA=[%#lx, %#lx)\n",
> + page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end);

This page_to_pfn(poisoned_page) can be called multiple times in this function,
so this can be stored in a local variable.

> continue;
> }
>
> @@ -1528,15 +1536,21 @@ static void try_to_split_huge_mapping(struct folio *folio,
> * lock on vma->vm_file->f_mapping, which caller
> * (e.g. hwpoison_user_mappings) should already acquired.
> */
> - if (hugetlb_full_walk(&hpte, vma, head_addr))
> + if (hugetlb_full_walk(&hpte, vma, head_addr)) {
> + pr_err("%#lx: failed to PT-walk with HGM on eligible VMA=[%#lx, %#lx)\n",
> + page_to_pfn(poisoned_page), vma->vm_start, vma->vm_end);
> continue;
> + }
>
> if (hugetlb_split_to_shift(vma->vm_mm, vma, &hpte,
> poisoned_addr, PAGE_SHIFT)) {
> - pr_err("Failed to split huge mapping: pfn=%#lx, vaddr=%#lx in VMA=[%#lx, %#lx)\n",
> + pr_err("%#lx: Failed to split huge mapping: vaddr=%#lx in VMA=[%#lx, %#lx)\n",
> page_to_pfn(poisoned_page), poisoned_addr,
> vma->vm_start, vma->vm_end);
> + continue;
> }
> +
> + ++hwp_page->nr_expected_unmaps;
> }
> }
> #else
> @@ -1546,6 +1560,47 @@ static void try_to_split_huge_mapping(struct folio *folio,
> }
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> +static bool is_unmapping_successful(struct folio *folio,
> + struct page *poisoned_page)
> +{
> + bool unmap_success = false;
> + struct raw_hwp_page *hwp_page = find_in_raw_hwp_list(folio, poisoned_page);

As a small optimization, this find_in_raw_hwp_list() can be called after the
if-block below, because hwp_page is used only after it.

> +
> + if (!folio_test_hugetlb(folio) ||
> + folio_test_anon(folio) ||
> + !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING)) {
> + unmap_success = folio_mapped(folio);

Maybe you mean "unmap_success = !folio_mapped(folio)"n ?

> + if (!unmap_success)
> + pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
> + page_to_pfn(poisoned_page),
> + page_mapcount(folio_page(folio, 0)));
> +
> + return unmap_success;
> + }
> +
> + VM_BUG_ON_PAGE(!hwp_page, poisoned_page);
> +
> + /*
> + * Unmapping may not happen for some VMA:
> + * - HGM-eligible VMA but @poisoned_page is not faulted yet: nothing
> + * needs to be done at this point yet until page fault handling.
> + * - HGM-non-eliggible VMA: mapcount decreases by nr_subpages for each VMA,

s/eliggible/eligible/

> + * but not tracked so cannot tell if successfully unmapped from such VMA.
> + */
> + if (hwp_page->nr_vmas_mapped != hwp_page->nr_expected_unmaps)
> + pr_info("%#lx: mapped by %d VMAs but %d unmappings are expected\n",
> + page_to_pfn(poisoned_page), hwp_page->nr_vmas_mapped,
> + hwp_page->nr_expected_unmaps);
> +
> + unmap_success = hwp_page->nr_expected_unmaps == hwp_page->nr_actual_unmaps;
> +
> + if (!unmap_success)
> + pr_err("%#lx: failed to unmap page (folio_mapcount=%d)\n",
> + page_to_pfn(poisoned_page), folio_mapcount(folio));
> +
> + return unmap_success;
> +}
> +
> /*
> * Do all that is necessary to remove user space mappings. Unmap
> * the pages and send SIGBUS to the processes if the data was dirty.
> @@ -1631,10 +1686,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
> try_to_unmap(folio, ttu);
> }
>
> - unmap_success = !page_mapped(hpage);
> - if (!unmap_success)
> - pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
> - pfn, page_mapcount(hpage));
> + unmap_success = is_unmapping_successful(folio, p);
>
> /*
> * try_to_unmap() might put mlocked page in lru cache, so call
> @@ -1827,6 +1879,31 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>
> #ifdef CONFIG_HUGETLB_PAGE
>
> +/*
> + * Given a HWPOISON @subpage as raw page, find its location in @folio's
> + * _hugetlb_hwpoison. Return NULL if @subpage is not in the list.
> + */
> +struct raw_hwp_page *find_in_raw_hwp_list(struct folio *folio,
> + struct page *subpage)
> +{
> + struct llist_node *t, *tnode;
> + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> + struct raw_hwp_page *hwp_page = NULL;
> + struct raw_hwp_page *p;
> +
> + VM_BUG_ON_PAGE(PageHWPoison(subpage), subpage);
> +
> + llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> + p = container_of(tnode, struct raw_hwp_page, node);
> + if (subpage == p->page) {
> + hwp_page = p;
> + break;
> + }
> + }
> +
> + return hwp_page;
> +}
> +
> static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> {
> struct llist_head *head;
> @@ -1837,6 +1914,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> llist_for_each_safe(tnode, t, head->first) {
> struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
>
> + /* Ideally raw HWPoison pages are fully unmapped if possible. */
> + WARN_ON(p->nr_expected_unmaps != p->nr_actual_unmaps);
> +
> if (move_flag)
> SetPageHWPoison(p->page);
> else
> @@ -1853,7 +1933,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> struct llist_head *head;
> struct raw_hwp_page *raw_hwp;
> struct llist_node *t, *tnode;
> - int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> + bool has_hwpoison = folio_test_set_hwpoison(folio);
> + bool hgm_enabled = IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING);
>
> /*
> * Once the hwpoison hugepage has lost reliable raw error info,
> @@ -1873,9 +1954,20 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC);
> if (raw_hwp) {
> raw_hwp->page = page;
> + raw_hwp->nr_vmas_mapped = 0;
> + raw_hwp->nr_expected_unmaps = 0;
> + raw_hwp->nr_actual_unmaps = 0;
> llist_add(&raw_hwp->node, head);
> + if (hgm_enabled)
> + /*
> + * A new raw poisoned page. Don't return
> + * HWPOISON. Error event will be counted

(Sorry nitpicking ...)
"Don't return -EHWPOISON" ?

> + * in action_result().
> + */
> + return 0;
> +
> /* the first error event will be counted in action_result(). */
> - if (ret)
> + if (has_hwpoison)
> num_poisoned_pages_inc(page_to_pfn(page));
> } else {
> /*
> @@ -1889,8 +1981,16 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> * used any more, so free it.
> */
> __folio_free_raw_hwp(folio, false);
> +
> + /*
> + * HGM relies on raw_hwp allocated and inserted to raw_hwp_list.
> + */
> + if (hgm_enabled)
> + return -ENOMEM;
> }
> - return ret;
> +
> + BUG_ON(hgm_enabled);
> + return has_hwpoison ? -EHWPOISON : 0;
> }
>
> static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> @@ -1936,6 +2036,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> struct page *page = pfn_to_page(pfn);
> struct folio *folio = page_folio(page);
> int ret = 2; /* fallback to normal page handling */
> + int set_page_hwpoison = 0;
> bool count_increased = false;
>
> if (!folio_test_hugetlb(folio))
> @@ -1956,8 +2057,9 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> goto out;
> }
>
> - if (folio_set_hugetlb_hwpoison(folio, page)) {
> - ret = -EHWPOISON;
> + set_page_hwpoison = folio_set_hugetlb_hwpoison(folio, page);
> + if (set_page_hwpoison) {
> + ret = set_page_hwpoison;

This adds a new error return code, so could you update comment on this function?

Thanks,
Naoya Horiguchi

> goto out;
> }
>
> @@ -2004,7 +2106,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> res = kill_accessing_process(current, folio_pfn(folio), flags);
> }
> return res;
> - } else if (res == -EBUSY) {
> + } else if (res == -EBUSY || res == -ENOMEM) {
> if (!(flags & MF_NO_RETRY)) {
> flags |= MF_NO_RETRY;
> goto retry;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d3bc81466902..4cfaa34b001e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1453,6 +1453,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> struct mmu_notifier_range range;
> enum ttu_flags flags = (enum ttu_flags)(long)arg;
> bool page_poisoned;
> + bool hgm_eligible = hugetlb_hgm_eligible(vma);
> + struct raw_hwp_page *hwp_page;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1525,6 +1527,29 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> * in the case where the hugetlb page is poisoned.
> */
> VM_BUG_ON_FOLIO(!page_poisoned, folio);
> +
> + /*
> + * When VMA is not HGM eligible, unmap at hugepage's
> + * original P*D.
> + *
> + * When HGM is eligible:
> + * - if original P*D is split to smaller P*Ds and
> + * PTEs, we skip subpage if it is not raw HWPoison
> + * page, or it was but was already unmapped.
> + * - if original P*D is not split, skip unmapping
> + * and memory_failure result will be MF_IGNORED.
> + */
> + if (hgm_eligible) {
> + if (pvmw.pte_order > 0)
> + continue;
> + hwp_page = find_in_raw_hwp_list(folio, subpage);
> + if (hwp_page == NULL)
> + continue;
> + if (hwp_page->nr_expected_unmaps ==
> + hwp_page->nr_actual_unmaps)
> + continue;
> + }
> +
> /*
> * huge_pmd_unshare may unmap an entire PMD page.
> * There is no way of knowing exactly which PMDs may
> @@ -1760,12 +1785,19 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> *
> * See Documentation/mm/mmu_notifier.rst
> */
> - if (folio_test_hugetlb(folio))
> + if (!folio_test_hugetlb(folio))
> + page_remove_rmap(subpage, vma, false);
> + else {
> hugetlb_remove_rmap(subpage,
> pvmw.pte_order + PAGE_SHIFT,
> hstate_vma(vma), vma);
> - else
> - page_remove_rmap(subpage, vma, false);
> + if (hgm_eligible) {
> + VM_BUG_ON_FOLIO(pvmw.pte_order > 0, folio);
> + VM_BUG_ON_FOLIO(!hwp_page, folio);
> + VM_BUG_ON_FOLIO(subpage != hwp_page->page, folio);
> + ++hwp_page->nr_actual_unmaps;
> + }
> + }
>
> if (vma->vm_flags & VM_LOCKED)
> mlock_drain_local();
> --
> 2.40.1.495.gc816e09b53d-goog
>
>
>