LinuxLists.cc - [PATCH v2 0/5] Enhance soft hwpoison handling and injection

2024-05-10 06:27:00

Subject: [PATCH v2 0/5] Enhance soft hwpoison handling and injection

This series aim at the following enhancement -
- Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave
more like as if a real UE occurred. Because the other two injectors
such as hwpoison-inject and the 'einj' on x86 can't, and it seems to
me we need a better simulation to real UE scenario.
- For years, if the kernel is unable to unmap a hwpoisoned page, it send
a SIGKILL instead of SIGBUS to prevent user process from potentially
accessing the page again. But in doing so, the user process also lose
important information: vaddr, for recovery. Fortunately, the kernel
already has code to kill process re-accessing a hwpoisoned page, so
remove the '!unmap_success' check.
- Right now, if a thp page under GUP longterm pin is hwpoisoned, and
kernel cannot split the thp page, memory-failure simply ignores
the UE and returns. That's not ideal, it could deliver a SIGBUS with
useful information for userspace recovery.

Changes in v2:
- rebased to mm-stable as of 5/8/2024
- added RB by Oscar Salvador
- comments from Oscar on patch 1-of-3: clarify changelog
- comments from Miahe Lin on patch 3-of-3: remove unnecessary user page
checking and remove incorrect put_page() in kill_procs_now().
Invoke kill_procs_now() regardless MF_ACTIN_REQUIRED is set or not,
moved hwpoison_filter() higher up.
- added two patches 3-of-5 and 4-of-5

Jane Chu (5):
mm/memory-failure: try to send SIGBUS even if unmap failed
mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)
mm/memory-failure: improve memory failure action_result messages
mm/memory-failure: move hwpoison_filter() higher up
mm/memory-failure: send SIGBUS in the event of thp split fail

include/linux/mm.h | 2 +
include/ras/ras_event.h | 2 +
mm/madvise.c | 2 +-
mm/memory-failure.c | 89 +++++++++++++++++++++++++++--------------
4 files changed, 65 insertions(+), 30 deletions(-)

--
2.39.3

2024-05-10 06:27:11

by Jane Chu

[permalink] [raw]

Subject: [PATCH v2 4/5] mm/memory-failure: move hwpoison_filter() higher up

Move hwpoison_filter() higher up as there is no need to spend a lot
cycles only to find out later that the page is supposed to be skipped
for hwpoison handling.

Signed-off-by: Jane Chu <[email protected]>
---
mm/memory-failure.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 62133c10fb51..2fa884d8b5a3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2236,6 +2236,13 @@ int memory_failure(unsigned long pfn, int flags)
goto unlock_mutex;
}

+ if (hwpoison_filter(p)) {
+ if (flags & MF_COUNT_INCREASED)
+ put_page(p);
+ res = -EOPNOTSUPP;
+ goto unlock_mutex;
+ }
+
try_again:
res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
if (hugetlb)
@@ -2354,14 +2361,6 @@ int memory_failure(unsigned long pfn, int flags)
*/
page_flags = folio->flags;

- if (hwpoison_filter(p)) {
- ClearPageHWPoison(p);
- folio_unlock(folio);
- folio_put(folio);
- res = -EOPNOTSUPP;
- goto unlock_mutex;
- }
-
/*
* __munlock_folio() may clear a writeback folio's LRU flag without
* the folio lock. We need to wait for writeback completion for this
--
2.39.3

2024-05-10 06:27:25

by Jane Chu

[permalink] [raw]

Subject: [PATCH v2 3/5] mm/memory-failure: improve memory failure action_result messages

Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
Attemped to document the definition of various action names, and made a few
adjustment to the action_result() calls.

Signed-off-by: Jane Chu <[email protected]>
---
include/linux/mm.h | 2 ++
include/ras/ras_event.h | 2 ++
mm/memory-failure.c | 28 +++++++++++++++++++++++-----
3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..b4598c6a393a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4111,6 +4111,7 @@ enum mf_action_page_type {
MF_MSG_DIFFERENT_COMPOUND,
MF_MSG_HUGE,
MF_MSG_FREE_HUGE,
+ MF_MSG_GET_HWPOISON,
MF_MSG_UNMAP_FAILED,
MF_MSG_DIRTY_SWAPCACHE,
MF_MSG_CLEAN_SWAPCACHE,
@@ -4124,6 +4125,7 @@ enum mf_action_page_type {
MF_MSG_BUDDY,
MF_MSG_DAX,
MF_MSG_UNSPLIT_THP,
+ MF_MSG_ALREADY_POISONED,
MF_MSG_UNKNOWN,
};

diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index c011ea236e9b..b3f6832a94fe 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -360,6 +360,7 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
EM ( MF_MSG_HUGE, "huge page" ) \
EM ( MF_MSG_FREE_HUGE, "free huge page" ) \
+ EM ( MF_MSG_GET_HWPOISON, "get hwpoison page" ) \
EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \
EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \
EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \
@@ -373,6 +374,7 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_BUDDY, "free buddy page" ) \
EM ( MF_MSG_DAX, "dax page" ) \
EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \
+ EM ( MF_MSG_ALREADY_POISONED, "already poisoned" ) \
EMe ( MF_MSG_UNKNOWN, "unknown page" )

/*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 739311e121af..62133c10fb51 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -879,6 +879,20 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
return ret > 0 ? -EHWPOISON : -EFAULT;
}

+/*
+ * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did
+ * something, then caught in a race condition which renders the effort sort
+ * of moot. This is perhaps the most sever.
+ * MF_FAILED - The m-f() handler might have killed the process, but it can't
+ * do much to the poisoned page other than marking it poisoned, due to
+ * conditions such as extra pin, unmap failure, etc.
+ * MF_DELAYED - The poisoned page has been unmapped but not completely isolated,
+ * such as page might remain in LRU. But an attempt by userspace process
+ * to access the page again will hit page fault which will kill the process.
+ * MF_RECOVERED - The poisoned page has been completely isolated, via unmap,
+ * taking the page out of the buddy system, or hole punching out of the
+ * mapping.
+ */
static const char *action_name[] = {
[MF_IGNORED] = "Ignored",
[MF_FAILED] = "Failed",
@@ -893,6 +907,7 @@ static const char * const action_page_types[] = {
[MF_MSG_DIFFERENT_COMPOUND] = "different compound page after locking",
[MF_MSG_HUGE] = "huge page",
[MF_MSG_FREE_HUGE] = "free huge page",
+ [MF_MSG_GET_HWPOISON] = "get hwpoison page",
[MF_MSG_UNMAP_FAILED] = "unmapping failed page",
[MF_MSG_DIRTY_SWAPCACHE] = "dirty swapcache page",
[MF_MSG_CLEAN_SWAPCACHE] = "clean swapcache page",
@@ -906,6 +921,7 @@ static const char * const action_page_types[] = {
[MF_MSG_BUDDY] = "free buddy page",
[MF_MSG_DAX] = "dax page",
[MF_MSG_UNSPLIT_THP] = "unsplit thp",
+ [MF_MSG_ALREADY_POISONED] = "already poisoned",
[MF_MSG_UNKNOWN] = "unknown page",
};

@@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p)
{
pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
unlock_page(p);
- return MF_FAILED;
+ return MF_IGNORED;
}

/*
@@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
if (flags & MF_ACTION_REQUIRED) {
folio = page_folio(p);
res = kill_accessing_process(current, folio_pfn(folio), flags);
+ return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
}
return res;
} else if (res == -EBUSY) {
@@ -2062,7 +2079,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
flags |= MF_NO_RETRY;
goto retry;
}
- return action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
}

folio = page_folio(p);
@@ -2097,7 +2114,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb

if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
folio_unlock(folio);
- return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
+ return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED);
}

return identify_page_state(pfn, p, page_flags);
@@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags)
res = kill_accessing_process(current, pfn, flags);
if (flags & MF_COUNT_INCREASED)
put_page(p);
+ action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
goto unlock_mutex;
}

@@ -2267,7 +2285,7 @@ int memory_failure(unsigned long pfn, int flags)
}
goto unlock_mutex;
} else if (res < 0) {
- res = action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
goto unlock_mutex;
}
}
@@ -2363,7 +2381,7 @@ int memory_failure(unsigned long pfn, int flags)
* Abort on fail: __filemap_remove_folio() assumes unmapped page.
*/
if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
- res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
+ res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED);
goto unlock_page;
}

--
2.39.3

2024-05-10 06:27:37

by Jane Chu

[permalink] [raw]

Subject: [PATCH v2 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail

When handle hwpoison in a RDMA longterm pinned thp page,
try_to_split_thp_page() will fail. And at this point, there is
little else the kernel could do except sending a SIGBUS to
the user process, thus give it a chance to recover.

Signed-off-by: Jane Chu <[email protected]>
---
mm/memory-failure.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2fa884d8b5a3..15bb1c0c42e8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1697,7 +1697,7 @@ static int identify_page_state(unsigned long pfn, struct page *p,
return page_action(ps, p, pfn);
}

-static int try_to_split_thp_page(struct page *page)
+static int try_to_split_thp_page(struct page *page, bool release)
{
int ret;

@@ -1705,7 +1705,7 @@ static int try_to_split_thp_page(struct page *page)
ret = split_huge_page(page);
unlock_page(page);

- if (unlikely(ret))
+ if (ret && release)
put_page(page);

return ret;
@@ -2177,6 +2177,24 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
return rc;
}

+/*
+ * The calling condition is as such: thp split failed, page might have
+ * been RDMA pinned, not much can be done for recovery.
+ * But a SIGBUS should be delivered with vaddr provided so that the user
+ * application has a chance to recover. Also, application processes'
+ * election for MCE early killed will be honored.
+ */
+static int kill_procs_now(struct page *p, unsigned long pfn, int flags,
+ struct folio *folio)
+{
+ LIST_HEAD(tokill);
+
+ collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
+ kill_procs(&tokill, true, pfn, flags);
+
+ return -EHWPOISON;
+}
+
/**
* memory_failure - Handle memory failure of a page.
* @pfn: Page Number of the corrupted page
@@ -2313,8 +2331,11 @@ int memory_failure(unsigned long pfn, int flags)
* page is a valid handlable page.
*/
folio_set_has_hwpoisoned(folio);
- if (try_to_split_thp_page(p) < 0) {
- res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+ if (try_to_split_thp_page(p, false) < 0) {
+ pr_err("%#lx: thp split failed\n", pfn);
+ res = kill_procs_now(p, pfn, flags, folio);
+ put_page(p);
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED);
goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
@@ -2688,7 +2709,7 @@ static int soft_offline_in_use_page(struct page *page)
};

if (!huge && folio_test_large(folio)) {
- if (try_to_split_thp_page(page)) {
+ if (try_to_split_thp_page(page, true)) {
pr_info("soft offline: %#lx: thp split failed\n", pfn);
return -EBUSY;
}
--
2.39.3

2024-05-16 09:47:42

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v2 3/5] mm/memory-failure: improve memory failure action_result messages

On Fri, May 10, 2024 at 12:26:00AM -0600, Jane Chu wrote:
> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
> Attemped to document the definition of various action names, and made a few
> adjustment to the action_result() calls.
>
> Signed-off-by: Jane Chu <[email protected]>
..
> +/*
> + * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did
"or if it "
> + * something, then caught in a race condition which renders the effort sort
"it was caught"

I would also add to MF_IGNORED that we mark the page hwpoisoned anyway.

> @@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p)
> {
> pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
> unlock_page(p);
> - return MF_FAILED;
> + return MF_IGNORED;
> }

I was confused because I saw you replaced all MF_MSG_UNKNOWN, so I
wondered how we can end up here until I saw this is a catch-all in case
we fail to make sense of the page->flags.
While you are improving this, I would suggest to add a little comment
above the function explaining how we can reach it.

> /*
> @@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> if (flags & MF_ACTION_REQUIRED) {
> folio = page_folio(p);
> res = kill_accessing_process(current, folio_pfn(folio), flags);
> + return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);

I do not understand why are you doing this.

First of all, why is this considered a failure? We did not fail this
time, did we? We went right ahead and kill the process which was
re-accessing the hwpoisoned page. Is that considered a failure?

Second, you are know supressing -EHWPOISON with whatever action_result()
will gives us, which judging from the actual code would be -EBUSY?
I do not think that that is right, and we should be returning
-EHWPOISON. Or whatever error code kill_accessing_process() gives us.

> @@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags)
> res = kill_accessing_process(current, pfn, flags);
> if (flags & MF_COUNT_INCREASED)
> put_page(p);
> + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);

This is not coherent with what you did in try_memory_failure_hugetlb
for MF_MSG_ALREADY_POISONED, I __think__ that in there we should be
doing the same as we do here.

--
Oscar Salvador
SUSE Labs

2024-05-16 10:11:27

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm/memory-failure: move hwpoison_filter() higher up

On Fri, May 10, 2024 at 12:26:01AM -0600, Jane Chu wrote:
> Move hwpoison_filter() higher up as there is no need to spend a lot
> cycles only to find out later that the page is supposed to be skipped
> for hwpoison handling.
>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> mm/memory-failure.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 62133c10fb51..2fa884d8b5a3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2236,6 +2236,13 @@ int memory_failure(unsigned long pfn, int flags)
> goto unlock_mutex;
> }
>
> + if (hwpoison_filter(p)) {
> + if (flags & MF_COUNT_INCREASED)
> + put_page(p);
> + res = -EOPNOTSUPP;
> + goto unlock_mutex;
> + }

Now, it is true that doing this might not be optimal for the reasons
explained by Miaohe, but the whole hwpoison_filter() thing is only used
by the hwpoison-inject code AFAICS, which is just for testing purposes,
so I do not think there is any harm in lifting the check.

But no real strong opinion here.

--
Oscar Salvador
SUSE Labs

2024-05-16 12:47:30

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail

On Fri, May 10, 2024 at 12:26:02AM -0600, Jane Chu wrote:
> When handle hwpoison in a RDMA longterm pinned thp page,
> try_to_split_thp_page() will fail. And at this point, there is
> little else the kernel could do except sending a SIGBUS to
> the user process, thus give it a chance to recover.

Well, it does need to be a RDMA longterm pinned, right?
Anything holding an extra refcount can already make us bite the dust, so
I would not make it that specific.

> Signed-off-by: Jane Chu <[email protected]>
> ---
> mm/memory-failure.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fa884d8b5a3..15bb1c0c42e8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1697,7 +1697,7 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> return page_action(ps, p, pfn);
> }
>
> -static int try_to_split_thp_page(struct page *page)
> +static int try_to_split_thp_page(struct page *page, bool release)
> {
> int ret;
>
> @@ -1705,7 +1705,7 @@ static int try_to_split_thp_page(struct page *page)
> ret = split_huge_page(page);
> unlock_page(page);
>
> - if (unlikely(ret))
> + if (ret && release)
> put_page(page);

I would document whhen and when not we can release the page.
E.g: we cannot release it if there are still processes mapping the thp.

> +static int kill_procs_now(struct page *p, unsigned long pfn, int flags,
> + struct folio *folio)
> +{
> + LIST_HEAD(tokill);
> +
> + collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> + kill_procs(&tokill, true, pfn, flags);
> +
> + return -EHWPOISON;

You are returning -EHWPOISON here,

> +}
> +
> /**
> * memory_failure - Handle memory failure of a page.
> * @pfn: Page Number of the corrupted page
> @@ -2313,8 +2331,11 @@ int memory_failure(unsigned long pfn, int flags)
> * page is a valid handlable page.
> */
> folio_set_has_hwpoisoned(folio);
> - if (try_to_split_thp_page(p) < 0) {
> - res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> + if (try_to_split_thp_page(p, false) < 0) {
> + pr_err("%#lx: thp split failed\n", pfn);
> + res = kill_procs_now(p, pfn, flags, folio);
> + put_page(p);
> + res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED);

just to overwrite it here with action_result(). Which one do we need?
I think we would need -EBUSY here, right? So I would drop the retcode
from kill_procs_now.

Also, do we want the extra pr_err() here.
action_result() will already provide us the pfn and the
action_page_types which will be "unsplit thp". Is not that clear enough?

I would drop that.

--
Oscar Salvador
SUSE Labs

2024-05-20 18:16:20

by Jane Chu

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] mm/memory-failure: move hwpoison_filter() higher up

On 5/11/2024 1:29 AM, Miaohe Lin wrote:

> On 2024/5/10 14:26, Jane Chu wrote:
>> Move hwpoison_filter() higher up as there is no need to spend a lot
>> cycles only to find out later that the page is supposed to be skipped
>> for hwpoison handling.
>>
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> mm/memory-failure.c | 15 +++++++--------
>> 1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 62133c10fb51..2fa884d8b5a3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2236,6 +2236,13 @@ int memory_failure(unsigned long pfn, int flags)
>> goto unlock_mutex;
>> }
>>
>> + if (hwpoison_filter(p)) {
>> + if (flags & MF_COUNT_INCREASED)
>> + put_page(p);
>> + res = -EOPNOTSUPP;
>> + goto unlock_mutex;
>> + }
> It might not be a good idea to do hwpoison_filter() here. We don't hold extra page refcnt
> yet, so the page state will be really unstable. Or am I miss something?

I agree with you.

It looks like hwpoison_filter_flags() in particular needs a stable page
in order to retrieve

a wholesome KPF_ flags set that at any time, although the flags could
change immediately

afterwards, they won't be torn flags. For that, it looks like the folio
should be locked as well.

thanks!

-jane

> Thanks.
> .

2024-05-20 18:31:13

by Jane Chu

[permalink] [raw]

Subject: Re: [PATCH v2 3/5] mm/memory-failure: improve memory failure action_result messages

On 5/16/2024 2:46 AM, Oscar Salvador wrote:

> On Fri, May 10, 2024 at 12:26:00AM -0600, Jane Chu wrote:
>> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
>> Attemped to document the definition of various action names, and made a few
>> adjustment to the action_result() calls.
>>
>> Signed-off-by: Jane Chu <[email protected]>
> ...
>> +/*
>> + * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did
> "or if it"
>> + * something, then caught in a race condition which renders the effort sort
> "it was caught"
>
> I would also add to MF_IGNORED that we mark the page hwpoisoned anyway.
Thanks! Will fix, and add comments.
>
>> @@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p)
>> {
>> pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
>> unlock_page(p);
>> - return MF_FAILED;
>> + return MF_IGNORED;
>> }
> I was confused because I saw you replaced all MF_MSG_UNKNOWN, so I
> wondered how we can end up here until I saw this is a catch-all in case
> we fail to make sense of the page->flags.
> While you are improving this, I would suggest to add a little comment
> above the function explaining how we can reach it.
Yes, it's a catch-all versus something that suppose to happen, will add
comments.
>
>> /*
>> @@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>> if (flags & MF_ACTION_REQUIRED) {
>> folio = page_folio(p);
>> res = kill_accessing_process(current, folio_pfn(folio), flags);
>> + return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> I do not understand why are you doing this.
>
> First of all, why is this considered a failure? We did not fail this
> time, did we? We went right ahead and kill the process which was
> re-accessing the hwpoisoned page. Is that considered a failure?

Given that the goal for the m-f() handler is to isolate the poisoned
page, so in this case,

even if the kernel has killed the page, nothing else could be done to
prevent the page from

being accessed and triggering another MCE, which is, I suppose more
sever than triggering a page fault.

>
> Second, you are know supressing -EHWPOISON with whatever action_result()
> will gives us, which judging from the actual code would be -EBUSY?
> I do not think that that is right, and we should be returning
> -EHWPOISON. Or whatever error code kill_accessing_process() gives us.
>
>
>> @@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags)
>> res = kill_accessing_process(current, pfn, flags);
>> if (flags & MF_COUNT_INCREASED)
>> put_page(p);
>> + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> This is not coherent with what you did in try_memory_failure_hugetlb
> for MF_MSG_ALREADY_POISONED, I __think__ that in there we should be
> doing the same as we do here.
>
>

Good catch, thanks. In both cases, 'res' from kill_accessing_process()
should be returned.

Thanks!

-jane

>

2024-05-20 18:49:27

by Jane Chu

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail

On 5/16/2024 5:47 AM, Oscar Salvador wrote:

> On Fri, May 10, 2024 at 12:26:02AM -0600, Jane Chu wrote:
>> When handle hwpoison in a RDMA longterm pinned thp page,
>> try_to_split_thp_page() will fail. And at this point, there is
>> little else the kernel could do except sending a SIGBUS to
>> the user process, thus give it a chance to recover.
> Well, it does need to be a RDMA longterm pinned, right?
> Anything holding an extra refcount can already make us bite the dust, so
> I would not make it that specific.

How about let me just mention RDMA longterm pin as one of the use cases?

To be honest, it is the only known case to me, and not all FOLL_LONGTERM
pin

lead to THP split failure.

>
>> Signed-off-by: Jane Chu <[email protected]>
>> ---
>> mm/memory-failure.c | 31 ++++++++++++++++++++++++++-----
>> 1 file changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 2fa884d8b5a3..15bb1c0c42e8 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1697,7 +1697,7 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>> return page_action(ps, p, pfn);
>> }
>>
>> -static int try_to_split_thp_page(struct page *page)
>> +static int try_to_split_thp_page(struct page *page, bool release)
>> {
>> int ret;
>>
>> @@ -1705,7 +1705,7 @@ static int try_to_split_thp_page(struct page *page)
>> ret = split_huge_page(page);
>> unlock_page(page);
>>
>> - if (unlikely(ret))
>> + if (ret && release)
>> put_page(page);
> I would document whhen and when not we can release the page.
> E.g: we cannot release it if there are still processes mapping the thp.
Sure.
>
>> +static int kill_procs_now(struct page *p, unsigned long pfn, int flags,
>> + struct folio *folio)
>> +{
>> + LIST_HEAD(tokill);
>> +
>> + collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>> + kill_procs(&tokill, true, pfn, flags);
>> +
>> + return -EHWPOISON;
> You are returning -EHWPOISON here,

Yes, indeed.

>> +}
>> +
>> /**
>> * memory_failure - Handle memory failure of a page.
>> * @pfn: Page Number of the corrupted page
>> @@ -2313,8 +2331,11 @@ int memory_failure(unsigned long pfn, int flags)
>> * page is a valid handlable page.
>> */
>> folio_set_has_hwpoisoned(folio);
>> - if (try_to_split_thp_page(p) < 0) {
>> - res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>> + if (try_to_split_thp_page(p, false) < 0) {
>> + pr_err("%#lx: thp split failed\n", pfn);
>> + res = kill_procs_now(p, pfn, flags, folio);
>> + put_page(p);
>> + res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED);
> just to overwrite it here with action_result(). Which one do we need?
> I think we would need -EBUSY here, right? So I would drop the retcode
> from kill_procs_now.

The overwrite was wrong, it should return -EHWPOISON to indicate to the
caller (such as kill_me_maybe)

that no further action against the process is needed since the m-f()
handler has killed the process.

>
> Also, do we want the extra pr_err() here.
> action_result() will already provide us the pfn and the
> action_page_types which will be "unsplit thp". Is not that clear enough?
>
> I would drop that.

Agreed, will drop the extra print.

thanks!

-jane

>
>