2023-09-14 06:56:40

by Zhiquan Li

[permalink] [raw]
Subject: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

Kdump can exclude the HWPosion page to avoid touch the error page
again, the prerequisite is the PG_hwpoison page flag is set.
However, for some MCE fatal error cases, there is no opportunity
to queue a task for calling memory_failure(), as a result,
the capture kernel touches the error page again and panics.

Add function mce_set_page_hwpoison_now() which marks a page as
HWPoison before kernel panic() for MCE error, so that the dump
program can check and skip the error page and prevent the capture
kernel panic.

[Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]

Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>

---
V2 RESEND notes:
- No changes on this, just rebasing as v6.6-rc1 is out.
- Added the tag from Naoya.
Link: https://lore.kernel.org/all/[email protected]/#t

Changes since V1:
- Revised the commit message as per Naoya's suggestion.
- Replaced "TODO" comment in code with comments based on mailing list
discussion on the lack of value in covering other page types.
Link: https://lore.kernel.org/all/[email protected]/
---
arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..2725698268f3 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);

+/*
+ * Kdump can exclude the HWPosion page to avoid touch the error page again,
+ * the prerequisite is the PG_hwpoison page flag is set. However, for some
+ * MCE fatal error cases, there are no opportunity to queue a task
+ * for calling memory_failure(), as a result, the capture kernel panics.
+ * This function marks the page as HWPoison before kernel panic() for MCE.
+ */
+static void mce_set_page_hwpoison_now(unsigned long pfn)
+{
+ struct page *p;
+
+ p = pfn_to_online_page(pfn);
+ if (p)
+ SetPageHWPoison(p);
+}
+
static void __print_mce(struct mce *m)
{
pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
@@ -286,6 +302,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+ if (final && (final->status & MCI_STATUS_ADDRV))
+ mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
--
2.25.1


2023-10-02 21:58:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic


* Zhiquan Li <[email protected]> wrote:

> Kdump can exclude the HWPosion page to avoid touch the error page
> again, the prerequisite is the PG_hwpoison page flag is set.
> However, for some MCE fatal error cases, there is no opportunity
> to queue a task for calling memory_failure(), as a result,
> the capture kernel touches the error page again and panics.
>
> Add function mce_set_page_hwpoison_now() which marks a page as
> HWPoison before kernel panic() for MCE error, so that the dump
> program can check and skip the error page and prevent the capture
> kernel panic.
>
> [Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]
>
> Co-developed-by: Youquan Song <[email protected]>
> Signed-off-by: Youquan Song <[email protected]>
> Signed-off-by: Zhiquan Li <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Naoya Horiguchi <[email protected]>
>
> ---
> V2 RESEND notes:
> - No changes on this, just rebasing as v6.6-rc1 is out.
> - Added the tag from Naoya.
> Link: https://lore.kernel.org/all/[email protected]/#t
>
> Changes since V1:
> - Revised the commit message as per Naoya's suggestion.
> - Replaced "TODO" comment in code with comments based on mailing list
> discussion on the lack of value in covering other page types.
> Link: https://lore.kernel.org/all/[email protected]/
> ---
> arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 6f35f724cc14..2725698268f3 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
> }
> EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
>
> +/*
> + * Kdump can exclude the HWPosion page to avoid touch the error page again,
> + * the prerequisite is the PG_hwpoison page flag is set. However, for some
> + * MCE fatal error cases, there are no opportunity to queue a task
> + * for calling memory_failure(), as a result, the capture kernel panics.
> + * This function marks the page as HWPoison before kernel panic() for MCE.
> + */

The English in this commit is *atrocious*, both in the changelog and in
the comments - how on Earth did 'Posion' typo and half a dozen other typos
and bad grammar survive ~3 iterations and a Reviewed-by tag??

The version below fixes up the worst, but I suspect that's not the only problem
with this patch...

Thanks,

Ingo

================>
From: Zhiquan Li <[email protected]>
Date: Thu, 14 Sep 2023 11:05:39 +0800
Subject: [PATCH] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

Kdump can exclude the HWPoison page to avoid touching the error page
again, the prerequisite is the PG_hwpoison page flag is set.

However, for some MCE fatal error cases, there is no opportunity
to queue a task for calling memory_failure(), and as a result,
the capture kernel touches the error page again and panics.

Add the mce_set_page_hwpoison_now() function, which marks a page as
HWPoison before kernel panic() for MCE error, so that the dump
program can check and skip the error page and prevent the capture
kernel panic.

[ Tony: Changed TestSetPageHWPoison() to SetPageHWPoison() ]
[ mingo: Fixed the comments & changelog ]

Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/#t
---
arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..1a14e8233c5a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);

+/*
+ * Kdump can exclude the HWPoison page to avoid touching the error page again,
+ * the prerequisite is that the PG_hwpoison page flag is set. However, for some
+ * MCE fatal error cases, there is no opportunity to queue a task
+ * for calling memory_failure(), and as a result, the capture kernel panics.
+ * This function marks the page as HWPoison before kernel panic() for MCE.
+ */
+static void mce_set_page_hwpoison_now(unsigned long pfn)
+{
+ struct page *p;
+
+ p = pfn_to_online_page(pfn);
+ if (p)
+ SetPageHWPoison(p);
+}
+
static void __print_mce(struct mce *m)
{
pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
@@ -286,6 +302,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+ if (final && (final->status & MCI_STATUS_ADDRV))
+ mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);

2023-10-10 00:38:25

by Zhiquan Li

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic


On 2023/10/3 03:06, Ingo Molnar wrote:
> The English in this commit is *atrocious*, both in the changelog and in
> the comments - how on Earth did 'Posion' typo and half a dozen other
> typos and bad grammar survive ~3 iterations and a Reviewed-by tag?? The
> version below fixes up the worst, but I suspect that's not the only
> problem with this patch...

Many thanks for your attention and fixes up, Ingo.

I’d like to introduce more background of this patch.

Memory errors don’t happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, but not try to
make the kernel survive via memory_failure() handling. Unfortunately,
the capture kernel will panic for the same reason if it touches the
error memory again. The consequence is that only an incomplete vmcore
is left for sustaining engineers, it’s a big headache for them to make
clear what happened in the past.


We had considered 3 solutions and finally chose the last one.

1. When the capture kernel boots up, re-scans the MCE banks to check if
there are fatal errors, set the PG_hwpoison flag for each error
pages.
We can foresee this solution is heavy. It needs to find the struct
page of error pages from old memory and set the flag. Looks like we
need to remake the wheel, so we gave up it.

2. Replace the function copy_to_iter() at __copy_oldmem_page() with the
function _copy_mc_to_iter(), which is a #MC safe version.
This solution is lightweight but has following drawbacks:

1) Such issues are quite rare events; we don’t want to use a #MC safe
copy to accommodate it. Especially, if the problem can be deal
with by MCE handling rather than touching the Kdump stuff.

2) The #MC safe copy is conditionally, whether it can fix the #MC
error depends on MCE handling can reach the fixup_exception()
function at do_machine_check(). However, in fatal error case, it
might invoke mce_panic() to crash the capture kernel earlier than
fixing up the error.

3. The solution in this patch overcomes all above drawbacks. It set the
flag just before the production kernel calls panic(), which would not
introduce additional overhead in capture kernel or conflict with
other hwpoision-related code in production kernel. Furthermore, it
leverages the already existing mechanisms to fix the issue as much as
possible, the code changes are also lightweight.


To verify the fix is not difficult. The issue can be simulated by
ras-tools
(https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git),
"copyout" test case. It can inject a fatal memory error in kernel space
via APEI ENIJ interface (need hardware platform support), and then it
touches the error page to produce the issue. The patch has been
validated by this tool.

Any idea is welcome!

Best Regards,
Zhiquan

2023-10-10 08:26:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

On Tue, Oct 10, 2023 at 08:56:38AM +0800, Zhiquan Li wrote:
> 3. The solution in this patch overcomes all above drawbacks. It set the
> flag just before the production kernel calls panic(), which would not
> introduce additional overhead in capture kernel or conflict with
> other hwpoision-related code in production kernel. Furthermore, it
> leverages the already existing mechanisms to fix the issue as much as
> possible, the code changes are also lightweight.

And the kdump kernel reads the page poison flags and so it avoids
copying them?

More comments as a reply to the patch.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-10 08:29:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

On Thu, Sep 14, 2023 at 11:05:39AM +0800, Zhiquan Li wrote:
> Kdump can exclude the HWPosion page to avoid touch the error page
> again, the prerequisite is the PG_hwpoison page flag is set.
> However, for some MCE fatal error cases, there is no opportunity
> to queue a task for calling memory_failure(), as a result,
> the capture kernel touches the error page again and panics.
>
> Add function mce_set_page_hwpoison_now() which marks a page as
> HWPoison before kernel panic() for MCE error, so that the dump
> program can check and skip the error page and prevent the capture
> kernel panic.

This commit message should explain the full scenario, like you did in
your other reply.

Also explain how the poison flag is consumed by the kdump kernel and put
that in the comment below.

> [Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]
>
> Co-developed-by: Youquan Song <[email protected]>
> Signed-off-by: Youquan Song <[email protected]>
> Signed-off-by: Zhiquan Li <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>

What does Tony's SOB mean here?

If I read it correctly, it is him sending this patch now. But you're
sending it so you folks need to read up on SOB chains.

> Reviewed-by: Naoya Horiguchi <[email protected]>
>
> ---
> V2 RESEND notes:
> - No changes on this, just rebasing as v6.6-rc1 is out.
> - Added the tag from Naoya.
> Link: https://lore.kernel.org/all/[email protected]/#t
>
> Changes since V1:
> - Revised the commit message as per Naoya's suggestion.
> - Replaced "TODO" comment in code with comments based on mailing list
> discussion on the lack of value in covering other page types.
> Link: https://lore.kernel.org/all/[email protected]/
> ---
> arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 6f35f724cc14..2725698268f3 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
> }
> EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
>
> +/*
> + * Kdump can exclude the HWPosion page to avoid touch the error page again,
> + * the prerequisite is the PG_hwpoison page flag is set. However, for some
> + * MCE fatal error cases, there are no opportunity to queue a task
> + * for calling memory_failure(), as a result, the capture kernel panics.
> + * This function marks the page as HWPoison before kernel panic() for MCE.
> + */
> +static void mce_set_page_hwpoison_now(unsigned long pfn)
> +{
> + struct page *p;
> +
> + p = pfn_to_online_page(pfn);
> + if (p)
> + SetPageHWPoison(p);
> +}

there's no need for that function - just put everything...

> +
> static void __print_mce(struct mce *m)
> {
> pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
> @@ -286,6 +302,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> if (!fake_panic) {
> if (panic_timeout == 0)
> panic_timeout = mca_cfg.panic_timeout;
> + if (final && (final->status & MCI_STATUS_ADDRV))
> + mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);

... here, along with the comment.

> panic(msg);
> } else
> pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
> --

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-11 02:42:32

by Zhiquan Li

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic


On 2023/10/10 16:28, Borislav Petkov wrote:
> This commit message should explain the full scenario, like you did in
> your other reply.
>

Thanks for your review, Boris!

I'll improve the commit message in V3 as you said. Just adding the full
scenario part, the paragraphs to introduce the considerations for the 3
solutions and how to validate the patch are unnecessary, right?

> Also explain how the poison flag is consumed by the kdump kernel and put
> that in the comment below.
>

Aha, this is the neat thing about the patch. The main task of kdump
kernel is providing a "window" - /proc/vmcore, for the dump program to
access old memory. A dump program running in userspace determines the
"policy". Which pages need to be dumped is determined by the
configuration of dump program, it reads out the pages that the
sustaining engineer is interested in and excludes the rest. The de
facto dump program (makedumpfile) already supports to identify those
poisoned pages and exclude them a decade ago:

https://github.com/makedumpfile/makedumpfile/commit/030800d88d4baca5f60ade0acc1846a65608883c

That's why I said the solution 1 is remaking the wheels, scanning MCE
banks, checking the poison flag, and excluding error pages are
duplicated actions. To set the HWPosion flag in the production kernel
before panics is the only missing step to make everything work.


>> [Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]
>>
>> Co-developed-by: Youquan Song <[email protected]>
>> Signed-off-by: Youquan Song <[email protected]>
>> Signed-off-by: Zhiquan Li <[email protected]>
>> Signed-off-by: Tony Luck <[email protected]>
> What does Tony's SOB mean here?
>
> If I read it correctly, it is him sending this patch now. But you're
> sending it so you folks need to read up on SOB chains.
>

When we were developing the patch internally, Tony contributed a lot of
precious ideas and guidance, not only the code change he mentioned in
commit message.

The previous V2 sent by Tony missed the merge window of v6.5, so I
re-based it onto the latest v6.6 rc, re-validated and re-send the patch.
And I will follow up the feedback from community.


>> Reviewed-by: Naoya Horiguchi <[email protected]>
>>
>> ---
>> V2 RESEND notes:
>> - No changes on this, just rebasing as v6.6-rc1 is out.
>> - Added the tag from Naoya.
>> Link: https://lore.kernel.org/all/[email protected]/#t
>>
>> Changes since V1:
>> - Revised the commit message as per Naoya's suggestion.
>> - Replaced "TODO" comment in code with comments based on mailing list
>> discussion on the lack of value in covering other page types.
>> Link: https://lore.kernel.org/all/[email protected]/
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 6f35f724cc14..2725698268f3 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
>> }
>> EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
>>
>> +/*
>> + * Kdump can exclude the HWPosion page to avoid touch the error page again,
>> + * the prerequisite is the PG_hwpoison page flag is set. However, for some
>> + * MCE fatal error cases, there are no opportunity to queue a task
>> + * for calling memory_failure(), as a result, the capture kernel panics.
>> + * This function marks the page as HWPoison before kernel panic() for MCE.
>> + */
>> +static void mce_set_page_hwpoison_now(unsigned long pfn)
>> +{
>> + struct page *p;
>> +
>> + p = pfn_to_online_page(pfn);
>> + if (p)
>> + SetPageHWPoison(p);
>> +}
> there's no need for that function - just put everything...
>
>> +
>> static void __print_mce(struct mce *m)
>> {
>> pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
>> @@ -286,6 +302,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>> if (!fake_panic) {
>> if (panic_timeout == 0)
>> panic_timeout = mca_cfg.panic_timeout;
>> + if (final && (final->status & MCI_STATUS_ADDRV))
>> + mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);
> ... here, along with the comment.
>

Good idea. I'll send V3 as you said.


Best Regards,
Zhiquan

2023-10-12 14:57:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

On Wed, Oct 11, 2023 at 11:00:57AM +0800, Zhiquan Li wrote:
> Aha, this is the neat thing about the patch. The main task of kdump
> kernel is providing a "window" - /proc/vmcore, for the dump program to
> access old memory. A dump program running in userspace determines the
> "policy". Which pages need to be dumped is determined by the
> configuration of dump program, it reads out the pages that the
> sustaining engineer is interested in and excludes the rest. The de
> facto dump program (makedumpfile) already supports to identify those
> poisoned pages and exclude them a decade ago:

Yes, put that in your commit message.

> When we were developing the patch internally, Tony contributed a lot of
> precious ideas and guidance, not only the code change he mentioned in
> commit message.
>
> The previous V2 sent by Tony missed the merge window of v6.5, so I
> re-based it onto the latest v6.6 rc, re-validated and re-send the patch.
> And I will follow up the feedback from community.

Then you should ask Tony whether he wants Co-developed-by:. See

Documentation/process/submitting-patches.rst

for detail.

There it is also explained what an SOB chain is and how it should look
like.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-13 00:07:55

by Zhiquan Li

[permalink] [raw]
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

On 2023/10/12 22:57, Borislav Petkov wrote:
> Then you should ask Tony whether he wants Co-developed-by:. See
> Documentation/process/submitting-patches.rst for detail. There it is
> also explained what an SOB chain is and how it should look like.

Thanks for your guidance, Boris.

Let me check with him before send V3.

Best Regards,
Zhiquan