Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, thus there is
no opportunity to queue a task for calling memory_failure() which will
try to make the kernel survive via memory failure handling.
Unfortunately, the capture kernel will panic for the same reason if it
touches the error memory again. The consequence is that only an
incomplete vmcore is left for sustaining engineers, it's a big headache
for them to make clear what happened in the past.
The main task of kdump kernel is providing a "window" - /proc/vmcore,
for the dump program to access old memory. A dump program running in
userspace determines the "policy". Which pages need to be dumped is
determined by the configuration of dump program, it reads out the pages
that the sustaining engineer is interested in and excludes the rest.
Likewise, the dump program can exclude the poisoned page to avoid
touching the error page again, the prerequisite is the PG_hwpoison page
flag is set correctly by kernel. The de facto dump program
(makedumpfile) already supports this function in a decade ago. To set
the PG_hwpoison flag in the production kernel just before it panics is
the only missing step to make everything work.
And it would not introduce additional overhead in capture kernel or
conflict with other hwpoision-related code in production kernel. It
leverages the already existing mechanisms to fix the issue as much as
possible, so the code changes are lightweight.
[ Tony: Changed TestSetPageHWPoison() to SetPageHWPoison() ]
[ mingo: Fixed the comments & changelog ]
Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/#t
---
V2: https://lore.kernel.org/all/[email protected]/
Changes since V2:
- Rebased to v6.6-rc5.
- Explained full scenario in commit message per Boris's suggestion.
- Included Ingo's fixes.
Link: https://lore.kernel.org/all/ZRsUpM%[email protected]/
V1: https://lore.kernel.org/all/[email protected]/
Changes since V1:
- Revised the commit message as per Naoya's suggestion.
- Replaced "TODO" comment in code with comments based on mailing list
discussion on the lack of value in covering other page types.
- Added the tag from Naoya.
Link: https://lore.kernel.org/all/[email protected]/
---
arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..905e80c776b8 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -233,6 +233,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
struct llist_node *pending;
struct mce_evt_llist *l;
int apei_err = 0;
+ struct page *p;
/*
* Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+ /*
+ * Kdump can exclude the HWPoison page to avoid touching the error
+ * page again, the prerequisite is that the PG_hwpoison page flag is
+ * set. However, for some MCE fatal error cases, there is no
+ * opportunity to queue a task for calling memory_failure(), and as a
+ * result, the capture kernel panics. So mark the page as HWPoison
+ * before kernel panic() for MCE.
+ */
+ p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+ if (final && (final->status & MCI_STATUS_ADDRV) && p)
+ SetPageHWPoison(p);
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
--
2.25.1
> Co-developed-by: Youquan Song <[email protected]>
> Signed-off-by: Youquan Song <[email protected]>
> Signed-off-by: Zhiquan Li <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> Reviewed-by: Naoya Horiguchi <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Link: https://lore.kernel.org/all/[email protected]/#t
This still has problems. You should have removed my Signed-off-by. You should NOT
have added Ingo's Signed-off-by (the only time to add someone else's Signed-off-by
is when paired with a Co-developed-by).
-Tony
On 2023/10/14 13:12, Luck, Tony wrote:
>> Co-developed-by: Youquan Song <[email protected]>
>> Signed-off-by: Youquan Song <[email protected]>
>> Signed-off-by: Zhiquan Li <[email protected]>
>> Signed-off-by: Tony Luck <[email protected]>
>> Signed-off-by: Ingo Molnar <[email protected]>
>> Reviewed-by: Naoya Horiguchi <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Link: https://lore.kernel.org/all/[email protected]/#t
>
> This still has problems. You should have removed my Signed-off-by. You should NOT
> have added Ingo's Signed-off-by (the only time to add someone else's Signed-off-by
> is when paired with a Co-developed-by).
>
Oh, this is V3, not RESEND, I should reset the SoB chain.
Thanks for your reminder, Tony.
If my understanding is correct, the version below fixes the tag list.
Best Regards,
Zhiquan
========>
From: Zhiquan Li <[email protected]>
Date: Sat, 14 Oct 2023 13:03:17 -0600
Subject: [PATCH v3] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic
Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, thus there is
no opportunity to queue a task for calling memory_failure() which will
try to make the kernel survive via memory failure handling.
Unfortunately, the capture kernel will panic for the same reason if it
touches the error memory again. The consequence is that only an
incomplete vmcore is left for sustaining engineers, it's a big headache
for them to make clear what happened in the past.
The main task of kdump kernel is providing a "window" - /proc/vmcore,
for the dump program to access old memory. A dump program running in
userspace determines the "policy". Which pages need to be dumped is
determined by the configuration of dump program, it reads out the pages
that the sustaining engineer is interested in and excludes the rest.
Likewise, the dump program can exclude the poisoned page to avoid
touching the error page again, the prerequisite is the PG_hwpoison page
flag is set correctly by kernel. The de facto dump program
(makedumpfile) already supports this function in a decade ago. To set
the PG_hwpoison flag in the production kernel just before it panics is
the only missing step to make everything work.
And it would not introduce additional overhead in capture kernel or
conflict with other hwpoision-related code in production kernel. It
leverages the already existing mechanisms to fix the issue as much as
possible, so the code changes are lightweight.
Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Zhiquan Li <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
V2: https://lore.kernel.org/all/[email protected]/
Changes since V2:
- Rebased to v6.6-rc5.
- Explained full scenario in commit message per Boris's suggestion.
- Included Ingo's fixes.
Link: https://lore.kernel.org/all/ZRsUpM%[email protected]/
V1: https://lore.kernel.org/all/[email protected]/
Changes since V1:
- Revised the commit message as per Naoya's suggestion.
- Replaced "TODO" comment in code with comments based on mailing list
discussion on the lack of value in covering other page types.
- Added the tag from Naoya.
Link: https://lore.kernel.org/all/[email protected]/
---
arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..905e80c776b8 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -233,6 +233,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
struct llist_node *pending;
struct mce_evt_llist *l;
int apei_err = 0;
+ struct page *p;
/*
* Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+ /*
+ * Kdump can exclude the HWPoison page to avoid touching the error
+ * page again, the prerequisite is that the PG_hwpoison page flag is
+ * set. However, for some MCE fatal error cases, there is no
+ * opportunity to queue a task for calling memory_failure(), and as a
+ * result, the capture kernel panics. So mark the page as HWPoison
+ * before kernel panic() for MCE.
+ */
+ p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+ if (final && (final->status & MCI_STATUS_ADDRV) && p)
+ SetPageHWPoison(p);
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
--
2.25.1
On Sat, Oct 14, 2023 at 05:34:12PM +0800, Zhiquan Li wrote:
> Oh, this is V3, not RESEND, I should reset the SoB chain.
You should slow down and take the time to read the document I pointed
you at.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Sat, Oct 14, 2023 at 05:34:12PM +0800, Zhiquan Li wrote:
> Memory errors don't happen very often, especially the severity is fatal.
> However, in large-scale scenarios, such as data centers, it might still
> happen. For some MCE fatal error cases, the kernel might call
> mce_panic() to terminate the production kernel directly, thus there is
> no opportunity to queue a task for calling memory_failure() which will
> try to make the kernel survive via memory failure handling.
You can't "make the kernel survive" if the error has been deemed
critical. That's mce_severity()'s job. If it grades the error's severity
wrongly and memory_failure() should run after all, then this is
a different story.
> @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> if (!fake_panic) {
> if (panic_timeout == 0)
> panic_timeout = mca_cfg.panic_timeout;
This whole thing...
> + /*
> + * Kdump can exclude the HWPoison page to avoid touching the error
> + * page again, the prerequisite is that the PG_hwpoison page flag is
> + * set. However, for some MCE fatal error cases, there is no
> + * opportunity to queue a task for calling memory_failure(), and as a
> + * result, the capture kernel panics. So mark the page as HWPoison
> + * before kernel panic() for MCE.
> + */
> + p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
> + if (final && (final->status & MCI_STATUS_ADDRV) && p)
> + SetPageHWPoison(p);
... needs to be inside:
if (kexec_crash_loaded() {
...
}
otherwise it'll be useless work on the panic path.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2023/10/16 17:11, Borislav Petkov wrote:
>> For some MCE fatal error cases, the kernel might call
>> mce_panic() to terminate the production kernel directly, thus there is
>> no opportunity to queue a task for calling memory_failure() which will
>> try to make the kernel survive via memory failure handling.
> You can't "make the kernel survive" if the error has been deemed
> critical. That's mce_severity()'s job. If it grades the error's severity
> wrongly and memory_failure() should run after all, then this is
> a different story.
>
I understand what you mean. Looks I didn't express myself well on this
point and caused ambiguity. Maybe removing the attributive clause would
make it brief and clear? Such as,
For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, there
is no opportunity to queue a task for calling memory_failure().
Or do you have other better suggestions? Thanks.
>> @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>> if (!fake_panic) {
>> if (panic_timeout == 0)
>> panic_timeout = mca_cfg.panic_timeout;
> This whole thing...
>
>> + /*
>> + * Kdump can exclude the HWPoison page to avoid touching the error
>> + * page again, the prerequisite is that the PG_hwpoison page flag is
>> + * set. However, for some MCE fatal error cases, there is no
>> + * opportunity to queue a task for calling memory_failure(), and as a
>> + * result, the capture kernel panics. So mark the page as HWPoison
>> + * before kernel panic() for MCE.
>> + */
>> + p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
>> + if (final && (final->status & MCI_STATUS_ADDRV) && p)
>> + SetPageHWPoison(p);
> ... needs to be inside:
>
> if (kexec_crash_loaded() {
> ...
> }
>
> otherwise it'll be useless work on the panic path.
Good idea! The condition makes it more robust. I'll validate it and
send V4.
Best Regards,
Zhiquan
On 2023/10/14 18:18, Borislav Petkov wrote:
> You should slow down and take the time to read the document I pointed
> you at.
I'd like to revise the tag list as below for next version, and reference
rules are following each action. Please correct me if I still
understand the rules in submitting-patches.rst wrongly.
Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Zhiquan Li <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Borislav Petkov <[email protected]>
1) As the author of the initial patch and the person who submitting the
patch, I put my SoB following Youquan's tags per below rule:
Notably, the last Signed-off-by: must always be that of the
developer submitting the patch.
2) Remove Tony's SoB. I had confirmed with him, he needn't
Co-developed-by: tag, so the SoB added by himself in V1 and V2 should be
removed.
In fact, I'm not sure how to deal with such SoB for "RESEND" case.
While resent V2 I retained the SoB to reflect the chain. According to
my understanding, the RESEND process emphasizes "not modifying".
3) Remove Ingo's SoB. Because a new version means a new review cycle,
the SoB added in V2 should be reset to reflect the *new* real route,
unless Ingo needs a Co-developed-by: tag. Relative rule is following:
Any further SoBs (Signed-off-by:'s) following the author's SoB
are from people handling and transporting the patch, but were
not involved in its development. SoB chains should reflect the
*real* route a patch took as it was propagated to the
maintainers and ultimately to Linus, with the first SoB entry
signalling primary authorship of a single author.
I missed this point while I sent V3 :-(
4) Keep Naoya's Reviewed-by: according to below rule, because there is
no substantial change in V3.
Both Tested-by and Reviewed-by tags, once received on mailing
list from tester or reviewer, should be added by author to the
applicable patches when sending next versions. However if the
patch has changed substantially in following version, these tags
might not be applicable anymore and thus should be removed.
5) Add Cc: tag to you per below rule :-)
This is the only tag which might be added without an explicit
action by the person it names - but it should indicate that this
person was copied on the patch. This tag documents that
potentially interested parties have been included in the
discussion.
Best Regards,
Zhiquan
> I understand what you mean. Looks I didn't express myself well on this
> point and caused ambiguity. Maybe removing the attributive clause would
> make it brief and clear? Such as,
>
> For some MCE fatal error cases, the kernel might call
> mce_panic() to terminate the production kernel directly, there
> is no opportunity to queue a task for calling memory_failure().
How about:
When there is a fatal machine check Linux calls mce_panic()
without checking to see if bad data at some memory address
was reported in the machine check banks.
If kexec is enabled, check for memory errors and mark the
page as poisoned so that the kexec'd kernel can avoid accessing
the page.
-Tony
On Tue, Oct 17, 2023 at 01:24:53AM +0000, Luck, Tony wrote:
> How about:
>
> When there is a fatal machine check Linux calls mce_panic()
> without checking to see if bad data at some memory address
> was reported in the machine check banks.
... for the simple reason that the kernel cannot allow itself to do any
unnecessary work but panic immediately so that it can stop the
propagation of bad data.
Now, it's a whole different story whether that's the right thing to do
and whether the data has already propagated so that the panic is moot.
The whole point I'm trying to make is that the machine panics because
the error severity dictates it to do so. And there's no opportunity to
queue recovery work because it simply cannot in that case. So the commit
message should simply state that we're marking the page as poison for
the kexec'ed kernel's sake and not because of anything else.
> If kexec is enabled, check for memory errors and mark the
> page as poisoned so that the kexec'd kernel can avoid accessing
> the page.
Yap, yours makes sense.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2023/10/17 19:18, Borislav Petkov wrote:
> ... for the simple reason that the kernel cannot allow itself to do any
> unnecessary work but panic immediately so that it can stop the
> propagation of bad data.
>
> Now, it's a whole different story whether that's the right thing to do
> and whether the data has already propagated so that the panic is moot.
>
> The whole point I'm trying to make is that the machine panics because
> the error severity dictates it to do so. And there's no opportunity to
> queue recovery work because it simply cannot in that case. So the commit
> message should simply state that we're marking the page as poison for
> the kexec'ed kernel's sake and not because of anything else.
>
Wonderful! Thanks for your detail explanation, Boris!
I think I got the point why you emphasized "can't make the kernel
survive" before. In such scenario the consideration for recovery
doesn’t make sense at all, even thought there is opportunity it
shouldn’t do that, the only choice is panic ASAP.
>> If kexec is enabled, check for memory errors and mark the
>> page as poisoned so that the kexec'd kernel can avoid accessing
>> the page.
> Yap, yours makes sense.
Tony, your commit message made me realize how verbose my commit message
is. May I simplify the whole commit message as following for next version?
---start---
Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. When there is a fatal machine check Linux calls mce_panic()
without checking to see if bad data at some memory address
was reported in the machine check banks.
If kexec is enabled, check for memory errors and mark the page as
poisoned so that the kexec'ed kernel can avoid accessing the page.
---end---
It already covers the scenario, root cause and solution, and focuses on
kernel. No need to talk something else.
Thanks to both of you for great insights.
Best Regards,
Zhiquan
> Tony, your commit message made me realize how verbose my commit message
> is. May I simplify the whole commit message as following for next version?
>
> ---start---
> Memory errors don't happen very often, especially the severity is fatal.
> However, in large-scale scenarios, such as data centers, it might still
> happen. When there is a fatal machine check Linux calls mce_panic()
> without checking to see if bad data at some memory address
> was reported in the machine check banks.
>
> If kexec is enabled, check for memory errors and mark the page as
> poisoned so that the kexec'ed kernel can avoid accessing the page.
> ---end---
>
> It already covers the scenario, root cause and solution, and focuses on
> kernel. No need to talk something else.
Yes, you can use that. Being concise (but keeping the important details)
is the art of a good commit message.
-Tony
The following commit has been merged into the ras/core branch of tip:
Commit-ID: 1d11b153d23b5fd131d4ea125ff23c9e8ebc98ab
Gitweb: https://git.kernel.org/tip/1d11b153d23b5fd131d4ea125ff23c9e8ebc98ab
Author: Zhiquan Li <[email protected]>
AuthorDate: Mon, 23 Oct 2023 12:22:37 +08:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 23 Oct 2023 14:53:13 +02:00
x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
Memory errors don't happen very often, especially fatal ones. However,
in large-scale scenarios such as data centers, that probability
increases with the amount of machines present.
When a fatal machine check happens, mce_panic() is called based on the
severity grading of that error. The page containing the error is not
marked as poison.
However, when kexec is enabled, tools like makedumpfile understand when
pages are marked as poison and do not touch them so as not to cause
a fatal machine check exception again while dumping the previous
kernel's memory.
Therefore, mark the page containing the error as poisoned so that the
kexec'ed kernel can avoid accessing the page.
[ bp: Rewrite commit message and comment. ]
Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/mce/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 0214d42..a25e692 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -44,6 +44,7 @@
#include <linux/sync_core.h>
#include <linux/task_work.h>
#include <linux/hardirq.h>
+#include <linux/kexec.h>
#include <asm/intel-family.h>
#include <asm/processor.h>
@@ -233,6 +234,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
struct llist_node *pending;
struct mce_evt_llist *l;
int apei_err = 0;
+ struct page *p;
/*
* Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +288,18 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+
+ /*
+ * Kdump skips the poisoned page in order to avoid
+ * touching the error bits again. Poison the page even
+ * if the error is fatal and the machine is about to
+ * panic.
+ */
+ if (kexec_crash_loaded()) {
+ p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+ if (final && (final->status & MCI_STATUS_ADDRV) && p)
+ SetPageHWPoison(p);
+ }
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);