2022-02-08 11:51:33

by Jue Wang

[permalink] [raw]
Subject: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

The fast string copy instructions ("rep movs*") could consume an
uncorrectable memory error in the cache line _right after_ the
desired region to copy and raise an MCE.

Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string copy
and will avoid such spurious machine checks. However, that is less
preferable due to the permanent performance impact. Considering memory
poison is rare, it's desirable to keep fast string enabled until an MCE
is seen.

Intel has confirmed the following:
1. The CPU erratum of fast string copy only applies to
SKX/CLX/CPL generations.
2. Directly return from MCE handler will result in complete execution
of the fast string copy (rep movs*) with no data loss or corruption.
3. Directly return from MCE handler will not result in another MCE
firing on the next poisoned cache line due to rep movs*.
4. Directly return from MCE handler will resume execution from a
correct point in code.
5. Directly return from MCE handler due to any other SRAR MCEs will
result in the same instruction that triggered the MCE firing a second
MCE immediately.
6. It's not safe to directly return without disabling the fast string
copy, as the next fast string copy of the same buffer on the same CPU
would result in a PANIC MCE.

The mitigation in this patch should mitigate the erratum completely with
the only caveat that the fast string copy is disabled on the affected
hyper thread thus performance degradation.

This is still better than the OS crashes on MCEs raised on an
irrelevant process due to 'rep movs*' accesses in a kernel context,
e.g., copy_page.

Since a host drain / fail-over usually starts right after the first
MCE is signaled, which results in VM migration or termination, the
performance degradation is a transient effect.

Tested:

Injected errors on 1st cache line of 8 anonymous pages of process
'proc1' and observed MCE consumption from 'proc2' with no panic
(directly returned).

Without the fix, the host panicked within a few minutes on a
random 'proc2' process due to kernel access from copy_page.

Signed-off-by: Jue Wang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 61 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/internal.h | 5 ++-
2 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 5818b837fd4d..06001e3b2ff2 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -834,6 +834,57 @@ static void quirk_sandybridge_ifu(int bank, struct mce *m, struct pt_regs *regs)
m->cs = regs->cs;
}

+static bool is_intel_srar(u64 mci_status)
+{
+ return (mci_status &
+ (MCI_STATUS_VAL|MCI_STATUS_OVER|MCI_STATUS_UC|MCI_STATUS_EN|
+ MCI_STATUS_ADDRV|MCI_STATUS_MISCV|MCI_STATUS_PCC|
+ MCI_STATUS_AR|MCI_STATUS_S)) ==
+ (MCI_STATUS_VAL|MCI_STATUS_UC|MCI_STATUS_EN|MCI_STATUS_ADDRV|
+ MCI_STATUS_MISCV|MCI_STATUS_AR|MCI_STATUS_S);
+}
+
+/*
+ * Disable fast string copy and return from the MCE handler upon the first SRAR
+ * MCE on bank 1 due to a CPU erratum on Intel SKX/CLX/CPL CPUs.
+ * The fast string copy instructions ("rep movs*") could consume an
+ * uncorrectable memory error in the cache line _right after_ the
+ * desired region to copy and raise an MCE with RIP pointing to the
+ * instruction _after_ the "rep movs*".
+ * This mitigation addresses the issue completely with the caveat of
+ * performance degradation on the CPU affected. This is still better
+ * than the OS crashes on MCEs raised on an irrelevant process due to
+ * 'rep movs*' accesses in a kernel context (e.g., copy_page).
+ * Since a host drain / fail-over usually starts right after the first
+ * MCE is signaled, which results in VM migration or termination, the
+ * performance degradation is a transient effect.
+ *
+ * Returns true when fast string copy on cpu should be disabled.
+ */
+static bool quirk_skylake_repmov(void)
+{
+ /*
+ * State that represents if an SRAR MCE has already signaled on the DCU bank.
+ */
+ static DEFINE_PER_CPU(bool, srar_dcu_signaled);
+
+ if (unlikely(!__this_cpu_read(srar_dcu_signaled))) {
+ u64 mc1_status = mce_rdmsrl(MSR_IA32_MCx_STATUS(1));
+
+ if (is_intel_srar(mc1_status)) {
+ __this_cpu_write(srar_dcu_signaled, true);
+ msr_clear_bit(MSR_IA32_MISC_ENABLE,
+ MSR_IA32_MISC_ENABLE_FAST_STRING_BIT);
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+ mce_wrmsrl(MSR_IA32_MCx_STATUS(1), 0);
+ pr_err("First SRAR MCE on DCU, CPU: %d, disable fast string copy.\n",
+ smp_processor_id());
+ return true;
+ }
+ }
+ return false;
+}
+
/*
* Do a quick check if any of the events requires a panic.
* This decides if we keep the events around or clear them.
@@ -1403,6 +1454,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
else if (unlikely(!mca_cfg.initialized))
return unexpected_machine_check(regs);

+ if (mce_flags.skx_repmov_quirk && quirk_skylake_repmov())
+ return;
+
/*
* Establish sequential order between the CPUs entering the machine
* check handler.
@@ -1858,6 +1912,13 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)

if (c->x86 == 6 && c->x86_model == 45)
mce_flags.snb_ifu_quirk = 1;
+
+ /*
+ * Skylake, Cascacde Lake and Cooper Lake require a quirk on
+ * rep movs.
+ */
+ if (c->x86 == 6 && c->x86_model == INTEL_FAM6_SKYLAKE_X)
+ mce_flags.skx_repmov_quirk = 1;
}

if (c->x86_vendor == X86_VENDOR_ZHAOXIN) {
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 52c633950b38..cec227c25138 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -170,7 +170,10 @@ struct mce_vendor_flags {
/* SandyBridge IFU quirk */
snb_ifu_quirk : 1,

- __reserved_0 : 57;
+ /* Skylake, Cascade Lake, Cooper Lake rep movs quirk */
+ skx_repmov_quirk : 1,
+
+ __reserved_0 : 56;
};

extern struct mce_vendor_flags mce_flags;
--
2.35.0.263.gb82422642f-goog



2022-02-09 03:14:40

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

>> The erratum has made its way through to the public specification
>> update yet :-(
>
> You mean "has not"?

Curse my pathetic typing skills ... "has NOT made its way" is where we are today.
I don't know when that status will change.

> In any case, I guess you could say something like:
>
> pr_err_once("Erratum #XXX detected, disabling fast string copy instructions.\n");
>
> or so and people can search with the erratum number later where the doc
> will explain it in more detail.

When the errata (plural, there are separate lists for SKX and CLX) go public
we could update this message with the names.

-Tony

2022-02-09 06:27:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

On Mon, Feb 07, 2022 at 09:07:05PM +0000, Luck, Tony wrote:
> Yes. That would work. It's an extra MSR read instead of a memory read.
> But this isn't a performance path.

Exactly.

> The erratum has made its way through to the public specification
> update yet :-(

You mean "has not"?

In any case, I guess you could say something like:

pr_err_once("Erratum #XXX detected, disabling fast string copy instructions.\n");

or so and people can search with the erratum number later where the doc
will explain it in more detail.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-02-09 08:27:59

by Jue Wang

[permalink] [raw]
Subject: Re: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

Thanks for the feedback, Tony and Borislav.

I will send out an updated patch.

Thanks,
-Jue

On Mon, Feb 7, 2022 at 1:51 PM Luck, Tony <[email protected]> wrote:
>
> >> The erratum has made its way through to the public specification
> >> update yet :-(
> >
> > You mean "has not"?
>
> Curse my pathetic typing skills ... "has NOT made its way" is where we are today.
> I don't know when that status will change.
>
> > In any case, I guess you could say something like:
> >
> > pr_err_once("Erratum #XXX detected, disabling fast string copy instructions.\n");
> >
> > or so and people can search with the erratum number later where the doc
> > will explain it in more detail.
>
> When the errata (plural, there are separate lists for SKX and CLX) go public
> we could update this message with the names.
We've found this message in combination with logging about the
faulting process info
in do_machine_check helpful when analyzing the originating context of the MCEs.
>
> -Tony

2022-02-09 08:44:04

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

On Sun, Feb 06, 2022 at 08:36:40PM -0800, Jue Wang wrote:
> +static bool quirk_skylake_repmov(void)
> +{
> + /*
> + * State that represents if an SRAR MCE has already signaled on the DCU bank.
> + */
> + static DEFINE_PER_CPU(bool, srar_dcu_signaled);
> +
> + if (unlikely(!__this_cpu_read(srar_dcu_signaled))) {
> + u64 mc1_status = mce_rdmsrl(MSR_IA32_MCx_STATUS(1));

Jue,

When I reviewed this for you off-list, I didn't notice that you
dropped the test for mcgstatus & MCG_STATUS_LMCES as part of
moving to a helper function and expanding the test for more
bits in mc1_status.

I think that test still is still important ... knowing that this is
a *local* machine check before making decision based on just what this
CPU observes makes this a bit more robust.

> +
> + if (is_intel_srar(mc1_status)) {
> + __this_cpu_write(srar_dcu_signaled, true);
> + msr_clear_bit(MSR_IA32_MISC_ENABLE,
> + MSR_IA32_MISC_ENABLE_FAST_STRING_BIT);
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> + mce_wrmsrl(MSR_IA32_MCx_STATUS(1), 0);
> + pr_err("First SRAR MCE on DCU, CPU: %d, disable fast string copy.\n",
> + smp_processor_id());
> + return true;
> + }
> + }
> + return false;
> +}

-Tony

2022-02-09 12:18:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

And while you're working in Tony's request...

On Sun, Feb 06, 2022 at 08:36:40PM -0800, Jue Wang wrote:
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 5818b837fd4d..06001e3b2ff2 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -834,6 +834,57 @@ static void quirk_sandybridge_ifu(int bank, struct mce *m, struct pt_regs *regs)
> m->cs = regs->cs;
> }
>
> +static bool is_intel_srar(u64 mci_status)

You don't need this separate function - stick it all in quirk_skylake_repmov()

> + return (mci_status &
> + (MCI_STATUS_VAL|MCI_STATUS_OVER|MCI_STATUS_UC|MCI_STATUS_EN|
> + MCI_STATUS_ADDRV|MCI_STATUS_MISCV|MCI_STATUS_PCC|
> + MCI_STATUS_AR|MCI_STATUS_S)) ==
> + (MCI_STATUS_VAL|MCI_STATUS_UC|MCI_STATUS_EN|MCI_STATUS_ADDRV|
> + MCI_STATUS_MISCV|MCI_STATUS_AR|MCI_STATUS_S);
> +}
> +
> +/*
> + * Disable fast string copy and return from the MCE handler upon the first SRAR
> + * MCE on bank 1 due to a CPU erratum on Intel SKX/CLX/CPL CPUs.
> + * The fast string copy instructions ("rep movs*") could consume an
> + * uncorrectable memory error in the cache line _right after_ the
> + * desired region to copy and raise an MCE with RIP pointing to the
> + * instruction _after_ the "rep movs*".
> + * This mitigation addresses the issue completely with the caveat of
> + * performance degradation on the CPU affected. This is still better
> + * than the OS crashes on MCEs raised on an irrelevant process due to
> + * 'rep movs*' accesses in a kernel context (e.g., copy_page).
> + * Since a host drain / fail-over usually starts right after the first
> + * MCE is signaled, which results in VM migration or termination, the
> + * performance degradation is a transient effect.
> + *
> + * Returns true when fast string copy on cpu should be disabled.
> + */
> +static bool quirk_skylake_repmov(void)
> +{
> + /*
> + * State that represents if an SRAR MCE has already signaled on the DCU bank.
> + */
> + static DEFINE_PER_CPU(bool, srar_dcu_signaled);

What's that needed for?

If the MSR write below clears the CPUID bit, you clear the corresponding
X86_FEATURE flag. And this test becomes a X86_FEATURE flag test:

if (this_cpu_has(X86_FEATURE_FSRM))

I'd guess...

> + if (unlikely(!__this_cpu_read(srar_dcu_signaled))) {
> + u64 mc1_status = mce_rdmsrl(MSR_IA32_MCx_STATUS(1));
> +
> + if (is_intel_srar(mc1_status)) {
> + __this_cpu_write(srar_dcu_signaled, true);
> + msr_clear_bit(MSR_IA32_MISC_ENABLE,
> + MSR_IA32_MISC_ENABLE_FAST_STRING_BIT);
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> + mce_wrmsrl(MSR_IA32_MCx_STATUS(1), 0);
> + pr_err("First SRAR MCE on DCU, CPU: %d, disable fast string copy.\n",

That error message can be understood probably only by a handful dozen of
people on the planet. Is it write-only or is it supposed to be consumed
by humans and if so, what would be the use case?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette