2022-08-17 05:13:25

by Ashok Raj

[permalink] [raw]
Subject: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

When a microcode update is in progress, several instructions and MSR's can
be patched by the update. During the update in progress, touching any of
the resources being patched could result in unpredictable results. If
thread0 is doing the update and thread1 happens to get a MCE, the handler
might read an MSR that's being patched.

In order to have predictable behavior, to avoid this scenario we set the MCIP in
all threads. Since MCE's can't be nested, HW will automatically promote to
shutdown condition.

After the update is completed, MCIP flag is cleared. The system is going to
shutdown anyway, since the MCE could be a fatal error, or even recoverable
errors in kernel space are treated as unrecoverable.

Signed-off-by: Ashok Raj <[email protected]>
---
arch/x86/include/asm/mce.h | 4 ++++
arch/x86/kernel/cpu/mce/core.c | 9 +++++++++
arch/x86/kernel/cpu/microcode/core.c | 11 +++++++++++
3 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index cc73061e7255..2aef6120e23f 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -207,12 +207,16 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c);
void mcheck_cpu_clear(struct cpuinfo_x86 *c);
int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
u64 lapic_id);
+extern void mce_set_mcip(void);
+extern void mce_clear_mcip(void);
#else
static inline int mcheck_init(void) { return 0; }
static inline void mcheck_cpu_init(struct cpuinfo_x86 *c) {}
static inline void mcheck_cpu_clear(struct cpuinfo_x86 *c) {}
static inline int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
u64 lapic_id) { return -EINVAL; }
+static inline void mce_set_mcip(void) {}
+static inline void mce_clear_mcip(void) {}
#endif

void mce_setup(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c8ec5c71712..72b49d95bb3b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -402,6 +402,15 @@ static noinstr void mce_wrmsrl(u32 msr, u64 v)
: : "c" (msr), "a"(low), "d" (high) : "memory");
}

+void mce_set_mcip(void)
+{
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x1);
+}
+
+void mce_clear_mcip(void)
+{
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x0);
+}
/*
* Collect all global (w.r.t. this processor) status about this machine
* check into our "mce" struct so that we can use it later to assess
diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index ad57e0e4d674..d24e1c754c27 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -39,6 +39,7 @@
#include <asm/processor.h>
#include <asm/cmdline.h>
#include <asm/setup.h>
+#include <asm/mce.h>

#define DRIVER_VERSION "2.2"

@@ -450,6 +451,14 @@ static int __reload_late(void *info)
if (__wait_for_cpus(&late_cpus_in, NSEC_PER_SEC))
return -1;

+ /*
+ * Its dangerous to let MCE while microcode update is in progress.
+ * Its extremely rare and even if happens they are fatal errors.
+ * But reading patched areas before the update is complete can be
+ * leading to unpredictable results. Setting MCIP will guarantee
+ * the platform is taken to reset predictively.
+ */
+ mce_set_mcip();
/*
* On an SMT system, it suffices to load the microcode on one sibling of
* the core because the microcode engine is shared between the threads.
@@ -457,6 +466,7 @@ static int __reload_late(void *info)
* loading attempts happen on multiple threads of an SMT core. See
* below.
*/
+
if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
apply_microcode_local(&err);
else
@@ -473,6 +483,7 @@ static int __reload_late(void *info)
if (__wait_for_cpus(&late_cpus_out, NSEC_PER_SEC))
panic("Timeout during microcode update!\n");

+ mce_clear_mcip();
/*
* At least one thread has completed update on each core.
* For others, simply call the update to make sure the
--
2.32.0


2022-08-17 08:06:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update


* Ingo Molnar <[email protected]> wrote:

> > +void mce_set_mcip(void)
> > +{
> > + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x1);
> > +}
> > +
> > +void mce_clear_mcip(void)
> > +{
> > + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x0);
> > +}
>
> Instead of naming new APIs after how they are doing stuff, please name them
> after *what* they are doing at the highest level: they disable/enable MCEs.
>
> Ie. I'd suggest something like:
>
> mce_disable()
> mce_enable()
>
> I'd also suggest to at minimum add a WARN_ON_ONCE() if MSR_IA32_MCG_STATUS
> is already 1 when we disable it - because whoever wanted it disabled will
> now be surprised by us enabling them again.

Also, Boris tells me that writing 0x0 to MSR_IA32_MCG_STATUS apparently
shuts the platform down - which is not ideal...

Thanks,

Ingo

2022-08-17 08:12:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 09:58:03AM +0200, Ingo Molnar wrote:
> Also, Boris tells me that writing 0x0 to MSR_IA32_MCG_STATUS
> apparently shuts the platform down - which is not ideal...

Right, if you get an MCE raised while MCIP=0, the machine shuts down.

And frankly, I can't think of a good solution to this whole issue:

- with current hw, if you get an MCE and MCIP=0 -> shutdown

- in the future, even if you change the hardware to block MCEs from
being detected while the microcode update runs, what happens if a CPU
encounters a hw error during that update?

You raise it immediately after? What if there are multiple MCEs? Not
unheard of on a big machine...

Worse, what happens if there's a bitflip in the memory where the
to-be-updated microcode patch is?

You report the error afterwards?

Just thinking about this makes me real nervous.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-17 08:29:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update


* Ashok Raj <[email protected]> wrote:

> When a microcode update is in progress, several instructions and MSR's can
> be patched by the update. During the update in progress, touching any of
> the resources being patched could result in unpredictable results. If
> thread0 is doing the update and thread1 happens to get a MCE, the handler
> might read an MSR that's being patched.
>
> In order to have predictable behavior, to avoid this scenario we set the MCIP in
> all threads. Since MCE's can't be nested, HW will automatically promote to
> shutdown condition.
>
> After the update is completed, MCIP flag is cleared. The system is going to
> shutdown anyway, since the MCE could be a fatal error, or even recoverable
> errors in kernel space are treated as unrecoverable.
>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> arch/x86/include/asm/mce.h | 4 ++++
> arch/x86/kernel/cpu/mce/core.c | 9 +++++++++
> arch/x86/kernel/cpu/microcode/core.c | 11 +++++++++++
> 3 files changed, 24 insertions(+)
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index cc73061e7255..2aef6120e23f 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -207,12 +207,16 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c);
> void mcheck_cpu_clear(struct cpuinfo_x86 *c);
> int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
> u64 lapic_id);
> +extern void mce_set_mcip(void);
> +extern void mce_clear_mcip(void);
> #else
> static inline int mcheck_init(void) { return 0; }
> static inline void mcheck_cpu_init(struct cpuinfo_x86 *c) {}
> static inline void mcheck_cpu_clear(struct cpuinfo_x86 *c) {}
> static inline int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
> u64 lapic_id) { return -EINVAL; }
> +static inline void mce_set_mcip(void) {}
> +static inline void mce_clear_mcip(void) {}
> #endif
>
> void mce_setup(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2c8ec5c71712..72b49d95bb3b 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -402,6 +402,15 @@ static noinstr void mce_wrmsrl(u32 msr, u64 v)
> : : "c" (msr), "a"(low), "d" (high) : "memory");
> }
>
> +void mce_set_mcip(void)
> +{
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x1);
> +}
> +
> +void mce_clear_mcip(void)
> +{
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x0);
> +}

Instead of naming new APIs after how they are doing stuff, please name them
after *what* they are doing at the highest level: they disable/enable MCEs.

Ie. I'd suggest something like:

mce_disable()
mce_enable()

I'd also suggest to at minimum add a WARN_ON_ONCE() if MSR_IA32_MCG_STATUS
is already 1 when we disable it - because whoever wanted it disabled will
now be surprised by us enabling them again.

> + /*
> + * Its dangerous to let MCE while microcode update is in progress.

s/let MCE while
/let MCEs execute while

> + * Its extremely rare and even if happens they are fatal errors.
> + * But reading patched areas before the update is complete can be
> + * leading to unpredictable results. Setting MCIP will guarantee

s/can be leading to
/can lead to

> + * the platform is taken to reset predictively.

What does 'the platform is taken to reset predictively' mean?

Did you mean 'predictibly'/'reliably'?

> + */
> + mce_set_mcip();
> /*
> * On an SMT system, it suffices to load the microcode on one sibling of
> * the core because the microcode engine is shared between the threads.
> @@ -457,6 +466,7 @@ static int __reload_late(void *info)
> * loading attempts happen on multiple threads of an SMT core. See
> * below.
> */
> +
> if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
> apply_microcode_local(&err);
> else

Spurious newline added?

Thanks,

Ingo

2022-08-17 12:21:10

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 10:09:00AM +0200, Borislav Petkov wrote:
> On Wed, Aug 17, 2022 at 09:58:03AM +0200, Ingo Molnar wrote:
> > Also, Boris tells me that writing 0x0 to MSR_IA32_MCG_STATUS
> > apparently shuts the platform down - which is not ideal...
>
> Right, if you get an MCE raised while MCIP=0, the machine shuts down.
>
> And frankly, I can't think of a good solution to this whole issue:
>
> - with current hw, if you get an MCE and MCIP=0 -> shutdown

You have this reversed. if you get an MCE and MCIP=1 -> shutdown

I'm still very reluctant, this is actually an overkill. I added what is
possible based on Boris's recommendation.

When MCE's happen during the update they are always fatal errors. But
atleast you can log them, even if some other weird error were to be
observed because they stomed over the patch area that primary is currently
working on.

What we do here by setting MCIP=1, we promote to a more severe shutdown.

Ideally I would rather let the fallout happen since its observable vs a
blind shutdown is what we are promoting to.

>
> - in the future, even if you change the hardware to block MCEs from
> being detected while the microcode update runs, what happens if a CPU
> encounters a hw error during that update?

I don't think there ever will be blocking MCE's :-)

If an error happens, it leads to shutdown.
>
> You raise it immediately after? What if there are multiple MCEs? Not
> unheard of on a big machine...

Shutdown, shutdown.. There is only 1 MCE no matter how many CPUs you have.

Exception is the Local MCE which is recoverable, but only to user space.

If you get an error in the atomic we are polling, its a fatal error since
SW can't recover and we shutdown.
>
> Worse, what happens if there's a bitflip in the memory where the
> to-be-updated microcode patch is?
>
> You report the error afterwards?
>
> Just thinking about this makes me real nervous.

Overthinking :-).. If there is concensus, if Boris feels comfortable
enough, i would drop this patch.

2022-08-17 12:21:58

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 09:41:31AM +0200, Ingo Molnar wrote:
>
> > +void mce_set_mcip(void)
> > +{
> > + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x1);
> > +}
> > +
> > +void mce_clear_mcip(void)
> > +{
> > + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0x0);
> > +}
>
> Instead of naming new APIs after how they are doing stuff, please name them
> after *what* they are doing at the highest level: they disable/enable MCEs.
>
> Ie. I'd suggest something like:
>
> mce_disable()
> mce_enable()

We actually aren't disabling MCE's we set things up to promote it to a more
severe shutdown if an MCE were to be signaled while in the ucode update
flow.

I'm struggling to find a suitable name. But I agree with what you are
saying.

promote_mce_to_fatal()? I'll take any names that seem fit.

>
> I'd also suggest to at minimum add a WARN_ON_ONCE() if MSR_IA32_MCG_STATUS
> is already 1 when we disable it - because whoever wanted it disabled will
> now be surprised by us enabling them again.

Ok, will add.

>
> > + /*
> > + * Its dangerous to let MCE while microcode update is in progress.
>
> s/let MCE while
> /let MCEs execute while
>
> > + * Its extremely rare and even if happens they are fatal errors.
> > + * But reading patched areas before the update is complete can be
> > + * leading to unpredictable results. Setting MCIP will guarantee
>
> s/can be leading to
> /can lead to
>
> > + * the platform is taken to reset predictively.
>
> What does 'the platform is taken to reset predictively' mean?

Since we are setting MCG_STATUS.MCIP=1, since MCE's aren't nestable,
if there is a hardware event trying to signal a MCE, it will turn into a
platform reset. The MCE registers that logged the event will be sticky
and preserve in a warm reset case. BIOS or OS can pickup values after the
reboot is complete.

>
> Did you mean 'predictibly'/'reliably'?

I don't know the difference, except both are a trustworthy topic :-)

I like predictable, the system is going down.. not reliable :-)
>
> > + */
> > + mce_set_mcip();
> > /*
> > * On an SMT system, it suffices to load the microcode on one sibling of
> > * the core because the microcode engine is shared between the threads.
> > @@ -457,6 +466,7 @@ static int __reload_late(void *info)
> > * loading attempts happen on multiple threads of an SMT core. See
> > * below.
> > */
> > +
> > if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
> > apply_microcode_local(&err);
> > else
>
> Spurious newline added?

It's a gonner !

Cheers,
Ashok

2022-08-17 12:25:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 11:57:37AM +0000, Ashok Raj wrote:
> You have this reversed. if you get an MCE and MCIP=1 -> shutdown

Yeah yeah.

> When MCE's happen during the update they are always fatal errors.

How did you decide that?

Because all CPUs are executing the loop and thus no user process?

> What we do here by setting MCIP=1, we promote to a more severe shutdown.

It probably should say somewhere that a shutdown is possible. Because if
the shutdown really happens, you get a black screen and no info as to
why...

> Ideally I would rather let the fallout happen since its observable vs a
> blind shutdown is what we are promoting to.

What fallout do you mean exactly?

> Shutdown, shutdown.. There is only 1 MCE no matter how many CPUs you have.

Because all CPUs are executing the loop? Or how do you decide this?

> Exception is the Local MCE which is recoverable, but only to user space.
>
> If you get an error in the atomic we are polling, its a fatal error since
> SW can't recover and we shutdown.

Aha, I think you mean that: the MCE is fatal because if it happens on
any CPU, it will be in kernel mode.

> Overthinking :-).. If there is concensus, if Boris feels comfortable
> enough, i would drop this patch.

This is what we're doing right now - thinking about the consensus. And
Boris will feel comfortable once we've arrived at a sensible decision.

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-17 12:43:12

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 02:10:51PM +0200, Borislav Petkov wrote:
> On Wed, Aug 17, 2022 at 11:57:37AM +0000, Ashok Raj wrote:
> > You have this reversed. if you get an MCE and MCIP=1 -> shutdown
>
> Yeah yeah.
>
> > When MCE's happen during the update they are always fatal errors.
>
> How did you decide that?
>
> Because all CPUs are executing the loop and thus no user process?

Correct. There can be a fatal IOMCA which is asynchronous and can fire
anytime. We removed any CPU initiated async errors like patrol-scrub and
cache errors observed during eviction (EWB) into CMCI's when we developed
LMCE.

>
> > What we do here by setting MCIP=1, we promote to a more severe shutdown.
>
> It probably should say somewhere that a shutdown is possible. Because if
> the shutdown really happens, you get a black screen and no info as to
> why...

You will find out when system returns after reboot and hopefully wasn't
promoted to a cold-boot which will loose MCE banks.

I can check with the HW team and get back..

>
> > Ideally I would rather let the fallout happen since its observable vs a
> > blind shutdown is what we are promoting to.
>
> What fallout do you mean exactly?

Meaning deal with the effect of a really rare MCE. Rather than trying to
avoid it. Taking the MCE is more important than finishing the update,
and loosing what the error signaled was trying to convey.

>
> > Shutdown, shutdown.. There is only 1 MCE no matter how many CPUs you have.
>
> Because all CPUs are executing the loop? Or how do you decide this?

Fatal errors signaled with PCC=1 in the MCAx.STATUS is *ALWAYS*
broadcast[1] to all CPUs in the system. So MCIP is set at the source of
the CPU and if any other CPU is also attempting they go down. You can't
even collect data.

[1] Broadcast is true for Intel CPUs, its legacy behavior
LMCE is the only one that isn't broadcast and truly local to the logical
thread.


>
> > Exception is the Local MCE which is recoverable, but only to user space.
> >
> > If you get an error in the atomic we are polling, its a fatal error since
> > SW can't recover and we shutdown.
>
> Aha, I think you mean that: the MCE is fatal because if it happens on
> any CPU, it will be in kernel mode.

Yep!

>
> > Overthinking :-).. If there is concensus, if Boris feels comfortable
> > enough, i would drop this patch.
>
> This is what we're doing right now - thinking about the consensus. And
> Boris will feel comfortable once we've arrived at a sensible decision.
>
> :-)
>
I'm waiting for the results. :-). And if you feel we can merge the
- Patch1 - bug fix
- Patch2 - min-rev id

I do have the comments from Ingo captured, but I'll wait for other comments
before i resend just those 2 and we can leave the NMI handling to get more
testing and review before we consider.

Cheers,
Ashok

2022-08-17 14:47:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 12:30:49PM +0000, Ashok Raj wrote:
> You will find out when system returns after reboot and hopefully wasn't
> promoted to a cold-boot which will loose MCE banks.

Not good enough!

This should issue a warning in dmesg that a potential MCE while update
is running would cause a lockup. That is if we don't disable MCE around
it.

If we decide to disable MCE, it should say shutdown.

> Meaning deal with the effect of a really rare MCE. Rather than trying to
> avoid it. Taking the MCE is more important than finishing the update,
> and loosing what the error signaled was trying to convey.

Right now I'm inclined to not do anything and warn of a potential rare
situation.

> > > Shutdown, shutdown.. There is only 1 MCE no matter how many CPUs you have.
> >
> > Because all CPUs are executing the loop? Or how do you decide this?
>
> Fatal errors signaled with PCC=1 in the MCAx.STATUS is *ALWAYS*

What does that have to do with

"There is only 1 MCE no matter how many CPUs you have."

?

That's bullsh*t. Especially if the machine can do LMCE.

> I'm waiting for the results. :-). And if you feel we can merge the
> - Patch1 - bug fix
> - Patch2 - min-rev id
>
> I do have the comments from Ingo captured, but I'll wait for other comments
> before i resend just those 2 and we can leave the NMI handling to get more
> testing and review before we consider.

No, you need to go read Documentation/process/.

I'm tired of having to explain to you how the kernel development process
works. You send your set, wait for a week, collect feedback and then you
send a new revision.

Not hammer people with patchsets every day.

This is not how that works.

If someone's breathing down your neck lemme know - I'd like to talk to
him/her.

Ok?!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-17 15:26:03

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On Wed, Aug 17, 2022 at 04:19:40PM +0200, Borislav Petkov wrote:
> On Wed, Aug 17, 2022 at 12:30:49PM +0000, Ashok Raj wrote:
> > You will find out when system returns after reboot and hopefully wasn't
> > promoted to a cold-boot which will loose MCE banks.
>
> Not good enough!

I probably misread your question.. are you suggesting we add some WARN when
we initiate late_load? I thought you were asking if the HW must signal
something and OS should log when an MCE happens if MCIP=1


>
> This should issue a warning in dmesg that a potential MCE while update
> is running would cause a lockup. That is if we don't disable MCE around
> it.
>
> If we decide to disable MCE, it should say shutdown.

Ok, that clarifies it.. "IF we choose to set MCIP=1, we should tell users
that hell can break loose, get under the table" :-)

>
> > Meaning deal with the effect of a really rare MCE. Rather than trying to
> > avoid it. Taking the MCE is more important than finishing the update,
> > and loosing what the error signaled was trying to convey.
>
> Right now I'm inclined to not do anything and warn of a potential rare
> situation.

Encouraging.. So I'll drop that patch from the list next time around.

>
> > > > Shutdown, shutdown.. There is only 1 MCE no matter how many CPUs you have.
> > >
> > > Because all CPUs are executing the loop? Or how do you decide this?
> >
> > Fatal errors signaled with PCC=1 in the MCAx.STATUS is *ALWAYS*
>
> What does that have to do with
>
> "There is only 1 MCE no matter how many CPUs you have."
>
> ?
>
> That's bullsh*t. Especially if the machine can do LMCE.

Well, not outlandish :)

LMCE is only for recoverable errors. When we have a fatal error, sometimes
the signalling and consumption of poison are going in different directions.
In order to minimize exposure of bad data from being consumed,
*ALL* Intel processors have always broadcast fatal errors. This is the
history behind why we broadcast.

BTW: This is all legacy behavior. Nothing should come as surprise.

LMCE is best effort. This is the current state.

Cheers,
Ashok

2022-08-29 14:54:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/microcode: Avoid any chance of MCE's during microcode update

On 8/17/22 08:06, Ashok Raj wrote:
> On Wed, Aug 17, 2022 at 04:19:40PM +0200, Borislav Petkov wrote:
>> On Wed, Aug 17, 2022 at 12:30:49PM +0000, Ashok Raj wrote:
>>> You will find out when system returns after reboot and hopefully wasn't
>>> promoted to a cold-boot which will loose MCE banks.
>> Not good enough!
> I probably misread your question.. are you suggesting we add some WARN when
> we initiate late_load? I thought you were asking if the HW must signal
> something and OS should log when an MCE happens if MCIP=1
>
>
>> This should issue a warning in dmesg that a potential MCE while update
>> is running would cause a lockup. That is if we don't disable MCE around
>> it.
>>
>> If we decide to disable MCE, it should say shutdown.
> Ok, that clarifies it.. "IF we choose to set MCIP=1, we should tell users
> that hell can break loose, get under the table" :-)
>
>>> Meaning deal with the effect of a really rare MCE. Rather than trying to
>>> avoid it. Taking the MCE is more important than finishing the update,
>>> and loosing what the error signaled was trying to convey.
>> Right now I'm inclined to not do anything and warn of a potential rare
>> situation.
> Encouraging.. So I'll drop that patch from the list next time around.


If I followed all this correctly, I agree. If we set MCIP to force a
crash if we get MCE, then we are guaranteed to crash.  If we don't, then
we might crash.


An imperfect alternative would be to set a (percpu?) flag that we're
doing a ucode update and then detect that flag early in the MCE handler
and warn very loudly.  This seems like it will give us the best chance
of getting a useful diagnostic.