2013-06-26 19:08:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Update MCE severity condition check

On Thu, Jun 20, 2013 at 05:16:12AM -0400, Luck, Tony wrote:
> From: Chen Gong <[email protected]>
>
> Update some SRAR severity conditions check to make it clearer
> according to latest Intel SDM Vol 3B (June 2013), table 15-20.
>
> Signed-off-by: Chen Gong <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> Chen Gong wrote:
> > If this patch is OK, would you please help to update it when merging
> > it? Thanks very much
>
> This is what I plan to apply.
> 1. Changed "user land" to "in a user process" (2 places) per Boris comment
> 2. Changed "non-affected" to "unaffected" per Naveen comment
>
> Anyone wants to jump on the "Acked-by" bandwagon - speak now.
>
> -Tony
>
> arch/x86/kernel/cpu/mcheck/mce-severity.c | 15 +++++----------
> 1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> index beb1f16..e2703520 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> @@ -110,22 +110,17 @@ static struct severity {
> /* known AR MCACODs: */
> #ifdef CONFIG_MEMORY_FAILURE
> MCESEV(
> - KEEP, "HT thread notices Action required: data load error",
> - SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
> - MCGMASK(MCG_STATUS_EIPV, 0)
> + KEEP, "Action required but unaffected thread is continuable",
> + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR),

Why did we lose MCACOD_DATA from the MASK above? Was this intentional?

> + MCGMASK(MCG_STATUS_RIPV, MCG_STATUS_RIPV)

This change I can understand as restart IP is valid for thread is
continuable.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--


2013-06-26 20:23:51

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: Update MCE severity condition check

MCESEV(
> - KEEP, "HT thread notices Action required: data load error",
> - SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
> - MCGMASK(MCG_STATUS_EIPV, 0)
> + KEEP, "Action required but unaffected thread is continuable",
> + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR),

> Why did we lose MCACOD_DATA from the MASK above? Was this intentional?


We used to have separate entries for "HT thread notices ... data load" and "HT thread notices ... instruction load"
because the old SDM had a complex table calling out the bit settings for each type of recoverable machine check.

Latest SDM simplifies the table making it clear that for every SRAR (software recoverable action required) error
we'll have the same bits in MCG_STATUS (EIPV=0, RIPV=1) ... so we don't need to check for the MCACOD value.
See attached snapshot of the new table.

-Tony


Attachments:
SRAR.png (16.72 kB)
SRAR.png

2013-06-26 20:36:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Update MCE severity condition check

On Wed, Jun 26, 2013 at 08:23:47PM +0000, Luck, Tony wrote:
> MCESEV(
> > - KEEP, "HT thread notices Action required: data load error",
> > - SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
> > - MCGMASK(MCG_STATUS_EIPV, 0)
> > + KEEP, "Action required but unaffected thread is continuable",
> > + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR),
>
> > Why did we lose MCACOD_DATA from the MASK above? Was this intentional?
>
>
> We used to have separate entries for "HT thread notices ... data load" and "HT thread notices ... instruction load"
> because the old SDM had a complex table calling out the bit settings for each type of recoverable machine check.
>
> Latest SDM simplifies the table making it clear that for every SRAR (software recoverable action required) error
> we'll have the same bits in MCG_STATUS (EIPV=0, RIPV=1) ... so we don't need to check for the MCACOD value.
> See attached snapshot of the new table.

And this obviously is the case for the hardware too, I assume, not only
the SDM?

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-06-26 21:00:16

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: Update MCE severity condition check

> And this obviously is the case for the hardware too, I assume, not only
> the SDM?

Yes - we have a magic process which reconfigures all deployed silicon whenever
a new SDM is published :-)

Actually the SDM had been collecting new features for each generation ... each
time just bolting on a new paragraph or table. I snapped when I saw the table
that was proposed for 15-20 to add "continuable" errors and complained that
it had gotten way too complicated ... and proposed the version that you see in
the current SDM.

It accurately portrays what older generations implemented, and adds the new
continuable (EIPV=1, RIPV=1) while removing many rows and columns.

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-06-26 21:10:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Update MCE severity condition check

On Wed, Jun 26, 2013 at 09:00:10PM +0000, Luck, Tony wrote:
> > And this obviously is the case for the hardware too, I assume, not only
> > the SDM?
>
> Yes - we have a magic process which reconfigures all deployed silicon whenever
> a new SDM is published :-)

Haha, I wouldn't wonder if your silicon dudes come up with
reprogrammable fuses someday :-)

> Actually the SDM had been collecting new features for each generation ... each
> time just bolting on a new paragraph or table. I snapped when I saw the table
> that was proposed for 15-20 to add "continuable" errors and complained that
> it had gotten way too complicated ... and proposed the version that you see in
> the current SDM.
>
> It accurately portrays what older generations implemented, and adds the new
> continuable (EIPV=1, RIPV=1) while removing many rows and columns.

Cool :)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-06-27 06:50:43

by Chen, Gong

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Update MCE severity condition check

On Wed, Jun 26, 2013 at 11:10:52PM +0200, Borislav Petkov wrote:
> Date: Wed, 26 Jun 2013 23:10:52 +0200
> From: Borislav Petkov <[email protected]>
> To: "Luck, Tony" <[email protected]>
> Cc: "[email protected]" <[email protected]>, Chen
> Gong <[email protected]>, "Naveen N. Rao"
> <[email protected]>
> Subject: Re: [PATCH] x86/mce: Update MCE severity condition check
> User-Agent: Mutt/1.5.21 (2010-09-15)
>
> On Wed, Jun 26, 2013 at 09:00:10PM +0000, Luck, Tony wrote:
> > > And this obviously is the case for the hardware too, I assume, not only
> > > the SDM?
> >
> > Yes - we have a magic process which reconfigures all deployed silicon whenever
> > a new SDM is published :-)
>
> Haha, I wouldn't wonder if your silicon dudes come up with
> reprogrammable fuses someday :-)

If so, it must be a catastrophe and the guys who wrote SDM have a big
trouble ... ;-)


Attachments:
(No filename) (920.00 B)
signature.asc (836.00 B)
Digital signature
Download all attachments