2020-02-05 13:00:54

by Prarit Bhargava

[permalink] [raw]
Subject: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
"spurious corrected errors may be logged in the IA32_MC0_STATUS register
with the valid field (bit 63) set, the uncorrected error field (bit 61)
not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
an MCA Error Code (bits [15:0]) of 0x0005."

Block these spurious errors from the console and logs.

Links to Intel Specification updates:
HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html

Signed-off-by: Alexander Krupp <[email protected]>
Signed-off-by: Prarit Bhargava <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c4f949611e4..d893cc764a06 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -121,6 +121,8 @@ static struct irq_work mce_irq_work;

static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);

+static int (*quirk_noprint)(struct mce *m);
+
/*
* CPU/chipset specific EDAC code can register a notifier call here to print
* MCE errors in a human-readable form.
@@ -232,6 +234,9 @@ struct mca_msr_regs msr_ops = {

static void __print_mce(struct mce *m)
{
+ if (quirk_noprint && quirk_noprint(m))
+ return;
+
pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
m->extcpu,
(m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""),
@@ -1622,6 +1627,15 @@ static void quirk_sandybridge_ifu(int bank, struct mce *m, struct pt_regs *regs)
m->cs = regs->cs;
}

+static int quirk_spurious_ce_noprint(struct mce *m)
+{
+ if (m->bank == 0 &&
+ (m->status & 0xa0000000ffffffff) == 0x80000000000f0005)
+ return 1;
+
+ return 0;
+}
+
/* Add per CPU specific workarounds here */
static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
{
@@ -1696,6 +1710,13 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)

if (c->x86 == 6 && c->x86_model == 45)
quirk_no_way_out = quirk_sandybridge_ifu;
+
+ if ((c->x86 == 6) &&
+ ((c->x86_model == 0x3c) || (c->x86_model == 0x3d) ||
+ (c->x86_model == 0x45) || (c->x86_model == 46))) {
+ pr_info("MCE errata HSD131, HSM142, HSW131, BDM48, or HSM142 enabled.\n");
+ quirk_noprint = quirk_spurious_ce_noprint;
+ }
}

if (c->x86_vendor == X86_VENDOR_ZHAOXIN) {
--
2.21.1


2020-02-06 11:23:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote:

> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

That subject is unreadable for humans.

> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
> "spurious corrected errors may be logged in the IA32_MC0_STATUS register
> with the valid field (bit 63) set, the uncorrected error field (bit 61)
> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
> an MCA Error Code (bits [15:0]) of 0x0005."
>
> Block these spurious errors from the console and logs.

Are they being hit in the wild or why do we need this?

> Links to Intel Specification updates:
> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html

Those links tend to get stale with time. If you really want to refer to
the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add
them there as an attachment and add the link to the entry to the commit
message.

> Signed-off-by: Alexander Krupp <[email protected]>

What's that Signed-off-by: tag supposed to mean?

> Signed-off-by: Prarit Bhargava <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)

If at all, this should be done by adding an intel_filter_mce() function
and called from filter_mce() so that such errors don't get logged.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-02-06 13:03:11

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142



On 2/6/20 6:10 AM, Borislav Petkov wrote:
> On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote:
>
>> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142
>
> That subject is unreadable for humans.

Yeah :/ I couldn't think of a better one. Maybe "Block spurious corrected
errors on some Intel processors"? Any other suggestion?

>
>> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
>> "spurious corrected errors may be logged in the IA32_MC0_STATUS register
>> with the valid field (bit 63) set, the uncorrected error field (bit 61)
>> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
>> an MCA Error Code (bits [15:0]) of 0x0005."
>>
>> Block these spurious errors from the console and logs.
>
> Are they being hit in the wild or why do we need this?

Alexander, cc'd, is being hit by this in the wild.

>
>> Links to Intel Specification updates:
>> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
>> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
>> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
>> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html
>
> Those links tend to get stale with time. If you really want to refer to
> the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add
> them there as an attachment and add the link to the entry to the commit
> message.
>
>> Signed-off-by: Alexander Krupp <[email protected]>
>
> What's that Signed-off-by: tag supposed to mean?
>
>> Signed-off-by: Prarit Bhargava <[email protected]>
>> Cc: Tony Luck <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>
> If at all, this should be done by adding an intel_filter_mce() function
> and called from filter_mce() so that such errors don't get logged.

I'll take a look over there.

P.

>
> Thx.
>

2020-02-06 13:09:08

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142



On 2/6/20 7:53 AM, Prarit Bhargava wrote:
>
>
> On 2/6/20 6:10 AM, Borislav Petkov wrote:
>> On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote:
>>
>>> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142
>>
>> That subject is unreadable for humans.
>
> Yeah :/ I couldn't think of a better one. Maybe "Block spurious corrected
> errors on some Intel processors"? Any other suggestion?
>
>>
>>> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
>>> "spurious corrected errors may be logged in the IA32_MC0_STATUS register
>>> with the valid field (bit 63) set, the uncorrected error field (bit 61)
>>> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
>>> an MCA Error Code (bits [15:0]) of 0x0005."
>>>
>>> Block these spurious errors from the console and logs.
>>
>> Are they being hit in the wild or why do we need this?
>
> Alexander, cc'd, is being hit by this in the wild.
>
>>
>>> Links to Intel Specification updates:
>>> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
>>> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
>>> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
>>> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html
>>
>> Those links tend to get stale with time. If you really want to refer to
>> the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add
>> them there as an attachment and add the link to the entry to the commit
>> message.
>>
>>> Signed-off-by: Alexander Krupp <[email protected]>
>>
>> What's that Signed-off-by: tag supposed to mean?

Sorry. I missed this question, but I really don't understand the question.
Alexander posted a patch in a kernel bugzilla @ Red Hat and I modified the patch
with some additional changes. I don't want him to lose credit for the work so
he's got a proper Signed-off-by tag for this patch.

P.

2020-02-06 13:29:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

On Thu, Feb 06, 2020 at 07:53:34AM -0500, Prarit Bhargava wrote:
> Yeah :/ I couldn't think of a better one. Maybe "Block spurious corrected
> errors on some Intel processors"? Any other suggestion?

"Do not log ..."

> Alexander, cc'd, is being hit by this in the wild.

Do say that in the commit message.

> >> Signed-off-by: Alexander Krupp <[email protected]>
> >
> > What's that Signed-off-by: tag supposed to mean?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You missed this one.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-02-06 14:09:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

On Thu, Feb 06, 2020 at 08:05:24AM -0500, Prarit Bhargava wrote:
> Sorry. I missed this question, but I really don't understand the question.
> Alexander posted a patch in a kernel bugzilla @ Red Hat and I modified the patch
> with some additional changes. I don't want him to lose credit for the work so
> he's got a proper Signed-off-by tag for this patch.

This is not how this is expressed. Either you write that in free text in
the commit message or you use Co-developed-by. More details in

Documentation/process/submitting-patches.rst

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette