2023-04-12 15:17:48

by Paul Menzel

[permalink] [raw]
Subject: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b

Dear Linux folks,


On a Dell PowerEdge R7525 with AMD EPYC 7763 64-Core Processor, Linux
5.15.94 logs the machine check exceptions (MCE) below:

```
[5154053.127240] mce: [Hardware Error]: Machine check events logged
[5154053.133711] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17:
d42040000000011b
[5154053.141948] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN
2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00
[5154053.152893] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME
1679213602 SOCKET 0 APIC 6 microcode a001173
[5608214.292978] mce: [Hardware Error]: Machine check events logged
[5608214.299771] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17:
d42040000000011b
[5608214.308066] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN
2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00
[5608214.319102] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME
1679667766 SOCKET 0 APIC 6 microcode a001173
[5707500.646385] mce: [Hardware Error]: Machine check events logged
[5707500.652973] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17:
d42041000000011b
[5707500.661238] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN
2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00
[5707500.672271] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME
1679767053 SOCKET 0 APIC 6 microcode a001173
[5810063.788078] mce: [Hardware Error]: Machine check events logged
[5810063.794698] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17:
d42041000000011b
[5810063.803126] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN
2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00
[5810063.814264] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME
1679869617 SOCKET 0 APIC 6 microcode a001173
```

Does GNU/Linux offer a way to decode this automatically?


Kind regards,

Paul


PS:

```
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7763 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 3529.0520
CPU min MHz: 1500.0000
BogoMIPS: 4890.81
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx
mmxext fxsr_o
pt pdpe1gb rdtscp lm constant_tsc rep_good
nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor
ssse3 fma
cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm
sse4a mi
salignsse 3dnowprefetch osvw ibs skinit wdt
tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3
cdp_l3 in
vpcid_single hw_pstate ssbd mba ibrs ibpb
stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx
smap clflus
hopt clwb sha_ni xsaveopt xsavec xgetbv1
xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf
xsaveerptr rd
pru wbnoinvd amd_ppin arat npt lbrv svm_lock
nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
pfthreshold
v_vmsave_vmload vgif v_spec_ctrl umip pku
ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 4 MiB (128 instances)
L1i: 4 MiB (128 instances)
L2: 64 MiB (128 instances)
L3: 512 MiB (16 instances)
NUMA:
NUMA node(s): 8
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled
via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and
__user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional,
IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
```


2023-04-12 16:36:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b

On Wed, Apr 12, 2023 at 05:11:26PM +0200, Paul Menzel wrote:
> On a Dell PowerEdge R7525 with AMD EPYC 7763 64-Core Processor, Linux
> 5.15.94 logs the machine check exceptions (MCE) below:
>
> ```
> [5154053.127240] mce: [Hardware Error]: Machine check events logged
> [5154053.133711] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17:
> d42040000000011b
> [5154053.141948] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN
> 2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00

Build the latest kernel with CONFIG_X86_MCE_INJECT and
CONFIG_EDAC_DECODE_MCE enabled and CONFIG_RAS_CEC *disabled*. Then boot
it on that machine with and do the following below.

The files are in debugfs:

/sys/kernel/debug/mce-inject/
├── addr
├── bank
├── cpu
├── flags
├── ipid
├── misc
├── README
├── status
└── synd

so you go and do

echo 0xd42040000000011b > status
echo 0xb3cbdbbc0 > addr
echo 3 > cpu
echo "sw" > flags
echo 0x6bd210000a801002 > synd
echo 0x9600650f00 > ipid
echo 17 > bank

Remember to keep the bank write last because this one injects the error.

It should dump the decoded error in dmesg.

Alternatively, if you have CONFIG_EDAC_DECODE_MCE enabled on the
machine and you boot with "ras=cec_disable", it would decode it
automatically so you don't have to do it yourself.

Below's the full help text how to do the injection.

And yeah, I know, this is not a very user-friendly way how to decode
those but we're working on one...

HTH.

static const char readme_msg[] =
"Description of the files and their usages:\n"
"\n"
"Note1: i refers to the bank number below.\n"
"Note2: See respective BKDGs for the exact bit definitions of the files below\n"
"as they mirror the hardware registers.\n"
"\n"
"status:\t Set MCi_STATUS: the bits in that MSR control the error type and\n"
"\t attributes of the error which caused the MCE.\n"
"\n"
"misc:\t Set MCi_MISC: provide auxiliary info about the error. It is mostly\n"
"\t used for error thresholding purposes and its validity is indicated by\n"
"\t MCi_STATUS[MiscV].\n"
"\n"
"synd:\t Set MCi_SYND: provide syndrome info about the error. Only valid on\n"
"\t Scalable MCA systems, and its validity is indicated by MCi_STATUS[SyndV].\n"
"\n"
"addr:\t Error address value to be written to MCi_ADDR. Log address information\n"
"\t associated with the error.\n"
"\n"
"cpu:\t The CPU to inject the error on.\n"
"\n"
"bank:\t Specify the bank you want to inject the error into: the number of\n"
"\t banks in a processor varies and is family/model-specific, therefore, the\n"
"\t supplied value is sanity-checked. Setting the bank value also triggers the\n"
"\t injection.\n"
"\n"
"flags:\t Injection type to be performed. Writing to this file will trigger a\n"
"\t real machine check, an APIC interrupt or invoke the error decoder routines\n"
"\t for AMD processors.\n"
"\n"
"\t Allowed error injection types:\n"
"\t - \"sw\": Software error injection. Decode error to a human-readable \n"
"\t format only. Safe to use.\n"
"\t - \"hw\": Hardware error injection. Causes the #MC exception handler to \n"
"\t handle the error. Be warned: might cause system panic if MCi_STATUS[PCC] \n"
"\t is set. Therefore, consider setting (debugfs_mountpoint)/mce/fake_panic \n"
"\t before injecting.\n"
"\t - \"df\": Trigger APIC interrupt for Deferred error. Causes deferred \n"
"\t error APIC interrupt handler to handle the error if the feature is \n"
"\t is present in hardware. \n"
"\t - \"th\": Trigger APIC interrupt for Threshold errors. Causes threshold \n"
"\t APIC interrupt handler to handle the error. \n"
"\n"
"ipid:\t IPID (AMD-specific)\n"
"\n";

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-14 09:33:27

by Paul Menzel

[permalink] [raw]
Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b

Dear Borislav,


Thank you for your quick and helpful reply.

Am 12.04.23 um 18:32 schrieb Borislav Petkov:
> On Wed, Apr 12, 2023 at 05:11:26PM +0200, Paul Menzel wrote:
>> On a Dell PowerEdge R7525 with AMD EPYC 7763 64-Core Processor, Linux
>> 5.15.94 logs the machine check exceptions (MCE) below:
>>
>> ```
>> [5154053.127240] mce: [Hardware Error]: Machine check events logged
>> [5154053.133711] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17: d42040000000011b
>> [5154053.141948] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN 2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00
>
> Build the latest kernel with CONFIG_X86_MCE_INJECT and
> CONFIG_EDAC_DECODE_MCE enabled and CONFIG_RAS_CEC *disabled*. Then boot
> it on that machine with and do the following below.
>
> The files are in debugfs:
>
> /sys/kernel/debug/mce-inject/
> ├── addr
> ├── bank
> ├── cpu
> ├── flags
> ├── ipid
> ├── misc
> ├── README
> ├── status
> └── synd
>
> so you go and do
>
> echo 0xd42040000000011b > status
> echo 0xb3cbdbbc0 > addr
> echo 3 > cpu
> echo "sw" > flags
> echo 0x6bd210000a801002 > synd
> echo 0x9600650f00 > ipid
> echo 17 > bank
>
> Remember to keep the bank write last because this one injects the error.
>
> It should dump the decoded error in dmesg.

Yes, that worked:

```
[ 436.584741] mce: [Hardware Error]: Machine check events logged
[ 436.590638] [Hardware Error]: Corrected error, no action required.
[ 436.596869] [Hardware Error]: CPU:3 (19:1:1)
MC17_STATUS[Over|CE|-|AddrV|-|SyndV|CECC|-|-|-]: 0xd42040000000011b
[ 436.607083] [Hardware Error]: Error Addr: 0x0000000b3cbdbbc0
[ 436.612763] [Hardware Error]: IPID: 0x0000009600650f00, Syndrome:
0x6bd210000a801002
[ 436.620569] [Hardware Error]: Unified Memory Controller Ext. Error
Code: 0, DRAM ECC error.
[ 436.628942] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
```

It says “no action required”, but out of the identical 14 servers with
the same workload this is the only one having shown this errors three times.

Maybe the DIMM at bank 17 should just be replaced.

[…]


Kind regards,

Paul

2023-04-14 10:28:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b

On Fri, Apr 14, 2023 at 11:26:27AM +0200, Paul Menzel wrote:
> It says “no action required”,

Yes, it means you had a single bit flip in some DIMM and it got
corrected by the ECC so you don't need to do anything.

> but out of the identical 14 servers with the same workload this is the
> only one having shown this errors three times.

Or you could enable CONFIG_RAS_CEC and don't see those errors anymore.

It all depends: a DIMM could be producing correctable errors for a long
time before going bad. If ever. If you don't want to risk whatever
you're running on that machine by a DIMM *potentially* going bad, sure,
you can replace it. That's a budget call. :)

> Maybe the DIMM at bank 17 should just be replaced.

Bank 17 is the CPU MCA bank which reports the error - not a DIMM bank.
In order to pinpoint the location, you should have amd64_edac loaded so
that it decodes to which DIMM. You could try loading that module and
injecting all errors you have to see what it says, it should work this
way too as the error signature has everything needed for decoding,
AFAICT.

But Yazen can chime in here if I'm off.

HTH.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-14 13:49:47

by Yazen Ghannam

[permalink] [raw]
Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b

On 4/14/23 06:24, Borislav Petkov wrote:
> On Fri, Apr 14, 2023 at 11:26:27AM +0200, Paul Menzel wrote:
>> It says “no action required”,
>
> Yes, it means you had a single bit flip in some DIMM and it got
> corrected by the ECC so you don't need to do anything.
>
>> but out of the identical 14 servers with the same workload this is the
>> only one having shown this errors three times.
>
> Or you could enable CONFIG_RAS_CEC and don't see those errors anymore.
>
> It all depends: a DIMM could be producing correctable errors for a long
> time before going bad. If ever. If you don't want to risk whatever
> you're running on that machine by a DIMM *potentially* going bad, sure,
> you can replace it. That's a budget call. :)
>
>> Maybe the DIMM at bank 17 should just be replaced.
>
> Bank 17 is the CPU MCA bank which reports the error - not a DIMM bank.
> In order to pinpoint the location, you should have amd64_edac loaded so
> that it decodes to which DIMM. You could try loading that module and
> injecting all errors you have to see what it says, it should work this
> way too as the error signature has everything needed for decoding,
> AFAICT.
>
> But Yazen can chime in here if I'm off.
>

Yes, that's right with one caveat. The info from EDAC will show the
channel/DIMM from the SoC/CPU's perspective. This may not match what is
printed on the motherboard. The board vendor will need to provide that
information.

Thanks,
Yazen