2013-05-10 17:01:50

by Ming Lei

[permalink] [raw]
Subject: x86_mce: mce_start uses number of phsical cores instead of logical cores


I found this bug in my edac testing on intel 56xx motherboard. I did mce-inject test on AMD 64 CPU also. Thanks


Subject: [PATCH] mce_start times out waiting for the logical cores which have never received the mce broadcast. It can potentially turn non fatal mce exception into kernel panic complaining "Some CPUs didn't answer in synchronization". Here is the example of console log before and after the fix.

mce: [Hardware Error]: CPU 16: Machine Check Exception: 4 Bank 8: be0000000001009f
mce: [Hardware Error]: TSC 2d7ff062dc4 ADDR 622eac800 MISC 1004000006040
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1368135921 SOCKET 0 APIC 13 microcode 15
mce: [Hardware Error]: Some CPUs didn't answer in synchronization
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check on current CPU

[admin@amnesiac ~]# cd /sys/devices/system/edac/mc/mc0
[admin@amnesiac mc0]# echo 2 >inject_type
[admin@amnesiac mc0]# echo 3 >inject_section
[admin@amnesiac mc0]# echo 65600 >inject_eccmask
[admin@amnesiac mc0]# echo 1 >inject_enable
[admin@amnesiac mc0]# mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 8: be0000000001009f
mce: [Hardware Error]: TSC 12541d6511e ADDR 625f2d380 MISC 1004000001181
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1368141409 SOCKET 0 APIC 0 microcode 15
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check

I saw duplicate mce entries being logged when mce exception is fatal so solution is only allow the Monarch doing mce_log.
---
arch/x86/kernel/cpu/mcheck/mce.c | 12 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 9239504..7a40ae5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -796,6 +796,10 @@ static int mce_start(int *no_way_out)
if (!timeout)
return -1;

+#if NR_CPUS > 1
+ cpus /= cpumask_weight(cpu_core_mask(0)) / cpu_data(0).booted_cores;
+#endif
+
atomic_add(*no_way_out, &global_nwo);
/*
* global_nwo should be updated before mce_callin
@@ -871,6 +875,10 @@ static int mce_end(int order)
/* CHECKME: Can this race with a parallel hotplug? */
int cpus = num_online_cpus();

+#if NR_CPUS > 1
+ cpus /= cpumask_weight(cpu_core_mask(0)) /
+ cpu_data(0).booted_cores;
+#endif
+
/*
* Monarch: Wait for everyone to go through their scanning
* loops.
@@ -1113,7 +1121,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
mce_ring_add(m.addr >> PAGE_SHIFT);

- mce_log(&m);
+ if (atomic_read(&mce_executing) <= 1)
+ mce_log(&m);

if (severity > worst) {
*final = m;
--


2013-05-10 18:13:53

by Luck, Tony

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> +#if NR_CPUS > 1
> + cpus /= cpumask_weight(cpu_core_mask(0)) / cpu_data(0).booted_cores;
> +#endif

Not entirely sure what you are trying to do here (apart from making "cpus"
be a smaller number). What is the reasoning behind the right hand side of
this expression?

Is this problem more related to how EDAC is injecting an error? When I've
used other methods (e.g. ACPI/EINJ) I end up with a machine check that is
broadcast to all processors ... so "cpus = num_online_cpus()" is the correct[1]
number of processors to wait for.

-Tony

[1] Andi may point me (again) to a fix to help deal with the case that Linux
has taken some cpus offline. In that case this code is wrong as the "offline"
cpus will still show up for machine checks. But there are troubling corner
cases with the fix.

2013-05-10 18:42:30

by Ming Lei

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

With hyperthread turns on, the num_online_cpus reports the number of all logical cores. What I found in testing is only half the cores receives the mce broadcast, so I assume only the physical cores get broadcast. I have two sockets 5646 onboard. num_online_cpus() returns 24 and I only get 12 cores enter do_machine_check. I used both edac error injection and hardware edac error injector as well in my testing.

cpumask_weight(cpu_core_mask(0)) / cpu_data(0).booted_cores returns the ratio between logical cores and physical cores. In my case it is two.

Here is intel spec:
Processor Number E5645
# of Cores 6
# of Threads 12

Ming

-----Original Message-----
From: Luck, Tony [mailto:[email protected]]
Sent: Friday, May 10, 2013 11:14 AM
To: Ming Lei; [email protected]
Cc: [email protected]; [email protected]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> +#if NR_CPUS > 1
> + cpus /= cpumask_weight(cpu_core_mask(0)) / cpu_data(0).booted_cores;
> +#endif

Not entirely sure what you are trying to do here (apart from making "cpus"
be a smaller number). What is the reasoning behind the right hand side of this expression?

Is this problem more related to how EDAC is injecting an error? When I've used other methods (e.g. ACPI/EINJ) I end up with a machine check that is broadcast to all processors ... so "cpus = num_online_cpus()" is the correct[1] number of processors to wait for.

-Tony

[1] Andi may point me (again) to a fix to help deal with the case that Linux has taken some cpus offline. In that case this code is wrong as the "offline"
cpus will still show up for machine checks. But there are troubling corner cases with the fix.

2013-05-10 19:09:51

by Luck, Tony

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> With hyperthread turns on, the num_online_cpus reports the number of all logical cores.
> What I found in testing is only half the cores receives the mce broadcast, so I assume only the physical cores get broadcast.

See Intel Software Developer Manual Volume 3B Section 15.10.4.1, 3rd bullet:

o For processors on which CPUID reports DisplayFamily_DisplayModel as 06H_0EH and onward, an MCA signal is
broadcast to all logical processors in the system

Your E-5645 processors are a lot newer than this cut-off version - so they should broadcast
to all your threads.

You are seeing something very strange. It would be interesting to know *which* 12 cpus
show up for your machine check. Perhaps you are seeing all the hyperthreads from one
socket and none from the other?

I still suspect that something is strange in the EDAC error injection side of this problem
and that you are not getting a h/w initiated INT#18 event.

-Tony

2013-05-10 20:10:15

by Ming Lei

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

I used intel edac error injector and saw the same problem. I actually wrote down the core numbers and I saw mce got to 0-5 and 12-17, but not the others. I have 2 sockets, 24 logical cores. Below is the trace I put into mce code. The core number is after "#".

Ming

344 :344 #4 ** 802097241816 (207303152230.v1) (207303152334) 4294874599 :24::::: mce_start do_machine_check
345 :345 #16 ** 802097241876 (207303152404.v1) (207303152426) 4294874599 :12:16:1:4:4: mce_start do_machine_check
346 :346 #0 ** 802097241914 (207303152271.v1) (207303152343) 4294874599 :24::::: mce_start do_machine_check
347 :347 #1 * 802097242074 (207303152515.v1) (207303152599) 4294874599 :8:-4755801206503178081:256::: mce_no_way_out do_machine_check
348 :348 #13 * 802097242098 (207303152512.v1) (207303152552) 4294874599 :7::::: mce_no_way_out do_machine_check
349 :349 #3 * 802097242282 (207303152630.v1) (207303152679) 4294874599 :7::::: mce_no_way_out do_machine_check
350 :350 #14 ** 802097242342 (207303152452.v1) (207303152520) 4294874599 :12:16:1:4:4: mce_start do_machine_check
351 :351 #2 * 802097242366 (207303152458.v1) (207303152537) 4294874599 :8:-4755801206503178081:256::: mce_no_way_out do_machine_check
352 :352 #0 ** 802097242774 (207303152627.v1) (207303152676) 4294874599 :12:16:1:4:4: mce_start do_machine_check
353 :353 #12 ** 802097242838 (207303152829.v1) (207303152853) 4294874599 :24::::: mce_start do_machine_check
354 :354 #15 ** 802097242890 (207303152676.v1) (207303152707) 4294874599 :24::::: mce_start do_machine_check
355 :355 #4 ** 802097243056 (207303152747.v1) (207303152825) 4294874599 :12:16:1:4:4: mce_start do_machine_check
356 :356 #2 ** 802097243386 (207303152881.v1) (207303153006) 4294874599 :24::::: mce_start do_machine_check
357 :357 #17 ** 802097243546 (207303152953.v1) (207303153023) 4294874599 :24::::: mce_start do_machine_check
358 :358 #5 ** 802097243566 (207303152963.v1) (207303153041) 4294874599 :24::::: mce_start do_machine_check
359 :359 #15 ** 802097243922 (207303153107.v1) (207303153193) 4294874599 :12:21:1:9:9: mce_start do_machine_check
360 :360 #3 * 802097243994 (207303153342.v1) (207303153356) 4294874599 :8:-4755801206503178081:256::: mce_no_way_out do_machine_check
361 :361 #13 * 802097244074 (207303153175.v1) (207303153242) 4294874599 :8:-4755801206503178081:256::: mce_no_way_out do_machine_check
362 :362 #1 ** 802097244050 (207303153167.v1) (207303153229) 4294874599 :24::::: mce_start do_machine_check
363 :363 #12 ** 802097244174 (207303153212.v1) (207303153284) 4294874599 :12:22:1:9:9: mce_start do_machine_check
364 :364 #2 ** 802097244490 (207303153347.v1) (207303153419) 4294874599 :12:22:1:10:10: mce_start do_machine_check
365 :365 #1 ** 802097244746 (207303153452.v1) (207303153521) 4294874599 :12:22:1:10:10: mce_start do_machine_check
366 :366 #5 ** 802097244834 (207303153488.v1) (207303153558) 4294874599 :12:22:1:10:10: mce_start do_machine_check
367 :367 #17 ** 802097244902 (207303153645.v1) (207303153665) 4294874599 :12:22:1:10:10: mce_start do_machine_check
368 :368 #3 ** 802097245130 (207303153611.v1) (207303153680) 4294874599 :24::::: mce_start do_machine_check
369 :369 #13 ** 802097245302 (207303153681.v1) (207303153760) 4294874599 :24::::: mce_start do_machine_check
370 :370 #3 ** 802097245710 (207303153857.v1) (207303153979) 4294874599 :12:24:1:12:12: mce_start do_machine_check
371 :371 #13 ** 802097246234 (207303154072.v1) (207303154141) 4294874599 :12:24:1:12:12: mce_start do_machine_check
372 :372 #15 *** 802097246542 (207303154201.v1) (207303154283) 4294874599 :12:5:::: mce_start do_machine_check
373 :373 #3 *** 802097246614 (207303154539.v1) (207303154565) 4294874599 :12:11:::: mce_start do_machine_check
374 :374 #2 *** 802097246678 (207303154265.v1) (207303154331) 4294874599 :12:9:::: mce_start do_machine_check
375 :375 #13 *** 802097246794 (207303154313.v1) (207303154376) 4294874599 :12:12:::: mce_start do_machine_check
376 :376 #1 *** 802097246814 (207303154325.v1) (207303154388) 4294874599 :12:10:::: mce_start do_machine_check
377 :377 #0 *** 802097246898 (207303154350.v1) (207303154420) 4294874599 :12:4:::: mce_start do_machine_check
378 :378 #12 *** 802097246966 (207303154614.v1) (207303154640) 4294874599 :12:6:::: mce_start do_machine_check
379 :379 #4 *** 802097247044 (207303154416.v1) (207303154481) 4294874599 :12:3:::: mce_start do_machine_check
380 :380 #16 *** 802097247064 (207303154429.v1) (207303154494) 4294874599 :12:1:::: mce_start do_machine_check
381 :381 #17 *** 802097247226 (207303154669.v1) (207303154696) 4294874599 :12:7:::: mce_start do_machine_check
382 :382 #14 *** 802097247250 (207303154495.v1) (207303154575) 4294874599 :12:2:::: mce_start do_machine_check
383 :383 #5 *** 802097247574 (207303154632.v1) (207303154666) 4294874599 :12:8:::: mce_start do_machine_check
384 :384 #16 **** 802097247812 (207303154735.v1) (207303154768) 4294874599 :12:1:::: mce_start do_machine_check
385 :385 #16 *** 802097258184 (207303159067.v1) (207303159094) 4294874599 :8:-4755801206503178081:6::: do_machine_check machine_check
386 :386 #16 * 802097260944 (207303160222.v1) (207303160255) 4294874599 :1:2000000000:1::: mce_end do_machine_check
387 :387 #14 **** 802097261950 (207303160640.v1) (207303160714) 4294874599 :12:2:::: mce_start do_machine_check
388 :388 #16 ** 802097262056 (207303160686.v1) (207303160750) 4294874599 :12::::: mce_end do_machine_check
389 :389 #14 *** 802097263530 (207303161304.v1) (207303161334) 4294874599 :8:-4755801206503178081:6::: do_machine_check machine_check
390 :390 #14 * 802097265926 (207303162305.v1) (207303162331) 4294874599 :2:2000000000:2::: mce_end do_machine_check
391 :391 #4 **** 802097266672 (207303162615.v1) (207303162645) 4294874599 :12:3:::: mce_start do_machine_check
392 :392 #4 *** 802097267796 (207303163087.v1) (207303163119) 4294874599 :8:-4755801206503178081:6::: do_machine_check machine_check
393 :393 #4 * 802097269420 (207303163764.v1) (207303163794) 4294874599 :3:2000000000:3::: mce_end do_machine_check
394 :394 #0 **** 802097270254 (207303164111.v1) (207303164139) 4294874599 :12:4:::: mce_start do_machine_check
395 :395 #0 *** 802097271566 (207303164659.v1) (207303164726) 4294874599 :8:-4755801206503178081:6::: do_machine_check machine_check
396 :396 #0 * 802097273954 (207303165660.v1) (207303165690) 4294874599 :4:2000000000:4::: mce_end do_machine_check
397 :397 #15 **** 802097275214 (207303166183.v1) (207303166211) 4294874599 :12:5:::: mce_start do_machine_check
398 :398 #15 *** 802097276598 (207303166764.v1) (207303166826) 4294874599 :8:-4755801206503178081:6::: do_machine_check machine_check
399 :399 #15 * 802097278818 (207303167688.v1) (207303167720) 4294874599 :5:2000000000:5::: mce_end do_machine_check
400 :400 #12 **** 802097279702 (207303168057.v1) (207303168122) 4294874599 :12:6:::: mce_start do_machine_check


-----Original Message-----
From: Luck, Tony [mailto:[email protected]]
Sent: Friday, May 10, 2013 12:10 PM
To: Ming Lei; [email protected]
Cc: [email protected]; [email protected]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> With hyperthread turns on, the num_online_cpus reports the number of all logical cores.
> What I found in testing is only half the cores receives the mce broadcast, so I assume only the physical cores get broadcast.

See Intel Software Developer Manual Volume 3B Section 15.10.4.1, 3rd bullet:

o For processors on which CPUID reports DisplayFamily_DisplayModel as 06H_0EH and onward, an MCA signal is
broadcast to all logical processors in the system

Your E-5645 processors are a lot newer than this cut-off version - so they should broadcast to all your threads.

You are seeing something very strange. It would be interesting to know *which* 12 cpus show up for your machine check. Perhaps you are seeing all the hyperthreads from one socket and none from the other?

I still suspect that something is strange in the EDAC error injection side of this problem and that you are not getting a h/w initiated INT#18 event.

-Tony

2013-05-10 21:04:47

by Luck, Tony

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> I used intel edac error injector and saw the same problem. I actually wrote down the core numbers
> and I saw mce got to 0-5 and 12-17, but not the others. I have 2 sockets, 24 logical cores.

Mauro: How does the EDAC injector work on E5645 (Westmere-EP)? Does it create a real
error in memory that the processor then accesses ... tripping a machine check?

The mapping of Linux logical cpu numbers to physical socket/core/thread is somewhat
as the mercy of the order that the BIOS lists things in its tables. But those numbers look
very much like you just saw the machine check on one socket.

Look at /proc/cpuinfo to be sure. If you run "grep ^physical /proc/cpuinfo" I think you'll
see output like this (confirming that only socket 0 saw the machine check):

physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1

2013-05-10 21:58:11

by Ming Lei

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

You are right, here is the boot information confirming this:
smpboot: Booting Node 0, Processors #1 #2 #3 #4 #5 OK
smpboot: Booting Node 1, Processors #6 #7 #8 #9 #10 #11 OK
smpboot: Booting Node 0, Processors #12 #13 #14 #15 #16 #17 OK
smpboot: Booting Node 1, Processors #18 #19 #20 #21 #22 #23 OK

So only one socket gets the machine check. So is there still a problem but the fix will be different? I think the error inject creates a real machine check, but since each CPU has its own memory controller, the machine check may only send to the CPU the error happens.

Ming


-----Original Message-----
From: Luck, Tony [mailto:[email protected]]
Sent: Friday, May 10, 2013 2:05 PM
To: Ming Lei; [email protected]
Cc: [email protected]; [email protected]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> I used intel edac error injector and saw the same problem. I actually
> wrote down the core numbers and I saw mce got to 0-5 and 12-17, but not the others. I have 2 sockets, 24 logical cores.

Mauro: How does the EDAC injector work on E5645 (Westmere-EP)? Does it create a real error in memory that the processor then accesses ... tripping a machine check?

The mapping of Linux logical cpu numbers to physical socket/core/thread is somewhat as the mercy of the order that the BIOS lists things in its tables. But those numbers look very much like you just saw the machine check on one socket.

Look at /proc/cpuinfo to be sure. If you run "grep ^physical /proc/cpuinfo" I think you'll see output like this (confirming that only socket 0 saw the machine check):

physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1

2013-05-10 22:42:03

by Luck, Tony

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> So only one socket gets the machine check. So is there still a problem but the fix will be different?
> I think the error inject creates a real machine check, but since each CPU has its own memory controller,
> the machine check may only send to the CPU the error happens.

If there is a real machine check, then it must go to all logical cpus. If it doesn't get there, then there
is a h/w (or possibly f/w configuration) problem. Interesting that few others have seen this. Perhaps
because it only shows up in a fatal path and the machine is crashing anyway. A Google search for
the "Some CPUs didn't answer in synchronization" message does have a few hits that look relevant,
but following a few didn't give me enough details on machine configuration to tell whether they
match what you are seeing.

If there are many machines that do this - then we may need a workaround in Linux code for them.
Who is the manufacturer of the motherboard and/or system you are using?

But the current code that expects to see the machine check on all logical cpus is correct (and works
as is on other machines that are following the specification).

-Tony

2013-05-11 06:01:59

by Ming Lei

[permalink] [raw]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

I read the section you pointed out:

15.10.4.1 Machine-Check Exception Handler for Error Recovery
When writing a machine-check exception (MCE) handler to support software recovery from Uncorrected Recover- able (UCR) errors, consider the following:
For processors on which CPUID reports DisplayFamily_DisplayModel as 06H_0EH and onward, an MCA signal is broadcast to all logical processors in the system. Due to the potentially shared machine check MSR resources among the logical processors on the same package/core, the MCE handler may be required to synchronize with the other processors that received a machine check error and serialize access to the machine check registers when analyzing, logging and clearing the information in the machine check registers.

What I understand from above in intel 64 Arch software Developer's manual are:
1) this manual is written for software developer;
2) It says that MCE handler only requires to synchronize among the logical cores in the same package/core(what I assume here is same CPU socket).

I have two CPU sockets on motherboard and total 24 logical cores(12 cores each CPU). Each CPU has its own integrated memory controller. Each memory controller controls three channels of DIMMs. I can understand that if one dimm has error, the memory controller can trigger the MCE exception to it's own CPU, but why should this memory controller sends the MCE exception to the other CPU or the rest CPUs on the motherboard? Is there any hardware standard or specification for it?

Ming

-----Original Message-----
From: Luck, Tony [mailto:[email protected]]
Sent: Friday, May 10, 2013 3:42 PM
To: Ming Lei; [email protected]
Cc: [email protected]; [email protected]
Subject: RE: x86_mce: mce_start uses number of phsical cores instead of logical cores

> So only one socket gets the machine check. So is there still a problem but the fix will be different?
> I think the error inject creates a real machine check, but since each
> CPU has its own memory controller, the machine check may only send to the CPU the error happens.

If there is a real machine check, then it must go to all logical cpus. If it doesn't get there, then there is a h/w (or possibly f/w configuration) problem. Interesting that few others have seen this. Perhaps because it only shows up in a fatal path and the machine is crashing anyway. A Google search for the "Some CPUs didn't answer in synchronization" message does have a few hits that look relevant, but following a few didn't give me enough details on machine configuration to tell whether they match what you are seeing.

If there are many machines that do this - then we may need a workaround in Linux code for them.
Who is the manufacturer of the motherboard and/or system you are using?

But the current code that expects to see the machine check on all logical cpus is correct (and works as is on other machines that are following the specification).

-Tony

2013-05-11 17:32:31

by Tony Luck

[permalink] [raw]
Subject: Re: x86_mce: mce_start uses number of phsical cores instead of logical cores

> What I understand from above in intel 64 Arch software Developer's manual are:
> 1) this manual is written for software developer;
> 2) It says that MCE handler only requires to synchronize among the logical cores in the same package/core(what I assume here is same CPU socket).
>
> I have two CPU sockets on motherboard and total 24 logical cores(12 cores each CPU). Each CPU has its own integrated memory controller. Each memory controller controls three channels of DIMMs. I can understand that if one dimm has error, the memory controller can trigger the MCE exception to it's own CPU, but why should this memory controller sends the MCE exception to the other CPU or the rest CPUs on the motherboard? Is there any hardware standard or specification for it?

The Software Developer Manual is the specification of the architecture
- there are data sheets for each processor which describe
implementation details (e.g. perhaps which types of errors are
reported in whcih banks, an MCi_STATUS.MSCOD field values providing
more information about an error).

Your "1&2" understanding is correct. Your question on "why should this
memory controller send the MCE exception ..." is a good one. The
answer is because the architecture requires it; even though you and I
can imagine that it is possible for OS to do its work if the error is
just sent to the processors on the socket where the error was found in
some cases. There may be some cases where this is less easy (e.g. a
logical processor on one socket issues a NUMA read to a location that
is on the memory controller on the other socket).

-Tony