LinuxLists.cc - [PATCH 0/3] Fix MCE handling for AMD multi-node processors

2014-12-22 19:57:00

Subject: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

When a MCE happens that is to be logged onto bank 4 of AMD multi-node
processors, they are reported only to corresponding node base core of
the cpu on which the error occurred.

Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
clarifications on the reporting of MC4 errors only to NBC MSRs.

We don't have the exception handler wired up to handle this currently.
As a consequence, do_machine_check only runs on the core on which error
occurred and (since according to the BKDGs, reads to MC4_STATUS MSR of
non-NBC will simply RAZ) the exception is ignored for the core.

This is a problem as now we have dropped MCEs.
I tested this on Fam10h and Fam15h using mce_amd_inj and by triggering
a real HW MCE using Boris' new interface; And can confirm the behavior.

This patch set fixes the issue by looking at the NBC MSRs when bank 4
errors happen on AMD multi node processors.

Patch 1: Refactor AMD cpu topology functions so that we can get some
relevant info that we need to use in EDAC, MC handler routines
Patch 2: The fix to our problem
Patch 3: Modify mce_amd_inj interfaces to write to only NBC for bank 4
errors. Only then will they be picked up for error handling.

Aravind Gopalakrishnan (3):
x86,amd: Refactor amd cpu topology functions for multi-node processors
x86, mce: Handle AMD MCE on bank4 on NBC for multi-node processors
edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors

arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/amd.c | 78 ++++++++++++++----
arch/x86/kernel/cpu/mcheck/mce.c | 167 +++++++++++++++++++++++++++++++++++----
drivers/edac/mce_amd_inj.c | 21 ++++-
4 files changed, 235 insertions(+), 32 deletions(-)

--
2.0.2

2014-12-22 19:42:52

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: [PATCH 2/3] x86, mce: Handle AMD MCE on bank4 on NBC for multi-node processors

The problem is that when MC4 error happens on AMD multi-node processors,
they are reported only to corresponding node base core of the cpu
on which the error occurred.

We don't have the exception handler wired up to handle this currently.
As a consequence, do_machine_check only runs on the core on which error
occurred and (since according to the BKDGs, reads to MC4_STATUS MSR of
non-NBC will simply RAZ) the exception is ignored for the core.

Now, when an exception _does_ happen on the node base core, that gets
handled properly. I tested this on Fam10h and Fam15h using mce_amd_inj
and by triggering a real HW MCE using Boris' new interface; And can
confirm the behavior. (Of course, with mce_amd_inj I could only inject
on the NBC as BKDGs state that writes are ignored. The following patch
addresses this problem)

This patch fixes the issue by reading (or writing) only to the NBC MSR
for bank 4 errors on multi node processors for Fam10h onwards.
For this, we need to rdmsrl_on_cpu and I have provided wrappers that
also supports the injection interface used by mce-inject.c

The handling of CE also has to be corrected as a result of this.
- Since MC4 errors are only processed on NBCs, we can theoritically
have multiple number of CEs in flight waiting to be handled (as many
as there are nodes on the system)
- Providing an array to save cpu on which CEs occur
- Later, when the poller runs, it only need to verify if the current
cpu is the saved one for the node on which current cpu is on. If yes,
handle the error.

Misc changes:
- Modify mce_read_aux to use bank value off existing mce struct argument
- modifying mce_clear_state to take a cpu argument as it will be needed
for clearing correct MSR STATUS register if we need_amd_nbc_handling

Testing details:
- Tested the patch on
- multi node: Fam10h, Fam15h Model 00h
- single node: Fam15h Model 60h
- Before patch, no error injected to non NBC is handled by MCE handler
- After patch, UE are handled immediately by looking at correct NBC
- For CEs cpu values are stored and handled by polling mechanism
later on the correct cpu on which error happened.

Signed-off-by: Aravind Gopalakrishnan <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 167 +++++++++++++++++++++++++++++++++++----
1 file changed, 153 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d2c6116..c5eedbd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -87,6 +87,12 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_chrdev_wait);
static DEFINE_PER_CPU(struct mce, mces_seen);
static int cpu_missing;

+/* AMD fam10h and later can only access NBC MSRs for MC4 errors */
+static int need_amd_nbc_handling;
+
+/* multiple CEs can be in flight on the per-node MSR waiting to be handled */
+static int *amd_saved_ce_on_cpu;
+
/* CMCI storm detection filter */
static DEFINE_PER_CPU(unsigned long, mce_polled_error);

@@ -428,6 +434,45 @@ static void mce_wrmsrl(u32 msr, u64 v)
wrmsrl(msr, v);
}

+/* MSR per-cpu access wrappers */
+static u64 mce_rdmsrl_on_cpu(unsigned int cpu, u32 msr)
+{
+ struct mce *reg = per_cpu_ptr(&injectm, cpu);
+ u64 v;
+
+ if (reg->finished) {
+ int offset = msr_to_offset(msr);
+
+ if (offset < 0)
+ return 0;
+ return *(u64 *)((char *)per_cpu_ptr(&injectm, cpu) + offset);
+ }
+
+ if (rdmsrl_on_cpu(cpu, msr, &v)) {
+ WARN_ONCE(1, "mce: Unable to read msr %d!\n", msr);
+ v = 0;
+ }
+
+ return v;
+}
+
+static void mce_wrmsrl_on_cpu(unsigned int cpu, u32 msr, u64 v)
+{
+ struct mce *reg = per_cpu_ptr(&injectm, cpu);
+
+ if (reg->finished) {
+ int offset = msr_to_offset(msr);
+
+ if (offset >= 0)
+ *(u64 *)((char *)per_cpu_ptr(&injectm, cpu) +
+ offset) = v;
+ return;
+ }
+
+ if (wrmsrl_on_cpu(cpu, msr, v))
+ WARN_ONCE(1, "mce: Unable to write to msr %d!\n", msr);
+}
+
/*
* Collect all global (w.r.t. this processor) status about this machine
* check into our "mce" struct so that we can use it later to assess
@@ -557,12 +602,19 @@ static void mce_report_event(struct pt_regs *regs)
/*
* Read ADDR and MISC registers.
*/
-static void mce_read_aux(struct mce *m, int i)
+static void mce_read_aux(struct mce *m)
{
if (m->status & MCI_STATUS_MISCV)
- m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
+ m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(m->bank));
if (m->status & MCI_STATUS_ADDRV) {
- m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+ if (m->bank == 4 && need_amd_nbc_handling) {
+ unsigned int nbc = amd_get_nbc_for_cpu(m->extcpu);
+
+ m->addr = mce_rdmsrl_on_cpu(nbc,
+ MSR_IA32_MCx_ADDR(m->bank));
+ } else {
+ m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(m->bank));
+ }

/*
* Mask the reported address by the reported granularity.
@@ -627,6 +679,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
{
struct mce m;
int severity;
+ u16 nid = 0;
int i;

this_cpu_inc(mce_poll_count);
@@ -643,7 +696,19 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
m.tsc = 0;

barrier();
- m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+ if (i == 4 && need_amd_nbc_handling) {
+ nid = amd_get_nb_id(m.extcpu);
+ if (amd_saved_ce_on_cpu[nid] != -1 &&
+ amd_saved_ce_on_cpu[nid] != m.extcpu)
+ continue;
+ else
+ m.status = mce_rdmsrl_on_cpu(
+ amd_get_nbc_for_cpu(m.extcpu),
+ MSR_IA32_MCx_STATUS(i));
+ } else {
+ m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+ }
+
if (!(m.status & MCI_STATUS_VAL))
continue;

@@ -658,7 +723,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
(m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
continue;

- mce_read_aux(&m, i);
+ mce_read_aux(&m);

if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
@@ -686,7 +751,21 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
/*
* Clear state for this bank.
*/
- mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
+ if (i == 4 &&
+ need_amd_nbc_handling &&
+ amd_saved_ce_on_cpu[nid] != -1 &&
+ amd_saved_ce_on_cpu[nid] == m.extcpu) {
+ mce_wrmsrl_on_cpu(amd_get_nbc_for_cpu(m.extcpu),
+ MSR_IA32_MCx_STATUS(i),
+ 0);
+ /*
+ * We also need to reset
+ * saved_ce_on_cpu for future CEs
+ */
+ amd_saved_ce_on_cpu[nid] = -1;
+ } else {
+ mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
+ }
}

/*
@@ -708,7 +787,13 @@ static int mce_no_way_out(struct mce *m, char **msg, unsigned long *validp,
int i, ret = 0;

for (i = 0; i < mca_cfg.banks; i++) {
- m->status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+ if (i == 4 && need_amd_nbc_handling)
+ m->status = mce_rdmsrl_on_cpu(
+ amd_get_nbc_for_cpu(m->extcpu),
+ MSR_IA32_MCx_STATUS(i));
+ else
+ m->status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+
if (m->status & MCI_STATUS_VAL) {
__set_bit(i, validp);
if (quirk_no_way_out)
@@ -992,13 +1077,23 @@ static int mce_usable_address(struct mce *m)
return 1;
}

-static void mce_clear_state(unsigned long *toclear)
+/*
+ * If need_amd_nbc_handling is set, then we'll need a cpu
+ * value to be able to clean the status on the correct core
+ */
+static void mce_clear_state(unsigned long *toclear, unsigned int cpu)
{
int i;

for (i = 0; i < mca_cfg.banks; i++) {
- if (test_bit(i, toclear))
- mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
+ if (test_bit(i, toclear)) {
+ if (i == 4 && need_amd_nbc_handling)
+ mce_wrmsrl_on_cpu(cpu,
+ MSR_IA32_MCx_STATUS(i),
+ 0);
+ else
+ mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
+ }
}
}

@@ -1081,6 +1176,10 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* error.
*/
int kill_it = 0;
+
+ /* for saving nbc value for AMD systems that need_amd_nbc_handling */
+ unsigned int nbc = 0;
+
DECLARE_BITMAP(toclear, MAX_NR_BANKS);
DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
char *msg = "Unknown";
@@ -1125,7 +1224,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
m.addr = 0;
m.bank = i;

- m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+ if (i == 4 && need_amd_nbc_handling) {
+ nbc = amd_get_nbc_for_cpu(m.extcpu);
+ m.status = mce_rdmsrl_on_cpu(nbc,
+ MSR_IA32_MCx_STATUS(i));
+ } else {
+ m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+ }
+
if ((m.status & MCI_STATUS_VAL) == 0)
continue;

@@ -1133,9 +1239,22 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* Non uncorrected or non signaled errors are handled by
* machine_check_poll. Leave them alone, unless this panics.
*/
+
+ /*
+ * Also, If we are on AMD system and need_amd_nbc_handling is
+ * set, then keep track of the cpu number on which error
+ * happened so that polling mechanism can handle it on the
+ * correct core later on
+ */
if (!(m.status & (cfg->ser ? MCI_STATUS_S : MCI_STATUS_UC)) &&
- !no_way_out)
+ !no_way_out) {
+ if (i == 4 && need_amd_nbc_handling) {
+ u16 nid = amd_get_nb_id(m.extcpu);
+
+ amd_saved_ce_on_cpu[nid] = m.extcpu;
+ }
continue;
+ }

/*
* Set taint even when machine check was not enabled.
@@ -1160,7 +1279,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
continue;
}

- mce_read_aux(&m, i);
+ mce_read_aux(&m);

/*
* Action optional error. Queue address for later processing.
@@ -1184,7 +1303,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
m = *final;

if (!no_way_out)
- mce_clear_state(toclear);
+ mce_clear_state(toclear, nbc);

/*
* Do most of the synchronization with other CPUs.
@@ -1559,6 +1678,26 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)

/* This should be disabled by the BIOS, but isn't always */
if (c->x86_vendor == X86_VENDOR_AMD) {
+ if (c->x86 >= 0x10) {
+ int n_nodes = num_online_nodes();
+ int i;
+
+ /* we only need to initialize once */
+ if (!amd_saved_ce_on_cpu) {
+ amd_saved_ce_on_cpu = (int *)
+ kzalloc(sizeof(int) * n_nodes,
+ GFP_KERNEL);
+
+ if (!amd_saved_ce_on_cpu)
+ return -ENOMEM;
+
+ for (i = 0; i < n_nodes; i++)
+ amd_saved_ce_on_cpu[i] = -1;
+
+ need_amd_nbc_handling = 1;
+ }
+ }
+
if (c->x86 == 15 && cfg->banks > 4) {
/*
* disable GART TBL walk error reporting, which
--
2.0.2

2014-12-22 19:57:14

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: [PATCH 3/3] edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors

For Fam10h onwards, bank4 MCE are reported only to the node base
cores. This patch modifies the do_inject code path to take care
of this.

Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
clarifications on the reporting of MC4 errors only to NBC MSRs.

Signed-off-by: Aravind Gopalakrishnan <[email protected]>
---
drivers/edac/mce_amd_inj.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/mce_amd_inj.c b/drivers/edac/mce_amd_inj.c
index 0bd91a8..d4aa14f 100644
--- a/drivers/edac/mce_amd_inj.c
+++ b/drivers/edac/mce_amd_inj.c
@@ -126,6 +126,7 @@ static void do_inject(void)
{
u64 mcg_status = 0;
unsigned int cpu = i_mce.extcpu;
+ int saved_cpu = -1;
u8 b = i_mce.bank;

if (!(i_mce.inject_flags & MCJ_EXCEPTION)) {
@@ -143,22 +144,34 @@ static void do_inject(void)
if (!(i_mce.status & MCI_STATUS_PCC))
mcg_status |= MCG_STATUS_RIPV;

- toggle_hw_mce_inject(cpu, true);
-
wrmsr_on_cpu(cpu, MSR_IA32_MCG_STATUS,
(u32)mcg_status, (u32)(mcg_status >> 32));

+ /*
+ * for Fam10h+, if bank = 4, then we should write to NBC MSR
+ * for multi-node processors else to core 0 for single node processors
+ */
+ if (boot_cpu_data.x86 >= 0x10 && b == 4) {
+ saved_cpu = cpu;
+ cpu = amd_get_nbc_for_cpu(cpu);
+ }
+
+ toggle_hw_mce_inject(cpu, true);
+
wrmsr_on_cpu(cpu, MSR_IA32_MCx_STATUS(b),
(u32)i_mce.status, (u32)(i_mce.status >> 32));

wrmsr_on_cpu(cpu, MSR_IA32_MCx_ADDR(b),
(u32)i_mce.addr, (u32)(i_mce.addr >> 32));

+ toggle_hw_mce_inject(cpu, false);
+
+ if (saved_cpu != -1)
+ cpu = saved_cpu;
+
wrmsr_on_cpu(cpu, MSR_IA32_MCx_MISC(b),
(u32)i_mce.misc, (u32)(i_mce.misc >> 32));

- toggle_hw_mce_inject(cpu, false);
-
smp_call_function_single(cpu, trigger_mce, NULL, 0);

err:
--
2.0.2

2014-12-22 19:57:36

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: [PATCH 1/3] x86,amd: Refactor amd cpu topology functions for multi-node processors

We would like to be able to provide a mechanism for finding the NBC for
each node in multi-node platforms. For this, we will need the number of
nodes per processor.

So, factor this calculation out of amd_get_topology() and provide separate
getter function 'amd_get_nbc_for_node' for obtaining NBC for a node.

Another useful calculation is to find the NBC of a node on which a given cpu
is on. Provide a getter for this as well.
- We are using it in EDAC drivers, MCE handlers right away, so marked
extern.
- We currently don't use amd_get_nbc_for_node outside of here, so
marking it static. We can make this extern as and when need arises.

While at it, reverse the return condition in amd_get_topology
and save an indent level.

Signed-off-by: Aravind Gopalakrishnan <[email protected]>
---
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/amd.c | 78 ++++++++++++++++++++++++++++++++--------
2 files changed, 65 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a092a0c..6e1dc06 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -979,6 +979,7 @@ static inline int mpx_disable_management(struct task_struct *tsk)
#endif /* CONFIG_X86_INTEL_MPX */

extern u16 amd_get_nb_id(int cpu);
+extern unsigned int amd_get_nbc_for_cpu(int cpu);

static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
{
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index a220239..0c0eae5 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -281,6 +281,28 @@ static int nearby_node(int apicid)
}
#endif

+#ifdef CONFIG_X86_HT
+static u32 amd_get_num_nodes(struct cpuinfo_x86 *c)
+{
+ u32 nodes = 1;
+
+ if (cpu_has_topoext) {
+ u32 ecx;
+
+ ecx = cpuid_ecx(0x8000001e);
+ nodes = ((ecx >> 8) & 7) + 1;
+ } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
+ u64 value;
+
+ rdmsrl(MSR_FAM10H_NODE_ID, value);
+ nodes = ((value >> 3) & 7) + 1;
+ }
+
+ return nodes;
+}
+EXPORT_SYMBOL_GPL(amd_get_num_nodes);
+#endif
+
/*
* Fixup core topology information for
* (1) AMD multi-node processors
@@ -291,15 +313,18 @@ static int nearby_node(int apicid)
static void amd_get_topology(struct cpuinfo_x86 *c)
{
u32 nodes, cores_per_cu = 1;
+ u32 cores_per_node;
+ u32 cus_per_node;
u8 node_id;
int cpu = smp_processor_id();

/* get information required for multi-node processors */
+ nodes = amd_get_num_nodes(c);
+
if (cpu_has_topoext) {
u32 eax, ebx, ecx, edx;

cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
- nodes = ((ecx >> 8) & 7) + 1;
node_id = ecx & 7;

/* get compute unit information */
@@ -310,27 +335,24 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
u64 value;

rdmsrl(MSR_FAM10H_NODE_ID, value);
- nodes = ((value >> 3) & 7) + 1;
node_id = value & 7;
} else
return;

/* fixup multi-node processor information */
- if (nodes > 1) {
- u32 cores_per_node;
- u32 cus_per_node;
+ if (nodes < 2)
+ return;

- set_cpu_cap(c, X86_FEATURE_AMD_DCM);
- cores_per_node = c->x86_max_cores / nodes;
- cus_per_node = cores_per_node / cores_per_cu;
+ set_cpu_cap(c, X86_FEATURE_AMD_DCM);
+ cores_per_node = c->x86_max_cores / nodes;
+ cus_per_node = cores_per_node / cores_per_cu;

- /* store NodeID, use llc_shared_map to store sibling info */
- per_cpu(cpu_llc_id, cpu) = node_id;
+ /* store NodeID, use llc_shared_map to store sibling info */
+ per_cpu(cpu_llc_id, cpu) = node_id;

- /* core id has to be in the [0 .. cores_per_node - 1] range */
- c->cpu_core_id %= cores_per_node;
- c->compute_unit_id %= cus_per_node;
- }
+ /* core id has to be in the [0 .. cores_per_node - 1] range */
+ c->cpu_core_id %= cores_per_node;
+ c->compute_unit_id %= cus_per_node;
}
#endif

@@ -365,6 +387,34 @@ u16 amd_get_nb_id(int cpu)
}
EXPORT_SYMBOL_GPL(amd_get_nb_id);

+#ifdef CONFIG_X86_HT
+static u32 amd_get_nbc_for_node(int node_id)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ u32 nodes, cores_per_node;
+
+ nodes = amd_get_num_nodes(c);
+ cores_per_node = c->x86_max_cores / nodes;
+
+ return cores_per_node * node_id;
+}
+EXPORT_SYMBOL_GPL(amd_get_nbc_for_node);
+#endif
+
+#ifdef CONFIG_X86_HT
+unsigned int amd_get_nbc_for_cpu(int cpu)
+{
+ unsigned int nbc = 0;
+
+ /* for multi-node */
+ if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_AMD_DCM))
+ nbc = amd_get_nbc_for_node(amd_get_nb_id(cpu));
+
+ return nbc;
+}
+EXPORT_SYMBOL_GPL(amd_get_nbc_for_cpu);
+#endif
+
static void srat_detect_node(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_NUMA
--
2.0.2

2014-12-22 20:15:46

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

On Mon, Dec 22, 2014 at 02:10:09PM -0600, Aravind Gopalakrishnan wrote:
> When a MCE happens that is to be logged onto bank 4 of AMD multi-node
> processors, they are reported only to corresponding node base core of
> the cpu on which the error occurred.
>
> Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for

Let me try to understand this correctly:

Does that mean that we could fix this by simply doing:

D18F3x44[NbMcaToMstCpuEn]=0b

on each NB?

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-12-22 20:56:57

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

On 12/22/2014 2:15 PM, Borislav Petkov wrote:
> On Mon, Dec 22, 2014 at 02:10:09PM -0600, Aravind Gopalakrishnan wrote:
>> When a MCE happens that is to be logged onto bank 4 of AMD multi-node
>> processors, they are reported only to corresponding node base core of
>> the cpu on which the error occurred.
>>
>> Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
> Let me try to understand this correctly:
>
> Does that mean that we could fix this by simply doing:
>
> D18F3x44[NbMcaToMstCpuEn]=0b
>
> on each NB?
>

Not quite..
When this field is 0, BKDG says the error may be reported to the core
that originated the request *if applicable and known*
Looking at the error signatures table for MC4 (Part 2),
we can see only some errors have 'ErrCoreId' column as valid

Besides, if IO originated the request, then it is reported only to NBC.

So, to take care of all these cases, I am just following one approach
here: and that is to look at NBC MSRs for any bank 4 errors.
(It seems to be what the BKDG recommends anyway as BIOS by default
should set D18F3x44[NbMcaToMstCpuEn])

Thanks,
-Aravind.

2014-12-22 23:19:34

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

On Mon, Dec 22, 2014 at 02:56:47PM -0600, Aravind Gopalakrishnan wrote:
> On 12/22/2014 2:15 PM, Borislav Petkov wrote:
> >On Mon, Dec 22, 2014 at 02:10:09PM -0600, Aravind Gopalakrishnan wrote:
> >>When a MCE happens that is to be logged onto bank 4 of AMD multi-node
> >>processors, they are reported only to corresponding node base core of
> >>the cpu on which the error occurred.
> >>
> >>Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
> >Let me try to understand this correctly:
> >
> >Does that mean that we could fix this by simply doing:
> >
> >D18F3x44[NbMcaToMstCpuEn]=0b
> >
> >on each NB?
> >
>
> Not quite..
> When this field is 0, BKDG says the error may be reported to the core that
> originated the request *if applicable and known*
> Looking at the error signatures table for MC4 (Part 2),
> we can see only some errors have 'ErrCoreId' column as valid
>
> Besides, if IO originated the request, then it is reported only to NBC.
>
> So, to take care of all these cases, I am just following one approach here:
> and that is to look at NBC MSRs for any bank 4 errors.
> (It seems to be what the BKDG recommends anyway as BIOS by default should
> set D18F3x44[NbMcaToMstCpuEn])

Then in that case you have to check the case where
D18F3x44[NbMcaToMstCpuEn] is 0 for whatever reason (some BIOS forgot to
set it or whatever) and to set it again.

Then, upon a quick scan, your patches are adding a lot of vendor-specific
stuff which doesn't belong in the #MC handler, should probably be
wrapped or so, no good idea right now.

Then, you're using rd/wrmsr_on_cpu which does smp_call_function_single()
which can deadlock in atomic context and #MC is one.

Also, the math in amd_get_nbc_for_node() is too fragile and will break
the moment some BIOS renumbers cores to accomodate some other OS.

In any case, I won't be able to take a detailed look soon with the
holidays coming up.

Also, I'm wondering if this can't be solved much more elegantly
by detecting that condition (bank == 4) in the #MC handler and
issuing an IPI before exiting it using irq_work which will schedule
do_machine_check on the NBC. And that should be even easier to do since
we're moving the #MC handler out of the IST and to the normal kernel
stack for 3.20, which would make this endeavor pretty cheap.

Anyway, just a couple of thoughts...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-12-23 19:58:00

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

On 12/22/2014 5:19 PM, Borislav Petkov wrote:
> On Mon, Dec 22, 2014 at 02:56:47PM -0600, Aravind Gopalakrishnan wrote:
>> On 12/22/2014 2:15 PM, Borislav Petkov wrote:
>>> On Mon, Dec 22, 2014 at 02:10:09PM -0600, Aravind Gopalakrishnan wrote:
>>>> When a MCE happens that is to be logged onto bank 4 of AMD multi-node
>>>> processors, they are reported only to corresponding node base core of
>>>> the cpu on which the error occurred.
>>>>
>>>> Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
>>> Let me try to understand this correctly:
>>>
>>> Does that mean that we could fix this by simply doing:
>>>
>>> D18F3x44[NbMcaToMstCpuEn]=0b
>>>
>>> on each NB?
>>>
>> Not quite..
>> When this field is 0, BKDG says the error may be reported to the core that
>> originated the request *if applicable and known*
>> Looking at the error signatures table for MC4 (Part 2),
>> we can see only some errors have 'ErrCoreId' column as valid
>>
>> Besides, if IO originated the request, then it is reported only to NBC.
>>
>> So, to take care of all these cases, I am just following one approach here:
>> and that is to look at NBC MSRs for any bank 4 errors.
>> (It seems to be what the BKDG recommends anyway as BIOS by default should
>> set D18F3x44[NbMcaToMstCpuEn])
> Then in that case you have to check the case where
> D18F3x44[NbMcaToMstCpuEn] is 0 for whatever reason (some BIOS forgot to
> set it or whatever) and to set it again.
Okay.

> Then, upon a quick scan, your patches are adding a lot of vendor-specific
> stuff which doesn't belong in the #MC handler, should probably be
> wrapped or so, no good idea right now.
>
> Then, you're using rd/wrmsr_on_cpu which does smp_call_function_single()
> which can deadlock in atomic context and #MC is one.
>
> Also, the math in amd_get_nbc_for_node() is too fragile and will break
> the moment some BIOS renumbers cores to accomodate some other OS.
>
> In any case, I won't be able to take a detailed look soon with the
> holidays coming up.
>
> Also, I'm wondering if this can't be solved much more elegantly
> by detecting that condition (bank == 4) in the #MC handler and
> issuing an IPI before exiting it using irq_work which will schedule
> do_machine_check on the NBC. And that should be even easier to do since
> we're moving the #MC handler out of the IST and to the normal kernel
> stack for 3.20, which would make this endeavor pretty cheap.

Ok. I'll look into this approach too over the holidays and we can
restart the discussion
at a more convenient time.

> Anyway, just a couple of thoughts...
>

Thanks,
-Aravind.