2012-02-28 16:11:47

by Borislav Petkov

[permalink] [raw]
Subject: [RFC PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs

From: Borislav Petkov <[email protected]>

Hi all,

this is an initial, more or less serious attempt to collect decoded
MCE info into a buffer and jettison it into userspace using the MCE
tracepoint trace_mce_record(). This initial approach needs userspace to
do

$ echo 1 > /sys/devices/system/ras/agent

and decoded MCE info gets collected into a buffer which enlarges itself
to accomodate differently-sized error messages. Then, when decoding
is finished, the tracepoint is called and the MCE info along with the
decoded information lands in the ring buffer and at possible userspace
consumers.

Also, the commit messages of the single patches contain additional info.

For example, the data looks like this:

mcegen.py-2318 [001] .N.. 580.902409: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|PCC|AddrV|CECC]: 0xd604c00006080a41 MC4_ADDR: 0x0000000000000016
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[Hardware Error]: ERR_ADDR: 0x16 row: 0, channel: 0
[Hardware Error]: cache level: L1, mem/io: MEM, mem-tx: DWR, part-proc: RES (no timeout)
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: d604c00006080a41, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

mcegen.py-2326 [001] .N.. 598.795494: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|MiscV|PCC|-|UECC]: 0xfa002000001c011b
[Hardware Error]: Northbridge Error (node 0): L3 ECC data cache error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: fa002000001c011b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

mcegen.py-2343 [013] .N.. 619.620698: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[-|UE|MiscV|PCC|-|UECC]: 0xba002100000f001b[HardwareError]: Northbridge Error (node 0): GART Table Walk data error.
[Hardware Error]: cache level: L3/GEN, tx: GEN
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: ba002100000f001b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)

As always, reviews and comments are welcome.

Thanks.


2012-02-28 16:11:48

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

From: Borislav Petkov <[email protected]>

The idea here is to pass an additional decoded MCE message through
the tracepoint and into the ring buffer for userspace to consume. The
designated consumers are RAS daemons and other tools collecting RAS
information.

Drop unneeded fields while at it, thus saving some room in the ring
buffer.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
include/trace/events/mce.h | 22 +++++++---------------
2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 5a11ae2e9e91..072e020ecaf3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -144,7 +144,7 @@ void mce_log(struct mce *mce)
int ret = 0;

/* Emit the trace record: */
- trace_mce_record(mce);
+ trace_mce_record("", mce);

ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
if (ret == NOTIFY_STOP)
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 4cbbcef6baa8..5fada9a40708 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -10,11 +10,12 @@

TRACE_EVENT(mce_record,

- TP_PROTO(struct mce *m),
+ TP_PROTO(const char *msg, struct mce *m),

- TP_ARGS(m),
+ TP_ARGS(msg, m),

TP_STRUCT__entry(
+ __string( msg, msg )
__field( u64, mcgcap )
__field( u64, mcgstatus )
__field( u64, status )
@@ -24,15 +25,12 @@ TRACE_EVENT(mce_record,
__field( u64, tsc )
__field( u64, walltime )
__field( u32, cpu )
- __field( u32, cpuid )
- __field( u32, apicid )
- __field( u32, socketid )
__field( u8, cs )
__field( u8, bank )
- __field( u8, cpuvendor )
),

TP_fast_assign(
+ __assign_str(msg, msg);
__entry->mcgcap = m->mcgcap;
__entry->mcgstatus = m->mcgstatus;
__entry->status = m->status;
@@ -42,25 +40,19 @@ TRACE_EVENT(mce_record,
__entry->tsc = m->tsc;
__entry->walltime = m->time;
__entry->cpu = m->extcpu;
- __entry->cpuid = m->cpuid;
- __entry->apicid = m->apicid;
- __entry->socketid = m->socketid;
__entry->cs = m->cs;
__entry->bank = m->bank;
- __entry->cpuvendor = m->cpuvendor;
),

- TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
+ TP_printk("%s" HW_ERR "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, TIME: %llu)",
+ __get_str(msg),
__entry->cpu,
__entry->mcgcap, __entry->mcgstatus,
__entry->bank, __entry->status,
__entry->addr, __entry->misc,
__entry->cs, __entry->ip,
__entry->tsc,
- __entry->cpuvendor, __entry->cpuid,
- __entry->walltime,
- __entry->socketid,
- __entry->apicid)
+ __entry->walltime)
);

#endif /* _TRACE_MCE_H */
--
1.7.8.rc0

2012-02-28 16:11:49

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 2/3] x86, RAS: Add a decoded msg buffer

From: Borislav Petkov <[email protected]>

Echoing 1 into /sys/devices/system/ras/agent causes the ras_printk()
function to buffer a string describing a hardware error. This is meant
for userspace daemons which are running on the system and are going to
consume decoded information through the MCE tracepoint.

Also, upon first use, the buffer enlarges itself from its initial size
so that it can accomodate longer messages.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/Kconfig | 9 +++
arch/x86/Makefile | 3 +
arch/x86/include/asm/ras.h | 13 ++++
arch/x86/ras/Makefile | 1 +
arch/x86/ras/ras.c | 162 ++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 188 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/ras.h
create mode 100644 arch/x86/ras/Makefile
create mode 100644 arch/x86/ras/ras.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e189fa..bda1480241b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -657,6 +657,15 @@ config X86_CYCLONE_TIMER
def_bool y
depends on X86_SUMMIT

+config X86_RAS
+ def_bool y
+ prompt "X86 RAS features"
+ ---help---
+ A collection of Reliability, Availability and Serviceability
+ software features which aim to enable hardware error logging
+ and reporting. Leave it at 'y' unless you really know what
+ you're doing
+
source "arch/x86/Kconfig.cpu"

config HPET_TIMER
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 209ba1294592..a6b6bb1f308b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -146,6 +146,9 @@ drivers-$(CONFIG_OPROFILE) += arch/x86/oprofile/
# suspend and hibernation support
drivers-$(CONFIG_PM) += arch/x86/power/

+# RAS support
+core-y += arch/x86/ras/
+
drivers-$(CONFIG_FB) += arch/x86/video/

####
diff --git a/arch/x86/include/asm/ras.h b/arch/x86/include/asm/ras.h
new file mode 100644
index 000000000000..17aa2679032b
--- /dev/null
+++ b/arch/x86/include/asm/ras.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_X86_RAS_H
+#define _ASM_X86_RAS_H
+
+extern bool ras_agent;
+
+#define PR_EMERG BIT(0)
+#define PR_WARNING BIT(1)
+#define PR_CONT BIT(2)
+
+extern const char *ras_get_decoded_err(void);
+extern void ras_printk(unsigned long flags, const char *fmt, ...);
+
+#endif /* _ASM_X86_RAS_H */
diff --git a/arch/x86/ras/Makefile b/arch/x86/ras/Makefile
new file mode 100644
index 000000000000..7a70bb5cd057
--- /dev/null
+++ b/arch/x86/ras/Makefile
@@ -0,0 +1 @@
+obj-y := ras.o
diff --git a/arch/x86/ras/ras.c b/arch/x86/ras/ras.c
new file mode 100644
index 000000000000..38e24908e682
--- /dev/null
+++ b/arch/x86/ras/ras.c
@@ -0,0 +1,162 @@
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/ras.h>
+
+static size_t err_str_sz, dec_len;
+static char *err_str;
+
+/*
+ * If true, userspace has an agent running and eating all the
+ * tracing data we're sending out so there's no dmesg output
+ */
+bool ras_agent;
+EXPORT_SYMBOL_GPL(ras_agent);
+
+/* getting the string implies the current buffer is emptied */
+const char *ras_get_decoded_err(void)
+{
+ dec_len = 0;
+ return err_str;
+}
+
+void ras_printk(unsigned long flags, const char *fmt, ...)
+{
+ va_list args;
+ char *buf;
+ size_t left;
+ int i;
+
+ /* add a HW_ERR prefix to a newly started line */
+ if (!(flags & PR_CONT)) {
+ strcpy(err_str + dec_len, HW_ERR);
+ dec_len += strlen(HW_ERR);
+ }
+
+ left = err_str_sz - dec_len - 1;
+
+ if (left <= 50) {
+ /* enlarge arbitrarily by 50 chars */
+ err_str_sz += 50;
+ left += 50;
+
+ err_str = krealloc(err_str, err_str_sz, GFP_KERNEL);
+ if (!err_str) {
+ pr_err("Error enlarging decode buffer.\n");
+ return;
+ }
+ }
+
+ /* use err_str _after_ the realloc because it can change */
+ buf = err_str + dec_len;
+
+ va_start(args, fmt);
+ i = vsnprintf(buf, left, fmt, args);
+ va_end(args);
+
+ if (i >= left) {
+ pr_err("Error decode buffer truncated.\n");
+ dec_len = err_str_sz-1;
+ err_str[dec_len] = '\n';
+ } else
+ dec_len += i;
+
+ if (!ras_agent) {
+ if (flags & PR_EMERG)
+ pr_emerg("%s", buf);
+ if (flags & PR_WARNING)
+ pr_warning("%s", buf);
+ else if (flags & PR_CONT)
+ pr_cont("%s", buf);
+
+ dec_len = 0;
+ }
+}
+EXPORT_SYMBOL_GPL(ras_printk);
+
+struct bus_type ras_subsys = {
+ .name = "ras",
+ .dev_name = "ras",
+};
+
+struct ras_attr {
+ struct attribute attr;
+ ssize_t (*show) (struct kobject *kobj, struct ras_attr *attr, char *bf);
+ ssize_t (*store)(struct kobject *kobj, struct ras_attr *attr,
+ const char *buf, size_t count);
+};
+
+#define RAS_ATTR(_name, _mode, _show, _store) \
+static struct ras_attr ras_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+static ssize_t ras_agent_show(struct kobject *kobj,
+ struct ras_attr *attr,
+ char *buf)
+{
+ return sprintf(buf, "%.1d\n", ras_agent);
+}
+
+static ssize_t ras_agent_store(struct kobject *kobj,
+ struct ras_attr *attr,
+ const char *buf, size_t count)
+{
+ int ret = 0;
+ unsigned long value;
+
+ ret = kstrtoul(buf, 10, &value);
+ if (ret < 0) {
+ printk(KERN_ERR "Wrong value for ras_agent field.\n");
+ return ret;
+ }
+
+ ras_agent = !!value;
+
+ return count;
+}
+
+RAS_ATTR(agent, 0644, ras_agent_show, ras_agent_store);
+
+static struct attribute *ras_root_attrs[] = {
+ &ras_attr_agent.attr,
+ NULL
+};
+
+static const struct attribute_group ras_root_attr_group = {
+ .attrs = ras_root_attrs,
+};
+
+static const struct attribute_group *ras_root_attr_groups[] = {
+ &ras_root_attr_group,
+ NULL,
+};
+
+static int __init ras_init(void)
+{
+ int err = 0;
+
+ err = subsys_system_register(&ras_subsys, ras_root_attr_groups);
+ if (err) {
+ printk(KERN_ERR "Error registering toplevel RAS sysfs node.\n");
+ return err;
+ }
+
+ /* initial string size, will be potentially enlarged if needed */
+ err_str_sz = 200;
+
+ /* no freeing of this since it ras.c is compiled-on only */
+ err_str = kzalloc(err_str_sz, GFP_KERNEL);
+ if (!err_str) {
+ err = -ENOMEM;
+ goto err_alloc;
+ }
+
+ return 0;
+
+err_alloc:
+ bus_unregister(&ras_subsys);
+
+ return err;
+}
+subsys_initcall(ras_init);
--
1.7.8.rc0

2012-02-28 16:12:23

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH 3/3] EDAC: Convert AMD EDAC pieces to use RAS printk buffer

From: Borislav Petkov <[email protected]>

This is an initial version of the patch which converts MCE decoding
facilities to use the RAS printk buffer. When there's no userspace agent
running (i.e., /sys/devices/system/ras/agent == 0), we fall back to the
default printk's into dmesg which is what we've been doing so far.

Signed-off-by: Borislav Petkov <[email protected]>
---
drivers/edac/amd64_edac.c | 8 ++-
drivers/edac/edac_mc.c | 42 ++++++---
drivers/edac/mce_amd.c | 215 ++++++++++++++++++++++++---------------------
3 files changed, 147 insertions(+), 118 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c9eee6d33e9a..2d9a64b01ed5 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1,6 +1,7 @@
-#include "amd64_edac.h"
#include <asm/amd_nb.h>
+#include <asm/ras.h>

+#include "amd64_edac.h"
static struct edac_pci_ctl_info *amd64_ctl_pci;

static int report_gart_errors;
@@ -1901,7 +1902,10 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
sys_addr = get_error_address(m);
syndrome = extract_syndrome(m->status);

- amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
+ if (ras_agent)
+ ras_printk(PR_EMERG, "ERR_ADDR: 0x%llx", sys_addr);
+ else
+ amd64_mc_err(mci, "CE ERROR_ADDRESS= 0x%llx\n", sys_addr);

pvt->ops->map_sysaddr_to_csrow(mci, sys_addr, syndrome);
}
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index ca6c04d350ee..3b3db477b5d0 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -30,8 +30,10 @@
#include <asm/uaccess.h>
#include <asm/page.h>
#include <asm/edac.h>
+#include <asm/ras.h>
#include "edac_core.h"
#include "edac_module.h"
+#include "mce_amd.h"

/* lock to memory controller's control array */
static DEFINE_MUTEX(mem_ctls_mutex);
@@ -701,14 +703,22 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
return;
}

- if (edac_mc_get_log_ce())
- /* FIXME - put in DIMM location */
- edac_mc_printk(mci, KERN_WARNING,
- "CE page 0x%lx, offset 0x%lx, grain %d, syndrome "
- "0x%lx, row %d, channel %d, label \"%s\": %s\n",
- page_frame_number, offset_in_page,
- mci->csrows[row].grain, syndrome, row, channel,
- mci->csrows[row].channels[channel].label, msg);
+ if (edac_mc_get_log_ce()) {
+ if (ras_agent)
+ ras_printk(PR_CONT, ", row: %d, channel: %d\n",
+ row, channel);
+ else
+ /* FIXME - put in DIMM location */
+ edac_mc_printk(mci, KERN_WARNING,
+ "CE page 0x%lx, offset 0x%lx, grain %d,"
+ " syndrome 0x%lx, row %d, channel %d,"
+ " label \"%s\": %s\n",
+ page_frame_number, offset_in_page,
+ mci->csrows[row].grain, syndrome,
+ row, channel,
+ mci->csrows[row].channels[channel].label,
+ msg);
+ }

mci->ce_count++;
mci->csrows[row].ce_count++;
@@ -780,12 +790,16 @@ void edac_mc_handle_ue(struct mem_ctl_info *mci,
pos += chars;
}

- if (edac_mc_get_log_ue())
- edac_mc_printk(mci, KERN_EMERG,
- "UE page 0x%lx, offset 0x%lx, grain %d, row %d, "
- "labels \"%s\": %s\n", page_frame_number,
- offset_in_page, mci->csrows[row].grain, row,
- labels, msg);
+ if (edac_mc_get_log_ue()) {
+ if (ras_agent)
+ ras_printk(PR_CONT, "row: %d\n", row);
+ else
+ edac_mc_printk(mci, KERN_EMERG,
+ "UE page 0x%lx, offset 0x%lx, grain %d,"
+ " row %d, labels \"%s\": %s\n",
+ page_frame_number, offset_in_page,
+ mci->csrows[row].grain, row, labels, msg);
+ }

if (edac_mc_get_panic_on_ue())
panic("EDAC MC%d: UE page 0x%lx, offset 0x%lx, grain %d, "
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index bd926ea2e00c..92591c433e35 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1,5 +1,7 @@
#include <linux/module.h>
#include <linux/slab.h>
+#include <trace/events/mce.h>
+#include <asm/ras.h>

#include "mce_amd.h"

@@ -137,9 +139,9 @@ static bool f12h_dc_mce(u16 ec, u8 xec)
ret = true;

if (ll == LL_L2)
- pr_cont("during L1 linefill from L2.\n");
+ ras_printk(PR_CONT, "during L1 linefill from L2.\n");
else if (ll == LL_L1)
- pr_cont("Data/Tag %s error.\n", R4_MSG(ec));
+ ras_printk(PR_CONT, "Data/Tag %s error.\n", R4_MSG(ec));
else
ret = false;
}
@@ -149,7 +151,7 @@ static bool f12h_dc_mce(u16 ec, u8 xec)
static bool f10h_dc_mce(u16 ec, u8 xec)
{
if (R4(ec) == R4_GEN && LL(ec) == LL_L1) {
- pr_cont("during data scrub.\n");
+ ras_printk(PR_CONT, "during data scrub.\n");
return true;
}
return f12h_dc_mce(ec, xec);
@@ -158,7 +160,7 @@ static bool f10h_dc_mce(u16 ec, u8 xec)
static bool k8_dc_mce(u16 ec, u8 xec)
{
if (BUS_ERROR(ec)) {
- pr_cont("during system linefill.\n");
+ ras_printk(PR_CONT, "during system linefill.\n");
return true;
}

@@ -178,14 +180,14 @@ static bool f14h_dc_mce(u16 ec, u8 xec)
switch (r4) {
case R4_DRD:
case R4_DWR:
- pr_cont("Data/Tag parity error due to %s.\n",
+ ras_printk(PR_CONT, "Data/Tag parity error due to %s.\n",
(r4 == R4_DRD ? "load/hw prf" : "store"));
break;
case R4_EVICT:
- pr_cont("Copyback parity error on a tag miss.\n");
+ ras_printk(PR_CONT, "Copyback parity error on a tag miss.\n");
break;
case R4_SNOOP:
- pr_cont("Tag parity error during snoop.\n");
+ ras_printk(PR_CONT, "Tag parity error during snoop.\n");
break;
default:
ret = false;
@@ -195,17 +197,17 @@ static bool f14h_dc_mce(u16 ec, u8 xec)
if ((II(ec) != II_MEM && II(ec) != II_IO) || LL(ec) != LL_LG)
return false;

- pr_cont("System read data error on a ");
+ ras_printk(PR_CONT, "System read data error on a ");

switch (r4) {
case R4_RD:
- pr_cont("TLB reload.\n");
+ ras_printk(PR_CONT, "TLB reload.\n");
break;
case R4_DWR:
- pr_cont("store.\n");
+ ras_printk(PR_CONT, "store.\n");
break;
case R4_DRD:
- pr_cont("load.\n");
+ ras_printk(PR_CONT, "load.\n");
break;
default:
ret = false;
@@ -225,28 +227,29 @@ static bool f15h_dc_mce(u16 ec, u8 xec)

switch (xec) {
case 0x0:
- pr_cont("Data Array access error.\n");
+ ras_printk(PR_CONT, "Data Array access error.\n");
break;

case 0x1:
- pr_cont("UC error during a linefill from L2/NB.\n");
+ ras_printk(PR_CONT, "UC error during a linefill "
+ "from L2/NB.\n");
break;

case 0x2:
case 0x11:
- pr_cont("STQ access error.\n");
+ ras_printk(PR_CONT, "STQ access error.\n");
break;

case 0x3:
- pr_cont("SCB access error.\n");
+ ras_printk(PR_CONT, "SCB access error.\n");
break;

case 0x10:
- pr_cont("Tag error.\n");
+ ras_printk(PR_CONT, "Tag error.\n");
break;

case 0x12:
- pr_cont("LDQ access error.\n");
+ ras_printk(PR_CONT, "LDQ access error.\n");
break;

default:
@@ -255,9 +258,9 @@ static bool f15h_dc_mce(u16 ec, u8 xec)
} else if (BUS_ERROR(ec)) {

if (!xec)
- pr_cont("during system linefill.\n");
+ ras_printk(PR_CONT, "during system linefill.\n");
else
- pr_cont(" Internal %s condition.\n",
+ ras_printk(PR_CONT, " Internal %s condition.\n",
((xec == 1) ? "livelock" : "deadlock"));
} else
ret = false;
@@ -270,12 +273,12 @@ static void amd_decode_dc_mce(struct mce *m)
u16 ec = EC(m->status);
u8 xec = XEC(m->status, xec_mask);

- pr_emerg(HW_ERR "Data Cache Error: ");
+ ras_printk(PR_EMERG, "Data Cache Error: ");

/* TLB error signatures are the same across families */
if (TLB_ERROR(ec)) {
if (TT(ec) == TT_DATA) {
- pr_cont("%s TLB %s.\n", LL_MSG(ec),
+ ras_printk(PR_CONT, "%s TLB %s.\n", LL_MSG(ec),
((xec == 2) ? "locked miss"
: (xec ? "multimatch" : "parity")));
return;
@@ -283,7 +286,7 @@ static void amd_decode_dc_mce(struct mce *m)
} else if (fam_ops->dc_mce(ec, xec))
;
else
- pr_emerg(HW_ERR "Corrupted DC MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted DC MCE info?\n");
}

static bool k8_ic_mce(u16 ec, u8 xec)
@@ -295,19 +298,19 @@ static bool k8_ic_mce(u16 ec, u8 xec)
return false;

if (ll == 0x2)
- pr_cont("during a linefill from L2.\n");
+ ras_printk(PR_CONT, "during a linefill from L2.\n");
else if (ll == 0x1) {
switch (R4(ec)) {
case R4_IRD:
- pr_cont("Parity error during data load.\n");
+ ras_printk(PR_CONT, "Parity error during data load.\n");
break;

case R4_EVICT:
- pr_cont("Copyback Parity/Victim error.\n");
+ ras_printk(PR_CONT, "Copyback Parity/Victim error.\n");
break;

case R4_SNOOP:
- pr_cont("Tag Snoop error.\n");
+ ras_printk(PR_CONT, "Tag Snoop error.\n");
break;

default:
@@ -330,9 +333,9 @@ static bool f14h_ic_mce(u16 ec, u8 xec)
ret = false;

if (r4 == R4_IRD)
- pr_cont("Data/tag array parity error for a tag hit.\n");
+ ras_printk(PR_CONT, "Data/tag array parity error for a tag hit.\n");
else if (r4 == R4_SNOOP)
- pr_cont("Tag error during snoop/victimization.\n");
+ ras_printk(PR_CONT, "Tag error during snoop/victimization.\n");
else
ret = false;
}
@@ -348,15 +351,16 @@ static bool f15h_ic_mce(u16 ec, u8 xec)

switch (xec) {
case 0x0 ... 0xa:
- pr_cont("%s.\n", f15h_ic_mce_desc[xec]);
+ ras_printk(PR_CONT, "%s.\n", f15h_ic_mce_desc[xec]);
break;

case 0xd:
- pr_cont("%s.\n", f15h_ic_mce_desc[xec-2]);
+ ras_printk(PR_CONT, "%s.\n", f15h_ic_mce_desc[xec-2]);
break;

case 0x10 ... 0x14:
- pr_cont("Decoder %s parity error.\n", f15h_ic_mce_desc[xec-4]);
+ ras_printk(PR_CONT, "Decoder %s parity error.\n",
+ f15h_ic_mce_desc[xec-4]);
break;

default:
@@ -370,19 +374,20 @@ static void amd_decode_ic_mce(struct mce *m)
u16 ec = EC(m->status);
u8 xec = XEC(m->status, xec_mask);

- pr_emerg(HW_ERR "Instruction Cache Error: ");
+ ras_printk(PR_EMERG, "Instruction Cache Error: ");

if (TLB_ERROR(ec))
- pr_cont("%s TLB %s.\n", LL_MSG(ec),
+ ras_printk(PR_CONT, "%s TLB %s.\n", LL_MSG(ec),
(xec ? "multimatch" : "parity error"));
else if (BUS_ERROR(ec)) {
bool k8 = (boot_cpu_data.x86 == 0xf && (m->status & BIT_64(58)));

- pr_cont("during %s.\n", (k8 ? "system linefill" : "NB data read"));
+ ras_printk(PR_CONT, "during %s.\n", (k8 ? "system linefill"
+ : "NB data read"));
} else if (fam_ops->ic_mce(ec, xec))
;
else
- pr_emerg(HW_ERR "Corrupted IC MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted IC MCE info?\n");
}

static void amd_decode_bu_mce(struct mce *m)
@@ -390,30 +395,33 @@ static void amd_decode_bu_mce(struct mce *m)
u16 ec = EC(m->status);
u8 xec = XEC(m->status, xec_mask);

- pr_emerg(HW_ERR "Bus Unit Error");
+ ras_printk(PR_EMERG, "Bus Unit Error");

if (xec == 0x1)
- pr_cont(" in the write data buffers.\n");
+ ras_printk(PR_CONT, " in the write data buffers.\n");
else if (xec == 0x3)
- pr_cont(" in the victim data buffers.\n");
+ ras_printk(PR_CONT, " in the victim data buffers.\n");
else if (xec == 0x2 && MEM_ERROR(ec))
- pr_cont(": %s error in the L2 cache tags.\n", R4_MSG(ec));
+ ras_printk(PR_CONT, ": %s error in the L2 cache tags.\n",
+ R4_MSG(ec));
else if (xec == 0x0) {
if (TLB_ERROR(ec))
- pr_cont(": %s error in a Page Descriptor Cache or "
- "Guest TLB.\n", TT_MSG(ec));
+ ras_printk(PR_CONT, ": %s error in a Page Descriptor "
+ "Cache or Guest TLB.\n",
+ TT_MSG(ec));
else if (BUS_ERROR(ec))
- pr_cont(": %s/ECC error in data read from NB: %s.\n",
- R4_MSG(ec), PP_MSG(ec));
+ ras_printk(PR_CONT, ": %s/ECC error in data read from NB: %s.\n",
+ R4_MSG(ec), PP_MSG(ec));
else if (MEM_ERROR(ec)) {
u8 r4 = R4(ec);

if (r4 >= 0x7)
- pr_cont(": %s error during data copyback.\n",
- R4_MSG(ec));
+ ras_printk(PR_CONT, ": %s error during data copyback.\n",
+ R4_MSG(ec));
else if (r4 <= 0x1)
- pr_cont(": %s parity/ECC error during data "
- "access from L2.\n", R4_MSG(ec));
+ ras_printk(PR_CONT, ": %s parity/ECC error "
+ "during data access from L2.\n",
+ R4_MSG(ec));
else
goto wrong_bu_mce;
} else
@@ -424,7 +432,7 @@ static void amd_decode_bu_mce(struct mce *m)
return;

wrong_bu_mce:
- pr_emerg(HW_ERR "Corrupted BU MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted BU MCE info?\n");
}

static void amd_decode_cu_mce(struct mce *m)
@@ -432,28 +440,28 @@ static void amd_decode_cu_mce(struct mce *m)
u16 ec = EC(m->status);
u8 xec = XEC(m->status, xec_mask);

- pr_emerg(HW_ERR "Combined Unit Error: ");
+ ras_printk(PR_EMERG, "Combined Unit Error: ");

if (TLB_ERROR(ec)) {
if (xec == 0x0)
- pr_cont("Data parity TLB read error.\n");
+ ras_printk(PR_CONT, "Data parity TLB read error.\n");
else if (xec == 0x1)
- pr_cont("Poison data provided for TLB fill.\n");
+ ras_printk(PR_CONT, "Poison data provided for TLB fill.\n");
else
goto wrong_cu_mce;
} else if (BUS_ERROR(ec)) {
if (xec > 2)
goto wrong_cu_mce;

- pr_cont("Error during attempted NB data read.\n");
+ ras_printk(PR_CONT, "Error during attempted NB data read.\n");
} else if (MEM_ERROR(ec)) {
switch (xec) {
case 0x4 ... 0xc:
- pr_cont("%s.\n", f15h_cu_mce_desc[xec - 0x4]);
+ ras_printk(PR_CONT, "%s.\n", f15h_cu_mce_desc[xec - 0x4]);
break;

case 0x10 ... 0x14:
- pr_cont("%s.\n", f15h_cu_mce_desc[xec - 0x7]);
+ ras_printk(PR_CONT, "%s.\n", f15h_cu_mce_desc[xec - 0x7]);
break;

default:
@@ -464,7 +472,7 @@ static void amd_decode_cu_mce(struct mce *m)
return;

wrong_cu_mce:
- pr_emerg(HW_ERR "Corrupted CU MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted CU MCE info?\n");
}

static void amd_decode_ls_mce(struct mce *m)
@@ -473,12 +481,12 @@ static void amd_decode_ls_mce(struct mce *m)
u8 xec = XEC(m->status, xec_mask);

if (boot_cpu_data.x86 >= 0x14) {
- pr_emerg("You shouldn't be seeing an LS MCE on this cpu family,"
- " please report on LKML.\n");
+ ras_printk(PR_EMERG, "You shouldn't be seeing an LS MCE on this"
+ " cpu family, please report on LKML.\n");
return;
}

- pr_emerg(HW_ERR "Load Store Error");
+ ras_printk(PR_EMERG, "Load Store Error");

if (xec == 0x0) {
u8 r4 = R4(ec);
@@ -486,14 +494,14 @@ static void amd_decode_ls_mce(struct mce *m)
if (!BUS_ERROR(ec) || (r4 != R4_DRD && r4 != R4_DWR))
goto wrong_ls_mce;

- pr_cont(" during %s.\n", R4_MSG(ec));
+ ras_printk(PR_CONT, " during %s.\n", R4_MSG(ec));
} else
goto wrong_ls_mce;

return;

wrong_ls_mce:
- pr_emerg(HW_ERR "Corrupted LS MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted LS MCE info?\n");
}

static bool k8_nb_mce(u16 ec, u8 xec)
@@ -502,15 +510,15 @@ static bool k8_nb_mce(u16 ec, u8 xec)

switch (xec) {
case 0x1:
- pr_cont("CRC error detected on HT link.\n");
+ ras_printk(PR_CONT, "CRC error detected on HT link.\n");
break;

case 0x5:
- pr_cont("Invalid GART PTE entry during GART table walk.\n");
+ ras_printk(PR_CONT, "Invalid GART PTE entry during GART table walk.\n");
break;

case 0x6:
- pr_cont("Unsupported atomic RMW received from an IO link.\n");
+ ras_printk(PR_CONT, "Unsupported atomic RMW received from an IO link.\n");
break;

case 0x0:
@@ -518,11 +526,11 @@ static bool k8_nb_mce(u16 ec, u8 xec)
if (boot_cpu_data.x86 == 0x11)
return false;

- pr_cont("DRAM ECC error detected on the NB.\n");
+ ras_printk(PR_CONT, "DRAM ECC error detected on the NB.\n");
break;

case 0xd:
- pr_cont("Parity error on the DRAM addr/ctl signals.\n");
+ ras_printk(PR_CONT, "Parity error on the DRAM addr/ctl signals.\n");
break;

default:
@@ -552,9 +560,9 @@ static bool f10h_nb_mce(u16 ec, u8 xec)

case 0xf:
if (TLB_ERROR(ec))
- pr_cont("GART Table Walk data error.\n");
+ ras_printk(PR_CONT, "GART Table Walk data error.\n");
else if (BUS_ERROR(ec))
- pr_cont("DMA Exclusion Vector Table Walk error.\n");
+ ras_printk(PR_CONT, "DMA Exclusion Vector Table Walk error.\n");
else
ret = false;

@@ -563,7 +571,7 @@ static bool f10h_nb_mce(u16 ec, u8 xec)

case 0x19:
if (boot_cpu_data.x86 == 0x15)
- pr_cont("Compute Unit Data Error.\n");
+ ras_printk(PR_CONT, "Compute Unit Data Error.\n");
else
ret = false;

@@ -581,7 +589,7 @@ static bool f10h_nb_mce(u16 ec, u8 xec)
break;
}

- pr_cont("%s.\n", f10h_nb_mce_desc[xec - offset]);
+ ras_printk(PR_CONT, "%s.\n", f10h_nb_mce_desc[xec - offset]);

out:
return ret;
@@ -599,27 +607,27 @@ void amd_decode_nb_mce(struct mce *m)
u16 ec = EC(m->status);
u8 xec = XEC(m->status, 0x1f);

- pr_emerg(HW_ERR "Northbridge Error (node %d): ", node_id);
+ ras_printk(PR_EMERG, "Northbridge Error (node %d): ", node_id);

switch (xec) {
case 0x2:
- pr_cont("Sync error (sync packets on HT link detected).\n");
+ ras_printk(PR_CONT, "Sync error (sync packets on HT link detected).\n");
return;

case 0x3:
- pr_cont("HT Master abort.\n");
+ ras_printk(PR_CONT, "HT Master abort.\n");
return;

case 0x4:
- pr_cont("HT Target abort.\n");
+ ras_printk(PR_CONT, "HT Target abort.\n");
return;

case 0x7:
- pr_cont("NB Watchdog timeout.\n");
+ ras_printk(PR_CONT, "NB Watchdog timeout.\n");
return;

case 0x9:
- pr_cont("SVM DMA Exclusion Vector error.\n");
+ ras_printk(PR_CONT, "SVM DMA Exclusion Vector error.\n");
return;

default:
@@ -636,7 +644,7 @@ void amd_decode_nb_mce(struct mce *m)
return;

wrong_nb_mce:
- pr_emerg(HW_ERR "Corrupted NB MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted NB MCE info?\n");
}
EXPORT_SYMBOL_GPL(amd_decode_nb_mce);

@@ -651,80 +659,80 @@ static void amd_decode_fr_mce(struct mce *m)
if (c->x86 != 0x15 && xec != 0x0)
goto wrong_fr_mce;

- pr_emerg(HW_ERR "%s Error: ",
+ ras_printk(PR_EMERG, "%s Error: ",
(c->x86 == 0x15 ? "Execution Unit" : "FIROB"));

if (xec == 0x0 || xec == 0xc)
- pr_cont("%s.\n", fr_ex_mce_desc[xec]);
+ ras_printk(PR_CONT, "%s.\n", fr_ex_mce_desc[xec]);
else if (xec < 0xd)
- pr_cont("%s parity error.\n", fr_ex_mce_desc[xec]);
+ ras_printk(PR_CONT, "%s parity error.\n", fr_ex_mce_desc[xec]);
else
goto wrong_fr_mce;

return;

wrong_fr_mce:
- pr_emerg(HW_ERR "Corrupted FR MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted FR MCE info?\n");
}

static void amd_decode_fp_mce(struct mce *m)
{
u8 xec = XEC(m->status, xec_mask);

- pr_emerg(HW_ERR "Floating Point Unit Error: ");
+ ras_printk(PR_EMERG, "Floating Point Unit Error: ");

switch (xec) {
case 0x1:
- pr_cont("Free List");
+ ras_printk(PR_CONT, "Free List");
break;

case 0x2:
- pr_cont("Physical Register File");
+ ras_printk(PR_CONT, "Physical Register File");
break;

case 0x3:
- pr_cont("Retire Queue");
+ ras_printk(PR_CONT, "Retire Queue");
break;

case 0x4:
- pr_cont("Scheduler table");
+ ras_printk(PR_CONT, "Scheduler table");
break;

case 0x5:
- pr_cont("Status Register File");
+ ras_printk(PR_CONT, "Status Register File");
break;

default:
goto wrong_fp_mce;
break;
}
-
- pr_cont(" parity error.\n");
+ ras_printk(PR_CONT, " parity error.\n");

return;

wrong_fp_mce:
- pr_emerg(HW_ERR "Corrupted FP MCE info?\n");
+ ras_printk(PR_EMERG, "Corrupted FP MCE info?\n");
}

static inline void amd_decode_err_code(u16 ec)
{

- pr_emerg(HW_ERR "cache level: %s", LL_MSG(ec));
+ ras_printk(PR_EMERG, "cache level: %s", LL_MSG(ec));

if (BUS_ERROR(ec))
- pr_cont(", mem/io: %s", II_MSG(ec));
+ ras_printk(PR_CONT, ", mem/io: %s", II_MSG(ec));
else
- pr_cont(", tx: %s", TT_MSG(ec));
+ ras_printk(PR_CONT, ", tx: %s", TT_MSG(ec));

if (MEM_ERROR(ec) || BUS_ERROR(ec)) {
- pr_cont(", mem-tx: %s", R4_MSG(ec));
+ ras_printk(PR_CONT, ", mem-tx: %s", R4_MSG(ec));

if (BUS_ERROR(ec))
- pr_cont(", part-proc: %s (%s)", PP_MSG(ec), TO_MSG(ec));
+ ras_printk(PR_CONT, ", part-proc: %s (%s)",
+ PP_MSG(ec), TO_MSG(ec));
}

- pr_cont("\n");
+ ras_printk(PR_CONT, "\n");
}

/*
@@ -752,7 +760,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
if (amd_filter_mce(m))
return NOTIFY_STOP;

- pr_emerg(HW_ERR "CPU:%d\tMC%d_STATUS[%s|%s|%s|%s|%s",
+ ras_printk(PR_EMERG, "CPU:%d MC%d_STATUS[%s|%s|%s|%s|%s",
m->extcpu, m->bank,
((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
@@ -761,19 +769,20 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));

if (c->x86 == 0x15)
- pr_cont("|%s|%s",
+ ras_printk(PR_CONT, "|%s|%s",
((m->status & BIT_64(44)) ? "Deferred" : "-"),
((m->status & BIT_64(43)) ? "Poison" : "-"));

/* do the two bits[14:13] together */
ecc = (m->status >> 45) & 0x3;
if (ecc)
- pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
+ ras_printk(PR_CONT, "|%sECC", ((ecc == 2) ? "C" : "U"));

- pr_cont("]: 0x%016llx\n", m->status);
+ ras_printk(PR_CONT, "]: 0x%016llx", m->status);

if (m->status & MCI_STATUS_ADDRV)
- pr_emerg(HW_ERR "\tMC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
+ ras_printk(PR_CONT, " MC%d_ADDR: 0x%016llx\n",
+ m->bank, m->addr);

switch (m->bank) {
case 0:
@@ -813,6 +822,8 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)

amd_decode_err_code(m->status & 0xffff);

+ trace_mce_record(ras_get_decoded_err(), m);
+
return NOTIFY_STOP;
}
EXPORT_SYMBOL_GPL(amd_decode_mce);
@@ -882,10 +893,10 @@ static int __init mce_amd_init(void)
return -EINVAL;
}

- pr_info("MCE: In-kernel MCE decoding enabled.\n");
-
mce_register_decode_chain(&amd_mce_dec_nb);

+ pr_info("MCE: In-kernel MCE decoding enabled.\n");
+
return 0;
}
early_initcall(mce_amd_init);
--
1.7.8.rc0

2012-02-28 22:43:56

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 2/3] x86, RAS: Add a decoded msg buffer

+ if (left <= 50) {
+ /* enlarge arbitrarily by 50 chars */
+ err_str_sz += 50;
+ left += 50;
+
+ err_str = krealloc(err_str, err_str_sz, GFP_KERNEL);
+ if (!err_str) {
+ pr_err("Error enlarging decode buffer.\n");
+ return;
+ }
+ }

This looks worrying to me. Some bad stuff has happened, we are perhaps
in machine check context, and you want to start allocating memory!

Is there some upper bound for how long this string can get? Why not
just allocate a generous amount at boot time?

-Tony

2012-02-29 01:15:18

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

(2012/02/29 1:11), Borislav Petkov wrote:
> From: Borislav Petkov <[email protected]>
>
> The idea here is to pass an additional decoded MCE message through
> the tracepoint and into the ring buffer for userspace to consume. The
> designated consumers are RAS daemons and other tools collecting RAS
> information.

I could not catch the point... Why you need this msg field?

I think that all of information about the error is already packed in
the record and that we can make a string from the bits in the record
soon afterward. From my point of view it seems that what you are
doing here is just consuming the ring buffer by repeating same
contents in another format with dynamic length which might be short
but otherwise could be too long.

And one more unacceptable point is that filling this msg field is
expected to be done in machine check context where have many
limitations in kernel's subsystems such as use of memory allocators.

Suggestion; How about having a kind of translator function for
userland, e.g. an exported function named mce_record_to_msg()?
Tool obtains raw data from the record in the tracepoint's ring buffer,
and if it likes, optionally it can pass the record to the translator
function to get some accomplished string.

>
> Drop unneeded fields while at it, thus saving some room in the ring
> buffer.

Really unneeded and should be killed?


Thanks,
H.Seto

2012-02-29 10:11:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 10:14:33AM +0900, Hidetoshi Seto wrote:
> (2012/02/29 1:11), Borislav Petkov wrote:
> > From: Borislav Petkov <[email protected]>
> >
> > The idea here is to pass an additional decoded MCE message through
> > the tracepoint and into the ring buffer for userspace to consume. The
> > designated consumers are RAS daemons and other tools collecting RAS
> > information.
>
> I could not catch the point... Why you need this msg field?
>
> I think that all of information about the error is already packed in
> the record and that we can make a string from the bits in the record
> soon afterward. From my point of view it seems that what you are
> doing here is just consuming the ring buffer by repeating same
> contents in another format with dynamic length which might be short
> but otherwise could be too long.

Right, to answer your immediate question: we've already decoded the MCE
so we carry that decoded info to userspace.

To address your indirect question: why aren't we using the MCE fields
to decode the MCE in userspace? Well, this has been a long discussion
already and one of the strong arguments for decoding hardware errors in
the kernel is that the kernel simply knows its hardware better. Imagine
a big server farm with heterogeneous hw configurations - if you get an
MCE there you have to also have collected the hardware platform details
so that you are able to decode it. If the kernel can do that for ya, you
don't have to do anything!

Or the case where you get an uncorrectable error and the machine panics
- it is much more convenient to see the decoded error on the screen
before the machine dies instead of some MCA register dumps which you
have to jot down and go and decode them by hand.

> And one more unacceptable point is that filling this msg field is
> expected to be done in machine check context where have many
> limitations in kernel's subsystems such as use of memory allocators.

Doh, I should've seen that, thanks to you and Tony for pointing that
out.

> Suggestion; How about having a kind of translator function for
> userland, e.g. an exported function named mce_record_to_msg()?
> Tool obtains raw data from the record in the tracepoint's ring buffer,
> and if it likes, optionally it can pass the record to the translator
> function to get some accomplished string.

Either that or I could simply allocate a large enough buffer from the
get-go, as Tony suggests. I'll experiment with my MCE generation script
and see how large a buffer can become.

> > Drop unneeded fields while at it, thus saving some room in the ring
> > buffer.
>
> Really unneeded and should be killed?

Right, so this is me suggesting to remove those because I don't see
why we'd need them, I'm expecting other people to come and say either
"Boris, no no, this is needed in... " or "Yeah, go ahead and remove
them, no one uses those." So feel free to argue either way.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 10:12:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86, RAS: Add a decoded msg buffer

On Tue, Feb 28, 2012 at 10:43:52PM +0000, Luck, Tony wrote:
> + if (left <= 50) {
> + /* enlarge arbitrarily by 50 chars */
> + err_str_sz += 50;
> + left += 50;
> +
> + err_str = krealloc(err_str, err_str_sz, GFP_KERNEL);
> + if (!err_str) {
> + pr_err("Error enlarging decode buffer.\n");
> + return;
> + }
> + }
>
> This looks worrying to me. Some bad stuff has happened, we are perhaps
> in machine check context, and you want to start allocating memory!

Yes, absolutely. See my other mail to Seto-san.

> Is there some upper bound for how long this string can get? Why not
> just allocate a generous amount at boot time?

Yep, this sounds like the easiest thing to do, let me play with MCE
injection a bit.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 12:05:28

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 29-02-2012 07:10, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 10:14:33AM +0900, Hidetoshi Seto wrote:
>> (2012/02/29 1:11), Borislav Petkov wrote:
>>> From: Borislav Petkov <[email protected]>
>>>
>>> The idea here is to pass an additional decoded MCE message through
>>> the tracepoint and into the ring buffer for userspace to consume. The
>>> designated consumers are RAS daemons and other tools collecting RAS
>>> information.
>>
>> I could not catch the point... Why you need this msg field?
>>
>> I think that all of information about the error is already packed in
>> the record and that we can make a string from the bits in the record
>> soon afterward. From my point of view it seems that what you are
>> doing here is just consuming the ring buffer by repeating same
>> contents in another format with dynamic length which might be short
>> but otherwise could be too long.

Not all information is packed in the record. The record packs only what it
is inside the MCE registers. However, for certain errors, it is needed to
parse other hardware registers to decode the error (for example, on Sandy
Bridge, the MCE registers don't contain the affected dimms).

> Right, to answer your immediate question: we've already decoded the MCE
> so we carry that decoded info to userspace.
>
> To address your indirect question: why aren't we using the MCE fields
> to decode the MCE in userspace? Well, this has been a long discussion
> already and one of the strong arguments for decoding hardware errors in
> the kernel is that the kernel simply knows its hardware better. Imagine
> a big server farm with heterogeneous hw configurations - if you get an
> MCE there you have to also have collected the hardware platform details
> so that you are able to decode it. If the kernel can do that for ya, you
> don't have to do anything!
>
> Or the case where you get an uncorrectable error and the machine panics
> - it is much more convenient to see the decoded error on the screen
> before the machine dies instead of some MCA register dumps which you
> have to jot down and go and decode them by hand.
>
>> And one more unacceptable point is that filling this msg field is
>> expected to be done in machine check context where have many
>> limitations in kernel's subsystems such as use of memory allocators.
>
> Doh, I should've seen that, thanks to you and Tony for pointing that
> out.
>
>> Suggestion; How about having a kind of translator function for
>> userland, e.g. an exported function named mce_record_to_msg()?
>> Tool obtains raw data from the record in the tracepoint's ring buffer,
>> and if it likes, optionally it can pass the record to the translator
>> function to get some accomplished string.
>
> Either that or I could simply allocate a large enough buffer from the
> get-go, as Tony suggests. I'll experiment with my MCE generation script
> and see how large a buffer can become.

Just allocate one page. 4096 should be enough even for the most hungry needs.

>>> Drop unneeded fields while at it, thus saving some room in the ring
>>> buffer.
>>
>> Really unneeded and should be killed?
>
> Right, so this is me suggesting to remove those because I don't see
> why we'd need them, I'm expecting other people to come and say either
> "Boris, no no, this is needed in... " or "Yeah, go ahead and remove
> them, no one uses those." So feel free to argue either way.

IMHO, before removing those fields, it would be better to first implement
what is there at the mcelog userspace parser for the Intel machines into
kernelspace (or to look into its source code), and check what registers
aren't used by either AMD 64 MCE decoder or by the Intel MCE decoder.

Tony,

Is there anyone at Intel working on porting it to kernelspace?

Regards,
Mauro

2012-02-29 12:19:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 09:04:46AM -0300, Mauro Carvalho Chehab wrote:
> Not all information is packed in the record. The record packs only what it
> is inside the MCE registers. However, for certain errors, it is needed to
> parse other hardware registers to decode the error (for example, on Sandy
> Bridge, the MCE registers don't contain the affected dimms).

If SB is not using MCA to report the error, it should use either a
generic TP like the trace_hw_error() example I gave last week, or rather
a TP which matches the hw registers of the reporting hardware scheme.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 12:52:56

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 28-02-2012 13:11, Borislav Petkov escreveu:
> From: Borislav Petkov <[email protected]>
>
> The idea here is to pass an additional decoded MCE message through
> the tracepoint and into the ring buffer for userspace to consume. The
> designated consumers are RAS daemons and other tools collecting RAS
> information.
>
> Drop unneeded fields while at it, thus saving some room in the ring
> buffer.
>
> Signed-off-by: Borislav Petkov <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
> include/trace/events/mce.h | 22 +++++++---------------
> 2 files changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 5a11ae2e9e91..072e020ecaf3 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -144,7 +144,7 @@ void mce_log(struct mce *mce)
> int ret = 0;
>
> /* Emit the trace record: */
> - trace_mce_record(mce);
> + trace_mce_record("", mce);
>
> ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
> if (ret == NOTIFY_STOP)
> diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
> index 4cbbcef6baa8..5fada9a40708 100644
> --- a/include/trace/events/mce.h
> +++ b/include/trace/events/mce.h
> @@ -10,11 +10,12 @@
>
> TRACE_EVENT(mce_record,
>
> - TP_PROTO(struct mce *m),
> + TP_PROTO(const char *msg, struct mce *m),

While is very easy to fold everything into a single errror message field,
doing that makes harder for userspace to break the string into the components,
as already discussed.

The big advantage of using perf, instead of printk is that it allows to pass
the fields into a binary form, instead of passing a single string to userspace.
We should use this feature, otherwise, there's no sense on using perf at all.

Also, even if we agree on some string format to make parsing easier, we're not
good on preserving the kernel-userspace API for string messages, so regressions
can (and will likely) be introduced by a simple change at the "msg" field.

On a complex environment, the network farm is monitored by a Network Management
System. On such systems, there are some fields that are very relevant:
- how critical is the error;
- the location of the error;
- the error message.

So, we should, at least, break it into 3 fields. Yet, for location, there are
two ways of reporting it: The silkscreen label of the affected component
(a CPU socket label, a memory slot label, a PCI slot label, etc),
and the location of the affected element, which generally more granular, like
the cache level, the memory bank, etc.

What I'm saying is that, we should, at least, add 4 fields there, instead of one:

- severity
- location
- silkscreen_label
- error_msg

The severity should be an enum type.

>
> - TP_ARGS(m),
> + TP_ARGS(msg, m),
>
> TP_STRUCT__entry(
> + __string( msg, msg )
> __field( u64, mcgcap )
> __field( u64, mcgstatus )
> __field( u64, status )
> @@ -24,15 +25,12 @@ TRACE_EVENT(mce_record,
> __field( u64, tsc )
> __field( u64, walltime )
> __field( u32, cpu )

This one can be merged into the location field, for the errors where
it is pertinent.

> - __field( u32, cpuid )
> - __field( u32, apicid )

It seems safe to get rid of them.

> - __field( u32, socketid )

For CPU errors, the socket ID is the location of the error. This is
probably the most important "ID" field, as it represents the component
that may need to be replaced.

Yet, by adding a silkscreen_label field, it seems safe to get rid of it.

> __field( u8, cs )
> __field( u8, bank )
> - __field( u8, cpuvendor )

Assuming that some daemon would collect those errors on the monitored machine
and pass them to another machine that will store it and parse (this is a typical
scenario for monitored environments), I don't think we can get rid of the
cpuvendor, as this field is required to decode the MCE registers. It is likely
that we need to add also the CPU family and step here, as the MCE decoding
depends on them.

On the other hand, the userspace daemon can easily get it by parsing /proc/cpuinfo.

So, it is probably ok to remove it.

> ),
>
> TP_fast_assign(
> + __assign_str(msg, msg);
> __entry->mcgcap = m->mcgcap;
> __entry->mcgstatus = m->mcgstatus;
> __entry->status = m->status;
> @@ -42,25 +40,19 @@ TRACE_EVENT(mce_record,
> __entry->tsc = m->tsc;
> __entry->walltime = m->time;
> __entry->cpu = m->extcpu;
> - __entry->cpuid = m->cpuid;
> - __entry->apicid = m->apicid;
> - __entry->socketid = m->socketid;
> __entry->cs = m->cs;
> __entry->bank = m->bank;
> - __entry->cpuvendor = m->cpuvendor;
> ),
>
> - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
> + TP_printk("%s" HW_ERR "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, TIME: %llu)",
> + __get_str(msg),
> __entry->cpu,
> __entry->mcgcap, __entry->mcgstatus,
> __entry->bank, __entry->status,
> __entry->addr, __entry->misc,
> __entry->cs, __entry->ip,
> __entry->tsc,
> - __entry->cpuvendor, __entry->cpuid,
> - __entry->walltime,
> - __entry->socketid,
> - __entry->apicid)
> + __entry->walltime)
> );
>
> #endif /* _TRACE_MCE_H */

2012-02-29 13:06:40

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 29-02-2012 09:19, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 09:04:46AM -0300, Mauro Carvalho Chehab wrote:
>> Not all information is packed in the record. The record packs only what it
>> is inside the MCE registers. However, for certain errors, it is needed to
>> parse other hardware registers to decode the error (for example, on Sandy
>> Bridge, the MCE registers don't contain the affected dimms).
>
> If SB is not using MCA to report the error, it should use either a
> generic TP like the trace_hw_error() example I gave last week, or rather
> a TP which matches the hw registers of the reporting hardware scheme.

This is not what I said. On intel, both SB and Nehalem use MCA to report errors.
Older chipsets don't use MCA.

However, there's a fundamental difference between SB and Nehalem:

- on Nehalem, the MCE status register encodes not only the error message; it
also encodes the DIMM that generated the error. So, it is possible to
completely decode the error on userspace, using only the MCE registers.

- on SB, the MCE status register only has the error message. In order to get
the DIMM location, the driver needs to parse the registers that describe
how the DIMM's are organized (this is spread on dozens of PCI devices, and
200+ registers), and how they're interlaced, in order to convert the error
address reported by the MCA into a DIMM location.

So, just storing the values of the MCE registers is not enough to completely
decode the error.

Regards,
Mauro

2012-02-29 13:37:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 10:05:53AM -0300, Mauro Carvalho Chehab wrote:
> Em 29-02-2012 09:19, Borislav Petkov escreveu:
> > On Wed, Feb 29, 2012 at 09:04:46AM -0300, Mauro Carvalho Chehab wrote:
> >> Not all information is packed in the record. The record packs only what it
> >> is inside the MCE registers. However, for certain errors, it is needed to
> >> parse other hardware registers to decode the error (for example, on Sandy
> >> Bridge, the MCE registers don't contain the affected dimms).
> >
> > If SB is not using MCA to report the error, it should use either a
> > generic TP like the trace_hw_error() example I gave last week, or rather
> > a TP which matches the hw registers of the reporting hardware scheme.
>
> This is not what I said. On intel, both SB and Nehalem use MCA to report errors.
> Older chipsets don't use MCA.
>
> However, there's a fundamental difference between SB and Nehalem:
>
> - on Nehalem, the MCE status register encodes not only the error message; it
> also encodes the DIMM that generated the error. So, it is possible to
> completely decode the error on userspace, using only the MCE registers.

Well, depending on what Tony wants to do there, either decode the error
in the kernel and pass it on with the 'msg' arg or do the whole decoding
in userspace.

> - on SB, the MCE status register only has the error message. In order to get
> the DIMM location, the driver needs to parse the registers that describe
> how the DIMM's are organized (this is spread on dozens of PCI devices, and
> 200+ registers), and how they're interlaced, in order to convert the error
> address reported by the MCA into a DIMM location.

As I already said, amd64_edac does a similar thing does already so I
don't see any difference in the solutions there: decode to the DIMM and
pass the info through 'msg'.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 13:46:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 09:52:33AM -0300, Mauro Carvalho Chehab wrote:

[snip a bunch of crap I'm tired of correcting again, for the 100th time]

> What I'm saying is that, we should, at least, add 4 fields there, instead of one:
>
> - severity
> - location

I've already explained to you why this format doesn't fit the MCE
tracepoint - go and read my replies to you again.

> - silkscreen_label

ditto.

> - error_msg

[ snip more bullshit ]

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 14:04:34

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 29-02-2012 10:45, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 09:52:33AM -0300, Mauro Carvalho Chehab wrote:
>
> [snip a bunch of crap I'm tired of correcting again, for the 100th time]
>
>> What I'm saying is that, we should, at least, add 4 fields there, instead of one:
>>
>> - severity
>> - location
>
> I've already explained to you why this format doesn't fit the MCE
> tracepoint - go and read my replies to you again.

No, you didn't. Every time i touch on this point, you just say that it doesn't
fit without giving any explanation why not.

The severity is there, even on your patches, but encoded as string:

+#define PR_EMERG BIT(0)
+#define PR_WARNING BIT(1)

There are, actually, 3 variants, on Intel MCA:

- corrected error;
- uncorrected error (bit 61 of status);
- uncorrected, recoverable error (bit 61 + bit 56 of status).

The location is also there encoded at the MCA. It is specific to the type
of the error, but all errors happen somewhere: a CPU, a BUS, a memory, ...


>> - silkscreen_label
>
> ditto.

ditto: all errors happen somewhere. The silkscreen label is the first
replaceable unit to be replaced by the user, in order to get rid of the
error.

>
>> - error_msg
>
> [ snip more bullshit ]
>

Running away from this discussion won't help at all.

Regards,
Mauro

2012-02-29 14:41:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 11:04:09AM -0300, Mauro Carvalho Chehab wrote:
> No, you didn't. Every time i touch on this point, you just say that it
> doesn't fit without giving any explanation why not.

Let me explain it to you one _last_ time:

- severity: No real need for it. If the error is severe enough, the
kernel handles automatically, i.e. memory poisoning and recovery. In all
the other cases it is not severe enough.

- location: this is contained in the ->cpu field.

- silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
Cache error during an L1 linefill from L2, what the f*ck does the
silkscreen label mean for such an error?! Well, nobody knows wtf it
means!</sarcasm>

- error_msg: already there in my patch.

So go and read and try _understanding_ this before you come back with
more crap, ok?

> Running away from this discussion won't help at all.

Not running away - trying not to waste time with bullshit.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 16:58:24

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

> - severity: No real need for it. If the error is severe enough, the
> kernel handles automatically, i.e. memory poisoning and recovery. In all
> the other cases it is not severe enough.

We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
will run to pull them). But we will see both corrected error chatter and
recovered uncorrectable errors. I would be able to tell these apart.
Corrected errors in small doses are normal and don't require any
action beyond logging so you can see whether there are enough to cross
a threshold and cause alarm. Recovered uncorrectable errors are going
to be much rarer, and I think deserve closer scrutiny - even when there
is just one of them.
If you drop the severity field, is there some other way to make this
distinction?

> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> Cache error during an L1 linefill from L2, what the f*ck does the
> silkscreen label mean for such an error?! Well, nobody knows wtf it
> means!</sarcasm>

Cache error should point to a cpu socket - I'd like to have a silk
screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
or "1, 2, 3 ..."?) No idea where we'd get that information from. dmidecode
shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
system. I'd have to pull the system apart to see if those are helpful
in identifying which physical cpu is which.

-Tony

2012-02-29 17:11:34

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

>> - on Nehalem, the MCE status register encodes not only the error message; it
>> also encodes the DIMM that generated the error. So, it is possible to
>> completely decode the error on userspace, using only the MCE registers.
>
> Well, depending on what Tony wants to do there, either decode the error
> in the kernel and pass it on with the 'msg' arg or do the whole decoding
> in userspace.

For best results - we should decode right away in the kernel. Decoding later
requires that we carry a lot of additional information about the system
configuration at the time of the error. Consider the case of a hard error
(either fatal or recoverable). If the system reboots, then the DIMM
with the error should fail self test - and thus be mapped out of the system.
If the error analyzer doesn't realize that this has happened, it will be
very confused. Even if it does notice - the Sandy bridge decoder won't be
able to check that the right DIMM was mapped out (since the configuration
registers it reads to map addresses to DIMMS will now be set for the new
configuration, with different mappings).

-Tony

2012-02-29 17:16:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 04:58:09PM +0000, Luck, Tony wrote:
> > - severity: No real need for it. If the error is severe enough, the
> > kernel handles automatically, i.e. memory poisoning and recovery. In all
> > the other cases it is not severe enough.
>
> We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
> will run to pull them). But we will see both corrected error chatter and
> recovered uncorrectable errors. I would be able to tell these apart.
> Corrected errors in small doses are normal and don't require any
> action beyond logging so you can see whether there are enough to cross
> a threshold and cause alarm. Recovered uncorrectable errors are going
> to be much rarer, and I think deserve closer scrutiny - even when there
> is just one of them.
> If you drop the severity field, is there some other way to make this
> distinction?

Err, MCi_STATUS bits like bit 55 (Action Required) and 56 (Signaled #MC)
in your case...?

> > - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> > Cache error during an L1 linefill from L2, what the f*ck does the
> > silkscreen label mean for such an error?! Well, nobody knows wtf it
> > means!</sarcasm>
>
> Cache error should point to a cpu socket - I'd like to have a silk
> screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
> or "1, 2, 3 ..."?) No idea where we'd get that information from. dmidecode
> shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
> system. I'd have to pull the system apart to see if those are helpful
> in identifying which physical cpu is which.

First of all, silkscreen label denotes DIMM slots in this context
AFAICT. Concerning CPU sockets, I'm not aware of a method to read out
the silkscreen labels at the CPU sockets, are you? Or am I missing
something?

IOW, we want to assume that cores 0, 1, 2 ... k-1 are on node 0; k, k+1
... 2k-1 belong to node 1, etc., where k is the number of cores on a
socket and thus we have a regular core enumeration on the box.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 17:17:42

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 29-02-2012 11:40, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 11:04:09AM -0300, Mauro Carvalho Chehab wrote:
>> No, you didn't. Every time i touch on this point, you just say that it
>> doesn't fit without giving any explanation why not.
>
> Let me explain it to you one _last_ time:

Thanks! Your view is now clear.
>
> - severity: No real need for it. If the error is severe enough, the
> kernel handles automatically, i.e. memory poisoning and recovery. In all
> the other cases it is not severe enough.

I see your point, but opone thing is to recover from severe errors;
another thing is to properly report it.

In general, when an error occurs, what users do is to account
them, taking other measures (like replace the affected hardware)
only if the error count is above a certain threshold.

The threshold criteria for non-severe errors is generally different from
the criteria for severe ones. That's why the severity information should
be reported.

>
> - location: this is contained in the ->cpu field.

Assuming that the CPU field contains the location of the error is a bad
assumption. On SB, the CPU that reports a memory error is the CPU where
some code tried to access the RAM, and not the CPU where the memory
controller is. So, the value for CPU is bogus for those error types.
This is also true, at least on Intel, for other types of errors, like
bus/interconnect ones: the CPU that reports the error is the one trying
to access the bus.

Also, on almost all memory error cases, the location of the affected
component is not the memory controller at the CPU. Instead, it is the
DRAM chip, located inside a DIMM.

So, while there are several error types where the location is cpu field,
there are also several other cases where location != cpu.

The kernel decoder knows the error location, on most cases. So, instead
of letting the userspace to guess the error location, it should report
what it was decoded.

> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> Cache error during an L1 linefill from L2, what the f*ck does the
> silkscreen label mean for such an error?! Well, nobody knows wtf it
> means!</sarcasm>

It means what component needs to be replaced, because there are too
many errors there, and it is likely damaged.

Silkscreen label for a L1 error: CPU0, CPU1, CPU2, CPU3 (the socket "name"
for the CPU socket at the motherboard, as labeled at the silkscreen).
Again, this is not the CPU field, as one CPU socket has several cores,
and the core/socket ID order can be different than the actual CPU slots
(btw, this is explicitly noticed on a few Intel datasheets).

Of course, for both location and silkscreen label fields, if, for any
reason, the location of the affected component can't be identified, those
fields should be filled with an empty string (or with something like "unknown").

Silkscreen label for a memory error: DIMM1A, DIMM2B, etc.

> - error_msg: already there in my patch.
>
> So go and read and try _understanding_ this before you come back with
> more crap, ok?

Ok, but please do the same and try to see the question from the users
perspective that needs to know what is the damn broken FRU
(Field Replaceable Unit) that needs a replacement on their
critical systems.
>
>> Running away from this discussion won't help at all.
>
> Not running away - trying not to waste time with bullshit.
>

Thanks,
Mauro

2012-02-29 17:19:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

On Wed, Feb 29, 2012 at 05:11:08PM +0000, Luck, Tony wrote:
> >> - on Nehalem, the MCE status register encodes not only the error message; it
> >> also encodes the DIMM that generated the error. So, it is possible to
> >> completely decode the error on userspace, using only the MCE registers.
> >
> > Well, depending on what Tony wants to do there, either decode the error
> > in the kernel and pass it on with the 'msg' arg or do the whole decoding
> > in userspace.
>
> For best results - we should decode right away in the kernel. Decoding later
> requires that we carry a lot of additional information about the system
> configuration at the time of the error. Consider the case of a hard error
> (either fatal or recoverable). If the system reboots, then the DIMM
> with the error should fail self test - and thus be mapped out of the system.
> If the error analyzer doesn't realize that this has happened, it will be
> very confused. Even if it does notice - the Sandy bridge decoder won't be
> able to check that the right DIMM was mapped out (since the configuration
> registers it reads to map addresses to DIMMS will now be set for the new
> configuration, with different mappings).

Absolutely!

And also, you've lost the moment the system reboots and you haven't
gotten all the info needed for decoding the error.


--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-02-29 17:28:18

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

> IMHO, before removing those fields, it would be better to first implement
> what is there at the mcelog userspace parser for the Intel machines into
> kernelspace (or to look into its source code), and check what registers
> aren't used by either AMD 64 MCE decoder or by the Intel MCE decoder.
>
> Tony,
>
> Is there anyone at Intel working on porting it to kernelspace?

The mcelog code just looks at model specific fields in MCi_STATUS
and MCi_MISC. We could move it to the kernel - but I don't see
much value in doing so. In this case all the information we need
is carried in status/misc - so as long as we keep all of those
bits (and the cpu family/model) we can safely decode/analyze later.

-Tony

2012-02-29 17:34:21

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

> IOW, we want to assume that cores 0, 1, 2 ... k-1 are on node 0; k, k+1
> ... 2k-1 belong to node 1, etc., where k is the number of cores on a
> socket and thus we have a regular core enumeration on the box.

Sounds dubious:

Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node 0, Processors #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node 1, Processors #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Now those are logical cpu numbers, and we brought up the first HT
thread on each core first, and then came around for a 2nd pass
bringing up the other HT thread. This order is determined by
how the BIOS lists the cpus (and in this case it seems to be
doing so according to recommendations) - so here our core numbers
will match what you said. But the BIOS could do something
strange and list logical cpus alternating between sockets. In
which case cores 0, 2, 4, 6 ... would be on node 0, and cores
1, 3, 5, 7, ... on node 1.

-Tony

2012-02-29 17:46:06

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 29-02-2012 14:16, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 04:58:09PM +0000, Luck, Tony wrote:
>>> - severity: No real need for it. If the error is severe enough, the
>>> kernel handles automatically, i.e. memory poisoning and recovery. In all
>>> the other cases it is not severe enough.
>>
>> We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
>> will run to pull them).

With the current approach, that's true.

I remember you've mentioned an idea of storing fatal errors on an APEI
non-volatile memory for them to be sent to userspace after a machine
reboot. If this is implemented some day, those type of errors could
also be reported via trace, depending on how such feature would be
implemented, as one possibility would be to just store there the contents
of the last dmesg content.

>> But we will see both corrected error chatter and
>> recovered uncorrectable errors. I would be able to tell these apart.
>> Corrected errors in small doses are normal and don't require any
>> action beyond logging so you can see whether there are enough to cross
>> a threshold and cause alarm. Recovered uncorrectable errors are going
>> to be much rarer, and I think deserve closer scrutiny - even when there
>> is just one of them.
>> If you drop the severity field, is there some other way to make this
>> distinction?
>
> Err, MCi_STATUS bits like bit 55 (Action Required) and 56 (Signaled #MC)
> in your case...?

That would force all userspace tools that handle such errors to have
some MCA-specific logic inside, which is one of the things we're trying
to avoid. Also, non-MCA drivers will have a severity that aren't present
at the MCA status.

Assuming that the same tool can work with both MCA and non-MCA drivers,
for API consistency, we should try to use the same way to describe
severity (and label/location) on both MCA and non-MCA cases.

With regards to Intel, as far as I know, there are some cpu-family
specific stuff for recovered uncorrectable errors.

>>> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
>>> Cache error during an L1 linefill from L2, what the f*ck does the
>>> silkscreen label mean for such an error?! Well, nobody knows wtf it
>>> means!</sarcasm>
>>
>> Cache error should point to a cpu socket - I'd like to have a silk
>> screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
>> or "1, 2, 3 ..."?) No idea where we'd get that information from. dmidecode
>> shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
>> system. I'd have to pull the system apart to see if those are helpful
>> in identifying which physical cpu is which.
>
> First of all, silkscreen label denotes DIMM slots in this context
> AFAICT.

No. I'm referring to the Silkscreen label (and the location) of the
affected component, and not to just DIMMs.

> Concerning CPU sockets, I'm not aware of a method to read out
> the silkscreen labels at the CPU sockets, are you? Or am I missing
> something?

The same strategy used by edac can be used there: add a 'label' node to
/sys/devices/system/cpu/cpu?/

To allow userspace to fill it/override it.

> IOW, we want to assume that cores 0, 1, 2 ... k-1 are on node 0; k, k+1
> ... 2k-1 belong to node 1, etc., where k is the number of cores on a
> socket and thus we have a regular core enumeration on the box.

Initially, the RAS code could fill the 'label' using the above criteria,
while allowing an userspace tool to get the labels from dmidecode and
use it there.

Regards,
Mauro

2012-02-29 18:01:00

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

Em 29-02-2012 14:20, Luck, Tony escreveu:
>> IMHO, before removing those fields, it would be better to first implement
>> what is there at the mcelog userspace parser for the Intel machines into
>> kernelspace (or to look into its source code), and check what registers
>> aren't used by either AMD 64 MCE decoder or by the Intel MCE decoder.
>>
>> Tony,
>>
>> Is there anyone at Intel working on porting it to kernelspace?
>
> The mcelog code just looks at model specific fields in MCi_STATUS
> and MCi_MISC. We could move it to the kernel - but I don't see
> much value in doing so.

I see a few reasons:

- it would be consistent with what's being done at AMD. So, all x86
arch will report errors at the same way;

- userspace won't need to run an extra daemon/tool to decode the
errors;

- fatal errors won't be lost (well, in fact, you have there already
a parser for fatal errors. Not sure if all possible fatal errors
are covered here, nor what else is needed, as you have already
there part of mcelog decoder);

- a single place to maintain, when new cpu families are added;

- it makes easier to centralize the hardware error information,
as there's no need to enrich the error on userspace.

> In this case all the information we need
> is carried in status/misc - so as long as we keep all of those
> bits (and the cpu family/model) we can safely decode/analyze later.

Hmm... the mcelog tool opens the /proc/cpuinfo:

$ grep cpuinfo *
mcelog.c: f = fopen("/proc/cpuinfo","r");
mcelog.c: Eprintf("warning: Cannot parse /proc/cpuinfo\n");
mcelog.c: Eprintf("warning: Cannot open /proc/cpuinfo\n");
tsc.c: asprintf(&fn, "/sys/devices/system/cpu/cpu%d/cpufreq/cpuinfo_max_freq", cpu);
tsc.c: /* /sys exists, but no cpufreq -- use value from cpuinfo */

That probably means that the needed cpu family/model info is not (or may not) be
stored at the MCE structure.

>
> -Tony

2012-02-29 18:12:53

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

> That probably means that the needed cpu family/model info is not (or may not) be
> stored at the MCE structure.

mcelog can run as a daemon, or one-off to process reports from earlier (or
even from other systems). It picks up cpu/family from /proc/cpuinfo - but can
also take a command line argument. These get checked against the cpuid
field in the structure that is read from /dev/mcelog (or decoded from a
console log with the --ascii option).

In some cases the checks may be redundant (e.g. running in daemon mode,
there is no way that /dev/mcelog will not have a cpuid that matches the
cpu that you are running on!) But better to have too many checks than too
few.

-Tony

2012-03-01 02:24:08

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

(2012/02/29 22:37), Borislav Petkov wrote:
> On Wed, Feb 29, 2012 at 10:05:53AM -0300, Mauro Carvalho Chehab wrote:
>> Em 29-02-2012 09:19, Borislav Petkov escreveu:
>> - on SB, the MCE status register only has the error message. In order to get
>> the DIMM location, the driver needs to parse the registers that describe
>> how the DIMM's are organized (this is spread on dozens of PCI devices, and
>> 200+ registers), and how they're interlaced, in order to convert the error
>> address reported by the MCA into a DIMM location.
>
> As I already said, amd64_edac does a similar thing does already so I
> don't see any difference in the solutions there: decode to the DIMM and
> pass the info through 'msg'.

My concern is; on Sandy Bridge, is it safe to gather info about the DIMM
location in/from machine check context in a reasonable time span?
I know that for corrected errors which is handled in normal context it is
safe to refer the vast PCI configuration space...

Or is it really possible to determine the erroneous DIMM location from OS?
It looks like that how to get the location is highly depending on the
hardware, processor's vendor/family/model and firmware configuration etc..
Even if OS tells me "please replace memory seated on slot#3 at node#5" or
so, I'm not sure whether these numbers are consistent over reboot if
there are some hot-plugged node and/or memory. Order of numbering can
be changed by how firmware enumerate ACPI namespace or so...
Actually in these days we usually use firmware's system event log to
determine which module should be replaced, assuming that firmware knows
hardware better than OSes running on that machine.

Getting back to the "msg" I think it is not necessary if it does not
contain any new data which is not available in the mce_record today.
If you just want to add field about physical memory location, I think
string "msg" is not only way to do so.


Thanks,
H.Seto