2010-01-21 22:17:20

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] x86: mce: Xeon75xx specific interface to get corrected memory error information

x86: mce: Xeon75xx specific interface to get corrected memory error information

Xeon 75xx doesn't log physical addresses on corrected machine check
events in the standard architectural MSRs. Instead the address has to
be retrieved in a model specific way. This makes it impossible
to do predictive failure analysis.

Implement cpu model specific code to do this in mce-xeon75xx.c using a new hook
that is called from the generic poll code. The code retrieves
the physical address/DIMM of the last corrected error from the platform
and makes the address look like a standard architectural MCA address for
further processing.

In addition the DIMM information is retrieved and put into two new aux0/aux1
fields in struct mce. These fields are specific to a given CPU.
These fields can then be decoded by mcelog into specific DIMM information.
The latest mcelog version has support for this.

Longer term this will be likely in a different output format, but
short term that seemed like the least intrusive solution. Older
mcelog can deal with an extended record.

There's no code to print this information on a panic because this only
works for corrected errors, and corrected errors do not usually result in
panics.

The act of retrieving the DIMM/PA information can take some time, so this
code has a rate limit to avoid taking too much CPU time on a error flood.

The whole thing can be loaded as a module and has suitable
PCI-IDs so that it can be auto-loaded by a distribution.
The code also checks explicitely for the expected CPU model
number to make sure this code doesn't run anywhere else.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/Kconfig | 8
arch/x86/include/asm/mce.h | 2
arch/x86/kernel/cpu/mcheck/Makefile | 1
arch/x86/kernel/cpu/mcheck/mce-internal.h | 1
arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c | 427 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mcheck/mce.c | 12
arch/x86/kernel/e820.c | 3
7 files changed, 453 insertions(+), 1 deletion(-)

Index: linux/arch/x86/kernel/cpu/mcheck/mce.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ linux/arch/x86/kernel/cpu/mcheck/mce.c
@@ -88,6 +88,9 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_wait)
static DEFINE_PER_CPU(struct mce, mces_seen);
static int cpu_missing;

+void (*cpu_specific_poll)(struct mce *);
+EXPORT_SYMBOL_GPL(cpu_specific_poll);
+
/*
* CPU/chipset specific EDAC code can register a notifier call here to print
* MCE errors in a human-readable form.
@@ -364,6 +367,11 @@ static void mce_wrmsrl(u32 msr, u64 v)
wrmsrl(msr, v);
}

+static int under_injection(void)
+{
+ return __get_cpu_var(injectm).finished;
+}
+
/*
* Simple lockless ring to communicate PFNs from the exception handler with the
* process context work function. This is vastly simplified because there's
@@ -575,6 +583,10 @@ void machine_check_poll(enum mcp_flags f

if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
+
+ if (cpu_specific_poll && !under_injection() && !mce_dont_log_ce)
+ cpu_specific_poll(&m);
+
/*
* Don't get the IP here because it's unlikely to
* have anything to do with the actual error location.
Index: linux/arch/x86/kernel/cpu/mcheck/mce-internal.h
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ linux/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -28,3 +28,4 @@ extern int mce_ser;

extern struct mce_bank *mce_banks;

+extern void (*cpu_specific_poll)(struct mce *);
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h
+++ linux/arch/x86/include/asm/mce.h
@@ -67,6 +67,8 @@ struct mce {
__u32 socketid; /* CPU socket ID */
__u32 apicid; /* CPU initial apic ID */
__u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */
+ __u64 aux0; /* model specific */
+ __u64 aux1; /* model specific */
};

/*
Index: linux/arch/x86/kernel/cpu/mcheck/Makefile
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/Makefile
+++ linux/arch/x86/kernel/cpu/mcheck/Makefile
@@ -2,6 +2,7 @@ obj-y = mce.o mce-severity.o

obj-$(CONFIG_X86_ANCIENT_MCE) += winchip.o p5.o
obj-$(CONFIG_X86_MCE_INTEL) += mce_intel.o
+obj-$(CONFIG_X86_MCE_XEON75XX) += mce-xeon75xx.o
obj-$(CONFIG_X86_MCE_AMD) += mce_amd.o
obj-$(CONFIG_X86_MCE_THRESHOLD) += threshold.o
obj-$(CONFIG_X86_MCE_INJECT) += mce-inject.o
Index: linux/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
===================================================================
--- /dev/null
+++ linux/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
@@ -0,0 +1,427 @@
+/*
+ * Xeon 7500 series specific machine check support code.
+ * Copyright 2009, 2010 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * Implement Xeon 7500 series specific code to retrieve the physical address
+ * and DIMM information for corrected memory errors.
+ *
+ * Interface: mce->aux0/aux1 is mapped to a struct pfa_dimm with pad
+ * redefined to DIMM valid bits. Consumers check CPUID and bank and
+ * then interpret aux0/aux1
+ */
+
+/* #define DEBUG 1 */ /* disable for production */
+#define pr_fmt(x) "MCE: " x
+
+#include <linux/moduleparam.h>
+#include <linux/pci_ids.h>
+#include <linux/hrtimer.h>
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <asm/processor.h>
+#include <asm/e820.h>
+#include <asm/mce.h>
+#include <asm/io.h>
+
+#include "mce-internal.h"
+
+#define PFA_SIG "$PFA"
+#define PFA_SIG_LEN 4
+
+/* DIMM description */
+struct aux_pfa_dimm {
+ u8 fbd_channel_id;
+ u8 ddr_channel_id;
+ u8 ddr_dimm_id;
+ u8 ddr_rank_id;
+ u8 ddr_dimm_bank_id;
+ u8 ddr_dimm_row_id;
+ u8 ddr_dimm_column_id;
+ u8 valid;
+} __attribute__((packed));
+
+struct pfa_dimm {
+ u8 fbd_channel_id;
+ u8 ddr_channel_id;
+ u8 ddr_dimm_id;
+ u8 ddr_rank_id;
+ u8 ddr_dimm_bank_id;
+ u32 ddr_dimm_row_id;
+ u32 ddr_dimm_column_id;
+} __attribute__((packed));
+
+/* Memory translation table in memory. */
+struct pfa_table {
+ u8 sig[PFA_SIG_LEN]; /* Signature: '$PFA' */
+ u16 len; /* total length */
+ u16 revision; /* 0x11 */
+ u8 checksum; /* 8bit sum to zero */
+ u8 db_value; /* mailbox port command value */
+ u8 db_port; /* mailbox port */
+ /* end of header; end of checksum */
+ u8 command; /* input command */
+ u32 valid; /* valid input/output bits */
+ u16 status; /* output status */
+ u8 socket_id; /* input socket id*/
+ u8 bank_id; /* input MCE bank id */
+ u32 pad1;
+ u64 mbox_address;
+ u64 physical_addr; /* physical address */
+ struct pfa_dimm dimm[2];
+ /*
+ * topology information follows: not used for now.
+ */
+} __attribute__((packed));
+
+/* DIMM valid bits in valid: DIMM0: 8..12; DIMM1 16..20 */
+#define DIMM_VALID_BITS(val, num) (((val) >> (4 + (num) * 8)) & DIMM_VALID_ALL)
+#define DIMM_SET_VALID(val, num) ((val) << (4 + (num) * 8))
+
+enum {
+ MCE_BANK_MBOX0 = 8,
+ MCE_BANK_MBOX1 = 9,
+
+ PFA_REVISION = 0x11, /* v1.1 */
+
+ /* Status bits for valid field */
+ PFA_VALID_MA = (1 << 0),
+ PFA_VALID_SOCKETID = (1 << 1),
+ PFA_VALID_BANKID = (1 << 2),
+ PFA_VALID_PA = (1 << 3),
+
+ /* DIMM valid bits in valid */
+ /* use with DIMM_VALID_BITS/DIMM_SET_VALID for pfa->valid */
+ DIMM_VALID_FBD_CHAN = (1 << 0),
+ DIMM_VALID_DDR_CHAN = (1 << 1),
+ DIMM_VALID_DDR_DIMM = (1 << 2),
+ DIMM_VALID_DDR_RANK = (1 << 3),
+ DIMM_VALID_DIMM_BANK = (1 << 4),
+ DIMM_VALID_DIMM_ROW = (1 << 5),
+ DIMM_VALID_DIMM_COLUMN = (1 << 6),
+ DIMM_VALID_ALL = 0x7f,
+
+ PFA_DIMM_VALID_MASK = DIMM_SET_VALID(DIMM_VALID_ALL, 0)
+ | DIMM_SET_VALID(DIMM_VALID_ALL, 1),
+
+ /* Values for status field */
+ PFA_STATUS_SUCCESS = 0,
+ PFA_STATUS_SOCKET_INVALID = (1 << 1),
+ PFA_STATUS_MBOX_INVALID = (1 << 2),
+ PFA_STATUS_MA_INVALID = (1 << 3),
+ PFA_STATUS_PA_INVALID = (1 << 4),
+
+ /* Values for command field */
+ PFA_CMD_GET_MEM_CORR_ERR_PA = 0,
+ PFA_CMD_PA_TO_DIMM_ADDR = 1,
+ PFA_CMD_DIMM_TO_PA = 2,
+ PFA_CMD_GET_TOPOLOGY = 3,
+
+ /* PCI device IDs and the base register */
+ ICH_PFA_CFG = 0x8c, /* SCRATCH4 */
+ PCI_DEVICE_ID_BXB_ICH_LEGACY0 = 0x3422,
+};
+
+static struct pfa_table *pfa_table __read_mostly;
+static int memerr_max_conv_rate __read_mostly = 100;
+static int memerr_min_interval __read_mostly = 500;
+static int pfa_lost; /* for diagnosis */
+
+enum {
+ RATE_LIMIT_PERIOD = USEC_PER_SEC, /* in us; period of rate limit */
+};
+
+module_param(memerr_max_conv_rate, int, 0644);
+MODULE_PARM_DESC(memerr_max_conv_rate,
+ "Maximum number of memory error conversions each second; 0 to disable");
+module_param(memerr_min_interval, int, 0644);
+MODULE_PARM_DESC(memerr_min_interval,
+ "Minimum time delta between two memory conversions; in us; default 500");
+
+static int notest;
+static int nocsum;
+module_param(notest, int, 0);
+module_param(nocsum, int, 0);
+
+static u64 encode_dimm(struct pfa_dimm *d, u8 valid)
+{
+ union {
+ struct aux_pfa_dimm d;
+ u64 v;
+ } p;
+
+ BUILD_BUG_ON(sizeof(struct aux_pfa_dimm) != sizeof(u64));
+ p.d.fbd_channel_id = d->fbd_channel_id;
+ p.d.ddr_channel_id = d->ddr_channel_id;
+ p.d.ddr_dimm_id = d->ddr_dimm_id;
+ p.d.ddr_rank_id = d->ddr_rank_id;
+ p.d.ddr_dimm_bank_id = d->ddr_dimm_bank_id;
+ p.d.ddr_dimm_row_id = d->ddr_dimm_row_id;
+ if (p.d.ddr_dimm_row_id != d->ddr_dimm_row_id) /* truncated? */
+ valid &= ~DIMM_VALID_DIMM_ROW;
+ p.d.ddr_dimm_column_id = d->ddr_dimm_column_id;
+ if (p.d.ddr_dimm_column_id != d->ddr_dimm_column_id)
+ valid &= ~DIMM_VALID_DIMM_COLUMN;
+ p.d.valid = valid;
+ pr_debug("PFA fbd_ch %u ddr_ch %u dimm %u rank %u bank %u valid %x\n",
+ d->fbd_channel_id,
+ d->ddr_channel_id,
+ d->ddr_dimm_id,
+ d->ddr_rank_id,
+ d->ddr_dimm_bank_id,
+ valid);
+ return p.v;
+}
+
+static u8 csum(u8 *table, u16 len)
+{
+ u8 sum = 0;
+ int i;
+ for (i = 0; i < len; i++)
+ sum += *table++;
+ return sum;
+}
+
+/*
+ * Execute a command through the mailbox interface.
+ */
+static int
+pfa_command(unsigned bank, unsigned socketid, unsigned command, unsigned valid)
+{
+ pfa_table->bank_id = bank;
+ pfa_table->socket_id = socketid;
+ pfa_table->valid = valid | PFA_VALID_SOCKETID;
+ pfa_table->command = command;
+
+ outb(pfa_table->db_value, pfa_table->db_port);
+
+ mb(); /* Reread fields after they got changed */
+
+ if (pfa_table->status != PFA_STATUS_SUCCESS) {
+ pr_debug("Memory PFA command %d failed: socket:%d bank:%d status:%x\n",
+ command, socketid, bank, pfa_table->status);
+ return -pfa_table->status;
+ }
+ return 0;
+}
+
+/*
+ * Retrieve physical address and DIMMs.
+ */
+static int translate_memory_error(struct mce *m)
+{
+ struct pfa_table *pfa = pfa_table;
+ u64 status;
+ int ret;
+ u32 valid;
+ int cpu = smp_processor_id();
+
+ /* Make sure our structures match the specification */
+ BUILD_BUG_ON(offsetof(struct pfa_table, physical_addr) != 0x20);
+ BUILD_BUG_ON(offsetof(struct pfa_table, status) != 0x10);
+ BUILD_BUG_ON(offsetof(struct pfa_table, physical_addr) != 0x20);
+ BUILD_BUG_ON(offsetof(struct pfa_table, dimm[1].ddr_dimm_column_id) !=
+ 0x3e);
+
+ /* Ask for PA/DIMMs of last error */
+ if (pfa_command(m->bank, m->socketid,
+ PFA_CMD_GET_MEM_CORR_ERR_PA, PFA_VALID_BANKID) < 0)
+ return -1;
+
+ /*
+ * Recheck machine check bank. If the overflow bit was set
+ * there was a race. Don't use the information in this case.
+ */
+ rdmsrl(MSR_IA32_MCx_STATUS(m->bank), status);
+ if (status & MCI_STATUS_OVER) {
+ pr_debug("%d: overflow race on bank %d\n", cpu, m->bank);
+ return -1;
+ }
+
+ ret = -1;
+ valid = pfa->valid;
+ if (valid & PFA_VALID_PA) {
+ m->status |= MCI_STATUS_ADDRV;
+ m->addr = pfa_table->physical_addr;
+ pr_debug("%d: got physical address %llx valid %x\n",
+ cpu, m->addr, valid);
+ ret = 0;
+ }
+
+ /* When DIMM information was supplied pass it out */
+ if (valid & PFA_DIMM_VALID_MASK) {
+ m->aux0 = encode_dimm(&pfa->dimm[0], DIMM_VALID_BITS(valid, 0));
+ m->aux1 = encode_dimm(&pfa->dimm[1], DIMM_VALID_BITS(valid, 1));
+ ret = 0;
+ }
+
+ return ret;
+}
+
+/*
+ * Xeon 75xx specific mce poll method to retrieve the physical address
+ * and DIMM information.
+ */
+static void xeon75xx_mce_poll(struct mce *m)
+{
+ static DEFINE_SPINLOCK(convert_lock); /* Protect table and static */
+ static unsigned long cperm;
+ static ktime_t last, last_int;
+ unsigned long flags;
+ ktime_t now;
+ s64 delta;
+
+ /* Memory error? */
+ if (m->bank != MCE_BANK_MBOX0 && m->bank != MCE_BANK_MBOX1)
+ return;
+ if (m->status & MCI_STATUS_OVER)
+ return;
+ if (memerr_max_conv_rate == 0)
+ return;
+
+ spin_lock_irqsave(&convert_lock, flags);
+ /*
+ * Rate limit conversions. The conversion takes some time,
+ * but it's not good to use all the CPU time during a error
+ * flood.
+ * Enforce maximum number per second and minimum interval.
+ * The ktime call should use TSC on this machine and be fast.
+ */
+ now = ktime_get();
+ delta = ktime_us_delta(now, last);
+ if (delta >= RATE_LIMIT_PERIOD) {
+ cperm = 0;
+ last = now;
+ }
+ if (ktime_us_delta(now, last_int) >= memerr_min_interval &&
+ ++cperm <= memerr_max_conv_rate) {
+ if (translate_memory_error(m) < 0) {
+ /* On error stop converting for the next second */
+ cperm = memerr_max_conv_rate;
+ pr_debug("PFA translation failed\n");
+ }
+ } else
+ pfa_lost++;
+ last_int = now;
+ spin_unlock_irqrestore(&convert_lock, flags);
+}
+
+static struct pci_device_id bxb_mce_pciids[] = {
+ { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_BXB_ICH_LEGACY0) },
+ {}
+};
+
+static int __init xeon75xx_mce_init(void)
+{
+ u32 addr = 0;
+ struct pci_dev *dev;
+
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+ boot_cpu_data.x86 != 6 ||
+ boot_cpu_data.x86_model != 0x2e)
+ return -ENODEV;
+
+ /*
+ * Get table address from register in IOH.
+ * This just looks up the device, because we don't want to "own" it.
+ */
+ dev = NULL;
+ while ((dev = pci_get_device(PCI_VENDOR_ID_INTEL, PCI_ANY_ID, dev))
+ != NULL) {
+ if (!pci_match_id(bxb_mce_pciids, dev))
+ continue;
+ pci_read_config_dword(dev, ICH_PFA_CFG, &addr);
+ if (addr)
+ break;
+ }
+ pci_dev_put(dev);
+ if (!addr)
+ return -ENODEV;
+
+ if (!e820_all_mapped(addr, addr + PAGE_SIZE, E820_RESERVED)) {
+ pr_info("PFA table at %x not e820 reserved\n", addr);
+ return -ENODEV;
+ }
+
+ pfa_table = (__force struct pfa_table *)ioremap_cache(addr, PAGE_SIZE);
+ if (!pfa_table) {
+ pr_err("Cannot map PFA table at %x\n", addr);
+ return -EIO;
+ }
+
+ if (memcmp(&pfa_table->sig, PFA_SIG, PFA_SIG_LEN) ||
+ pfa_table->len < sizeof(struct pfa_table) ||
+ /* assume newer versions are compatible */
+ pfa_table->revision < PFA_REVISION) {
+ pr_info("PFA table at %x invalid\n", addr);
+ goto error_unmap;
+ }
+
+ if (!nocsum && csum((u8 *)pfa_table,
+ offsetof(struct pfa_table, command))) {
+ pr_info("PFA table at %x length %u has invalid checksum\n",
+ addr, pfa_table->len);
+ goto error_unmap;
+ }
+
+ /* Not strictly needed today */
+ if (pfa_table->len > PAGE_SIZE) {
+ unsigned len = roundup(pfa_table->len, PAGE_SIZE);
+ iounmap(pfa_table);
+ pfa_table = (__force void *)ioremap_cache(addr, len);
+ if (!pfa_table) {
+ pr_err("Cannot remap %u bytes PFA table at %x\n",
+ len, addr);
+ return -EIO;
+ }
+ }
+
+ if (!notest) {
+ int status = pfa_command(0, 0, PFA_CMD_GET_TOPOLOGY, 0);
+ if (status < 0) {
+ pr_err("Test of PFA table failed: %x\n", -status);
+ goto error_unmap;
+ }
+ }
+
+ pr_info("Found Xeon75xx PFA memory error translation table at %x\n",
+ addr);
+ mb();
+ cpu_specific_poll = xeon75xx_mce_poll;
+ return 0;
+
+error_unmap:
+ iounmap(pfa_table);
+ return -ENODEV;
+}
+
+MODULE_DEVICE_TABLE(pci, bxb_mce_pciids);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Andi Kleen");
+MODULE_DESCRIPTION("Intel Xeon 75xx specific DIMM error reporting");
+
+#ifdef CONFIG_MODULE
+static void __exit xeon75xx_mce_exit(void)
+{
+ cpu_specific_poll = NULL;
+ wmb();
+ /* Wait for all machine checks to finish before really unloading */
+ synchronize_rcu();
+ iounmap(pfa_table);
+}
+
+module_init(xeon75xx_mce_init);
+module_exit(xeon75xx_mce_exit);
+#else
+/* When built-in run as soon as the PCI subsystem is up */
+fs_initcall(xeon75xx_mce_init);
+#endif
Index: linux/arch/x86/kernel/e820.c
===================================================================
--- linux.orig/arch/x86/kernel/e820.c
+++ linux/arch/x86/kernel/e820.c
@@ -79,7 +79,7 @@ EXPORT_SYMBOL_GPL(e820_any_mapped);
* Note: this function only works correct if the e820 table is sorted and
* not-overlapping, which is the case
*/
-int __init e820_all_mapped(u64 start, u64 end, unsigned type)
+int e820_all_mapped(u64 start, u64 end, unsigned type)
{
int i;

@@ -106,6 +106,7 @@ int __init e820_all_mapped(u64 start, u6
}
return 0;
}
+EXPORT_SYMBOL_GPL(e820_all_mapped);

/*
* Add a memory region to the kernel e820 map.
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -831,6 +831,14 @@ config X86_MCE_INTEL
Additional support for intel specific MCE features such as
the thermal monitor.

+config X86_MCE_XEON75XX
+ tristate "Intel Xeon 7500 series corrected memory error driver"
+ depends on X86_MCE_INTEL
+ ---help---
+ Add support for a Intel Xeon 7500 series specific memory error driver.
+ This allows to report the DIMM and physical address on a corrected
+ memory error machine check event.
+
config X86_MCE_AMD
def_bool y
prompt "AMD MCE features"


2010-01-22 10:52:20

by Andi Kleen

[permalink] [raw]
Subject: [tip:x86/mce] x86, mce: Xeon75xx specific interface to get corrected memory error information

Commit-ID: c773f70fd6b53ee646727f871833e53649907264
Gitweb: http://git.kernel.org/tip/c773f70fd6b53ee646727f871833e53649907264
Author: Andi Kleen <[email protected]>
AuthorDate: Thu, 21 Jan 2010 23:17:12 +0100
Committer: H. Peter Anvin <[email protected]>
CommitDate: Thu, 21 Jan 2010 17:38:20 -0800

x86, mce: Xeon75xx specific interface to get corrected memory error information

Xeon 75xx doesn't log physical addresses on corrected machine check
events in the standard architectural MSRs. Instead the address has to
be retrieved in a model specific way. This makes it impossible to do
predictive failure analysis.

Implement cpu model specific code to do this in mce-xeon75xx.c using a
new hook that is called from the generic poll code. The code retrieves
the physical address/DIMM of the last corrected error from the
platform and makes the address look like a standard architectural MCA
address for further processing.

In addition the DIMM information is retrieved and put into two new
aux0/aux1 fields in struct mce. These fields are specific to a given
CPU. These fields can then be decoded by mcelog into specific DIMM
information. The latest mcelog version has support for this.

Longer term this will be likely in a different output format, but
short term that seemed like the least intrusive solution. Older mcelog
can deal with an extended record.

There's no code to print this information on a panic because this only
works for corrected errors, and corrected errors do not usually result
in panics.

The act of retrieving the DIMM/PA information can take some time, so
this code has a rate limit to avoid taking too much CPU time on a
error flood.

The whole thing can be loaded as a module and has suitable PCI-IDs so
that it can be auto-loaded by a distribution. The code also checks
explicitely for the expected CPU model number to make sure this code
doesn't run anywhere else.

Signed-off-by: Andi Kleen <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/Kconfig | 8 +
arch/x86/include/asm/mce.h | 2 +
arch/x86/kernel/cpu/mcheck/Makefile | 1 +
arch/x86/kernel/cpu/mcheck/mce-internal.h | 1 +
arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c | 427 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mcheck/mce.c | 12 +
arch/x86/kernel/e820.c | 3 +-
7 files changed, 453 insertions(+), 1 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cbcbfde..f62db24 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -832,6 +832,14 @@ config X86_MCE_INTEL
Additional support for intel specific MCE features such as
the thermal monitor.

+config X86_MCE_XEON75XX
+ tristate "Intel Xeon 7500 series corrected memory error driver"
+ depends on X86_MCE_INTEL
+ ---help---
+ Add support for a Intel Xeon 7500 series specific memory error driver.
+ This allows to report the DIMM and physical address on a corrected
+ memory error machine check event.
+
config X86_MCE_AMD
def_bool y
prompt "AMD MCE features"
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6c3fdd6..d2cf043 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -67,6 +67,8 @@ struct mce {
__u32 socketid; /* CPU socket ID */
__u32 apicid; /* CPU initial apic ID */
__u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */
+ __u64 aux0; /* model specific */
+ __u64 aux1; /* model specific */
};

/*
diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index 4ac6d48..16606f4 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -2,6 +2,7 @@ obj-y = mce.o mce-severity.o

obj-$(CONFIG_X86_ANCIENT_MCE) += winchip.o p5.o
obj-$(CONFIG_X86_MCE_INTEL) += mce_intel.o
+obj-$(CONFIG_X86_MCE_XEON75XX) += mce-xeon75xx.o
obj-$(CONFIG_X86_MCE_AMD) += mce_amd.o
obj-$(CONFIG_X86_MCE_THRESHOLD) += threshold.o
obj-$(CONFIG_X86_MCE_INJECT) += mce-inject.o
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 32996f9..d5b7eec 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -28,3 +28,4 @@ extern int mce_ser;

extern struct mce_bank *mce_banks;

+extern void (*cpu_specific_poll)(struct mce *);
diff --git a/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c b/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
new file mode 100644
index 0000000..67ad39b
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
@@ -0,0 +1,427 @@
+/*
+ * Xeon 7500 series specific machine check support code.
+ * Copyright 2009, 2010 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * Implement Xeon 7500 series specific code to retrieve the physical address
+ * and DIMM information for corrected memory errors.
+ *
+ * Interface: mce->aux0/aux1 is mapped to a struct pfa_dimm with pad
+ * redefined to DIMM valid bits. Consumers check CPUID and bank and
+ * then interpret aux0/aux1
+ */
+
+/* #define DEBUG 1 */ /* disable for production */
+#define pr_fmt(x) "MCE: " x
+
+#include <linux/moduleparam.h>
+#include <linux/pci_ids.h>
+#include <linux/hrtimer.h>
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <asm/processor.h>
+#include <asm/e820.h>
+#include <asm/mce.h>
+#include <asm/io.h>
+
+#include "mce-internal.h"
+
+#define PFA_SIG "$PFA"
+#define PFA_SIG_LEN 4
+
+/* DIMM description */
+struct aux_pfa_dimm {
+ u8 fbd_channel_id;
+ u8 ddr_channel_id;
+ u8 ddr_dimm_id;
+ u8 ddr_rank_id;
+ u8 ddr_dimm_bank_id;
+ u8 ddr_dimm_row_id;
+ u8 ddr_dimm_column_id;
+ u8 valid;
+} __attribute__((packed));
+
+struct pfa_dimm {
+ u8 fbd_channel_id;
+ u8 ddr_channel_id;
+ u8 ddr_dimm_id;
+ u8 ddr_rank_id;
+ u8 ddr_dimm_bank_id;
+ u32 ddr_dimm_row_id;
+ u32 ddr_dimm_column_id;
+} __attribute__((packed));
+
+/* Memory translation table in memory. */
+struct pfa_table {
+ u8 sig[PFA_SIG_LEN]; /* Signature: '$PFA' */
+ u16 len; /* total length */
+ u16 revision; /* 0x11 */
+ u8 checksum; /* 8bit sum to zero */
+ u8 db_value; /* mailbox port command value */
+ u8 db_port; /* mailbox port */
+ /* end of header; end of checksum */
+ u8 command; /* input command */
+ u32 valid; /* valid input/output bits */
+ u16 status; /* output status */
+ u8 socket_id; /* input socket id*/
+ u8 bank_id; /* input MCE bank id */
+ u32 pad1;
+ u64 mbox_address;
+ u64 physical_addr; /* physical address */
+ struct pfa_dimm dimm[2];
+ /*
+ * topology information follows: not used for now.
+ */
+} __attribute__((packed));
+
+/* DIMM valid bits in valid: DIMM0: 8..12; DIMM1 16..20 */
+#define DIMM_VALID_BITS(val, num) (((val) >> (4 + (num) * 8)) & DIMM_VALID_ALL)
+#define DIMM_SET_VALID(val, num) ((val) << (4 + (num) * 8))
+
+enum {
+ MCE_BANK_MBOX0 = 8,
+ MCE_BANK_MBOX1 = 9,
+
+ PFA_REVISION = 0x11, /* v1.1 */
+
+ /* Status bits for valid field */
+ PFA_VALID_MA = (1 << 0),
+ PFA_VALID_SOCKETID = (1 << 1),
+ PFA_VALID_BANKID = (1 << 2),
+ PFA_VALID_PA = (1 << 3),
+
+ /* DIMM valid bits in valid */
+ /* use with DIMM_VALID_BITS/DIMM_SET_VALID for pfa->valid */
+ DIMM_VALID_FBD_CHAN = (1 << 0),
+ DIMM_VALID_DDR_CHAN = (1 << 1),
+ DIMM_VALID_DDR_DIMM = (1 << 2),
+ DIMM_VALID_DDR_RANK = (1 << 3),
+ DIMM_VALID_DIMM_BANK = (1 << 4),
+ DIMM_VALID_DIMM_ROW = (1 << 5),
+ DIMM_VALID_DIMM_COLUMN = (1 << 6),
+ DIMM_VALID_ALL = 0x7f,
+
+ PFA_DIMM_VALID_MASK = DIMM_SET_VALID(DIMM_VALID_ALL, 0)
+ | DIMM_SET_VALID(DIMM_VALID_ALL, 1),
+
+ /* Values for status field */
+ PFA_STATUS_SUCCESS = 0,
+ PFA_STATUS_SOCKET_INVALID = (1 << 1),
+ PFA_STATUS_MBOX_INVALID = (1 << 2),
+ PFA_STATUS_MA_INVALID = (1 << 3),
+ PFA_STATUS_PA_INVALID = (1 << 4),
+
+ /* Values for command field */
+ PFA_CMD_GET_MEM_CORR_ERR_PA = 0,
+ PFA_CMD_PA_TO_DIMM_ADDR = 1,
+ PFA_CMD_DIMM_TO_PA = 2,
+ PFA_CMD_GET_TOPOLOGY = 3,
+
+ /* PCI device IDs and the base register */
+ ICH_PFA_CFG = 0x8c, /* SCRATCH4 */
+ PCI_DEVICE_ID_BXB_ICH_LEGACY0 = 0x3422,
+};
+
+static struct pfa_table *pfa_table __read_mostly;
+static int memerr_max_conv_rate __read_mostly = 100;
+static int memerr_min_interval __read_mostly = 500;
+static int pfa_lost; /* for diagnosis */
+
+enum {
+ RATE_LIMIT_PERIOD = USEC_PER_SEC, /* in us; period of rate limit */
+};
+
+module_param(memerr_max_conv_rate, int, 0644);
+MODULE_PARM_DESC(memerr_max_conv_rate,
+ "Maximum number of memory error conversions each second; 0 to disable");
+module_param(memerr_min_interval, int, 0644);
+MODULE_PARM_DESC(memerr_min_interval,
+ "Minimum time delta between two memory conversions; in us; default 500");
+
+static int notest;
+static int nocsum;
+module_param(notest, int, 0);
+module_param(nocsum, int, 0);
+
+static u64 encode_dimm(struct pfa_dimm *d, u8 valid)
+{
+ union {
+ struct aux_pfa_dimm d;
+ u64 v;
+ } p;
+
+ BUILD_BUG_ON(sizeof(struct aux_pfa_dimm) != sizeof(u64));
+ p.d.fbd_channel_id = d->fbd_channel_id;
+ p.d.ddr_channel_id = d->ddr_channel_id;
+ p.d.ddr_dimm_id = d->ddr_dimm_id;
+ p.d.ddr_rank_id = d->ddr_rank_id;
+ p.d.ddr_dimm_bank_id = d->ddr_dimm_bank_id;
+ p.d.ddr_dimm_row_id = d->ddr_dimm_row_id;
+ if (p.d.ddr_dimm_row_id != d->ddr_dimm_row_id) /* truncated? */
+ valid &= ~DIMM_VALID_DIMM_ROW;
+ p.d.ddr_dimm_column_id = d->ddr_dimm_column_id;
+ if (p.d.ddr_dimm_column_id != d->ddr_dimm_column_id)
+ valid &= ~DIMM_VALID_DIMM_COLUMN;
+ p.d.valid = valid;
+ pr_debug("PFA fbd_ch %u ddr_ch %u dimm %u rank %u bank %u valid %x\n",
+ d->fbd_channel_id,
+ d->ddr_channel_id,
+ d->ddr_dimm_id,
+ d->ddr_rank_id,
+ d->ddr_dimm_bank_id,
+ valid);
+ return p.v;
+}
+
+static u8 csum(u8 *table, u16 len)
+{
+ u8 sum = 0;
+ int i;
+ for (i = 0; i < len; i++)
+ sum += *table++;
+ return sum;
+}
+
+/*
+ * Execute a command through the mailbox interface.
+ */
+static int
+pfa_command(unsigned bank, unsigned socketid, unsigned command, unsigned valid)
+{
+ pfa_table->bank_id = bank;
+ pfa_table->socket_id = socketid;
+ pfa_table->valid = valid | PFA_VALID_SOCKETID;
+ pfa_table->command = command;
+
+ outb(pfa_table->db_value, pfa_table->db_port);
+
+ mb(); /* Reread fields after they got changed */
+
+ if (pfa_table->status != PFA_STATUS_SUCCESS) {
+ pr_debug("Memory PFA command %d failed: socket:%d bank:%d status:%x\n",
+ command, socketid, bank, pfa_table->status);
+ return -pfa_table->status;
+ }
+ return 0;
+}
+
+/*
+ * Retrieve physical address and DIMMs.
+ */
+static int translate_memory_error(struct mce *m)
+{
+ struct pfa_table *pfa = pfa_table;
+ u64 status;
+ int ret;
+ u32 valid;
+ int cpu = smp_processor_id();
+
+ /* Make sure our structures match the specification */
+ BUILD_BUG_ON(offsetof(struct pfa_table, physical_addr) != 0x20);
+ BUILD_BUG_ON(offsetof(struct pfa_table, status) != 0x10);
+ BUILD_BUG_ON(offsetof(struct pfa_table, physical_addr) != 0x20);
+ BUILD_BUG_ON(offsetof(struct pfa_table, dimm[1].ddr_dimm_column_id) !=
+ 0x3e);
+
+ /* Ask for PA/DIMMs of last error */
+ if (pfa_command(m->bank, m->socketid,
+ PFA_CMD_GET_MEM_CORR_ERR_PA, PFA_VALID_BANKID) < 0)
+ return -1;
+
+ /*
+ * Recheck machine check bank. If the overflow bit was set
+ * there was a race. Don't use the information in this case.
+ */
+ rdmsrl(MSR_IA32_MCx_STATUS(m->bank), status);
+ if (status & MCI_STATUS_OVER) {
+ pr_debug("%d: overflow race on bank %d\n", cpu, m->bank);
+ return -1;
+ }
+
+ ret = -1;
+ valid = pfa->valid;
+ if (valid & PFA_VALID_PA) {
+ m->status |= MCI_STATUS_ADDRV;
+ m->addr = pfa_table->physical_addr;
+ pr_debug("%d: got physical address %llx valid %x\n",
+ cpu, m->addr, valid);
+ ret = 0;
+ }
+
+ /* When DIMM information was supplied pass it out */
+ if (valid & PFA_DIMM_VALID_MASK) {
+ m->aux0 = encode_dimm(&pfa->dimm[0], DIMM_VALID_BITS(valid, 0));
+ m->aux1 = encode_dimm(&pfa->dimm[1], DIMM_VALID_BITS(valid, 1));
+ ret = 0;
+ }
+
+ return ret;
+}
+
+/*
+ * Xeon 75xx specific mce poll method to retrieve the physical address
+ * and DIMM information.
+ */
+static void xeon75xx_mce_poll(struct mce *m)
+{
+ static DEFINE_SPINLOCK(convert_lock); /* Protect table and static */
+ static unsigned long cperm;
+ static ktime_t last, last_int;
+ unsigned long flags;
+ ktime_t now;
+ s64 delta;
+
+ /* Memory error? */
+ if (m->bank != MCE_BANK_MBOX0 && m->bank != MCE_BANK_MBOX1)
+ return;
+ if (m->status & MCI_STATUS_OVER)
+ return;
+ if (memerr_max_conv_rate == 0)
+ return;
+
+ spin_lock_irqsave(&convert_lock, flags);
+ /*
+ * Rate limit conversions. The conversion takes some time,
+ * but it's not good to use all the CPU time during a error
+ * flood.
+ * Enforce maximum number per second and minimum interval.
+ * The ktime call should use TSC on this machine and be fast.
+ */
+ now = ktime_get();
+ delta = ktime_us_delta(now, last);
+ if (delta >= RATE_LIMIT_PERIOD) {
+ cperm = 0;
+ last = now;
+ }
+ if (ktime_us_delta(now, last_int) >= memerr_min_interval &&
+ ++cperm <= memerr_max_conv_rate) {
+ if (translate_memory_error(m) < 0) {
+ /* On error stop converting for the next second */
+ cperm = memerr_max_conv_rate;
+ pr_debug("PFA translation failed\n");
+ }
+ } else
+ pfa_lost++;
+ last_int = now;
+ spin_unlock_irqrestore(&convert_lock, flags);
+}
+
+static struct pci_device_id bxb_mce_pciids[] = {
+ { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_BXB_ICH_LEGACY0) },
+ {}
+};
+
+static int __init xeon75xx_mce_init(void)
+{
+ u32 addr = 0;
+ struct pci_dev *dev;
+
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+ boot_cpu_data.x86 != 6 ||
+ boot_cpu_data.x86_model != 0x2e)
+ return -ENODEV;
+
+ /*
+ * Get table address from register in IOH.
+ * This just looks up the device, because we don't want to "own" it.
+ */
+ dev = NULL;
+ while ((dev = pci_get_device(PCI_VENDOR_ID_INTEL, PCI_ANY_ID, dev))
+ != NULL) {
+ if (!pci_match_id(bxb_mce_pciids, dev))
+ continue;
+ pci_read_config_dword(dev, ICH_PFA_CFG, &addr);
+ if (addr)
+ break;
+ }
+ pci_dev_put(dev);
+ if (!addr)
+ return -ENODEV;
+
+ if (!e820_all_mapped(addr, addr + PAGE_SIZE, E820_RESERVED)) {
+ pr_info("PFA table at %x not e820 reserved\n", addr);
+ return -ENODEV;
+ }
+
+ pfa_table = (__force struct pfa_table *)ioremap_cache(addr, PAGE_SIZE);
+ if (!pfa_table) {
+ pr_err("Cannot map PFA table at %x\n", addr);
+ return -EIO;
+ }
+
+ if (memcmp(&pfa_table->sig, PFA_SIG, PFA_SIG_LEN) ||
+ pfa_table->len < sizeof(struct pfa_table) ||
+ /* assume newer versions are compatible */
+ pfa_table->revision < PFA_REVISION) {
+ pr_info("PFA table at %x invalid\n", addr);
+ goto error_unmap;
+ }
+
+ if (!nocsum && csum((u8 *)pfa_table,
+ offsetof(struct pfa_table, command))) {
+ pr_info("PFA table at %x length %u has invalid checksum\n",
+ addr, pfa_table->len);
+ goto error_unmap;
+ }
+
+ /* Not strictly needed today */
+ if (pfa_table->len > PAGE_SIZE) {
+ unsigned len = roundup(pfa_table->len, PAGE_SIZE);
+ iounmap(pfa_table);
+ pfa_table = (__force void *)ioremap_cache(addr, len);
+ if (!pfa_table) {
+ pr_err("Cannot remap %u bytes PFA table at %x\n",
+ len, addr);
+ return -EIO;
+ }
+ }
+
+ if (!notest) {
+ int status = pfa_command(0, 0, PFA_CMD_GET_TOPOLOGY, 0);
+ if (status < 0) {
+ pr_err("Test of PFA table failed: %x\n", -status);
+ goto error_unmap;
+ }
+ }
+
+ pr_info("Found Xeon75xx PFA memory error translation table at %x\n",
+ addr);
+ mb();
+ cpu_specific_poll = xeon75xx_mce_poll;
+ return 0;
+
+error_unmap:
+ iounmap(pfa_table);
+ return -ENODEV;
+}
+
+MODULE_DEVICE_TABLE(pci, bxb_mce_pciids);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Andi Kleen");
+MODULE_DESCRIPTION("Intel Xeon 75xx specific DIMM error reporting");
+
+#ifdef CONFIG_MODULE
+static void __exit xeon75xx_mce_exit(void)
+{
+ cpu_specific_poll = NULL;
+ wmb();
+ /* Wait for all machine checks to finish before really unloading */
+ synchronize_rcu();
+ iounmap(pfa_table);
+}
+
+module_init(xeon75xx_mce_init);
+module_exit(xeon75xx_mce_exit);
+#else
+/* When built-in run as soon as the PCI subsystem is up */
+fs_initcall(xeon75xx_mce_init);
+#endif
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a8aacd4..1c1e4aa 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -88,6 +88,9 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
static DEFINE_PER_CPU(struct mce, mces_seen);
static int cpu_missing;

+void (*cpu_specific_poll)(struct mce *);
+EXPORT_SYMBOL_GPL(cpu_specific_poll);
+
/*
* CPU/chipset specific EDAC code can register a notifier call here to print
* MCE errors in a human-readable form.
@@ -364,6 +367,11 @@ static void mce_wrmsrl(u32 msr, u64 v)
wrmsrl(msr, v);
}

+static int under_injection(void)
+{
+ return __get_cpu_var(injectm).finished;
+}
+
/*
* Simple lockless ring to communicate PFNs from the exception handler with the
* process context work function. This is vastly simplified because there's
@@ -567,6 +575,10 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)

if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
+
+ if (cpu_specific_poll && !under_injection() && !mce_dont_log_ce)
+ cpu_specific_poll(&m);
+
/*
* Don't get the IP here because it's unlikely to
* have anything to do with the actual error location.
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index a1a7876..65ebdb6 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -79,7 +79,7 @@ EXPORT_SYMBOL_GPL(e820_any_mapped);
* Note: this function only works correct if the e820 table is sorted and
* not-overlapping, which is the case
*/
-int __init e820_all_mapped(u64 start, u64 end, unsigned type)
+int e820_all_mapped(u64 start, u64 end, unsigned type)
{
int i;

@@ -106,6 +106,7 @@ int __init e820_all_mapped(u64 start, u64 end, unsigned type)
}
return 0;
}
+EXPORT_SYMBOL_GPL(e820_all_mapped);

/*
* Add a memory region to the kernel e820 map.

2010-01-22 10:52:30

by H. Peter Anvin

[permalink] [raw]
Subject: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Commit-ID: f91c4d2649531cc36e10c6bc0f92d0f99116b209
Gitweb: http://git.kernel.org/tip/f91c4d2649531cc36e10c6bc0f92d0f99116b209
Author: H. Peter Anvin <[email protected]>
AuthorDate: Thu, 21 Jan 2010 18:31:54 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Thu, 21 Jan 2010 18:31:54 -0800

x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

cpu_specific_poll is a global variable, and it should have a global
namespace name. Since it is MCE-specific (it takes a struct mce *),
rename it mce_cpu_specific_poll.

Signed-off-by: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
LKML-Reference: <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 +-
arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c | 4 ++--
arch/x86/kernel/cpu/mcheck/mce.c | 8 ++++----
3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index d5b7eec..8e7c2f4 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -28,4 +28,4 @@ extern int mce_ser;

extern struct mce_bank *mce_banks;

-extern void (*cpu_specific_poll)(struct mce *);
+extern void (*mce_cpu_specific_poll)(struct mce *);
diff --git a/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c b/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
index 67ad39b..e09b736 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c
@@ -396,7 +396,7 @@ static int __init xeon75xx_mce_init(void)
pr_info("Found Xeon75xx PFA memory error translation table at %x\n",
addr);
mb();
- cpu_specific_poll = xeon75xx_mce_poll;
+ mce_cpu_specific_poll = xeon75xx_mce_poll;
return 0;

error_unmap:
@@ -412,7 +412,7 @@ MODULE_DESCRIPTION("Intel Xeon 75xx specific DIMM error reporting");
#ifdef CONFIG_MODULE
static void __exit xeon75xx_mce_exit(void)
{
- cpu_specific_poll = NULL;
+ mce_cpu_specific_poll = NULL;
wmb();
/* Wait for all machine checks to finish before really unloading */
synchronize_rcu();
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1c1e4aa..a277d34 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -88,8 +88,8 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
static DEFINE_PER_CPU(struct mce, mces_seen);
static int cpu_missing;

-void (*cpu_specific_poll)(struct mce *);
-EXPORT_SYMBOL_GPL(cpu_specific_poll);
+void (*mce_cpu_specific_poll)(struct mce *);
+EXPORT_SYMBOL_GPL(mce_cpu_specific_poll);

/*
* CPU/chipset specific EDAC code can register a notifier call here to print
@@ -576,8 +576,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;

- if (cpu_specific_poll && !under_injection() && !mce_dont_log_ce)
- cpu_specific_poll(&m);
+ if (mce_cpu_specific_poll && !under_injection() && !mce_dont_log_ce)
+ mce_cpu_specific_poll(&m);

/*
* Don't get the IP here because it's unlikely to

2010-01-23 05:17:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll


* tip-bot for H. Peter Anvin <[email protected]> wrote:

> Commit-ID: f91c4d2649531cc36e10c6bc0f92d0f99116b209
> Gitweb: http://git.kernel.org/tip/f91c4d2649531cc36e10c6bc0f92d0f99116b209
> Author: H. Peter Anvin <[email protected]>
> AuthorDate: Thu, 21 Jan 2010 18:31:54 -0800
> Committer: H. Peter Anvin <[email protected]>
> CommitDate: Thu, 21 Jan 2010 18:31:54 -0800
>
> x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll
>
> cpu_specific_poll is a global variable, and it should have a global
> namespace name. Since it is MCE-specific (it takes a struct mce *),
> rename it mce_cpu_specific_poll.
>
> Signed-off-by: H. Peter Anvin <[email protected]>
> Cc: Andi Kleen <[email protected]>
> LKML-Reference: <[email protected]>

FYI, this commit triggered a -tip test failure:

arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'

I'm excluding it from tip:master.

But the bigger problem with this commit is structure of it - or the lack
thereof.

It just blindly goes into the direction the MCE code has been going for some
time, minimally enabling the hardware, ignoring both the new EDAC design and
the new performance monitoring related design i outlined some time ago.

Newly introduced MCE code in the x86 tree should be integrated much better
into existing higher level facilities, not just limp along ignoring life
outside of its little box. This will have to be addressed before we can
reintroduce this commit, even if this trivial build failure is addressed.

I've Cc:-ed a few more interested parties.

Thanks,

Ingo

2010-01-23 08:06:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

(Adding some more interested parties to Cc:)

On Sat, Jan 23, 2010 at 06:17:17AM +0100, Ingo Molnar wrote:
>
> * tip-bot for H. Peter Anvin <[email protected]> wrote:
>
> > Commit-ID: f91c4d2649531cc36e10c6bc0f92d0f99116b209
> > Gitweb: http://git.kernel.org/tip/f91c4d2649531cc36e10c6bc0f92d0f99116b209
> > Author: H. Peter Anvin <[email protected]>
> > AuthorDate: Thu, 21 Jan 2010 18:31:54 -0800
> > Committer: H. Peter Anvin <[email protected]>
> > CommitDate: Thu, 21 Jan 2010 18:31:54 -0800
> >
> > x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll
> >
> > cpu_specific_poll is a global variable, and it should have a global
> > namespace name. Since it is MCE-specific (it takes a struct mce *),
> > rename it mce_cpu_specific_poll.
> >
> > Signed-off-by: H. Peter Anvin <[email protected]>
> > Cc: Andi Kleen <[email protected]>
> > LKML-Reference: <[email protected]>
>
> FYI, this commit triggered a -tip test failure:
>
> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'
>
> I'm excluding it from tip:master.
>
> But the bigger problem with this commit is structure of it - or the lack
> thereof.
>
> It just blindly goes into the direction the MCE code has been going for some
> time, minimally enabling the hardware, ignoring both the new EDAC design and
> the new performance monitoring related design i outlined some time ago.

I completely agree - from what I see this is adding vendor- or rather
vendor-and-machine-specific hooks to read out (1) the position of the
the memory translation table from PCI config space (0x8c), (2) then to
read out the offset from the first MCA status register in order to (3)
rdmsr the status information.

In AMD's case, we need similar hooks too, in order to evaluate
correctable MCEs for different RAS reasons like for example L3 cache or
data arrays errors for disabling L3 indices. I was looking into adding
hooks into machine_check_poll() and cpu_specific_poll() interface could
work.

Furthermore, lets leave mcheck be mcheck and do error decoding in EDAC
modules. For example, there was a core i7 EDAC module submission from
Mauro and the Xeon75xx-specific decoding bits could be added to it or
even as a new machine-specific module instead of mcelog.

With the evergrowing complexity of memory controller design I don't
think that the userspace mcelog approach will scale - you need the
whole decoding in the kernel where the module knows the exact memory
controllers setup and which DRAM addresses belong to which nodes and
whether you do memory hoisting and whether you interleave, if yes, how
and on what granilarity you interleave and on and on...

I believe Ingo also had some ideas about perf_event integration and this
is something we could add to the MCE polling routine too. Ingo?

--
Regards/Gruss,
Boris.

2010-01-23 09:00:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll


* Borislav Petkov <[email protected]> wrote:

> (Adding some more interested parties to Cc:)
>
> On Sat, Jan 23, 2010 at 06:17:17AM +0100, Ingo Molnar wrote:
> >
> > * tip-bot for H. Peter Anvin <[email protected]> wrote:
> >
> > > Commit-ID: f91c4d2649531cc36e10c6bc0f92d0f99116b209
> > > Gitweb: http://git.kernel.org/tip/f91c4d2649531cc36e10c6bc0f92d0f99116b209
> > > Author: H. Peter Anvin <[email protected]>
> > > AuthorDate: Thu, 21 Jan 2010 18:31:54 -0800
> > > Committer: H. Peter Anvin <[email protected]>
> > > CommitDate: Thu, 21 Jan 2010 18:31:54 -0800
> > >
> > > x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll
> > >
> > > cpu_specific_poll is a global variable, and it should have a global
> > > namespace name. Since it is MCE-specific (it takes a struct mce *),
> > > rename it mce_cpu_specific_poll.
> > >
> > > Signed-off-by: H. Peter Anvin <[email protected]>
> > > Cc: Andi Kleen <[email protected]>
> > > LKML-Reference: <[email protected]>
> >
> > FYI, this commit triggered a -tip test failure:
> >
> > arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
> > arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'
> >
> > I'm excluding it from tip:master.
> >
> > But the bigger problem with this commit is structure of it - or the lack
> > thereof.
> >
> > It just blindly goes into the direction the MCE code has been going for some
> > time, minimally enabling the hardware, ignoring both the new EDAC design and
> > the new performance monitoring related design i outlined some time ago.
>
> I completely agree - from what I see this is adding vendor- or rather
> vendor-and-machine-specific hooks to read out (1) the position of the the
> memory translation table from PCI config space (0x8c), (2) then to read out
> the offset from the first MCA status register in order to (3) rdmsr the
> status information.
>
> In AMD's case, we need similar hooks too, in order to evaluate correctable
> MCEs for different RAS reasons like for example L3 cache or data arrays
> errors for disabling L3 indices. I was looking into adding hooks into
> machine_check_poll() and cpu_specific_poll() interface could work.
>
> Furthermore, lets leave mcheck be mcheck and do error decoding in EDAC
> modules. For example, there was a core i7 EDAC module submission from Mauro
> and the Xeon75xx-specific decoding bits could be added to it or even as a
> new machine-specific module instead of mcelog.
>
> With the evergrowing complexity of memory controller design I don't think
> that the userspace mcelog approach will scale - you need the whole decoding
> in the kernel where the module knows the exact memory controllers setup and
> which DRAM addresses belong to which nodes and whether you do memory
> hoisting and whether you interleave, if yes, how and on what granilarity you
> interleave and on and on...

Yep. Could you give a few pointers to Andi where exactly you'd like to see the
Intel Xeon functionality added to the EDAC code? There is some Intel
functionality there already, but the current upstream code does not look very
uptodate. I've looked at e752x_edac.c. (there's some Corei7 work pending,
right?) In any case there's a lot of fixing to be done to the Intel code
there.

Memory controller functionality integrated into the kernel proper is a good
idea partly for similar reasons an on-die memory controller is a good idea in
the hardware space.

> I believe Ingo also had some ideas about perf_event integration and this is
> something we could add to the MCE polling routine too. Ingo?

Yes, my initial thoughts on that are in the lkml mail below from a few months
ago. We basically want to enumerate the hardware and its events intelligently
- and integrate that nicely with other sources of events. That will give us a
boatload of new performance monitoring and analysis features that we could not
have dreamt of before.

Certain events can be 'richer' and 'more special' than others (they can cause
things like signals - on correctable memory faults), but so far there's little
that deviates from the view that these are all system events, and that we want
a good in-kernel enumeration and handling of them. Exposing it on the low
level a'la mcelog is a fundamentally bad idea as it pushes hardware complexity
into user-space (handling hardware functionality and building good
abstractions on it is the task of the kernel - every time we push that to
user-space the kernel becomes a little bit poorer).

Note that this very much plugs into the whole problem space of how to
enumerate CPU cache hierarchies - something that i think Andreas is keenly
interested in. We want one unified enumeration of hardware [and software]
components and one enumeration of the events that originate from there.

Right now we are mostly focused on software component enumeration via
/debug/tracing/events, but that does not (and should not) remain so. It's not
a small task to implement all aspects of that, but it can be done gradually
and it will be very rewarding all along the way in my opinion.

[ Furthermore, if there's interest i wouldnt mind a 'perf mce' (or more
generally a 'perf edac') subcommand to perf either, which would specifically
be centered about all things EDAC/MCE policy. (but of course other tooling
can make use of it too - it doesnt 'have' to be within tools/perf/ per se -
it's just a convenient and friendly place for kernel developers and makes it
easy to backtest any new kernel code in this area.)

We already have subsystem specific perf subcommands: perf kmem, perf lock,
perf sched - this kind of spread out and subsystem specific support it's one
of the strong sides of perf. ]

Andi's patch goes in the exact opposite direction of all that, which is a step
backwards - and i'm not accepting that. Please at minimum come up with a good
drivers/edac/ solution. (and better yet, make it all plug into perf event
logging/monitoring and help bring hardware diagnostics and monitoring to all
Linux users.)

Thanks,

Ingo

----- Forwarded message from Ingo Molnar <[email protected]> -----

Date: Tue, 22 Sep 2009 15:39:49 +0200
From: Ingo Molnar <[email protected]>
To: Huang Ying <[email protected]>
Cc: Borislav Petkov <[email protected]>,
Fr??d??ric Weisbecker <[email protected]>,
Li Zefan <[email protected]>,
Steven Rostedt <[email protected]>,
"H. Peter Anvin" <[email protected]>, Andi Kleen <[email protected]>,
Hidetoshi Seto <[email protected]>,
"[email protected]" <[email protected]>,
Arjan van de Ven <[email protected]>,
Thomas Gleixner <[email protected]>
Subject: [PATCH] x86: mce: New MCE logging design


* Huang Ying <[email protected]> wrote:

> Hi, Ingo,
>
> I think it is a interesting idea to replace the original MCE logging
> part implementation (MCE log ring buffer + /dev/mcelog misc device)
> with tracing infrastructure. The tracing infrastructure has definitely
> more features than original method, but it is more complex than
> original method too.

TRACE_EVENT() can be used to define the tracepoint itself - but most of
the benefit comes from using the performance events subsystem for event
reporting. It's a perfect match for MCE events - see below.

> The tracing infrastructure is good at tracing. But I think it may be
> too complex to be used as a (hardware) error processing mechanism. I
> think the error processing should be simple and relatively independent
> with other part of the kernel to avoid triggering more (hardware)
> error during error processing. What do you think about this?

The fact that event reporting is used for other things as well might
even make it more fault-resilient than open-coded MCE code: just due to
it possibly being hot in the cache already (and thus not needing extra
DRAM cycles to a possibly faulty piece of hardware) while the MCE code,
not used by anything else, has to be brought into the cache for error
processing - triggering more faults.

But ... that argument is even largely irrelevant: the overriding core
kernel principle was always that of reuse of facilities. We want to keep
performance event reporting simple and robust too, so re-using
facilities is a good thing and the needs of MCE logging will improve
perf events - and vice versa.

Furthermore, to me as x86 maintainer it matters a lot that a piece of
code i maintain is clean, well-designed and maintainable in the long
run. The current /dev/mcelog code is not clean and not maintainable,
and that needs to be fixed before it can be extended.

That MCE logging also becomes more robust and more feature-rich as a
result is a happy side-effect.

> Even if we will reach consensus at some point to re-write MCE logging
> part totally. We still need to fix the bugs in original implementation
> before the new implementation being ready. Do you agree?

Well, this is all 2.6.33 stuff and that's plenty of time to do the
generic events + perf events approach.

Anyway, i think we can work on these MCE enhancements much better if we
move it to the level of actual patches. Let me give you a (very small)
demonstration of my requests.

The patch attached below adds two new events to the MCE code, to the
thermal throttling code:

mce_thermal_throttle_hot
mce_thermal_throttle_cold

If you apply the patch below to latest -tip and boot that kernel and do:

cd tools/perf/
make -j install

And type 'perf list', those two new MCE events will show up in the list
of available events:

# perf list
[...]
mce:mce_thermal_throttle_hot [Tracepoint event]
mce:mce_thermal_throttle_cold [Tracepoint event]
[...]

Initially you can check whether those events are occuring:

# perf stat -a -e mce:mce_thermal_throttle_hot sleep 1

Performance counter stats for 'sleep 1':

0 mce:mce_thermal_throttle_hot # 0.000 M/sec

1.000672214 seconds time elapsed

None on a normal system. Now, i tried it on a system that gets frequent
thermal events, and there it gives:

# perf stat -a -e mce:mce_thermal_throttle_hot sleep 1

Performance counter stats for 'sleep 1':

73 mce:mce_thermal_throttle_hot # 0.000 M/sec

1.000765260 seconds time elapsed

Now i can observe the frequency of those thermal throttling events:

# perf stat --repeat 10 -a -e mce:mce_thermal_throttle_hot sleep 1

Performance counter stats for 'sleep 1' (10 runs):

184 mce:mce_thermal_throttle_hot # 0.000 M/sec ( +- 1.971% )

1.006488821 seconds time elapsed ( +- 0.203% )

184/sec is pretty hefty, with a relatively stable rate.

Now i try to see exactly which code is affected by those thermal
throttling events. For that i use 'perf record', and profile the whole
system for 10 seconds:

# perf record -f -e mce:mce_thermal_throttle_hot -c 1 -a sleep 10
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.802 MB perf.data (~35026 samples) ]

And i used 'perf report' to analyze where the MCE events occured:

# Samples: 2495
#
# Overhead Command Shared Object Symbol
# ........ .............. ........................ ......
#
55.39% hackbench [vdso] [.] 0x000000f77e1430
35.27% hackbench f77c2430 [.] 0x000000f77c2430
3.05% hackbench [kernel] [k] ia32_setup_sigcontext
0.88% bash [vdso] [.] 0x000000f776f430
0.64% bash f7731430 [.] 0x000000f7731430
0.52% hackbench ./hackbench [.] sender
0.48% nice f7753430 [.] 0x000000f7753430
0.36% hackbench ./hackbench [.] receiver
0.24% perf /lib64/libc-2.5.so [.] __GI_read
0.24% hackbench /lib/libc-2.5.so [.] __libc_read

Most of the events concentrate themselves to around syscall entry
points. That's not because the system is doing that most of the time -
it's probably because the fast syscall entry point is a critical path of
the CPU and overheating events likely concentrate themselves to around
that place.

Now lets how we can do very detailed MCE logs, with per CPU data,
timestamps and other events mixed. For that i use perf record again:

# perf record -R -f -e mce:mce_thermal_throttle_hot:r -e mce:mce_thermal_throttle_cold:r -e irq:irq_handler_entry:r -M -c 1 -a sleep 10
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 1.167 MB perf.data (~50990 samples) ]

And i use 'perf trace' to look at the result:

hackbench-5070 [000] 1542.044184301: mce_thermal_throttle_hot: throttle_count=150852
hackbench-5070 [000] 1542.044672817: mce_thermal_throttle_cold: throttle_count=150852
hackbench-5056 [000] 1542.047029212: mce_thermal_throttle_hot: throttle_count=150853
hackbench-5057 [000] 1542.047518733: mce_thermal_throttle_cold: throttle_count=150853
hackbench-5058 [000] 1542.049822672: mce_thermal_throttle_hot: throttle_count=150854
hackbench-5046 [000] 1542.050312612: mce_thermal_throttle_cold: throttle_count=150854
hackbench-5072 [000] 1542.052720357: mce_thermal_throttle_hot: throttle_count=150855
hackbench-5072 [000] 1542.053210612: mce_thermal_throttle_cold: throttle_count=150855
:5074-5074 [000] 1542.055689468: mce_thermal_throttle_hot: throttle_count=150856
hackbench-5011 [000] 1542.056179849: mce_thermal_throttle_cold: throttle_count=150856
:5061-5061 [000] 1542.056937152: mce_thermal_throttle_hot: throttle_count=150857
hackbench-5048 [000] 1542.057428970: mce_thermal_throttle_cold: throttle_count=150857
hackbench-5053 [000] 1542.060372255: mce_thermal_throttle_hot: throttle_count=150858
hackbench-5054 [000] 1542.060863006: mce_thermal_throttle_cold: throttle_count=150858

The first thing we can see it from the trace, only the first core
reports thermal events (the box is a dual-core box). The other thing is
the timing: the CPU is in 'hot' state for about ~490 microseconds, then
in 'cold' state for ~250 microseconds.

So the CPU spends about 65% of its time in hot-throttled state, to cool
down enough to switch back to cold mode again.

Also, most of the overheating events hit the 'hackbench' app. (which is
no surprise on this box as it is only running that workload - but might
be interesting on other systems with more mixed workloads.)

So even in its completely MCE-unaware form, and with this small patch
that took 5 minutes to write, the performance events subsystem and
tooling can already give far greater insight into characteristics of
these events than the MCE subsystem did before.

Obviously the reported events can be used for different, MCE specific
things as well:

- Report an alert on the desktop (without having to scan the syslog all
the time)

- Mail the sysadmin

- Do some specific policy action such as soft-throttle the workload
until the situation and the log data can be examined.

As i outlined it in my previous mail, the perf events subsystem provides
a wide range of facilities that can be used for that purpose.

If there are aspects of MCE reporting usecases that necessiate the
extension of perf events, then we can do that - both sides will benefit
from it. What we dont want to do is to prolongue the pain caused by the
inferior, buggy and poorly designed /dev/mcelog ABI and code. It's not
serving our users well and people are spending time on fixing something
that has been misdesigned from the get go instead of concentrating on
improving the right facilities.

So this is really how i see the future of the MCE logging subsystem: use
meaningful generic events that empowers users, admins and tools to do
something intelligent with the available stream of MCE data. The
hardware tells us a _lot_ of information - lets use and expose that in a
structured form.

Ingo

---
arch/x86/kernel/cpu/mcheck/therm_throt.c | 9 +++++
include/trace/events/mce.h | 47 +++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 1 deletion(-)

Index: linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -31,6 +31,9 @@
#include <asm/mce.h>
#include <asm/msr.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/mce.h>
+
/* How long to wait between reporting thermal events */
#define CHECK_INTERVAL (300 * HZ)

@@ -118,8 +121,12 @@ static int therm_throt_process(bool is_t
was_throttled = state->is_throttled;
state->is_throttled = is_throttled;

- if (is_throttled)
+ if (is_throttled) {
state->throttle_count++;
+ trace_mce_thermal_throttle_hot(state->throttle_count);
+ } else {
+ trace_mce_thermal_throttle_cold(state->throttle_count);
+ }

if (time_before64(now, state->next_check) &&
state->throttle_count != state->last_throttle_count)
Index: linux/include/trace/events/mce.h
===================================================================
--- /dev/null
+++ linux/include/trace/events/mce.h
@@ -0,0 +1,47 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mce
+
+#if !defined(_TRACE_MCE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MCE_H
+
+#include <linux/ktime.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mce_thermal_throttle_hot,
+
+ TP_PROTO(unsigned int throttle_count),
+
+ TP_ARGS(throttle_count),
+
+ TP_STRUCT__entry(
+ __field( u64, throttle_count )
+ ),
+
+ TP_fast_assign(
+ __entry->throttle_count = throttle_count;
+ ),
+
+ TP_printk("throttle_count=%Lu", __entry->throttle_count)
+);
+
+TRACE_EVENT(mce_thermal_throttle_cold,
+
+ TP_PROTO(unsigned int throttle_count),
+
+ TP_ARGS(throttle_count),
+
+ TP_STRUCT__entry(
+ __field( u64, throttle_count )
+ ),
+
+ TP_fast_assign(
+ __entry->throttle_count = throttle_count;
+ ),
+
+ TP_printk("throttle_count=%Lu", __entry->throttle_count)
+);
+
+#endif /* _TRACE_MCE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

----- End forwarded message -----

2010-01-23 11:34:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

> FYI, this commit triggered a -tip test failure:
>
> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'

Does this patch fix it? (I assume your config is missing CONFIG_PCI)

-Andi

---

commit 748e9c041da32f173f20289c71ef0462b8e97234
Author: Andi Kleen <[email protected]>
Date: Sat Jan 23 12:34:10 2010 +0100

Make xeon75xx memory driver dependent on PCI

Found by Ingo Molnar's automated tester

Signed-off-by: Andi Kleen <[email protected]>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f62db24..ab2d0a8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -834,7 +834,7 @@ config X86_MCE_INTEL

config X86_MCE_XEON75XX
tristate "Intel Xeon 7500 series corrected memory error driver"
- depends on X86_MCE_INTEL
+ depends on X86_MCE_INTEL && PCI
---help---
Add support for a Intel Xeon 7500 series specific memory error driver.
This allows to report the DIMM and physical address on a corrected

2010-01-24 10:08:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

On Sat, Jan 23, 2010 at 10:00:03AM +0100, Ingo Molnar wrote:

[..]

> Yep. Could you give a few pointers to Andi where exactly you'd like to see the
> Intel Xeon functionality added to the EDAC code? There is some Intel
> functionality there already, but the current upstream code does not look very
> uptodate. I've looked at e752x_edac.c. (there's some Corei7 work pending,
> right?) In any case there's a lot of fixing to be done to the Intel code
> there.

Basically you've named them all - I'd go for a new module/c file though
if the Xeon75 stuff is completely new hw and cannot reuse existing EDAC
modules.

> Yes, my initial thoughts on that are in the lkml mail below from a few months
> ago. We basically want to enumerate the hardware and its events intelligently
> - and integrate that nicely with other sources of events. That will give us a
> boatload of new performance monitoring and analysis features that we could not
> have dreamt of before.
>
> Certain events can be 'richer' and 'more special' than others (they can cause
> things like signals - on correctable memory faults), but so far there's little
> that deviates from the view that these are all system events, and that we want
> a good in-kernel enumeration and handling of them. Exposing it on the low
> level a'la mcelog is a fundamentally bad idea as it pushes hardware complexity
> into user-space (handling hardware functionality and building good
> abstractions on it is the task of the kernel - every time we push that to
> user-space the kernel becomes a little bit poorer).
>
> Note that this very much plugs into the whole problem space of how to
> enumerate CPU cache hierarchies - something that i think Andreas is keenly
> interested in.

Oh yes, he's interested in that allright :)

> We want one unified enumeration of hardware [and software] components
> and one enumeration of the events that originate from there.
> Right now we are mostly focused on software component enumeration via
> /debug/tracing/events, but that does not (and should not) remain so. It's not
> a small task to implement all aspects of that, but it can be done gradually
> and it will be very rewarding all along the way in my opinion.

Yes, this is very interesting. How do we represent that in the kernel
space as one contiguous "tree" or "library" or whatever without adding
overhead and opening that info to userspace?

Because this is one thing that has been bugging us for a long time. We
don't have a centralized smart utility with lots of small subcommands
like perf or git, if you like, which can dump you the whole or parts
of the hw configuration of the machine - something like cache sizes
and hierarchy, CPU capabilities from CPUID flags, memory controllers
configuration, DRAM type and sizes, NUMA info, processor PCI config
space along with decoded register and bit values, ... (where do I
stop)...

Currently, we have a ragged collection of tools with their own syntax
and output formatting like numactl, x86info, /proc/cpuinfo, (eyeballing
dmesg output - which is not even a tool :) and it is very annoying when
you have a bunch of machines and you start pulling them tools in, one
after another, before you can even get to the hw information.

So, it would be much much more useful if we had such a tool that can
give you a precise hw information without disrupting the kernel (I
remember several bugs with ide-cd last year where some udev helpers
were querying the drive for capabilities but the drive wasn't ready
yet and, as a result, was getting puzzled so much that it wouldn't
load properly). Its subcommands could each cover a subsystem or a hw
component and you could do something like the following example (values
in {} are actual settings read from the hardware):

<tool> pcicfg -f 18.3 -r 0xe8
F3x0e8 (Northbridge Capabilities Register): {0x02073f99}

...

L3Capable: [25]: {1}
1=Specifies that an L3 cache is present. See
CPUID Fn8000_0006_EDX.

...

LnkRtryCap: [11]: {1}
Link error-retry capable.
HTC_capable: [10]: {1}
This affects F3x64 and F3x68.
SVM_capable: [9]: {1}

MctCap: [8]: {1}
memory controller (on the processor) capable.
DdrMaxRate: [7:5]: {0x4}
Specifies the maximum DRAM data rate that the
processor is designed to support.
Bits DDR limit Bits DDR limit
==== ========= ==== =========
000b No limit 100b 800 MT/s
001b Reserved 101b 667 MT/s
010b 1333 MT/s 110b 533 MT/s
011b 1067 MT/s 111b 400 MT/s

Chipkill_ECC_capable: [4]: {1}

ECC_capable: [3]: {1}

Eight_node_multi_processor_capable: [2]: {0}

Dual_node_multi_processor_capable: [1]: {0}

DctDualCap: [0]: {1}
two-channel DRAM capable (i.e., 128 bit).
0=Single channel (64-bit) only.


And yes, this is very detailed output but it simply serves the purpose
to show how detailed we can get.

The same thing can output MSR registers like lsmsr does:

MC4_CTL = 0x000000003fffffff (CECCEn=0x1, UECCEn=0x1, CrcErr0En=0x1, CrcErr1En=0x1, CrcErr2En=0x1, SyncPkt0En=0x1, SyncPkt1En=0x1, SyncPkt2En=0x1, MstrAbrtEn=0x1, TgtAbrtEn=0x1, GartTblWkEn=0x1, AtomicRMWEn=0x1, WDTRptEn=0x1, DevErrEn=0x1, L3ArrayCorEn=0x1, L3ArrayUCEn=0x1, HtProtEn=0x1, HtDataEn=0x1, DramParEn=0x1, RtryHt0En=0x1, RtryHt1En=0x1, RtryHt2En=0x1, RtryHt3En=0x1, CrcErr3En=0x1, SyncPkt3En=0x1, McaUsPwDatErrEn=0x1, NbArrayParEn=0x1, TblWlkDatErrEn=0x1)

but with in a more human-readable form without the need to open the hw
manual for that. And this is pretty lowlevel. How about nodes and cores
on each node and HT siblings and NUMA proximity and DIMM distribution
across NBs and which northbridge is connected to to the southbridge on
a multinode system, etc? I know, we have parts of that in /sysfs but it
should be easier to get that info.

You can have a gazillion examples like those and the use cases are not a
small number: ask a user for a specific hw configuration when debugging,
output from this tool can do automatic tuning suggestions like powertop
in 'perf stat' runs where the machine spends too much time in a function
because, for example, the HT link has been configured to a lower speed
for power savings but the app that is being profiled is generating a
bunch of threads doing parallel computations and causing a bunch of
cross-node traffic which slows it down, etc. etc. etc.

> [ Furthermore, if there's interest i wouldnt mind a 'perf mce' (or
> more generally a 'perf edac') subcommand to perf either, which would
> specifically be centered about all things EDAC/MCE policy. (but of
> course other tooling can make use of it too - it doesnt 'have' to be
> within tools/perf/ per se - it's just a convenient and friendly place
> for kernel developers and makes it easy to backtest any new kernel
> code in this area.)
>
> We already have subsystem specific perf subcommands: perf kmem, perf
> lock, perf sched - this kind of spread out and subsystem specific
> support it's one of the strong sides of perf. ]

The example below (which I cut for brevity) is a perfect example of how
it should be done. Let me first, however, go a step back and give you my
opinion of how I think this whole MCEs catching and decoding should be
done before we think of tooling:

1. We need to notify userspace, as you've said earlier, and not scan
the syslog all the time. And EDAC, although decoding the correctable
ECC, spews it in the syslog too causing more parsing (there's edac-utils
which polls /sysfs but this is just another tool with problems as
outlined above).

What is more, the notification mechanism we come up with should push
the error as early as possible and be able to send it over the network
to a monitor (think data center with thousands of compute nodes here
where CECCs happen every day at least) - something like a more resilient
netconsole which sends out decoded MCE info to the monitor.

2. Also another very good point you had is go into maintenance mode by
throttling or even suspend all uspace processes and start a restricted
maintenance shell after an MCE happens. This should be done based on the
severity of the MCE and the shell should run on a core that _didn't_
observe the MCE.

3. All the hw events like correctable ECCs should be thresholded so that
all errors exceeding a preset threshold (below that is normal operation
and they get corrected by ECC codes in the hardware anyway) should alarm
of a slowly failing DIMM or a L3 subcache index for the sysop to take
action against if the machine cannot do failover itself. For example,
in the L3 cache case, the machine can initially disable max. 2 subcache
indices and notify the user that it has done so but the user should be
warned that the hw is failing slowly.

The current decoding needs more loving too since now it says something
like the following:

EDAC DEBUG: in drivers/edac/amd64_edac_inj.c, line at 170: section=0x80000002 word_bits=0x10020001
EDAC DEBUG: in drivers/edac/amd64_edac_inj.c, line at 170: section=0x80000002 word_bits=0x10020001
Northbridge Error, node 0, core: -1
K8 ECC error.
EDAC amd64 MC0: CE ERROR_ADDRESS= 0x33574910
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1572: (dram=0) Base=0x0 SystemAddr= 0x33574910 Limit=0x12fffffff
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1583: HoleOffset=0x3000 HoleValid=0x1 IntlvSel=0x0
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1627: (ChannelAddrLong=0x19aba480) >> 8 becomes InputAddr=0x19aba4
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1515: InputAddr=0x19aba4 channelselect=0
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1537: CSROW=0 CSBase=0x0 RAW CSMask=0x783ee0
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1541: Final CSMask=0x7ffeff
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1544: (InputAddr & ~CSMask)=0x100 (CSBase & ~CSMask)=0x0
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1537: CSROW=1 CSBase=0x100 RAW CSMask=0x783ee0
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1541: Final CSMask=0x7ffeff
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1544: (InputAddr & ~CSMask)=0x100 (CSBase & ~CSMask)=0x100
EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1549: MATCH csrow=1
EDAC MC0: CE page 0x33574, offset 0x910, grain 0, syndrome 0xbe01, row 1, channel 0, label "": amd64_edac
EDAC MC0: CE - no information available: amd64_edacError Overflow
EDAC DEBUG: in drivers/edac/amd64_edac_inj.c, line at 170: section=0x80000002 word_bits=0x10020001

and this is only the chip select row but we need to map that to the
actual DIMM and to tell the admin: "DIMM with label "BLA" on your
motherboard seems to be failing" without first naming all DIMMs through
/sysfs to their silk-screen labels.

And yes, it is a lot of work but we can at least start talking about it
and gradually getting it done. What do the others think?

Thanks.

--
Regards/Gruss,
Boris.

2010-01-25 13:19:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Hi,

> Because this is one thing that has been bugging us for a long time. We
> don't have a centralized smart utility with lots of small subcommands
> like perf or git, if you like, which can dump you the whole or parts

PC configuration is all in dmidecode, CPU/node information in lscpu
these days (part of utils-linux)

The dmidecode information could be perhaps presented nicer, but
I don't think we need any fundamental new tools.

> 1. We need to notify userspace, as you've said earlier, and not scan
> the syslog all the time. And EDAC, although decoding the correctable

mcelog never scanned the syslog all the time. This is just
EDAC misdesign.

But yes syslog is exactly the wrong interface for these kinds of errors.

> 2. Also another very good point you had is go into maintenance mode by
> throttling or even suspend all uspace processes and start a restricted
> maintenance shell after an MCE happens. This should be done based on the

When you have a unrecoverable MCE this is not safe because you
can't write anything to disk (and usually the system is unstable
and will crash soon) because there are uncontained errors somewhere
in the hardware. The most important thing to do in this situation
is to *NOT* write anything to disk (and that is the reason
why the hardware raised the unrecoverable MCE in the first place)
Having a shell without being able to write to disk doesn't make sense.

When you have a recoverable MCE with contained errors this is not needed,
because it, well, just recovers.

> 3. All the hw events like correctable ECCs should be thresholded so that
> all errors exceeding a preset threshold (below that is normal operation

Agreed. Corrected errors without thresholds are useless (that is one
of the main reasons why syslog is a bad idea for them)

See also my plumbers presentation on the topic:

http://halobates.de/plumbers-error.pdf

One key part is that for most interesting reactions to thresholds
you need user space, kernel space is too limited.

My current direction was implementing this in mcelog which
maintains threshold counters and already does a couple of direct (user
based) threshold reactions, like offlining cores and pages and reporting
short user friendly error summaries when thresholds are exceeded.

Longer term I hope to move to a more generic (user) error infrastructure
that handles more kinds of errors. This needs some infrastructure
work, but not too much.

>
> The current decoding needs more loving too since now it says something
> like the following:

Yes, see the slide set above on thoughts how a good error looks like.

The big problem with EDAC currently is that it neither gives
the information actually needed (like mainboard labels), but gives
a lot of irrelevant low level information. And since it's kernel
based it cannot do most of the interesting reactions. And it doesn't
have a usable interface to add user events.

And yes having all that crap in syslog is completely useless, unless
you're debugging code.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-26 06:33:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Hi,

On Mon, Jan 25, 2010 at 02:19:15PM +0100, Andi Kleen wrote:
> > Because this is one thing that has been bugging us for a long time. We
> > don't have a centralized smart utility with lots of small subcommands
> > like perf or git, if you like, which can dump you the whole or parts
>
> PC configuration is all in dmidecode, CPU/node information in lscpu
> these days (part of utils-linux)
>
> The dmidecode information could be perhaps presented nicer, but
> I don't think we need any fundamental new tools.

Uuh, dmidecode doesn't even start to look usable in my book because you
have to rely on BIOS vendors to fill out the information for you. Here
are some assorted excerpts from dmidecode on my machines:

1. Incomplete info:

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: System manufacturer
Product Name: System Product Name
Version: System Version
Serial Number: System Serial Number
UUID: 201EE116-055E-DD11-8B0E-002215FDB1C6
Wake-up Type: Power Switch
SKU Number: To Be Filled By O.E.M.
Family: To Be Filled By O.E.M.

2. Wrong(!) info:

Handle 0x0007, DMI type 7, 19 bytes
Cache Information
Socket Designation: L3-Cache
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Varies With Memory Address
Location: Internal
Installed Size: 6144 KB
Maximum Size: 6144 KB
Supported SRAM Types:
Pipeline Burst
Installed SRAM Type: Pipeline Burst
Speed: Unknown

why?

Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 4-way Set-associative

how is my L3 4-way set-associative and how do they come up with that???

[ by the way, it says the same on my old P4 box' L2 so this could mean
anything besides the actual cache assoc. ]


Here's what the dmidecode manpage says:

"...

BUGS
More often than not, information contained in the DMI tables is
inaccurate, incomplete or simply wrong.
...
"

so I guess I'm not the only one :)

In the end, even if the info were correct, it is still not nearly enough
for all the information you might need from a system. So you end up
pulling a dozen of different tools just to get the info you need. So
yes, I really do think we need a tool to get do the job done right and
on any system. And this tool should be distributed with the kernel
sources like perf is, so that you don't have to jump through hoops to
pull the stuff (Esp. if you have to build everything everytime like
Andreas does :)).

> > 1. We need to notify userspace, as you've said earlier, and not scan
> > the syslog all the time. And EDAC, although decoding the correctable
>
> mcelog never scanned the syslog all the time. This is just
> EDAC misdesign.

Oh yes, EDAC has the edac-utils too which access /sysfs files but even
so, it is suboptimal and we really need a single interface/output
channel/whatever you call a beast like that to reliably transfer human
readable hw error info to userspace and/or network. And this has to be
pushed from kernel space outwards as early as the gravity of the error
suggests, IMO.

> But yes syslog is exactly the wrong interface for these kinds of errors.

Agreed completely.

> > 2. Also another very good point you had is go into maintenance mode by
> > throttling or even suspend all uspace processes and start a restricted
> > maintenance shell after an MCE happens. This should be done based on the
>
> When you have a unrecoverable MCE this is not safe because you
> can't write anything to disk (and usually the system is unstable
> and will crash soon) because there are uncontained errors somewhere
> in the hardware. The most important thing to do in this situation
> is to *NOT* write anything to disk (and that is the reason
> why the hardware raised the unrecoverable MCE in the first place)
> Having a shell without being able to write to disk doesn't make sense.

Hmm, not necessarily. First of all, not all UC errors are absolutely
valid reasons to panic the machine. Imagine, for example, you encounter
(as unlikely as it might be) a multibit error during L1 data cache
scrubbing which hasn't been consumed yet. Now, technically, no data
corruption has taken place yet so you can easily start the shell on
another core which doesn't have that datum in its cache, decode the
error for the user to see what it was and even allow her/him to poweroff
the machine properly.

Or imagine you have a L2 TLB multimatch - also UC but you can still
invalidate the two entries, maybe kill the processes that have caused
those mappings and start the shell.

So no, not all UC errors have to absolutely cause data corruption and
you can still prepare for a clean exit by warning the user that her/his
data might be compromized and whether (s)he wants to write to disk or
poweroff the machine immediately SysRq-O style.

And even if an UC causes data corruption, panicking the system doesn't
mean that the error has been contained. Nothing can assure you that by
the time do_machine_check() has run the corrupted data hasn't left the
CPU core and landed in another core's cache (maybe even on a different
node) and then on disk through an outstanding write request. That's why
we syncflood the HT links on certain error types since an MCE is not
enough to stop that propagation.

> > 3. All the hw events like correctable ECCs should be thresholded so that
> > all errors exceeding a preset threshold (below that is normal operation
>
> Agreed. Corrected errors without thresholds are useless (that is one
> of the main reasons why syslog is a bad idea for them)
>
> See also my plumbers presentation on the topic:
>
> http://halobates.de/plumbers-error.pdf
>
> One key part is that for most interesting reactions to thresholds
> you need user space, kernel space is too limited.
>
> My current direction was implementing this in mcelog which
> maintains threshold counters and already does a couple of direct (user
> based) threshold reactions, like offlining cores and pages and reporting
> short user friendly error summaries when thresholds are exceeded.

Yep, sounds good.

> Longer term I hope to move to a more generic (user) error infrastructure
> that handles more kinds of errors. This needs some infrastructure
> work, but not too much.

Yep, I think this is something we should definitely talk about since our
error reporting right now needs a bunch of work to even start becoming
really usable.

> > The current decoding needs more loving too since now it says something
> > like the following:
>
> Yes, see the slide set above on thoughts how a good error looks like.
>
> The big problem with EDAC currently is that it neither gives
> the information actually needed (like mainboard labels), but gives
> a lot of irrelevant low level information.

Yes, I'm very well aware of that. I'm currently working on a solution.
It's just an idea now but I might be able to read DIMM configuration
on the SPD ROM on the DIMM along with their labels and position on the
motherboard in order to be able to pinpoint the correct DIMM... Stay
tuned...

> And since it's kernel
> based it cannot do most of the interesting reactions. And it doesn't
> have a usable interface to add user events.
>
> And yes having all that crap in syslog is completely useless, unless
> you're debugging code.

So basically, IMHO we need:

1. Resilient error reporting that reliably pushes decoded error info to
userspace and/or network. That one might be tricky to do but we'll get
there.

2. Error severity grading and acting upon each type accordingly. This
might need to be vendor-specific.

3. Proper error format suiting all types of errors.

4. Vendor-specific hooks where it is needed for in-kernel handling of
certain errors (L3 cache index disable, for example).

5. Error thresholding, representation, etc all done in userspace (maybe
even on a different machine).

6. Last but not least, and maybe this is wishful thinking, a good tool
to dump hwinfo from the kernel. We do a great job of detecting that info
already - we should do something with it, at least report it...

Let's see what the others think.

Thanks.

--
Regards/Gruss,
Boris.

2010-01-26 09:07:19

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

(2010/01/26 15:33), Borislav Petkov wrote:
> In the end, even if the info were correct, it is still not nearly enough
> for all the information you might need from a system. So you end up
> pulling a dozen of different tools just to get the info you need. So
> yes, I really do think we need a tool to get do the job done right and
> on any system. And this tool should be distributed with the kernel
> sources like perf is, so that you don't have to jump through hoops to
> pull the stuff (Esp. if you have to build everything everytime like
> Andreas does :)).

How about having a system file which can be maintained with kernel,
e.g. like /proc/hwinfo, /sys/devices/platform/hwinfo, or directory
with some files like /somewhere/hwinfo/{dmi,acpi,pci,...} etc.?

>> And since it's kernel
>> based it cannot do most of the interesting reactions. And it doesn't
>> have a usable interface to add user events.
>>
>> And yes having all that crap in syslog is completely useless, unless
>> you're debugging code.
>
> So basically, IMHO we need:
>
> 1. Resilient error reporting that reliably pushes decoded error info to
> userspace and/or network. That one might be tricky to do but we'll get
> there.

I think it would be better to think "error" is a subset of "event",
which could be reported if interested but otherwise be filtered.
Use of TRACE_EVENT() for mce event aim such approach at least.

> 2. Error severity grading and acting upon each type accordingly. This
> might need to be vendor-specific.

I think you mean severity grading in kernel.
Even if hardware reported an error and graded it as corrected, kernel
can escalate the severity, likely based on some threshold.

> 3. Proper error format suiting all types of errors.

As mentioned in Andi's PDF, CPER format is one of good candidate
available today, I think.
However we could invent more suitable one if needed.

> 4. Vendor-specific hooks where it is needed for in-kernel handling of
> certain errors (L3 cache index disable, for example).

Some difficulty would be there to add such hook in the UE handling path,
but anyway we can have it for the CE path. Just need to be organized.

> 5. Error thresholding, representation, etc all done in userspace (maybe
> even on a different machine).

(...BTW, how about putting mcelog tree under the /tools, Andi?)

> 6. Last but not least, and maybe this is wishful thinking, a good tool
> to dump hwinfo from the kernel. We do a great job of detecting that info
> already - we should do something with it, at least report it...

Of course I want to have a tool to get a summary (not full dump) of
current hardware status too: e.g.
$ cat ./hwinfo/faulty
WARN: DIMM @ slot X on node Y: 208 errors corrected in last 3 days
INFO: PCI 0000:NN:01.1: 1 error recovered 37 hours ago

> Let's see what the others think.
>
> Thanks.

Thanks,
H.Seto

2010-01-26 15:36:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

On Tue, Jan 26, 2010 at 07:33:43AM +0100, Borislav Petkov wrote:
> Uuh, dmidecode doesn't even start to look usable in my book because you
> have to rely on BIOS vendors to fill out the information for you. Here
> are some assorted excerpts from dmidecode on my machines:

For most of the information in DMI decode you can't even
find it any other way. If you can't get it from the BIOS
it's simply not there.

One example are silkscreen labels, which are extremly important
for any kind of hardware error handling.

On my server class systems the information is largely correct. If it's not
on your system perhaps you need to complain to the vendor.

On non server class there are a lot of BIOS problems, but typically
the platforms there don't have enough support for good error
handling anyways.

> how is my L3 4-way set-associative and how do they come up with that???

Cache/CPU information is in lscpu. The important part are the motherboard
resources.

> on any system. And this tool should be distributed with the kernel
> sources like perf is, so that you don't have to jump through hoops to

Most distributions have some kind of summary tool to aggregate
complete system configuration.

It's not the same everywhere, but that's one of the strengths
of Linux imho, not a weakness.

> Oh yes, EDAC has the edac-utils too which access /sysfs files but even
> so, it is suboptimal and we really need a single interface/output
> channel/whatever you call a beast like that to reliably transfer human
> readable hw error info to userspace and/or network. And this has to be
> pushed from kernel space outwards as early as the gravity of the error
> suggests, IMO.

You just reinvented mcelog, congratulations.

> valid reasons to panic the machine. Imagine, for example, you encounter
> (as unlikely as it might be) a multibit error during L1 data cache
> scrubbing which hasn't been consumed yet. Now, technically, no data
> corruption has taken place yet so you can easily start the shell on

When no data corruption has been taken it's not a UC error.

An UC error in this case is defined as something that the hardware
tells us is a UC error, worse even a uncontained UC error

The reason the hardware tells us about that is that it wants us to prevent
further damage. And the primary way to do that is to NOT write anything
to disk, because that would risk corrupting it.

For contained memory UC errors I wrote all the infrastructure last year to
handle them. Making hwpoison better is still an ongoing
project, but it's already quite usable.

> And even if an UC causes data corruption, panicking the system doesn't
> mean that the error has been contained. Nothing can assure you that by
> the time do_machine_check() has run the corrupted data hasn't left the
> CPU core and landed in another core's cache (maybe even on a different

That is why the panic stops all cpus to prevent that.

But yes if the disk write happens at exactly the wrong point the error
could still escape, but we try to keep the window as small as possible.
Typically there's also some hardware help with that to catch currently
in flight transactions. It depends on the platform how well it works.

> Yes, I'm very well aware of that. I'm currently working on a solution.
> It's just an idea now but I might be able to read DIMM configuration
> on the SPD ROM on the DIMM along with their labels and position on the

The SPD ROM doesn't have labels. The only entity who knows them
is the BIOS (or someone who just studied the semantics of the motherboard,
but I don't think we can rely on that)

> 1. Resilient error reporting that reliably pushes decoded error info to
> userspace and/or network. That one might be tricky to do but we'll get
> there.

Not at all tricky. At least on modern Intel platforms mcelog
already does it.

>
> 2. Error severity grading and acting upon each type accordingly. This
> might need to be vendor-specific.


mcelog does it mostly. It's not perfect yet, but not too bad.


> 3. Proper error format suiting all types of errors.

I plan to look into that.

But "suiting all types of errors" is probably a mistake,
I don't think it makes sense to try to invent the one
perfect error that covers everything. People have tried
that in the past and it was always a spectacular failure.

I suspect the better goal is rather a range of error formats
for common situations with a lot of flexibility.


> 5. Error thresholding, representation, etc all done in userspace (maybe
> even on a different machine).

mcelog does that for memory errors on modern systems.

> 6. Last but not least, and maybe this is wishful thinking, a good tool
> to dump hwinfo from the kernel. We do a great job of detecting that info
> already - we should do something with it, at least report it...

IMHO there are already enough of them.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-26 16:09:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

On Tue, Jan 26, 2010 at 06:06:26PM +0900, Hidetoshi Seto wrote:
> How about having a system file which can be maintained with kernel,
> e.g. like /proc/hwinfo, /sys/devices/platform/hwinfo, or directory
> with some files like /somewhere/hwinfo/{dmi,acpi,pci,...} etc.?

Why not do that in user space?

In fact it's often already done.

Just because we're kernel programmers doesn't mean that everything
needs to be solved inside the kernel.

> >> And since it's kernel
> >> based it cannot do most of the interesting reactions. And it doesn't
> >> have a usable interface to add user events.
> >>
> >> And yes having all that crap in syslog is completely useless, unless
> >> you're debugging code.
> >
> > So basically, IMHO we need:
> >
> > 1. Resilient error reporting that reliably pushes decoded error info to
> > userspace and/or network. That one might be tricky to do but we'll get
> > there.
>
> I think it would be better to think "error" is a subset of "event",
> which could be reported if interested but otherwise be filtered.
> Use of TRACE_EVENT() for mce event aim such approach at least.

The whole trace event infrastructure right now is not really
aimed/useful for "always on low overhead background monitoring" like
standard error handling requires.

In principle it could be probably fixed (although I'm a bit
sceptical on the "low overhead" part), but I suspect the result
would be neither optimized for error handling nor optimized
for performance monitoring anymore. They simply have
very different requirements.

When you do full event tracing anyways it makes some sense to get events
for errors too, but that's a quite different use-case.

For the "standard" error handling I think we're better of with
something optimized for the job.

> > 2. Error severity grading and acting upon each type accordingly. This
> > might need to be vendor-specific.
>
> I think you mean severity grading in kernel.
> Even if hardware reported an error and graded it as corrected, kernel
> can escalate the severity, likely based on some threshold.

I don't think the kernel should do that, it's so much a policy
decision and these are best kept as near the administrator
as possible (= user space)

That is for some cases it might make sense to have limited thresholds
in the kernel, but I suspect they are limited. Mostly it would
be the case when the hardware itselfs already keeps these counters.

>
> > 3. Proper error format suiting all types of errors.
>
> As mentioned in Andi's PDF, CPER format is one of good candidate
> available today, I think.

Yes for hardware errors. It's definitely not perfect and somewhat
overdesigned, but I'm not sure we could come up with a much better one.
At least a subset of it with some extensions might do. Also in some
cases the error is already in this format.

The advantage of it is that it's at least well understood and documented.

> > 4. Vendor-specific hooks where it is needed for in-kernel handling of
> > certain errors (L3 cache index disable, for example).
>
> Some difficulty would be there to add such hook in the UE handling path,
> but anyway we can have it for the CE path. Just need to be organized.
>
> > 5. Error thresholding, representation, etc all done in userspace (maybe
> > even on a different machine).
>
> (...BTW, how about putting mcelog tree under the /tools, Andi?)

I don't see the advantage. Linux has always been a collection
of packages, not a unified single big tree. Also my current
impression is that the in tree user space builds don't work
very well.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-27 12:35:38

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Ingo Molnar wrote:
> * Borislav Petkov <[email protected]> wrote:
>
>> (Adding some more interested parties to Cc:)
>>
>> On Sat, Jan 23, 2010 at 06:17:17AM +0100, Ingo Molnar wrote:
>>> * tip-bot for H. Peter Anvin <[email protected]> wrote:
>>>
>>>> Commit-ID: f91c4d2649531cc36e10c6bc0f92d0f99116b209
>>>> Gitweb: http://git.kernel.org/tip/f91c4d2649531cc36e10c6bc0f92d0f99116b209
>>>> Author: H. Peter Anvin <[email protected]>
>>>> AuthorDate: Thu, 21 Jan 2010 18:31:54 -0800
>>>> Committer: H. Peter Anvin <[email protected]>
>>>> CommitDate: Thu, 21 Jan 2010 18:31:54 -0800
>>>>
>>>> x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll
>>>>
>>>> cpu_specific_poll is a global variable, and it should have a global
>>>> namespace name. Since it is MCE-specific (it takes a struct mce *),
>>>> rename it mce_cpu_specific_poll.
>>>>
>>>> Signed-off-by: H. Peter Anvin <[email protected]>
>>>> Cc: Andi Kleen <[email protected]>
>>>> LKML-Reference: <[email protected]>
>>> FYI, this commit triggered a -tip test failure:
>>>
>>> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
>>> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'
>>>
>>> I'm excluding it from tip:master.
>>>
>>> But the bigger problem with this commit is structure of it - or the lack
>>> thereof.
>>>
>>> It just blindly goes into the direction the MCE code has been going for some
>>> time, minimally enabling the hardware, ignoring both the new EDAC design and
>>> the new performance monitoring related design i outlined some time ago.
>> I completely agree - from what I see this is adding vendor- or rather
>> vendor-and-machine-specific hooks to read out (1) the position of the the
>> memory translation table from PCI config space (0x8c), (2) then to read out
>> the offset from the first MCA status register in order to (3) rdmsr the
>> status information.
>>
>> In AMD's case, we need similar hooks too, in order to evaluate correctable
>> MCEs for different RAS reasons like for example L3 cache or data arrays
>> errors for disabling L3 indices. I was looking into adding hooks into
>> machine_check_poll() and cpu_specific_poll() interface could work.
>>
>> Furthermore, lets leave mcheck be mcheck and do error decoding in EDAC
>> modules. For example, there was a core i7 EDAC module submission from Mauro
>> and the Xeon75xx-specific decoding bits could be added to it or even as a
>> new machine-specific module instead of mcelog.
>>
>> With the evergrowing complexity of memory controller design I don't think
>> that the userspace mcelog approach will scale - you need the whole decoding
>> in the kernel where the module knows the exact memory controllers setup and
>> which DRAM addresses belong to which nodes and whether you do memory
>> hoisting and whether you interleave, if yes, how and on what granilarity you
>> interleave and on and on...
>
> Yep. Could you give a few pointers to Andi where exactly you'd like to see the
> Intel Xeon functionality added to the EDAC code? There is some Intel
> functionality there already, but the current upstream code does not look very
> uptodate. I've looked at e752x_edac.c. (there's some Corei7 work pending,
> right?) In any case there's a lot of fixing to be done to the Intel code
> there.

The i7core_edac driver covers the Nehalem chipsets. It is at:
http://git.kernel.org/?p=linux/kernel/git/mchehab/i7core.git

The error decoding is done by i7core_mce_output_error(). If Xeon 75xx needs to do
something different, it is just a matter of calling a function there, to retrieve
the values from the proper registers.

In order to avoid duplication at the code, the driver is receiving the error events
from the mcelog driver, and having two drivers fighting to access the same error
registers at the same time, this driver relies on mce driver to send the events.

I expect to allocate some time to put the driver into a better shape for
upstream submission soon.

> Memory controller functionality integrated into the kernel proper is a good
> idea partly for similar reasons an on-die memory controller is a good idea in
> the hardware space.

I agree. The EDAC module handles memory errors for several memory controllers,
including some on non-x86 architectures. Letting this decode to happen outside
the kernel, just for a very few set of memory controllers, seems a bad idea.

Cheers,
Mauro.

2010-01-27 14:40:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

> The i7core_edac driver covers the Nehalem chipsets. It is at:

Erm no, first Nehalem isn't a chipset and then it only covers
parts of the space (some subset of Nehalems with a specific uncore).

This does not include Xeon 75xx where all this is quite different.

> I agree. The EDAC module handles memory errors for several memory controllers,
> including some on non-x86 architectures. Letting this decode to happen outside
> the kernel, just for a very few set of memory controllers, seems a bad idea.

Again you can't do all the things mcelog does in kernel space only.

-Andi
--
[email protected] -- Speaking for myself only.

2010-01-27 15:06:15

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Andi Kleen wrote:
> it only covers
> parts of the space (some subset of Nehalems with a specific uncore).

The changes for Xeon 55xx are minimal. I did some tests with the driver on it already.

> This does not include Xeon 75xx where all this is quite different.

I haven't the datasheets for 75xx, so I can't say for sure if it would be better to
use the same driver or to fork it.

>> I agree. The EDAC module handles memory errors for several memory controllers,
>> including some on non-x86 architectures. Letting this decode to happen outside
>> the kernel, just for a very few set of memory controllers, seems a bad idea.
>
> Again you can't do all the things mcelog does in kernel space only.

Well, the error parsing can be done in kernel space in a standard way provided
by the edac interface.

I don't see why not the mcelog userspace shouldn't use the EDAC interface as one
of its source, getting memory errors from it, avoiding the need of re-parsing
the errors.

Cheers,
Mauro.

2010-01-27 16:36:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

On Wed, Jan 27, 2010 at 01:04:55PM -0200, Mauro Carvalho Chehab wrote:
> I haven't the datasheets for 75xx, so I can't say for sure if it would be better to
> use the same driver or to fork it.

You can't use the same driver.

> Well, the error parsing can be done in kernel space in a standard way provided
> by the edac interface.
>
> I don't see why not the mcelog userspace shouldn't use the EDAC interface as one
> of its source, getting memory errors from it, avoiding the need of re-parsing
> the errors.

The errors are just numbers which are printed. If you mean with "parsing"
splitting up the bitfields that's not really an too interesting case.

Essentially to get terminology clear, for corrected errors there are multiple
steps: (uncorrected errors are quite different)

1) Getting the error from hardware registers
2) Accounting them
3) Presenting them to users
4) Reacting to events
which can be separated in
4a) protocol to communicate with event handler
4a1) interface to wake up event handler
4a2) communication
4b) event handler itself

Some parts of these need to be in kernel space: but that's
pretty much only (1)

Some parts of these need to be in user space: in particular
4b) and (3) for any non trivial presentation (the kernel can
do some very limited one, but it's not good at anything non trivial
here)

4b needs to be in user space, it's deep policies and most interesting
advanced reactions to errors cannot be done in kernel space alone.

i7core does (1), some of (2) but not complete and 4a)
I don't really count EDAC as (3) because fishing the numbers out of
sysfs by hand is not user friendly. In EDAC that's typically done
with the EDAC utils, which are user space.

EDAC doesn't really solve 4a1) unless you could "written a syslog scanner"
in it.

The xeon75xx mce driver only does (1) and uses the standard
MCE event passing mechanism (4a) to pass it to mcelog.

mcelog just does the other parts, most of which have to be in user space
anyways.

The only thing you could probably argue is if it should do accounting
or not. Right now it does it and EDAC does it too. At least
for advanced accounting (per page) where you can have a lot of
data (can be larger than a struct page per 4K page)
I personally prefer that to be swappable.

Hope this helps,

-Andi


--
[email protected] -- Speaking for myself only.

2010-02-05 23:35:32

by Andi Kleen

[permalink] [raw]
Subject: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

Commit-ID: 757fd770c649b0dfa6eeefc2d5e2ea3119b6be9c
Gitweb: http://git.kernel.org/tip/757fd770c649b0dfa6eeefc2d5e2ea3119b6be9c
Author: Andi Kleen <[email protected]>
AuthorDate: Sat, 23 Jan 2010 12:33:59 +0100
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 5 Feb 2010 14:51:53 -0800

x86, mce: Make xeon75xx memory driver dependent on PCI

Found by Ingo Molnar's automated tester.

Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/Kconfig | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f62db24..ab2d0a8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -834,7 +834,7 @@ config X86_MCE_INTEL

config X86_MCE_XEON75XX
tristate "Intel Xeon 7500 series corrected memory error driver"
- depends on X86_MCE_INTEL
+ depends on X86_MCE_INTEL && PCI
---help---
Add support for a Intel Xeon 7500 series specific memory error driver.
This allows to report the DIMM and physical address on a corrected

2010-02-16 20:47:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI


* tip-bot for Andi Kleen <[email protected]> wrote:

> Commit-ID: 757fd770c649b0dfa6eeefc2d5e2ea3119b6be9c
> Gitweb: http://git.kernel.org/tip/757fd770c649b0dfa6eeefc2d5e2ea3119b6be9c
> Author: Andi Kleen <[email protected]>
> AuthorDate: Sat, 23 Jan 2010 12:33:59 +0100
> Committer: H. Peter Anvin <[email protected]>
> CommitDate: Fri, 5 Feb 2010 14:51:53 -0800
>
> x86, mce: Make xeon75xx memory driver dependent on PCI
>
> Found by Ingo Molnar's automated tester.
>
> Reported-by: Ingo Molnar <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: H. Peter Anvin <[email protected]>

As an x86 maintainer i'm NAK-ing your Nehalem MCE patches. I really dont want
to see this code in v2.6.34, please work on it some more - this should be
implemented _properly_ and _cleanly_.

Andi, you've ignored my repeated complaints about the cleanliness of the MCE
code:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-01/msg08454.html

( There's been earlier remarks from me on this topic, months ago, the first
one more than a year ago, so it's not like you didnt have any advance
warning. I suspect i should have NAK-ed earlier. )

and you've ignored what the EDAC developers such as Mauro tried to explain to
you in that same thread. Your design to expose Intel hardware features sucks
(because it's essentially non-existent, and because it exposes so little and
gives users so little benefits) and you should know it.

Please work with Mauro on the Nehalem EDAC bits, they seem rather advanced to
me for v2.6.34, and _far_ cleaner and more capable as well. See those Intel
support bits at:

git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core.git master

Documentation/edac.txt | 151 +++
arch/x86/include/asm/pci_x86.h | 2 +
arch/x86/kernel/cpu/mcheck/mce_64.c | 1200 +++++++++++++++++++++
arch/x86/pci/legacy.c | 43 +-
drivers/edac/Kconfig | 19 +
drivers/edac/Makefile | 7 +
drivers/edac/edac_core.h | 5 +
drivers/edac/edac_mc_sysfs.c | 4 +
drivers/edac/edac_mce.c | 61 ++
drivers/edac/i7core_edac.c | 1966 +++++++++++++++++++++++++++++++++++
include/linux/edac_mce.h | 31 +
include/linux/pci.h | 1 +
include/linux/pci_ids.h | 19 +
13 files changed, 3492 insertions(+), 17 deletions(-)

It's a big step forward for Intel CPU features in my opinion, as it is a
proper, clean driver interface, and i find it highly curious why you are not
coding for that feature instead.

Once that is done the EDAC code can be pushed by all distros as a common
interface, as recent Intel CPUs would be supported by it.

I really do not understand why you are trying to keep Intel CPUs in the dark
ages while most other CPUs are properly supported by the EDAC code ... It
seems hugely counter-intuitive to me.

Note, any missing functionality on the EDAC side needs a proper driver
interface to arch/x86 - not this kind of ugly butchered-in 700 lines piece of
ugly crap that you are trying to push here ... I'd welcome patches for such
interfaces as they'd further simplify (and also enhance) arch/x86.

( And as i suggested you might also want to work on representing machine
events as part of the perf events framework. )

Ingo

2010-02-16 21:02:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll


* Borislav Petkov <[email protected]> wrote:

> > Yes, my initial thoughts on that are in the lkml mail below from a few
> > months ago. We basically want to enumerate the hardware and its events
> > intelligently - and integrate that nicely with other sources of events.
> > That will give us a boatload of new performance monitoring and analysis
> > features that we could not have dreamt of before.
> >
> > Certain events can be 'richer' and 'more special' than others (they can
> > cause things like signals - on correctable memory faults), but so far
> > there's little that deviates from the view that these are all system
> > events, and that we want a good in-kernel enumeration and handling of
> > them. Exposing it on the low level a'la mcelog is a fundamentally bad
> > idea as it pushes hardware complexity into user-space (handling hardware
> > functionality and building good abstractions on it is the task of the
> > kernel - every time we push that to user-space the kernel becomes a
> > little bit poorer).
> >
> > Note that this very much plugs into the whole problem space of how to
> > enumerate CPU cache hierarchies - something that i think Andreas is
> > keenly interested in.
>
> Oh yes, he's interested in that allright :)
>
> > We want one unified enumeration of hardware [and software] components and
> > one enumeration of the events that originate from there. Right now we are
> > mostly focused on software component enumeration via
> > /debug/tracing/events, but that does not (and should not) remain so. It's
> > not a small task to implement all aspects of that, but it can be done
> > gradually and it will be very rewarding all along the way in my opinion.
>
> Yes, this is very interesting. How do we represent that in the kernel space
> as one contiguous "tree" or "library" or whatever without adding overhead
> and opening that info to userspace?

If you do it within perf, or at least share code with it, you'd basically use
its accessor methods as a library. There's parsers for event descriptors, so
if the kernel exposes events it's all rather straightforward to code for.

For more dynamic set of events we could enhance the ftrace event interface
some more as well - or possibly create a separate 'events' filesystem that
enumerates all sorts of event sources in some nice unified hierarchy.

This is something that has come up before in other contexts as well.

> Because this is one thing that has been bugging us for a long time. We
> don't have a centralized smart utility with lots of small subcommands like
> perf or git, if you like, which can dump you the whole or parts of the hw
> configuration of the machine - something like cache sizes and hierarchy,
> CPU capabilities from CPUID flags, memory controllers configuration, DRAM
> type and sizes, NUMA info, processor PCI config space along with decoded
> register and bit values, ... (where do I stop)...
>
> Currently, we have a ragged collection of tools with their own syntax and
> output formatting like numactl, x86info, /proc/cpuinfo, (eyeballing dmesg
> output - which is not even a tool :) and it is very annoying when you have
> a bunch of machines and you start pulling them tools in, one after another,
> before you can even get to the hw information.
>
> So, it would be much much more useful if we had such a tool that can give
> you a precise hw information without disrupting the kernel (I remember
> several bugs with ide-cd last year where some udev helpers were querying
> the drive for capabilities but the drive wasn't ready yet and, as a result,
> was getting puzzled so much that it wouldn't load properly). Its
> subcommands could each cover a subsystem or a hw component and you could do
> something like the following example (values in {} are actual settings read
> from the hardware):
>
> <tool> pcicfg -f 18.3 -r 0xe8
> F3x0e8 (Northbridge Capabilities Register): {0x02073f99}
>
> ...
>
> L3Capable: [25]: {1}
> 1=Specifies that an L3 cache is present. See
> CPUID Fn8000_0006_EDX.
>
> ...
>
> LnkRtryCap: [11]: {1}
> Link error-retry capable.
> HTC_capable: [10]: {1}
> This affects F3x64 and F3x68.
> SVM_capable: [9]: {1}
>
> MctCap: [8]: {1}
> memory controller (on the processor) capable.
> DdrMaxRate: [7:5]: {0x4}
> Specifies the maximum DRAM data rate that the
> processor is designed to support.
> Bits DDR limit Bits DDR limit
> ==== ========= ==== =========
> 000b No limit 100b 800 MT/s
> 001b Reserved 101b 667 MT/s
> 010b 1333 MT/s 110b 533 MT/s
> 011b 1067 MT/s 111b 400 MT/s
>
> Chipkill_ECC_capable: [4]: {1}
>
> ECC_capable: [3]: {1}
>
> Eight_node_multi_processor_capable: [2]: {0}
>
> Dual_node_multi_processor_capable: [1]: {0}
>
> DctDualCap: [0]: {1}
> two-channel DRAM capable (i.e., 128 bit).
> 0=Single channel (64-bit) only.
>
>
> And yes, this is very detailed output but it simply serves the purpose to
> show how detailed we can get.
>
> The same thing can output MSR registers like lsmsr does:
>
> MC4_CTL = 0x000000003fffffff (CECCEn=0x1, UECCEn=0x1, CrcErr0En=0x1, CrcErr1En=0x1, CrcErr2En=0x1, SyncPkt0En=0x1, SyncPkt1En=0x1, SyncPkt2En=0x1, MstrAbrtEn=0x1, TgtAbrtEn=0x1, GartTblWkEn=0x1, AtomicRMWEn=0x1, WDTRptEn=0x1, DevErrEn=0x1, L3ArrayCorEn=0x1, L3ArrayUCEn=0x1, HtProtEn=0x1, HtDataEn=0x1, DramParEn=0x1, RtryHt0En=0x1, RtryHt1En=0x1, RtryHt2En=0x1, RtryHt3En=0x1, CrcErr3En=0x1, SyncPkt3En=0x1, McaUsPwDatErrEn=0x1, NbArrayParEn=0x1, TblWlkDatErrEn=0x1)
>
> but with in a more human-readable form without the need to open the hw
> manual for that. And this is pretty lowlevel. How about nodes and cores on
> each node and HT siblings and NUMA proximity and DIMM distribution across
> NBs and which northbridge is connected to to the southbridge on a multinode
> system, etc? I know, we have parts of that in /sysfs but it should be
> easier to get that info.
>
> You can have a gazillion examples like those and the use cases are not a
> small number: ask a user for a specific hw configuration when debugging,
> output from this tool can do automatic tuning suggestions like powertop in
> 'perf stat' runs where the machine spends too much time in a function
> because, for example, the HT link has been configured to a lower speed for
> power savings but the app that is being profiled is generating a bunch of
> threads doing parallel computations and causing a bunch of cross-node
> traffic which slows it down, etc. etc. etc.
>
> > [ Furthermore, if there's interest i wouldnt mind a 'perf mce' (or more
> > generally a 'perf edac') subcommand to perf either, which would
> > specifically be centered about all things EDAC/MCE policy. (but of course
> > other tooling can make use of it too - it doesnt 'have' to be within
> > tools/perf/ per se - it's just a convenient and friendly place for kernel
> > developers and makes it easy to backtest any new kernel code in this
> > area.)
> >
> > We already have subsystem specific perf subcommands: perf kmem, perf
> > lock, perf sched - this kind of spread out and subsystem specific
> > support it's one of the strong sides of perf. ]
>
> The example below (which I cut for brevity) is a perfect example of how it
> should be done. Let me first, however, go a step back and give you my
> opinion of how I think this whole MCEs catching and decoding should be done
> before we think of tooling:
>
> 1. We need to notify userspace, as you've said earlier, and not scan the
> syslog all the time. And EDAC, although decoding the correctable ECC, spews
> it in the syslog too causing more parsing (there's edac-utils which polls
> /sysfs but this is just another tool with problems as outlined above).

Via perf events you can get a super-fast mmap()-ed ring-buffer and get the
info in a lightweight way and possibly squeeze in some policy action before
the system possibly dies.

> What is more, the notification mechanism we come up with should push the
> error as early as possible and be able to send it over the network to a
> monitor (think data center with thousands of compute nodes here where CECCs
> happen every day at least) - something like a more resilient netconsole
> which sends out decoded MCE info to the monitor.

Yep.

> 2. Also another very good point you had is go into maintenance mode by
> throttling or even suspend all uspace processes and start a restricted
> maintenance shell after an MCE happens. This should be done based on the
> severity of the MCE and the shell should run on a core that _didn't_
> observe the MCE.
>
> 3. All the hw events like correctable ECCs should be thresholded so that
> all errors exceeding a preset threshold (below that is normal operation and
> they get corrected by ECC codes in the hardware anyway) should alarm of a
> slowly failing DIMM or a L3 subcache index for the sysop to take action
> against if the machine cannot do failover itself. For example, in the L3
> cache case, the machine can initially disable max. 2 subcache indices and
> notify the user that it has done so but the user should be warned that the
> hw is failing slowly.
>
> The current decoding needs more loving too since now it says something
> like the following:
>
> EDAC DEBUG: in drivers/edac/amd64_edac_inj.c, line at 170: section=0x80000002 word_bits=0x10020001
> EDAC DEBUG: in drivers/edac/amd64_edac_inj.c, line at 170: section=0x80000002 word_bits=0x10020001
> Northbridge Error, node 0, core: -1
> K8 ECC error.
> EDAC amd64 MC0: CE ERROR_ADDRESS= 0x33574910
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1572: (dram=0) Base=0x0 SystemAddr= 0x33574910 Limit=0x12fffffff
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1583: HoleOffset=0x3000 HoleValid=0x1 IntlvSel=0x0
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1627: (ChannelAddrLong=0x19aba480) >> 8 becomes InputAddr=0x19aba4
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1515: InputAddr=0x19aba4 channelselect=0
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1537: CSROW=0 CSBase=0x0 RAW CSMask=0x783ee0
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1541: Final CSMask=0x7ffeff
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1544: (InputAddr & ~CSMask)=0x100 (CSBase & ~CSMask)=0x0
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1537: CSROW=1 CSBase=0x100 RAW CSMask=0x783ee0
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1541: Final CSMask=0x7ffeff
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1544: (InputAddr & ~CSMask)=0x100 (CSBase & ~CSMask)=0x100
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 1549: MATCH csrow=1
> EDAC MC0: CE page 0x33574, offset 0x910, grain 0, syndrome 0xbe01, row 1, channel 0, label "": amd64_edac
> EDAC MC0: CE - no information available: amd64_edacError Overflow
> EDAC DEBUG: in drivers/edac/amd64_edac_inj.c, line at 170: section=0x80000002 word_bits=0x10020001
>
> and this is only the chip select row but we need to map that to the actual
> DIMM and to tell the admin: "DIMM with label "BLA" on your motherboard
> seems to be failing" without first naming all DIMMs through /sysfs to their
> silk-screen labels.
>
> And yes, it is a lot of work but we can at least start talking about it and
> gradually getting it done. What do the others think?

I like it.

You can do it as a 'perf hw' subcommand - or start off a fork as the 'hw'
utility, if you'd like to maintain it separately. It would have a daemon
component as well, to receive and log hardware events continuously, to
trigger policy action, etc.

I'd suggest you start to do it in small steps, always having something that
works - and extend it gradually.

Ingo

2010-02-16 22:29:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI


>
> Please work with Mauro on the Nehalem EDAC bits, they seem rather advanced to
> me for v2.6.34, and _far_ cleaner and more capable as well. See those Intel
> support bits at:

Hi Ingo,

core_i7 and EDAC has nothing to do with this code and
it has nothing to do with the problem this patch is
solving.

This is for a different chip (xeon75xx)
which has a completely different memory subsystem
and reports memory errors in a completely different way
than xeon75xx/core_i7.

For core_i7/xeon55xx there is no additional event interface needed;
it's all supplied by the hardware on the existing interfaces.

The point of this code is to annotate the CE events on Xeon 75xx
and to implement specific backend actions (page offlining, triggers)
based on specific events. These backend actions are already implemented
on 55xx without additional changes (no need for EDAC)

EDAC does not provide an event interface that can
be polled, just counts, so this cannot be done with EDAC.
It's simply a topology enumeration with error counts.
mcelog is not a topology interface, it's a event
notification mechanism.

EDAC and mcelog are orthogonal, they don't solve the same
problem.

So your nack is based on incorrect assumptions and doesn't make
sense. What you're asking for cannot be done with current
EDAC as far as I know.

-Andi

2010-02-19 10:51:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

On Tue, 16 Feb 2010, Andi Kleen wrote:
> >
> > Please work with Mauro on the Nehalem EDAC bits, they seem rather advanced
> > to
> > me for v2.6.34, and _far_ cleaner and more capable as well. See those Intel
> > support bits at:
>
> Hi Ingo,
>
> core_i7 and EDAC has nothing to do with this code and
> it has nothing to do with the problem this patch is
> solving.
>
> This is for a different chip (xeon75xx)
> which has a completely different memory subsystem
> and reports memory errors in a completely different way
> than xeon75xx/core_i7.
>
> For core_i7/xeon55xx there is no additional event interface needed;
> it's all supplied by the hardware on the existing interfaces.
>
> The point of this code is to annotate the CE events on Xeon 75xx
> and to implement specific backend actions (page offlining, triggers)
> based on specific events. These backend actions are already implemented
> on 55xx without additional changes (no need for EDAC)
>
> EDAC does not provide an event interface that can
> be polled, just counts, so this cannot be done with EDAC.
> It's simply a topology enumeration with error counts.
> mcelog is not a topology interface, it's a event
> notification mechanism.
>
> EDAC and mcelog are orthogonal, they don't solve the same
> problem.
>
> So your nack is based on incorrect assumptions and doesn't make
> sense. What you're asking for cannot be done with current
> EDAC as far as I know.

It does not matter at all that current EDAC cannot do that right
now. Fact is that you are stubbornly ignoring any request from the x86
maintainers to rework MCE, consolidate it with EDAC and integrate it
into perf as the suitable event logging mechanism.

MCE has no design at all, it's a specialized hack which is limited to
a specific subset of the overall machine health monitoring and
reporting facilities.

You refuse to even think about consolidating the handling of all
health monitoring and reporting facilities into a well designed and
integrated framework.

Your sole argument is that mce can do it and EDAC or whatever can
not. That's not a technical argument at all. MCE does not become a
better design just because you hacked another feature into it.

Ingo's NAK is completely correct and he has my full support for it.

We do not want new crap in the already horrible MCE code. We simply
request a consolidation of machine health monitoring/reporting
facilities before adding new stuff.

You have been ignoring our technical requests for more than a
year. You are refusing to work with other people on a well designed
solution. You just follow your own agenda and try to squeeze more
stuff into MCE.

tglx

2010-02-19 12:17:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

Hi Thomas,

I would appreciate if you could read the whole email
and ideally the references too before replying. I apologize for the length,
but this is a complicated topic.

> and integrate it
> into perf as the suitable event logging mechanism.

The main reason I didn't react to that proposal is
I don't see a clear path to make perf a good error mechanism.

I know there's a tendency that if you're working on something
that you think is cool, to try to force everything else
you're seeing into that model too (I occasionally have such tendencies
too :-)

But if you take a step back and look at the requirements with a sceptical
eye that's not always the best thing to do.

Requirements for error handling are very different from performance monitoring.

Let me walk you through some of these differences:

USER TOOLS:

The current perf user tools are not suitable for errors: they are not
"always on running in the background" like you need for errors.

They are aimed at a interactive user model which is fine
for performance monitoring (well at least some forms of performance
monitoring), but not for errors.

Yes they could be probably reworked for a "always on" daemon model, but the
result would be

a) completely different than what you have today in terms of interface
(it would be a lot more like you have with oprofile, and as I understand
one of the main motivations for perf was wide spread dislike of the oprofile
daemon model)
b) likely worse for performance monitoring (unless you fork them into two)
The requirements are simply very different.
c) a lot like what mcelog is today. mcelog today is a always on
error daemon optimized for error handling, nothing else.

There's no associated error oriented infrastructure like triggers etc.
in perf Yes that could be all implemented, but (b) and (c) above apply.

So yes it could be probably done, but I suspect the result would
not make you happy for performance monitoring.

EVENT INTERFACE I:

The perf interface is aimed at a specific way of filtering events, which
is not the right interface for errors, because you need usually
all errors in most (not all) cases. Basically in performance
monitoring typically most events are off and you sometimes
turn them on, in error handling it's exactly the other way around.

Also errors tend to have different behavior from performance
counters, for example a model for a error on a object
is more the "leaky bucket", which is not a good fit for performance.

(I have more details on this in http://halobates.de/plumbers-error.pdf)

OVERHEAD:

The perf subsystem has relatively high overhead today (in terms
of memory size and code size overhead) and is IMHO not suitable
to be always active because of this.

Errors are very fundamental and error reporting
mechanisms have to be always active, so it's extremely important
that they have very low overhead by default. That's not what
perf's model is: it trades memory size and code size for more
performance. That is fine for optional monitoring (if you
can afford it), but not the right model for an fundamental
"always on" mechanism. For "always on" infrastructure it's better
to be slim.

That said I suspect perf events could be likely put on a serious diet, but it's
unclear if the result would work as well as it does today
for performance monitoring. You would likely lose some features
optimized for it at least.

EVENT INTERFACE II:

Partly that's because it has a lot of functionality that are not needed
for errors anyways. For example error just needs some very simple error
buffers that can be straight forwardly implemented using kfifos (I did
that already in fact). That's just a few lines, all the functionality
in kernel/perf/* is not really needed.

There's no good way to throttle events per source, like it's needed
for errors.

EVENT INTERFACE III:

Then one of the current issues with mcelog is that it's not straight forward
to add variable length record types with typing etc. This isn't too
big a problem for MCEs (although the DIMM error reporting would have been
slightly nicer with it) but for some other types of errors it's a bigger
issue.

Now the funny thing is (and I keep waiting for Ingo to figure that out :-):
the perf record format has exactly the same problem as mcelog in this regard.
It's a untyped binary format where it's only possible to add something
at the end, not a fully extensible typed format with sub records etc.

A better match would be either netlink with its sub record
(although for various reasons other I don't think it's the best model either)
or the ASCII based udev sysfs interfaces.

In fact that is what Ingo asked for some time ago (before he
moved to the "everything must be perf" model). He wanted an ASCII
interface (so more like the udev model). I'm not completely happy with
that either, but it's probably still one of the better models and could be made
to work.

It's definitely not perf though.

> year. You are refusing to work with other people on a well designed

First I work with a lot of people on error handling, even if you're
not always in Cc.

We would need to agree to disagree on EDAC being a "well designed
solution) IMHO it has a lot of problems (not just in my opinion
if you read some of the mails e.g. from Borislav he's stating the same)
and it's definitely not the general frame work you're asking for
In fact in many ways EDAC far more specialized to some specific subset of
errors than mcelog.

A generic error frame work (that would be neither EDAC nor perf nor
mcelog on the interface level) could be probably done and I have
some ideas on how to do that properly (e.g. see the link below),
but it's not a short term project. It needs a lot of design
work to be done properly and also would likely need to evolve
for some time. It would also need a suitable user level infrastructure,
which is actually a larger project than the kernel interfaces.

The patch above was simply intended to solve a specific problem on a specific
chip. I don't claim the interface is the best I ever did (definitely not),
but at least it solves an existing problem in a relatively straight forward
way and I claim there's no clear better solution with today's infrastructure.

How are you suggesting to solve the DIMM error reporting in the short
term (let's say 2.6.34/35 time frame, without major redesigns) ?

-Andi

References:
- Thoughts on future error handling model:
http://halobates.de/plumbers-error.pdf
- mcelog kernel and userland design today:
http://halobates.de/mce-lc09-2.pdf

--
[email protected] -- Speaking for myself only.

Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

On Fri, Feb 19, 2010 at 01:17:34PM +0100, Andi Kleen wrote:

<snip all the perf analysis>

I think you're missing the point - it doesn't have to be
perf. It could just as well be some other tool which _shares_
functionality with perf. See this last mail from Ingo:
http://marc.info/?l=linux-kernel&m=126635420607236 on which you were
also CCed, by the way, where he suggests that we could have a tool
called 'hw' which reuses functionality with perf but concentrates on
error handling and all that RAS functionality in modern CPUs. It should
also have a daemon component etc...

> > year. You are refusing to work with other people on a well designed

Sorry, but from our last discussion on attempting to work towards such
an infrastructure solution I got the same impression as Thomas and Ingo
that you're simply not willing to work together on getting a real thing
done. That's why I stopped bothering - it simply made no sense to me to
waste time in fruitless discussions.

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

2010-02-19 13:21:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

Borislav,

> <snip all the perf analysis>
>
> I think you're missing the point - it doesn't have to be
> perf. It could just as well be some other tool which _shares_

That was one of my points, but there were others too about
the suitability of the kernel infra structure and the interfaces.

I don't think the perf syscall interface (nor the internal
implement) in its current form is good for errors and also see no clear
path to make it so (without basically adding another different interface)

> functionality with perf. See this last mail from Ingo:
> http://marc.info/?l=linux-kernel&m=126635420607236 on which you were
> also CCed, by the way, where he suggests that we could have a tool
> called 'hw' which reuses functionality with perf but concentrates on
> error handling and all that RAS functionality in modern CPUs. It should
> also have a daemon component etc...

So you would have different interfaces: you don't really
need new syscalls for this (read/write is fine) and perf_counter_open()
in its current form is completely unsuitable for errors
(unless your error reporting registers look like a x86 PMU --
mine doesn't at least) And different userland. And the kernel internal
requirements are very different too (see previous email)

Where is the commonality with perf?

On the tool side:

I'm working on such a tool already for quite some time. It's
called mcelog. Now it uses an older interface today, but at some
point I would expect it to move to other interfaces too
(e.g. next step for that would be APEI errors)

If you only knew mcelog from a few years ago: it's quite
different today than it was and please look again.

That is the end result will be likely called different
(it doesn't make much sense to call something that handles
all kinds of errors "mcelog") and also some stuff needs
to be more generic, but I suspect it'll share quite some
concepts.

If the only problem is the naming we can probably work something
out? In principle it could be called "hw", but the name
seems awfully generic, especially for a daemon. I was more
tending something like "errord" or so.

On the topology: I was not trying to replace existing
topology tools (like lscpu, lspci etc.). I don't see any
major problems (apart from some details that don't
deserve a redesign) with them.

>
>>> year. You are refusing to work with other people on a well designed
>
> Sorry, but from our last discussion on attempting to work towards such
> an infrastructure solution I got the same impression as Thomas and Ingo
> that you're simply not willing to work together on getting a real thing
> done. That's why I stopped bothering - it simply made no sense to me to
> waste time in fruitless discussions.

Well I keep ignoring suggestions to put more stuff into EDAC,
mostly because I think the EDAC design needs to be thrown out
instead of extended. Are you referring to that?

My impression was that you got to the same conclusion (at least
for parts of current EDAC like the events) based on your earlier emails.

The current issue is less in enumeration/topology anyways but more
in event handling I would say. In the end topology/enumeration is
the easier part, and most of it is already working quite well.

I'm trying to do things step by step, including for short term
problems extending current interfaces if possible and then longer
term moving to new better interfaces.

-Andi

2010-02-19 15:18:48

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

Andi,

Andi Kleen wrote:
> Borislav,
>>>> year. You are refusing to work with other people on a well designed
>>
>> Sorry, but from our last discussion on attempting to work towards such
>> an infrastructure solution I got the same impression as Thomas and Ingo
>> that you're simply not willing to work together on getting a real thing
>> done. That's why I stopped bothering - it simply made no sense to me to
>> waste time in fruitless discussions.
>
> Well I keep ignoring suggestions to put more stuff into EDAC,
> mostly because I think the EDAC design needs to be thrown out
> instead of extended. Are you referring to that?
>
> My impression was that you got to the same conclusion (at least
> for parts of current EDAC like the events) based on your earlier emails.
>
> The current issue is less in enumeration/topology anyways but more
> in event handling I would say. In the end topology/enumeration is
> the easier part, and most of it is already working quite well.
>
> I'm trying to do things step by step, including for short term
> problems extending current interfaces if possible and then longer
> term moving to new better interfaces.

Ingo and Thomas are right: we need to work together to improve the
error report, and this means that we shouldn't ignore putting more stuff
in EDAC.

EDAC is generic enough to work with different type of memory and memory
controllers, and to provide a consistent interface to describe it on a way
that userspace doesn't need to know what are the error registers at
the hardware, nor how to decode a "magic" error number into something
that has a meaning.

As Boris properly pointed, EDAC has space for improvements, and part of
the perf logic can be used as a start point to give some flash new ideas.

The main issue I see with MCE is at the interface level. I think if we
all cope together, we can converge into a proper interface that will
be accepted upstream.

--

Cheers,
Mauro

2010-02-19 15:37:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI


>
> EDAC is generic enough to work with different type of memory and memory
> controllers, and to provide a consistent interface to describe it on a way
> that userspace doesn't need to know what are the error registers at
> the hardware, nor how to decode a "magic" error number into something
> that has a meaning.

Well the main problem I have with EDAC is that it has far too much information
(e.g. down to ranks/banks and also too much information on
the internal topology of the memory controller, and it can't even
express some current designs).

For me it looks like it was designed by someone starring at a motherboard/DIMM
semantics plan, and I don't think that's the right level to think about
these things.

Going that deep typically
requires very hardware specific information and in some cases
it's not even possible. I also don't think it's useful information
to present (and it's really the opposite of "abstraction")

I also have yet to see a useful use case where you need to look "inside" a DIMM
on the reporting level. The useful level is typically the "FRU" (something
you can replace), with only some very specific extensions for special
use cases.

There's also no generic way to do the necessary enumeration down to
the level EDAC needs. For some cases hardware specific drivers can be written,
but it's always better if the generic case works in a architectural way.

Then it does all the enumeration on the kernel, but there
are no useful facilities to sync that with a user level representation.
And most of the useful advannced & interesting RAS features I'm interested in
need user level support.

I prefer at least for MCE to stay on the architectural level
with only minor extensions for specific use cases.

Now to address these problems you could throw large parts of EDAC
out (do you mean that with 'flexible enough'?) and then add a actual
event interface (working on the later is my plan)

> As Boris properly pointed, EDAC has space for improvements, and part of
> the perf logic can be used as a start point to give some flash new ideas.

See my analysis several mails up. Which parts of perf do you want
to actually use? I don't see any that's actually directly usable
without major changes.

> The main issue I see with MCE is at the interface level. I think if we
> all cope together, we can converge into a proper interface that will
> be accepted upstream.

Just that we're on the same level, could you spell out in detail
what problems you're seeing with it?

[I'm not claiming there are none, I'm just curious what you think they are]

-Andi

2010-02-19 15:48:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

Andi,

On Fri, 19 Feb 2010, Andi Kleen wrote:
> > and integrate it
> > into perf as the suitable event logging mechanism.
>
> The main reason I didn't react to that proposal is
> I don't see a clear path to make perf a good error mechanism.

Thanks for confirming the main problem here:

_You_ do not react, because _you_ do not see a clear path.

So _you_ ignore any request for a major cleanup of that whole MCE mess
and just keep on tinkering with the existing crap.

And you really expect us to tolerate your ignorance forever and just
apply happily more crap to something which should have never been
merged in the first place.

<snip>

> The patch above was simply intended to solve a specific problem on a specific
> chip. I don't claim the interface is the best I ever did (definitely not),
> but at least it solves an existing problem in a relatively straight forward
> way and I claim there's no clear better solution with today's infrastructure.

The patch does not solve a specific problem. It merily adds extra info
retrieved from this specific chip. This is simply a new feature which
you hacked into the existing mce code.

There will always be a next generation chip with a specific new
feature and it does _NOT_ matter at all whether support for such a new
feature is straight forward in your opinion or not. Straight forward
crap is still crap.

The whole point is that we are not longer accepting whacky addons to
MCE w/o seing any collaborative reaction from your side on our
technical requests to fix up the whole thing for real.

You can point me to as many PDFs as you want. I'm not interested in
stuff you are breeding on your own w/o talking to EDAC folks and
trying to address the concerns which were raised by Ingo, myself and
others.

Thanks,

tglx

2010-02-20 00:14:42

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

I was thinking on a way where we could work with the EDAC/MCE issues, and
a way for us to move ahead. My proposal is to organize an EDAC/MCE BoF session
or a mini-summit during the Collaboration Summit, in San Francisco, with
the interested parties. I suspect that some of us will be there already.

It shouldn't be hard to find some place there for us to take a look at the
EDAC architecture and come with some proposal.

As today is the last day for CFP, I've also submitted there a proposal for a
panel. If approved, we can use it to collect data from hardware error users
(sysops and other users that require high availability on their services),
for us to discuss some strategies to address the issue or to summarize what
will be discussed on the event.

Comments?

Cheers,
Mauro

2010-02-20 09:01:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

On Fri, Feb 19, 2010 at 10:14:17PM -0200, Mauro Carvalho Chehab wrote:

Mauro,

> I was thinking on a way where we could work with the EDAC/MCE issues, and
> a way for us to move ahead. My proposal is to organize an EDAC/MCE BoF session
> or a mini-summit during the Collaboration Summit, in San Francisco, with
> the interested parties. I suspect that some of us will be there already.

I didn't plan to be there so far. A BoF is probably a good idea,
and also looking closely at EDAC together,
but it would be better at some more kernel focussed conference.

If everyone else is at that summit I can try to come,
but it would be likely difficult.

We could probably do some kind of online BoF shorter time
(e.g. using some chat setup or on the phone)


> It shouldn't be hard to find some place there for us to take a look at the
> EDAC architecture and come with some proposal.

Proposal for what exactly?

Is this for a event interface or for a topology interface or both
or something else entirely?

My personal plan so far was to work on the APEI interface
and then possibly look at migrating MCE to that infrastructure too,
while updating mcelog to talk to it. This would be mostly
addressing events so far.

> As today is the last day for CFP, I've also submitted there a proposal for a
> panel. If approved, we can use it to collect data from hardware error users
> (sysops and other users that require high availability on their services),
> for us to discuss some strategies to address the issue or to summarize what
> will be discussed on the event.

You want to collect error rates or want to collect use cases?

For a serious collection doing it online would probably
give better coverage.

I do both to some degree already.

-Andi

--
[email protected] -- Speaking for myself only.

2010-02-20 10:15:21

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI


----- "Andi Kleen" <[email protected]> escreveu:

> On Fri, Feb 19, 2010 at 10:14:17PM -0200, Mauro Carvalho Chehab
> wrote:
>
> Mauro,
>
> > I was thinking on a way where we could work with the EDAC/MCE
> issues, and
> > a way for us to move ahead. My proposal is to organize an EDAC/MCE
> BoF session
> > or a mini-summit during the Collaboration Summit, in San Francisco,
> with
> > the interested parties. I suspect that some of us will be there
> already.
>
> I didn't plan to be there so far. A BoF is probably a good idea,
> and also looking closely at EDAC together,
> but it would be better at some more kernel focussed conference.
>
> If everyone else is at that summit I can try to come,
> but it would be likely difficult.

I'm proposing the Collaboration Summit due to its date: it will
happen in April. The next Kernel conf will be on a longer time.

> We could probably do some kind of online BoF shorter time
> (e.g. using some chat setup or on the phone)

We may try to do some discussions via chat before it, but I still
think that having we all at the same room with some whiteboard
will better work.

> > It shouldn't be hard to find some place there for us to take a look
> at the
> > EDAC architecture and come with some proposal.
>
> Proposal for what exactly?
>
> Is this for a event interface or for a topology interface or both
> or something else entirely?

We should define the topics. I think we should discuss both topics.
I agree that the better is to represent the hardware per FRU. So,
maybe we can find a better topology representation.

> My personal plan so far was to work on the APEI interface
> and then possibly look at migrating MCE to that infrastructure too,
> while updating mcelog to talk to it. This would be mostly
> addressing events so far.

The better is to have some discussions before using APEI or any other
interface for it. It should be considered that the hardware errors
should be presented on a consistent way not only for newer processors
and memory controllers, but also for the already supported ones.

> > As today is the last day for CFP, I've also submitted there a
> proposal for a
> > panel. If approved, we can use it to collect data from hardware
> error users
> > (sysops and other users that require high availability on their
> services),
> > for us to discuss some strategies to address the issue or to
> summarize what
> > will be discussed on the event.
>
> You want to collect error rates or want to collect use cases?
>
> For a serious collection doing it online would probably
> give better coverage.
>
> I do both to some degree already.

Both data collect are important (and I also do it to some degree),
and we can do it via other means, but a face-to-face meeting may
help to vallidate any ideas we may have about improvements/changes
at the interfaces and features.

If we succeed on such discussions at the planning phase, this will
save us a lot of development time and will help to have an easier
upstream adoption.

So, I think it is a worthy try.

Cheers,
Mauro

2010-02-22 07:39:06

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI

(2010/02/17 7:29), Andi Kleen wrote:
> This is for a different chip (xeon75xx)
> which has a completely different memory subsystem
> and reports memory errors in a completely different way
> than xeon75xx/core_i7.
>
> For core_i7/xeon55xx there is no additional event interface needed;
> it's all supplied by the hardware on the existing interfaces.
>
> The point of this code is to annotate the CE events on Xeon 75xx
> and to implement specific backend actions (page offlining, triggers)
> based on specific events. These backend actions are already implemented
> on 55xx without additional changes (no need for EDAC)

If my understanding is correct, your patch doesn't interact with xeon75xx
processor itself, but with the associated chip (I/O hub, aka Boxboro,
you abbreviated it to BXB) at least at first of all.

Then since MCE codes contain rather "generic, processor oriented" stuffs
while EDAC codes contain relatively "specific, chipset oriented" stuffs,
people can suppose that this "arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c"
could be implemented in different way and the file could have a different
name like "drivers/edac/i7500_edac.c" or so.

However the real issue is that I couldn't figure out why this code
requires new hook in the polling handler, what is PFA, and what kind of
restriction let you to do so. You noted that "The act of retrieving
the DIMM/PA information can take some time" but I'm not sure it is safe
even if poll handler is called via CMCI.

Anyway, Andi, could you point proper specification or datasheet to
know/check what you are going to do here?
Otherwise I could not distinguish your work from black magic...


Thanks,
H.Seto


2010-02-22 08:28:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

From: Ingo Molnar <[email protected]>
Date: Tue, Feb 16, 2010 at 10:02:15PM +0100
Hi,

> I like it.
>
> You can do it as a 'perf hw' subcommand - or start off a fork as the 'hw'
> utility, if you'd like to maintain it separately. It would have a daemon
> component as well, to receive and log hardware events continuously, to
> trigger policy action, etc.
>
> I'd suggest you start to do it in small steps, always having something that
> works - and extend it gradually.

I had the chance to meditate over the weekend a bit more on the whole
RAS thing after rereading all the discussion points more carefully.
Here are some aspects I think are important which I'd like to drop here
rather sooner than later so that we're in sync and don't waste time
implementing the wrong stuff:

* Critical errors: we need to switch to a console and dump decoded error
there at least, before panicking. Nowadays, almost everyone has a camera
with which that information can be extracted from the screen. I'm afraid
we won't be able to send the error over a network since climbing up the
TCP stack takes relatively long and we cannot risk error propagation...?
We could try to do it on a core which is not affected by the error
though as a last step in the sequence...

I think this is much more user-friendly than the current panicking
which is never seen when running X except when the user has a
serial/netconsole sending to some other machine.

All other non-that-critical errors are copied to userspace over a
mmapped buffer and then the uspace daemon is being poked with a uevent
to dump the error/signal over network/parse its contents and do policy
stuff.

* receive commands by syscall, also for hw config: I like the idea
of sending commands to the kernel over a syscall, we can reuse perf
functionality here and make those reused bits generic.

* do not bind to error format etc: not a big fan of slaving to an error
format - just dump error info into the buffer and let userspace format
it. We can do the formatting if we absolutely have to.

* can also configure hw: The tool can also send commands over the
syscall to configure certain aspects of the hardware, like:

- disable L3 cache indices which are faulty
- enable/disable MCE error sources: toggle MCi_CTL, MCi_CTL_MASK bits
- disable whole DIMMs: F2x[1, 0][5C:40][CSEnable]
- control ECC checking
- enable/disable powering down of DRAM regions for power savings
- set memory clock frequency
- some other relevant aspects of hw/CPU configuration

* keep all info in sysfs so that no tool is needed for accessing it,
similar to ftrace: All knobs needed for user interaction should appear
redundantly as sysfs files/dirs so that configuration/query can be done
"by hand" even when the hw tool is missing

* gradually move pieces of RAS code into kernel proper: important
codepaths/aspects from the HW which are being queried often (e.g., DIMM
population and config) should be moved gradually into the kernel proper.


Anyways, this is by all means not complete and still as alpha as it can
be. However, I'd like to discuss it as early as possble and in small,
incremental steps, omitting trial and error as much as possible. So,
feel free to throw all your crazy ideas at me and correct (or kill) all
those crappy points above.

Thanks.

--
Regards/Gruss,
Boris.

2010-02-22 09:48:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll


* Borislav Petkov <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Tue, Feb 16, 2010 at 10:02:15PM +0100
> Hi,
>
> > I like it.
> >
> > You can do it as a 'perf hw' subcommand - or start off a fork as the 'hw'
> > utility, if you'd like to maintain it separately. It would have a daemon
> > component as well, to receive and log hardware events continuously, to
> > trigger policy action, etc.
> >
> > I'd suggest you start to do it in small steps, always having something that
> > works - and extend it gradually.
>
> I had the chance to meditate over the weekend a bit more on the whole
> RAS thing after rereading all the discussion points more carefully.
> Here are some aspects I think are important which I'd like to drop here
> rather sooner than later so that we're in sync and don't waste time
> implementing the wrong stuff:
>
> * Critical errors: we need to switch to a console and dump decoded error
> there at least, before panicking. Nowadays, almost everyone has a camera
> with which that information can be extracted from the screen. I'm afraid we
> won't be able to send the error over a network since climbing up the TCP
> stack takes relatively long and we cannot risk error propagation...? We
> could try to do it on a core which is not affected by the error though as a
> last step in the sequence...
>
> I think this is much more user-friendly than the current panicking which is
> never seen when running X except when the user has a serial/netconsole
> sending to some other machine.

Yep.

> All other non-that-critical errors are copied to userspace over a mmapped
> buffer and then the uspace daemon is being poked with a uevent to dump the
> error/signal over network/parse its contents and do policy stuff.

If you use perf here you get the events and can poll() the event channel.
User-space can decide which events to listen in on. uevent/user-notifier is a
bit clumsy for that.

> * receive commands by syscall, also for hw config: I like the idea of
> sending commands to the kernel over a syscall, we can reuse perf
> functionality here and make those reused bits generic.
>
> * do not bind to error format etc: not a big fan of slaving to an error
> format - just dump error info into the buffer and let userspace format it.
> We can do the formatting if we absolutely have to.


If you use perf and tracepoints to shape the event log format then this is all
taken care of already, you get structured event format descriptors in
/debug/tracing/events/*. For example there's already an MCE tracepoint in the
upstream kernel today (for thermal events):

phoenix:/home/mingo> cat /debug/tracing/events/mce/mce_record/format
name: mce_record
ID: 28
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_lock_depth; offset:8; size:4; signed:1;

field:u64 mcgcap; offset:16; size:8; signed:0;
field:u64 mcgstatus; offset:24; size:8; signed:0;
field:u8 bank; offset:32; size:1; signed:0;
field:u64 status; offset:40; size:8; signed:0;
field:u64 addr; offset:48; size:8; signed:0;
field:u64 misc; offset:56; size:8; signed:0;
field:u64 ip; offset:64; size:8; signed:0;
field:u8 cs; offset:72; size:1; signed:0;
field:u64 tsc; offset:80; size:8; signed:0;
field:u64 walltime; offset:88; size:8; signed:0;
field:u32 cpu; offset:96; size:4; signed:0;
field:u32 cpuid; offset:100; size:4; signed:0;
field:u32 apicid; offset:104; size:4; signed:0;
field:u32 socketid; offset:108; size:4; signed:0;
field:u8 cpuvendor; offset:112; size:1; signed:0;

print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, REC->status, REC->addr, REC->misc, REC->cs, REC->ip, REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, REC->socketid, REC->apicid

tools/perf/util/trace-event-parse.c contains the above structured format
descriptor parsing code, and can turn it into records that you can read out
from C code - and provides all sorts of standard functionality over it.

I'd strongly suggest to reuse that - we _really_ want health monitoring and
general system performance monitoring to share a single facility: as they are
both one and the same thing, just from different viewpoints.

In other words: 'system component failure' is another metric of 'system
performance', so there's strong synergies all around.

> * can also configure hw: The tool can also send commands over the syscall to
> configure certain aspects of the hardware, like:
>
> - disable L3 cache indices which are faulty
> - enable/disable MCE error sources: toggle MCi_CTL, MCi_CTL_MASK bits
> - disable whole DIMMs: F2x[1, 0][5C:40][CSEnable]
> - control ECC checking
> - enable/disable powering down of DRAM regions for power savings
> - set memory clock frequency
> - some other relevant aspects of hw/CPU configuration

Once the hardware's structure is enumerated (into a tree/hiearchy), and events
are attached to individual components, then 'commands' are the next logical
step: they are methods of a given component/object.

One such method could be 'injection' functionality btw: to simulate rare
hardware failures and to make sure policy logic is ready for all
eventualities.

But ... while that is clearly the 'big grand' end goal, the panacea of RAS
design, i'd suggest to start with a small but useful base and pick up low
hanging fruits - then work towards this end goal. This is how perf is
developed/maintained as well.

So i'd suggest to start with _something_ that other people can try and have a
look at and extend, for example something that replaces basic mcelog
functionality. That alone should be fairly easy and immediately gives it a
short-term purpose. It would also be highly beneficial to the x86 code to get
rid of the mcelog abonimation.

> * keep all info in sysfs so that no tool is needed for accessing it,
> similar to ftrace: All knobs needed for user interaction should appear
> redundantly as sysfs files/dirs so that configuration/query can be done
> "by hand" even when the hw tool is missing

Please share this code with perf. Profiling needs the same kind of 'hardware
structure' enumeration - combined with 'software component enumeration'.

Currently we have that info /debug/tracing/events/. Some hw structure is in
there as well, but not much - most of it is kernel subsystem event structure.

sysfs would be an option but IMO it's even better to put ftrace's
/debug/tracing/events/ hiearchy into a separate eventfs - and extend it with
'hardware structure' details.

This would not only crystalise the RAS purpose, but would nicely extend perf
as well. With every hardware component you add from the RAS angle we'd get new
events for tracing/profiling use as well - and vice versa. There's no reason
why RAS should be limited to hw component failure events: a RAS policy action
could be defined over OOM events too for example, or over checksum failures in
network packets - etc.

RAS is not just about hardware, and profiling isnt just about software. We
want event logging to be a unified design - there's big advantages to that.

So please go for an integrated design. The easiest and most useful way for
that would be to factor out /debug/tracing/events/ into /eventfs.

> * gradually move pieces of RAS code into kernel proper: important
> codepaths/aspects from the HW which are being queried often (e.g., DIMM
> population and config) should be moved gradually into the kernel proper.

Yeah. Good plans.

Ingo

2010-02-22 12:04:24

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Ingo Molnar wrote:
> * Borislav Petkov <[email protected]> wrote:
>
>> From: Ingo Molnar <[email protected]>
>> Date: Tue, Feb 16, 2010 at 10:02:15PM +0100
>> Hi,
>>
>>> I like it.
>>>
>>> You can do it as a 'perf hw' subcommand - or start off a fork as the 'hw'
>>> utility, if you'd like to maintain it separately. It would have a daemon
>>> component as well, to receive and log hardware events continuously, to
>>> trigger policy action, etc.
>>>
>>> I'd suggest you start to do it in small steps, always having something that
>>> works - and extend it gradually.
>> I had the chance to meditate over the weekend a bit more on the whole
>> RAS thing after rereading all the discussion points more carefully.
>> Here are some aspects I think are important which I'd like to drop here
>> rather sooner than later so that we're in sync and don't waste time
>> implementing the wrong stuff:
>>
>> * Critical errors: we need to switch to a console and dump decoded error
>> there at least, before panicking. Nowadays, almost everyone has a camera
>> with which that information can be extracted from the screen. I'm afraid we
>> won't be able to send the error over a network since climbing up the TCP
>> stack takes relatively long and we cannot risk error propagation...? We
>> could try to do it on a core which is not affected by the error though as a
>> last step in the sequence...
>>
>> I think this is much more user-friendly than the current panicking which is
>> never seen when running X except when the user has a serial/netconsole
>> sending to some other machine.
>
> Yep.

If the user has set up kexec/kdump, currently, the error is dumped
to the crash log at the local machine or to another machine, depending on
the config.

I like the idea of switching to console before panicking, but this is sometimes
problematic, as there are some video drivers that switching to console don't
always work. Also, this could interfere at the kdump setup, if configured.
So, if kdump or serial/netconsole is enabled, the better is to not try to switch
the video mode. So, I think the better is to allow userspace to select one or
another mode.

>> All other non-that-critical errors are copied to userspace over a mmapped
>> buffer and then the uspace daemon is being poked with a uevent to dump the
>> error/signal over network/parse its contents and do policy stuff.

It seems interesting, but, if the userspace daemon is not running, it should
fallback to write the errors via dmesg.

> If you use perf here you get the events and can poll() the event channel.
> User-space can decide which events to listen in on. uevent/user-notifier is a
> bit clumsy for that.
>
>> * receive commands by syscall, also for hw config: I like the idea of
>> sending commands to the kernel over a syscall, we can reuse perf
>> functionality here and make those reused bits generic.

Why? I think we should keep using sysfs for hw config, and it seems that you also
agree. Sysfs work fine and fits nice for enumerating hierarchical data.
What's the rationale to add also a syscall API?

The EDAC data model needs some discussion, as, currently, the memory is represented
per csrow, and modern MCU don't allow such level of control (and it doesn't
make much sense on representing this way, as you can't replace a csrow). The
better is to use DIMM as the minumum unit.

>>
>> * do not bind to error format etc: not a big fan of slaving to an error
>> format - just dump error info into the buffer and let userspace format it.
>> We can do the formatting if we absolutely have to.

The format code is needed for critical errors anyway.

> If you use perf and tracepoints to shape the event log format then this is all
> taken care of already, you get structured event format descriptors in
> /debug/tracing/events/*. For example there's already an MCE tracepoint in the
> upstream kernel today (for thermal events):
>
> phoenix:/home/mingo> cat /debug/tracing/events/mce/mce_record/format
> name: mce_record
> ID: 28
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
> field:int common_lock_depth; offset:8; size:4; signed:1;
>
> field:u64 mcgcap; offset:16; size:8; signed:0;
> field:u64 mcgstatus; offset:24; size:8; signed:0;
> field:u8 bank; offset:32; size:1; signed:0;
> field:u64 status; offset:40; size:8; signed:0;
> field:u64 addr; offset:48; size:8; signed:0;
> field:u64 misc; offset:56; size:8; signed:0;
> field:u64 ip; offset:64; size:8; signed:0;
> field:u8 cs; offset:72; size:1; signed:0;
> field:u64 tsc; offset:80; size:8; signed:0;
> field:u64 walltime; offset:88; size:8; signed:0;
> field:u32 cpu; offset:96; size:4; signed:0;
> field:u32 cpuid; offset:100; size:4; signed:0;
> field:u32 apicid; offset:104; size:4; signed:0;
> field:u32 socketid; offset:108; size:4; signed:0;
> field:u8 cpuvendor; offset:112; size:1; signed:0;
>
> print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, REC->status, REC->addr, REC->misc, REC->cs, REC->ip, REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, REC->socketid, REC->apicid
>
> tools/perf/util/trace-event-parse.c contains the above structured format
> descriptor parsing code, and can turn it into records that you can read out
> from C code - and provides all sorts of standard functionality over it.
>
> I'd strongly suggest to reuse that - we _really_ want health monitoring and
> general system performance monitoring to share a single facility: as they are
> both one and the same thing, just from different viewpoints.
>
> In other words: 'system component failure' is another metric of 'system
> performance', so there's strong synergies all around.

Agreed.

>
>> * can also configure hw: The tool can also send commands over the syscall to
>> configure certain aspects of the hardware, like:
>>
>> - disable L3 cache indices which are faulty
>> - enable/disable MCE error sources: toggle MCi_CTL, MCi_CTL_MASK bits
>> - disable whole DIMMs: F2x[1, 0][5C:40][CSEnable]
>> - control ECC checking
>> - enable/disable powering down of DRAM regions for power savings
>> - set memory clock frequency
>> - some other relevant aspects of hw/CPU configuration

I agree that configuring the hw is interesting, while I still think that we should
use sysfs for it.

> Once the hardware's structure is enumerated (into a tree/hiearchy), and events
> are attached to individual components, then 'commands' are the next logical
> step: they are methods of a given component/object.
>
> One such method could be 'injection' functionality btw: to simulate rare
> hardware failures and to make sure policy logic is ready for all
> eventualities.

The i7core, amd64 and a very few other EDAC drivers already implements memory
error injection errors, via sysfs. The point with error injection is that this
feature is hardware dependent.

So, while you can use the same hierarchy/tree for hw description, things like
error injection will require an specific hierarchy. That's one of the reasons
why I think we should keep using sysfs: it can easily be used to represent data
that are hardware dependent.

The way I mapped this on the i7core_edac is that I kept using the standard EDAC
sysfs hierarchy for memory, and added a generic code that allows describing
the error injection bits found in Nehalem.

> But ... while that is clearly the 'big grand' end goal, the panacea of RAS
> design, i'd suggest to start with a small but useful base and pick up low
> hanging fruits - then work towards this end goal. This is how perf is
> developed/maintained as well.
>
> So i'd suggest to start with _something_ that other people can try and have a
> look at and extend, for example something that replaces basic mcelog
> functionality. That alone should be fairly easy and immediately gives it a
> short-term purpose. It would also be highly beneficial to the x86 code to get
> rid of the mcelog abonimation.
>
>> * keep all info in sysfs so that no tool is needed for accessing it,
>> similar to ftrace: All knobs needed for user interaction should appear
>> redundantly as sysfs files/dirs so that configuration/query can be done
>> "by hand" even when the hw tool is missing
>
> Please share this code with perf. Profiling needs the same kind of 'hardware
> structure' enumeration - combined with 'software component enumeration'.
>
> Currently we have that info /debug/tracing/events/. Some hw structure is in
> there as well, but not much - most of it is kernel subsystem event structure.
>
> sysfs would be an option but IMO it's even better to put ftrace's
> /debug/tracing/events/ hiearchy into a separate eventfs - and extend it with
> 'hardware structure' details.
>
> This would not only crystalise the RAS purpose, but would nicely extend perf
> as well. With every hardware component you add from the RAS angle we'd get new
> events for tracing/profiling use as well - and vice versa. There's no reason
> why RAS should be limited to hw component failure events: a RAS policy action
> could be defined over OOM events too for example, or over checksum failures in
> network packets - etc.
>
> RAS is not just about hardware, and profiling isnt just about software. We
> want event logging to be a unified design - there's big advantages to that.
>
> So please go for an integrated design. The easiest and most useful way for
> that would be to factor out /debug/tracing/events/ into /eventfs.
>
>> * gradually move pieces of RAS code into kernel proper: important
>> codepaths/aspects from the HW which are being queried often (e.g., DIMM
>> population and config) should be moved gradually into the kernel proper.
>
> Yeah. Good plans.
>
> Ingo


--

Cheers,
Mauro

2010-02-24 17:43:05

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

Mauro Carvalho Chehab wrote:
> The EDAC data model needs some discussion, as, currently, the memory is represented
> per csrow, and modern MCU don't allow such level of control (and it doesn't
> make much sense on representing this way, as you can't replace a csrow). The
> better is to use DIMM as the minumum unit.

Just to start the data model, this is what a typical EDAC driver presents:

/sys/devices/system/edac/mc/mc0/
|-- ce_count
|-- ce_noinfo_count
|-- csrow0
| |-- ce_count
| |-- ch0_ce_count
| |-- ch0_dimm_label
| |-- ch1_ce_count
| |-- ch1_dimm_label
| |-- ch2_ce_count
| |-- ch2_dimm_label
| |-- ch3_ce_count
| |-- ch3_dimm_label
| |-- dev_type
| |-- edac_mode
| |-- mem_type
| |-- size_mb
| `-- ue_count
|-- csrow1
| |-- ce_count
| |-- ch0_ce_count
| |-- ch0_dimm_label
| |-- ch1_ce_count
| |-- ch1_dimm_label
| |-- ch2_ce_count
| |-- ch2_dimm_label
| |-- ch3_ce_count
| |-- ch3_dimm_label
| |-- dev_type
| |-- edac_mode
| |-- mem_type
| |-- size_mb
| `-- ue_count
|-- device -> ../../../../pci0000:3f/0000:3f:03.0
|-- mc_name
|-- reset_counters
|-- sdram_scrub_rate
|-- seconds_since_reset
|-- size_mb
|-- ue_count
`-- ue_noinfo_count

In the case of i7core_edac, there's no way to identify csrows by using
the public registers (I've no idea is is there any non-documented register
for it). So, the driver maps one dimm per "edac csrow".

It would be good to see a better struct for it.

With respect to error injection, this is the way i7core maps it:

|-- inject_addrmatch
| |-- bank
| |-- channel
| |-- col
| |-- dimm
| |-- page
| `-- rank
|-- inject_eccmask
|-- inject_enable
|-- inject_section
|-- inject_type

The inject_addrmatch leaves control a match filter for the error
where the error will be inject. I doubt we would find a way to standardize it.

Cheers,
Mauro

2010-02-24 20:28:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

On Wed, Feb 24, 2010 at 02:42:24PM -0300, Mauro Carvalho Chehab wrote:
> Mauro Carvalho Chehab wrote:
> > The EDAC data model needs some discussion, as, currently, the memory is represented
> > per csrow, and modern MCU don't allow such level of control (and it doesn't
> > make much sense on representing this way, as you can't replace a csrow). The
> > better is to use DIMM as the minumum unit.
>
> Just to start the data model, this is what a typical EDAC driver presents:

First I suspect the result wouldn't be too compatible with current EDAC,
so it might be less confusing to give it a new name.

>
> /sys/devices/system/edac/mc/mc0/
> |-- ce_count
> |-- ce_noinfo_count
> |-- csrow0
> | |-- ce_count
> | |-- ch0_ce_count
> | |-- ch0_dimm_label
> | |-- ch1_ce_count
> | |-- ch1_dimm_label
> | |-- ch2_ce_count
> | |-- ch2_dimm_label
> | |-- ch3_ce_count
> | |-- ch3_dimm_label
> | |-- dev_type
> | |-- edac_mode
> | |-- mem_type
> | |-- size_mb
> | `-- ue_count
> |-- csrow1
> | |-- ce_count
> | |-- ch0_ce_count
> | |-- ch0_dimm_label
> | |-- ch1_ce_count
> | |-- ch1_dimm_label
> | |-- ch2_ce_count
> | |-- ch2_dimm_label
> | |-- ch3_ce_count
> | |-- ch3_dimm_label
> | |-- dev_type
> | |-- edac_mode
> | |-- mem_type
> | |-- size_mb
> | `-- ue_count
> |-- device -> ../../../../pci0000:3f/0000:3f:03.0
> |-- mc_name
> |-- reset_counters
> |-- sdram_scrub_rate
> |-- seconds_since_reset
> |-- size_mb
> |-- ue_count
> `-- ue_noinfo_count
>
> In the case of i7core_edac, there's no way to identify csrows by using
> the public registers (I've no idea is is there any non-documented register
> for it). So, the driver maps one dimm per "edac csrow".

Some thoughts on this:

One of my goals would be that a fall back driver can create the information
just from the SMBIOS (that is without error counts, although there
are some cases those can be mapped from corrected MCEs)

I think I would prefer a flat model:

socket <-> DIMM

socket could be arbitary, either a CPU socket, or a external memory controller

and then have a "path" string per DIMM that can be a arbitrary path to the DIMM,
depending on the system: e.g. on a system with on board buffers it could
include the buffer, the channel to the buffer and the channel behind
the buffer. The string could be free form like a path name e.g.

FOO-CH1/DDR-CH2/DIMMa

I think it doesn't make sense to have that in the file system hierarchy
itself (that would just make it harder to parse without any real
benefit, since hardware changes could completely change the directory
structure), I would just keep it as an attribute file in the leaf nodes.

I don't think we it makes much sense to have anything below the DIMM (e.g. ranks)
These are typically hard to get in a generic way and the most important
information on this level is what units to replace.

Then the other important part is a good way to manage fall back counters
when you cannot identify the target DIMM (that happens occasionally for various
reasons).

In this case would need a "aggregator object" with a truncated path
that still maintains the same counts.

Then of course there still needs to be a event interface to let someone
actually do something with all these counts without having to poll.

Most interesting cases always need user space support, so also of course
user space needs raw events too.

-Andi

--
[email protected] -- Speaking for myself only.