2012-05-17 21:28:19

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Add a new tracepoint-based hardware events report method for
reporting Memory Controller events.

Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.

[1] http://lwn.net/Articles/416669/

We have several subsystems & methods for reporting hardware errors:

1) EDAC ("Error Detection and Correction"). In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.

2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user level
programs decodes into somewhat human readable format.

3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
decodes errors reported via machine check bank registers in AMD
processors to the console log using printk();

Each of these mechanisms has a band of followers ... and none
of them appear to meet all the needs of all users.

As part of a RAS subsystem, let's encapsulate the memory error hardware
events into a trace facility.

The tracepoint printk will be displayed like:

mc_event: (Corrected|Uncorrected|Fatal) error:[error msg] on memory stick "[label]" ([location] [edac_mc detail] [driver_detail])

Where:
[error msg] is the driver-specific error message
(e. g. "memory read", "bus error", ...);
[location] is the location in terms of memory controller and
branch/channel/slot, channel/slot or csrow/channel;
[label] is the memory stick label;
[edac_mc detail] describes the address location of the error
and the syndrome;
[driver detail] is driver-specifig error message details,
when needed/provided (e. g. "area:DMA", ...)

For example:

mc_event: Corrected error:memory read on memory stick "DIMM_1A" (mc:0 channel:0 slot:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)

Of course, any userspace tools meant to handle errors should not parse
the above data. They should, instead, use the binary fields provided by
the tracepoint, mapping them directly into their MIBs.

NOTE: The original patch was providing an additional mechanism for
MCA-based trace events that also contained MCA error register data.
Hoever, as no agreement was reached so far for the MCA-based trace
events, for now, let's add events only for memory errors.
A latter patch is planned to change the tracepoint, for those types
of event.

Cc: Aristeu Rozanski <[email protected]>
Cc: Doug Thompson <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
---

v24b: Remove extra whitespaces and improve messages on amd64_edac.

drivers/edac/amd64_edac.c | 29 ++++++++--------
drivers/edac/amd64_edac.h | 3 +-
drivers/edac/edac_core.h | 2 +-
drivers/edac/edac_mc.c | 56 +++++++++++++++++++++++++------
include/ras/ras_event.h | 80 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 143 insertions(+), 27 deletions(-)
create mode 100644 include/ras/ras_event.h

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 60479f9..5be52e1 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1052,8 +1052,8 @@ static void k8_map_sysaddr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr,
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
page, offset, syndrome,
-1, -1, -1,
- EDAC_MOD_STR,
"failed to map error addr to a node",
+ EDAC_MOD_STR,
NULL);
return;
}
@@ -1064,8 +1064,8 @@ static void k8_map_sysaddr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr,
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
page, offset, syndrome,
-1, -1, -1,
- EDAC_MOD_STR,
"failed to map error addr to a csrow",
+ EDAC_MOD_STR,
NULL);
return;
}
@@ -1085,8 +1085,8 @@ static void k8_map_sysaddr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr,
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
page, offset, syndrome,
csrow, -1, -1,
- EDAC_MOD_STR,
"unknown syndrome - possible error reporting race",
+ EDAC_MOD_STR,
NULL);
return;
}
@@ -1105,7 +1105,7 @@ static void k8_map_sysaddr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr,
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, src_mci,
page, offset, syndrome,
csrow, channel, -1,
- EDAC_MOD_STR, "", NULL);
+ "", EDAC_MOD_STR, NULL);
}

static int ddr2_cs_size(unsigned i, bool dct_width)
@@ -1595,8 +1595,8 @@ static void f1x_map_sysaddr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr,
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
page, offset, syndrome,
-1, -1, -1,
- EDAC_MOD_STR,
"failed to map error addr to a csrow",
+ EDAC_MOD_STR,
NULL);
return;
}
@@ -1612,7 +1612,7 @@ static void f1x_map_sysaddr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr,
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
page, offset, syndrome,
csrow, chan, -1,
- EDAC_MOD_STR, "", NULL);
+ "", EDAC_MOD_STR, NULL);
}

/*
@@ -1896,8 +1896,8 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
0, 0, 0,
-1, -1, -1,
- EDAC_MOD_STR,
"HW has no ERROR_ADDRESS available",
+ EDAC_MOD_STR,
NULL);
return;
}
@@ -1925,8 +1925,8 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
0, 0, 0,
-1, -1, -1,
- EDAC_MOD_STR,
"HW has no ERROR_ADDRESS available",
+ EDAC_MOD_STR,
NULL);
return;
}
@@ -1945,8 +1945,9 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
page, offset, 0,
-1, -1, -1,
+ "ERROR ADDRESS NOT mapped to a MC",
EDAC_MOD_STR,
- "ERROR ADDRESS NOT mapped to a MC", NULL);
+ NULL);
return;
}

@@ -1959,14 +1960,14 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
page, offset, 0,
-1, -1, -1,
- EDAC_MOD_STR,
"ERROR ADDRESS NOT mapped to CS",
+ EDAC_MOD_STR,
NULL);
} else {
edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
page, offset, 0,
csrow, -1, -1,
- EDAC_MOD_STR, "", NULL);
+ "", EDAC_MOD_STR, NULL);
}
}

@@ -2471,7 +2472,7 @@ static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
mci->edac_ctl_cap |= EDAC_FLAG_S4ECD4ED;

mci->edac_cap = amd64_determine_edac_cap(pvt);
- mci->mod_name = EDAC_MOD_STR;
+ mci->mod_name = EDAC_DRIVER_STR;
mci->mod_ver = EDAC_AMD64_VERSION;
mci->ctl_name = fam->ctl_name;
mci->dev_name = pci_name(pvt->F2);
@@ -2731,7 +2732,7 @@ static const struct pci_device_id amd64_pci_table[] __devinitdata = {
MODULE_DEVICE_TABLE(pci, amd64_pci_table);

static struct pci_driver amd64_pci_driver = {
- .name = EDAC_MOD_STR,
+ .name = EDAC_DRIVER_STR,
.probe = amd64_probe_one_instance,
.remove = __devexit_p(amd64_remove_one_instance),
.id_table = amd64_pci_table,
@@ -2750,7 +2751,7 @@ static void setup_pci_device(void)

pvt = mci->pvt_info;
amd64_ctl_pci =
- edac_pci_create_generic_ctl(&pvt->F2->dev, EDAC_MOD_STR);
+ edac_pci_create_generic_ctl(&pvt->F2->dev, EDAC_DRIVER_STR);

if (!amd64_ctl_pci) {
pr_warning("%s(): Unable to create PCI control\n",
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 9a666cb..acea42c 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -145,7 +145,8 @@
*/

#define EDAC_AMD64_VERSION "3.4.0"
-#define EDAC_MOD_STR "amd64_edac"
+#define EDAC_DRIVER_STR "amd64_edac"
+#define EDAC_MOD_STR "driver:" EDAC_DRIVER_STR

/* Extended Model from CPUID, for CPU Revision numbers */
#define K8_REV_D 1
diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index f06ce9a..eee7360 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -468,7 +468,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
const int layer2,
const char *msg,
const char *other_detail,
- const void *mcelog);
+ const void *arch_log);

/*
* edac_device APIs
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 7246a3c..461f6f8 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -33,6 +33,10 @@
#include "edac_core.h"
#include "edac_module.h"

+#define CREATE_TRACE_POINTS
+#define TRACE_INCLUDE_PATH ../../include/ras
+#include <ras/ras_event.h>
+
/* lock to memory controller's control array */
static DEFINE_MUTEX(mem_ctls_mutex);
static LIST_HEAD(mc_devices);
@@ -384,6 +388,7 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
* which will perform kobj unregistration and the actual free
* will occur during the kobject callback operation
*/
+
return mci;
}
EXPORT_SYMBOL_GPL(edac_mc_alloc);
@@ -909,12 +914,12 @@ static void edac_ce_error(struct mem_ctl_info *mci,
if (edac_mc_get_log_ce()) {
if (other_detail && *other_detail)
edac_mc_printk(mci, KERN_WARNING,
- "CE %s on %s (%s%s - %s)\n",
+ "CE %s on %s (%s %s - %s)\n",
msg, label, location,
detail, other_detail);
else
edac_mc_printk(mci, KERN_WARNING,
- "CE %s on %s (%s%s)\n",
+ "CE %s on %s (%s %s)\n",
msg, label, location,
detail);
}
@@ -953,12 +958,12 @@ static void edac_ue_error(struct mem_ctl_info *mci,
if (edac_mc_get_log_ue()) {
if (other_detail && *other_detail)
edac_mc_printk(mci, KERN_WARNING,
- "UE %s on %s (%s%s - %s)\n",
+ "UE %s on %s (%s %s - %s)\n",
msg, label, location, detail,
other_detail);
else
edac_mc_printk(mci, KERN_WARNING,
- "UE %s on %s (%s%s)\n",
+ "UE %s on %s (%s %s)\n",
msg, label, location, detail);
}

@@ -975,6 +980,27 @@ static void edac_ue_error(struct mem_ctl_info *mci,
}

#define OTHER_LABEL " or "
+
+/**
+ * edac_mc_handle_error - reports a memory event to userspace
+ *
+ * @type: severity of the error (CE/UE/Fatal)
+ * @mci: a struct mem_ctl_info pointer
+ * @page_frame_number: mem page where the error occurred
+ * @offset_in_page: offset of the error inside the page
+ * @syndrome: ECC syndrome
+ * @layer0: Memory layer0 position
+ * @layer1: Memory layer2 position
+ * @layer2: Memory layer3 position
+ * @msg: Message meaningful to the end users that
+ * explains the event
+ * @other_detail: Technical details about the event that
+ * may help hardware manufacturers and
+ * EDAC developers to analyse the event
+ * @arch_log: Architecture-specific struct that can
+ * be used to add extended information to the
+ * tracepoint, like dumping MCE registers.
+ */
void edac_mc_handle_error(const enum hw_event_mc_err_type type,
struct mem_ctl_info *mci,
const unsigned long page_frame_number,
@@ -985,7 +1011,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
const int layer2,
const char *msg,
const char *other_detail,
- const void *mcelog)
+ const void *arch_log)
{
/* FIXME: too much for stack: move it to some pre-alocated area */
char detail[80], location[80];
@@ -1120,23 +1146,31 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
edac_layer_name[mci->layers[i].type],
pos[i]);
}
+ if (p > location)
+ *(p - 1) = '\0';

/* Memory type dependent details about the error */
- if (type == HW_EVENT_ERR_CORRECTED) {
+ if (type == HW_EVENT_ERR_CORRECTED)
snprintf(detail, sizeof(detail),
"page:0x%lx offset:0x%lx grain:%d syndrome:0x%lx",
page_frame_number, offset_in_page,
grain, syndrome);
- edac_ce_error(mci, pos, msg, location, label, detail,
- other_detail, enable_per_layer_report,
- page_frame_number, offset_in_page, grain);
- } else {
+ else
snprintf(detail, sizeof(detail),
"page:0x%lx offset:0x%lx grain:%d",
page_frame_number, offset_in_page, grain);

+ /* Report the error via the trace interface */
+ trace_mc_event(type, mci->mc_idx, msg, label, location,
+ detail, other_detail);
+
+ /* Report the error via the edac_mc_printk() interface */
+ if (type == HW_EVENT_ERR_CORRECTED)
+ edac_ce_error(mci, pos, msg, location, label, detail,
+ other_detail, enable_per_layer_report,
+ page_frame_number, offset_in_page, grain);
+ else
edac_ue_error(mci, pos, msg, location, label, detail,
other_detail, enable_per_layer_report);
- }
}
EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
new file mode 100644
index 0000000..2f9069d
--- /dev/null
+++ b/include/ras/ras_event.h
@@ -0,0 +1,80 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ras
+#define TRACE_INCLUDE_FILE ras_event
+
+#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HW_EVENT_MC_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+#include <linux/ktime.h>
+
+/*
+ * Hardware Events Report
+ *
+ * Those events are generated when hardware detected a corrected or
+ * uncorrected event, and are meant to replace the current API to report
+ * errors defined on both EDAC and MCE subsystems.
+ *
+ * FIXME: Add events for handling memory errors originated from the
+ * MCE subsystem.
+ */
+
+/*
+ * Hardware-independent Memory Controller specific events
+ */
+
+/*
+ * Default error mechanisms for Memory Controller errors (CE and UE)
+ */
+TRACE_EVENT(mc_event,
+
+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ const char *location,
+ const char *core_detail,
+ const char *driver_detail),
+
+ TP_ARGS(err_type, mc_index, error_msg, label, location,
+ core_detail, driver_detail),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, err_type )
+ __field( unsigned int, mc_index )
+ __string( msg, error_msg )
+ __string( label, label )
+ __string( detail, core_detail )
+ __string( location, location )
+ __string( driver_detail, driver_detail )
+ ),
+
+ TP_fast_assign(
+ __entry->err_type = err_type;
+ __entry->mc_index = mc_index;
+ __assign_str(msg, error_msg);
+ __assign_str(label, label);
+ __assign_str(location, location);
+ __assign_str(detail, core_detail);
+ __assign_str(driver_detail, driver_detail);
+ ),
+
+ TP_printk("%s error:%s%s on memory stick \"%s\" (mc:%d %s %s%s%s)",
+ (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+ ((__entry->err_type == HW_EVENT_ERR_FATAL) ?
+ "Fatal" : "Uncorrected"),
+ ((char *)__get_str(msg))[0] ? " " : "",
+ __get_str(msg),
+ __get_str(label),
+ __entry->mc_index,
+ __get_str(location),
+ __get_str(detail),
+ ((char *)__get_str(driver_detail))[0] ? " " : "",
+ __get_str(driver_detail))
+);
+
+#endif /* _TRACE_HW_EVENT_MC_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
1.7.8


2012-05-17 21:49:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Thu, May 17, 2012 at 05:41:17PM -0300, Mauro Carvalho Chehab wrote:
> Add a new tracepoint-based hardware events report method for
> reporting Memory Controller events.
>
> Part of the description bellow is shamelessly copied from Tony
> Luck's notes about the Hardware Error BoF during LPC 2010 [1].
> Tony, thanks for your notes and discussions to generate the
> h/w error reporting requirements.
>
> [1] http://lwn.net/Articles/416669/
>
> We have several subsystems & methods for reporting hardware errors:
>
> 1) EDAC ("Error Detection and Correction"). In its original form
> this consisted of a platform specific driver that read topology
> information and error counts from chipset registers and reported
> the results via a sysfs interface.
>
> 2) mcelog - x86 specific decoding of machine check bank registers
> reporting in binary form via /dev/mcelog. Recent additions make use
> of the APEI extensions that were documented in version 4.0a of the
> ACPI specification to acquire more information about errors without
> having to rely reading chipset registers directly. A user level
> programs decodes into somewhat human readable format.
>
> 3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
> decodes errors reported via machine check bank registers in AMD
> processors to the console log using printk();
>
> Each of these mechanisms has a band of followers ... and none
> of them appear to meet all the needs of all users.
>
> As part of a RAS subsystem, let's encapsulate the memory error hardware
> events into a trace facility.
>
> The tracepoint printk will be displayed like:
>
> mc_event: (Corrected|Uncorrected|Fatal) error:[error msg] on memory stick "[label]" ([location] [edac_mc detail] [driver_detail])
>
> Where:
> [error msg] is the driver-specific error message
> (e. g. "memory read", "bus error", ...);
> [location] is the location in terms of memory controller and
> branch/channel/slot, channel/slot or csrow/channel;
> [label] is the memory stick label;
> [edac_mc detail] describes the address location of the error
> and the syndrome;
> [driver detail] is driver-specifig error message details,
> when needed/provided (e. g. "area:DMA", ...)
>
> For example:
>
> mc_event: Corrected error:memory read on memory stick "DIMM_1A" (mc:0 channel:0 slot:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
>
> Of course, any userspace tools meant to handle errors should not parse
> the above data. They should, instead, use the binary fields provided by
> the tracepoint, mapping them directly into their MIBs.

Nacked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-18 07:12:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events


* Borislav Petkov <[email protected]> wrote:

> On Thu, May 17, 2012 at 05:41:17PM -0300, Mauro Carvalho Chehab wrote:
> > Add a new tracepoint-based hardware events report method for
> > reporting Memory Controller events.
> >
> > Part of the description bellow is shamelessly copied from Tony
> > Luck's notes about the Hardware Error BoF during LPC 2010 [1].
> > Tony, thanks for your notes and discussions to generate the
> > h/w error reporting requirements.
> >
> > [1] http://lwn.net/Articles/416669/
> >
> > We have several subsystems & methods for reporting hardware errors:
> >
> > 1) EDAC ("Error Detection and Correction"). In its original form
> > this consisted of a platform specific driver that read topology
> > information and error counts from chipset registers and reported
> > the results via a sysfs interface.
> >
> > 2) mcelog - x86 specific decoding of machine check bank registers
> > reporting in binary form via /dev/mcelog. Recent additions make use
> > of the APEI extensions that were documented in version 4.0a of the
> > ACPI specification to acquire more information about errors without
> > having to rely reading chipset registers directly. A user level
> > programs decodes into somewhat human readable format.
> >
> > 3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
> > decodes errors reported via machine check bank registers in AMD
> > processors to the console log using printk();
> >
> > Each of these mechanisms has a band of followers ... and none
> > of them appear to meet all the needs of all users.
> >
> > As part of a RAS subsystem, let's encapsulate the memory error hardware
> > events into a trace facility.
> >
> > The tracepoint printk will be displayed like:
> >
> > mc_event: (Corrected|Uncorrected|Fatal) error:[error msg] on memory stick "[label]" ([location] [edac_mc detail] [driver_detail])
> >
> > Where:
> > [error msg] is the driver-specific error message
> > (e. g. "memory read", "bus error", ...);
> > [location] is the location in terms of memory controller and
> > branch/channel/slot, channel/slot or csrow/channel;
> > [label] is the memory stick label;
> > [edac_mc detail] describes the address location of the error
> > and the syndrome;
> > [driver detail] is driver-specifig error message details,
> > when needed/provided (e. g. "area:DMA", ...)
> >
> > For example:
> >
> > mc_event: Corrected error:memory read on memory stick "DIMM_1A" (mc:0 channel:0 slot:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
> >
> > Of course, any userspace tools meant to handle errors should not parse
> > the above data. They should, instead, use the binary fields provided by
> > the tracepoint, mapping them directly into their MIBs.
>
> Nacked-by: Borislav Petkov <[email protected]>

Just wondering why this got nacked, and what the
suggestions/plans are to improve the situation: I assume Mauro
is working on these things to solve problems, or to add
features, Mauro could you please give a higher level list of
those problems or features? There must be more to it than just a
new tracepoint! :-)

Thanks,

Ingo

2012-05-18 09:56:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 09:12:44AM +0200, Ingo Molnar wrote:
> > > Of course, any userspace tools meant to handle errors should not parse
> > > the above data. They should, instead, use the binary fields provided by
> > > the tracepoint, mapping them directly into their MIBs.
> >
> > Nacked-by: Borislav Petkov <[email protected]>
>
> Just wondering why this got nacked, and what the
> suggestions/plans are to improve the situation:

Basically this is the thread which lead to it: http://marc.info/?l=linux-kernel&m=133709477524773&w=2

> I assume Mauro is working on these things to solve problems, or to
> add features, Mauro could you please give a higher level list of
> those problems or features? There must be more to it than just a new
> tracepoint! :-)

My main objection was that the tracepoint to report errors from edac
contains the following prototype:

+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ const char *location,
+ const char *core_detail,
+ const char *driver_detail),

and that the last args should be merged simply into one 'const char
*detail' which every driver can populate as it sees fit.

But Mauro did not want to parse the string in userspace but feed it
straight into a MIB (which could mean "Men In Black" for all I know),
right from the tracepoint:

> Of course, any userspace tools meant to handle errors should not parse
> the above data. They should, instead, use the binary fields provided by
> the tracepoint, mapping them directly into their MIBs.

And I wanted to have a generic, usable-for-all tracepoint output
which anyone in userspace can parse, decode, cut, paste as she sees
fit without forcing kernel output formatting into any abstract error
management hierarchy or whatever.

As Tony put it, we need to hammer that out properly now before it
becomes an ABI.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-18 11:00:22

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 18-05-2012 06:56, Borislav Petkov escreveu:
> On Fri, May 18, 2012 at 09:12:44AM +0200, Ingo Molnar wrote:
>>>> Of course, any userspace tools meant to handle errors should not parse
>>>> the above data. They should, instead, use the binary fields provided by
>>>> the tracepoint, mapping them directly into their MIBs.
>>>
>>> Nacked-by: Borislav Petkov <[email protected]>
>>
>> Just wondering why this got nacked, and what the
>> suggestions/plans are to improve the situation:
>
> Basically this is the thread which lead to it: http://marc.info/?l=linux-kernel&m=133709477524773&w=2
>
>> I assume Mauro is working on these things to solve problems, or to
>> add features, Mauro could you please give a higher level list of
>> those problems or features? There must be more to it than just a new
>> tracepoint! :-)
>
> My main objection was that the tracepoint to report errors from edac
> contains the following prototype:
>
> + TP_PROTO(const unsigned int err_type,
> + const unsigned int mc_index,
> + const char *error_msg,
> + const char *label,
> + const char *location,
> + const char *core_detail,
> + const char *driver_detail),
>
> and that the last args should be merged simply into one 'const char
> *detail' which every driver can populate as it sees fit.
>
> But Mauro did not want to parse the string in userspace

The "detail" field above is, in fact, a group of the following integer
values:

- physical address of the error, provided as 3 integers: page, offset an
the error offset granularity;

- ECC syndrome, with has a XOR value used to determine how many and what
bit(s) have errors.

"driver_detail" contains other driver-specific details about the errors.

The "location" field is also given by 3 integers: layer0, layer1, layer2 with
contains the branch/slot/channel or slot/channel or chip select row/channel
position of the affected DIMM.

On my last comment, I said that I'm convinced that those two fields should
be broken into their integer components.

The fields at "detail" and "location" are generated by a snprintf() logic
inside the EDAC core. "driver_detail" is filled by the drivers.

I did that merge in order to provide a clearer output message, but this
doesn't help userspace tools that could get the binary information at
the trace directly.

There's a conceptual issue here, and it seems that Boris is not understanding,
probably because he doesn't have any experience on handling errors on userspace.

In this point, my previous experience on writing network management
and my work at the userspace EDAC tool helps me to see both faces of
the coin.

Currently, EDAC prints an error message, and increments some Kernel counters
to indicate where the error is located.

Userspace tools can either get the dmesg logs or read the error counters.
The edac-tools (with is the only public tool for EDAC that I'm aware of)
prefers to rely on the error counters, instead of parsing the error logs.

I suspect that one of the reasons is because printk messages are not a
consistent ABI, as changing them are not considered as a Kernel regression.

There are several issues on relying at the Kernel error counters: they don't
have decay or any other kind of logic that would allow detecting bursts.

So, on a machine running for 30 days with, let's say, 10 corrected errors,
it is not possible to tell if they were all generated on a burst (because
of some temporary interference) or if they are sparse errors generated at
the same DRAM, indicating a bad memory.

The big advantage (maybe the only advantage) of using a tracepoint is that
the binary data can be exported directly.

If the location and memory location integers are exported as integers, it
is simple to change the edac-tools to work with tracepoints, instead of working
with the error counters.

With this simple change, the tool can be improved to provide a more
consistent memory error report, as userspace should not be worried on
parsing fields that may change in future without notice.

Also, doing that will avoid the extra effort of joining everything into
a single string, and then breaking them back into their individual fields
on userspace.

> but feed it
> straight into a MIB (which could mean "Men In Black" for all I know),
> right from the tracepoint:

I can do a s/MIB/Management Information Base/. We can also avoid all other
acronyms that have more than one sense, like RAS (with all network
people will associate it with "Remote Access Service").

>> Of course, any userspace tools meant to handle errors should not parse
>> the above data. They should, instead, use the binary fields provided by
>> the tracepoint, mapping them directly into their MIBs.
>
> And I wanted to have a generic, usable-for-all tracepoint output
> which anyone in userspace can parse, decode, cut, paste as she sees
> fit without forcing kernel output formatting into any abstract error
> management hierarchy or whatever.

s/printk/trace/? That doesn't sound a good idea. The advantage of
tracepoints of printk's is that they provide an ABI that allows passing
strucutured information to userspace.

> As Tony put it, we need to hammer that out properly now before it
> becomes an ABI.

Yes, sure.

Regards,
Mauro

2012-05-18 12:43:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 07:59:55AM -0300, Mauro Carvalho Chehab wrote:
> The "detail" field above is, in fact, a group of the following integer
> values:
>
> - physical address of the error, provided as 3 integers: page, offset an
> the error offset granularity;
>
> - ECC syndrome, with has a XOR value used to determine how many and what
> bit(s) have errors.
>
> "driver_detail" contains other driver-specific details about the errors.
>
> The "location" field is also given by 3 integers: layer0, layer1, layer2 with
> contains the branch/slot/channel or slot/channel or chip select row/channel
> position of the affected DIMM.
>
> On my last comment, I said that I'm convinced that those two fields should
> be broken into their integer components.
>
> The fields at "detail" and "location" are generated by a snprintf() logic
> inside the EDAC core. "driver_detail" is filled by the drivers.
>
> I did that merge in order to provide a clearer output message, but this
> doesn't help userspace tools that could get the binary information at
> the trace directly.
>
> There's a conceptual issue here, and it seems that Boris is not understanding,
> probably because he doesn't have any experience on handling errors on userspace.

Probably you're too stuck up with your error handling "experience" and
you fail to see that I'm trying to make this as generic as possible and
not fit it into anything so that EVERYONE can use it.

> In this point, my previous experience on writing network management
> and my work at the userspace EDAC tool helps me to see both faces of
> the coin.
>
> Currently, EDAC prints an error message, and increments some Kernel counters
> to indicate where the error is located.
>
> Userspace tools can either get the dmesg logs or read the error counters.
> The edac-tools (with is the only public tool for EDAC that I'm aware of)
> prefers to rely on the error counters, instead of parsing the error logs.
>
> I suspect that one of the reasons is because printk messages are not a
> consistent ABI, as changing them are not considered as a Kernel regression.

Oh yeah, teach me more about that, that's like not appearing maybe once
a week on lkml but what do I know...

> There are several issues on relying at the Kernel error counters: they
> don't have decay or any other kind of logic that would allow detecting
> bursts.

Guess what, you can do decay very successfully in userspace.

> So, on a machine running for 30 days with, let's say, 10 corrected
> errors, it is not possible to tell if they were all generated on a
> burst (because of some temporary interference) or if they are sparse
> errors generated at the same DRAM, indicating a bad memory.

Why not, you don't have timestamps?

> The big advantage (maybe the only advantage) of using a tracepoint is that
> the binary data can be exported directly.

Really?? You don't say. Teach me more about tracepoints and why it is a
good thing to use them for error reporting.

> If the location and memory location integers are exported as integers,
> it is simple to change the edac-tools to work with tracepoints,
> instead of working with the error counters.

Yeah, so?

> With this simple change, the tool can be improved to provide a more
> consistent memory error report, as userspace should not be worried on
> parsing fields that may change in future without notice.

Which userspace, your userspace? What happens if another userspace comes
with slightly different error formatting requirements? We change the
kernel again or we can't because this tracepoint is being used and we
add another tracepoint and let the bloat begin?

> Also, doing that will avoid the extra effort of joining everything into
> a single string, and then breaking them back into their individual fields
> on userspace.

I'm being told this is very easy to do in userspace in your garden
variety language.

> > but feed it
> > straight into a MIB (which could mean "Men In Black" for all I know),
> > right from the tracepoint:
>
> I can do a s/MIB/Management Information Base/. We can also avoid all other
> acronyms that have more than one sense, like RAS (with all network
> people will associate it with "Remote Access Service").

Oh, I know, MIB=Mauro Is Best.

> >> Of course, any userspace tools meant to handle errors should not parse
> >> the above data. They should, instead, use the binary fields provided by
> >> the tracepoint, mapping them directly into their MIBs.
> >
> > And I wanted to have a generic, usable-for-all tracepoint output
> > which anyone in userspace can parse, decode, cut, paste as she sees
> > fit without forcing kernel output formatting into any abstract error
> > management hierarchy or whatever.
>
> s/printk/trace/? That doesn't sound a good idea. The advantage of
> tracepoints of printk's is that they provide an ABI that allows passing
> strucutured information to userspace.

WTF? Maybe you don't understand what I'm talking about at all and I'm
wasting my time completely.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-18 13:23:58

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 18-05-2012 09:43, Borislav Petkov escreveu:

(offensive/ironic comments skipped)

> you fail to see that I'm trying to make this as generic as possible and
> not fit it into anything so that EVERYONE can use it.

Tracepoints are generic enough. Everyone can use it. If need new fields are
needed, just add a new tracepoint. Converting it into a new printk way
won't make it usable by everyone. It will just make the ABI unstable.

For sure the fields required for memory errors are different than the ones
required by PCI, CPU or other types of hardware.

>> There are several issues on relying at the Kernel error counters: they
>> don't have decay or any other kind of logic that would allow detecting
>> bursts.
>
> Guess what, you can do decay very successfully in userspace.

Yes, if and only if userspace receives the errors on a proper way.

>> So, on a machine running for 30 days with, let's say, 10 corrected
>> errors, it is not possible to tell if they were all generated on a
>> burst (because of some temporary interference) or if they are sparse
>> errors generated at the same DRAM, indicating a bad memory.
>
> Why not, you don't have timestamps?

At the error counters? No.

There are timestamps at printk's, but printk's can't be reliably parsed, as
formats are not consistent among drivers, and the "ABI" can break on every
Kernel release.

>> With this simple change, the tool can be improved to provide a more
>> consistent memory error report, as userspace should not be worried on
>> parsing fields that may change in future without notice.
>
> Which userspace, your userspace? What happens if another userspace comes
> with slightly different error formatting requirements? We change the
> kernel again or we can't because this tracepoint is being used and we
> add another tracepoint and let the bloat begin?

As said before, this is not about error formatting requirements! TP_printk() macro
does formatting. I'm not discussing TP_printk() output.

All Userspace tools require access to the error fields.

The real issue with tracepoint is to provide access to the structured data that
userspace needs (so, what fields should be exported), and not about what printk
formats. With regards to the printk format there's already an agreement that
seems to satisfy you, me and Tony.

By adding all fields there at the tracepoint, no changes will be needed inside
Kernel to satisfy any userspace requirements, and the ABI will be reliable and
stable.

>> Also, doing that will avoid the extra effort of joining everything into
>> a single string, and then breaking them back into their individual fields
>> on userspace.
>
> I'm being told this is very easy to do in userspace in your garden
> variety language.

Well, ask the one that told you that to write a parser that will get all
EDAC error reports on all drivers, on several kernel versions using dmesg.
This is not easy, as the messages are not consistent, the right fields are
sometimes at the wrong places, and different kernel versions have different
printk's.

Just to give you an example, this is how sb_edac currently outputs an error:

EDAC MC1: CE row 10, channel 0, label "DIMM_5": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x22719d0780 => socket=1, Channel=3(mask=8), rank=4

The above error happened at the memory controlled by the second CPU socket,
at slot #1 of channel 3.

A similar report for amd64_edac (if amd64 would have 4 channels) would be:

EDAC MC1: amd64_edac: row 1, channel 3, label "DIMM_5" [...]

E. g. the error location fields are on different places. Writing a parser for
that would require to look inside each driver, and will require parsers
rewrite on each kernel version (for example, on this patch series, the '='
delimiter were replaced by ':' due to upstream feedback).

I think we merged some printk change for sb_edac on 3.4. If so, this driver,
that started on 3.3, will have 3 different printk ABI variants: one for
3.3, one for 3.4 and another for 3.5.

Also, as the "mask" needs to be properly parsed, 3.6 will also have yet another
printk format.

Regards,
Mauro

2012-05-18 14:05:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 10:23:39AM -0300, Mauro Carvalho Chehab wrote:
> Em 18-05-2012 09:43, Borislav Petkov escreveu:
>
> (offensive/ironic comments skipped)
>
> > you fail to see that I'm trying to make this as generic as possible and
> > not fit it into anything so that EVERYONE can use it.
>
> Tracepoints are generic enough. Everyone can use it. If need new fields are
> needed, just add a new tracepoint.

This is exactly what I'm talking about - this is wrong on so many levels
that it ain't even funny! You can add as many tracepoints if you want
to to your pet project but in the kernel we add one tracepoint for one
thing which fits all users.

I want to see you be successful at adding a new tracepoint
to, say, prepare_task_switch() in kernel/sched/core.c because
trace_sched_switch() does not fit your needs.

> Converting it into a new printk way won't make it usable by everyone.
> It will just make the ABI unstable.

WTF? We're not even talking about converting it into a new printk!

[ … ]

> > Guess what, you can do decay very successfully in userspace.
>
> Yes, if and only if userspace receives the errors on a proper way.

Doh, am I talking to the wall here, do you even read what I'm sayin?

> >> So, on a machine running for 30 days with, let's say, 10 corrected
> >> errors, it is not possible to tell if they were all generated on a
> >> burst (because of some temporary interference) or if they are sparse
> >> errors generated at the same DRAM, indicating a bad memory.
> >
> > Why not, you don't have timestamps?
>
> At the error counters? No.
>
> There are timestamps at printk's, but printk's can't be reliably parsed, as
> formats are not consistent among drivers, and the "ABI" can break on every
> Kernel release.

Yes, enough already! We know about the printks! The whole world knows.

> >> With this simple change, the tool can be improved to provide a more
> >> consistent memory error report, as userspace should not be worried on
> >> parsing fields that may change in future without notice.
> >
> > Which userspace, your userspace? What happens if another userspace comes
> > with slightly different error formatting requirements? We change the
> > kernel again or we can't because this tracepoint is being used and we
> > add another tracepoint and let the bloat begin?
>
> As said before, this is not about error formatting requirements! TP_printk() macro
> does formatting. I'm not discussing TP_printk() output.
>
> All Userspace tools require access to the error fields.
>
> The real issue with tracepoint is to provide access to the structured data that
> userspace needs (so, what fields should be exported), and not about what printk
> formats. With regards to the printk format there's already an agreement that
> seems to satisfy you, me and Tony.
>
> By adding all fields there at the tracepoint, no changes will be needed inside
> Kernel to satisfy any userspace requirements, and the ABI will be reliable and
> stable.

Right, and this is why I'm asking you to have the following tracepoint proto:

+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ const char *location,
+ const char *detail)

where detail contains all the crap one driver adds for technical people
to pinpoint where the error is.

And not have _TWO_ detail arguments!

Btw, the output looks like this here:

<...>-2723 [001] .N.. 89.107045: mc_event: Corrected error: on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0xfc5b driver:amd64_edac)

Come to think of it, the "driver:amd64_edac" is not really needed
because on every single system there's only one EDAC driver running and
I don't think the fact that we're telling in the tracepoint who detected
the error is meaningfull information.

Which means, you can remove the EDAC_MOD_STR argument you're passing to
edac_mc_handle_error() and have one less argument.

> >> Also, doing that will avoid the extra effort of joining everything into
> >> a single string, and then breaking them back into their individual fields
> >> on userspace.
> >
> > I'm being told this is very easy to do in userspace in your garden
> > variety language.
>
> Well, ask the one that told you that to write a parser that will get all
> EDAC error reports on all drivers, on several kernel versions using dmesg.

Nobody is talking about dmesg - I'm talking about the "const char
*detail" string in the tracepoint above and how I want it to be a single
char * so that userspace can read it out and fumble with it the way it
sees fit.

Having "detail" and "other_detail" is simply misleading and unneeded and
a clear sign that this ABI is not designed properly yet.

That's it.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-18 14:32:10

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 18-05-2012 11:05, Borislav Petkov escreveu:
> On Fri, May 18, 2012 at 10:23:39AM -0300, Mauro Carvalho Chehab wrote:
>> Em 18-05-2012 09:43, Borislav Petkov escreveu:
>>
>> (offensive/ironic comments skipped)
>>
>>> you fail to see that I'm trying to make this as generic as possible and
>>> not fit it into anything so that EVERYONE can use it.
>>
>> Tracepoints are generic enough. Everyone can use it. If need new fields are
>> needed, just add a new tracepoint.
>
> This is exactly what I'm talking about - this is wrong on so many levels
> that it ain't even funny! You can add as many tracepoints if you want
> to to your pet project but in the kernel we add one tracepoint for one
> thing which fits all users.
>
> I want to see you be successful at adding a new tracepoint
> to, say, prepare_task_switch() in kernel/sched/core.c because
> trace_sched_switch() does not fit your needs.

Ok, but you won't use trace_sched_switch() as a memory tracepoint, as
they represent different things.

Memory errors are different than CPU errors. So, their tracepoints
will be different.

<skip>

> Right, and this is why I'm asking you to have the following tracepoint proto:
>
> + TP_PROTO(const unsigned int err_type,
> + const unsigned int mc_index,
> + const char *error_msg,
> + const char *label,
> + const char *location,
> + const char *detail)
>
> where detail contains all the crap one driver adds for technical people
> to pinpoint where the error is.
>
> And not have _TWO_ detail arguments!

And what I'm saying is that this should be, instead:

+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ int layer0,
+ int layer1,
+ int layer2,
+ unsigned long pfn,
+ unsigned long offset,
+ unsigned long grain,
+ unsigned long syndrome,
+ const char *driver_detail),

So, having just one detail argument, filled by the driver, and not
folding "location" and core "details" into strings, but keeping as they
are.

>
> Btw, the output looks like this here:
>
> <...>-2723 [001] .N.. 89.107045: mc_event: Corrected error: on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0xfc5b driver:amd64_edac)
>
> Come to think of it, the "driver:amd64_edac" is not really needed
> because on every single system there's only one EDAC driver running and
> I don't think the fact that we're telling in the tracepoint who detected
> the error is meaningfull information.
>
> Which means, you can remove the EDAC_MOD_STR argument you're passing to
> edac_mc_handle_error() and have one less argument.

That's what I said you, but you didn't seem to agree, as I understood that
you've required to keep "amd64_edac" at the trace, due to:
http://markmail.org/message/nr3ooep7gc7mhgdl.

If you're ok, I'll remove EDAC_MOD_STR argument from the amd64_edac calls
on a separate patch (with can be merged latter with the patch that converted
amd64_edac to the new function calls).

>
>>>> Also, doing that will avoid the extra effort of joining everything into
>>>> a single string, and then breaking them back into their individual fields
>>>> on userspace.
>>>
>>> I'm being told this is very easy to do in userspace in your garden
>>> variety language.
>>
>> Well, ask the one that told you that to write a parser that will get all
>> EDAC error reports on all drivers, on several kernel versions using dmesg.
>
> Nobody is talking about dmesg - I'm talking about the "const char
> *detail" string in the tracepoint above and how I want it to be a single
> char * so that userspace can read it out and fumble with it the way it
> sees fit.
>
> Having "detail" and "other_detail" is simply misleading and unneeded and
> a clear sign that this ABI is not designed properly yet.
>
> That's it.

I agree with that.

It follows a version removing the "core_detail" parameter from the trace.

I removed the changes at amd64_edac from it. I'll address them on a latter
patch, removing the EDAC_MOD_STR argument from the calls, after we've
done with this patch.

Regards,
Mauro

-

RAS: Add a tracepoint for reporting memory controller events

From: Mauro Carvalho Chehab <[email protected]>

Add a new tracepoint-based hardware events report method for
reporting Memory Controller events.

Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.

[1] http://lwn.net/Articles/416669/

We have several subsystems & methods for reporting hardware errors:

1) EDAC ("Error Detection and Correction"). In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.

2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user level
programs decodes into somewhat human readable format.

3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
decodes errors reported via machine check bank registers in AMD
processors to the console log using printk();

Each of these mechanisms has a band of followers ... and none
of them appear to meet all the needs of all users.

As part of a RAS subsystem, let's encapsulate the memory error hardware
events into a trace facility.

The tracepoint printk will be displayed like:

mc_event: (Corrected|Uncorrected|Fatal) error:[error msg] on memory stick "[label]" ([location] [edac_mc detail] [driver_detail])

Where:
[error msg] is the driver-specific error message
(e. g. "memory read", "bus error", ...);
[location] is the location in terms of memory controller and
branch/channel/slot, channel/slot or csrow/channel;
[label] is the memory stick label;
[edac_mc detail] describes the address location of the error
and the syndrome;
[driver detail] is driver-specifig error message details,
when needed/provided (e. g. "area:DMA", ...)

For example:

mc_event: Corrected error:memory read on memory stick "DIMM_1A" (mc:0 channel:0 slot:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)

Of course, any userspace tools meant to handle errors should not parse
the above data. They should, instead, use the binary fields provided by
the tracepoint, mapping them directly into their Management Information
Base.

NOTE: The original patch was providing an additional mechanism for
MCA-based trace events that also contained MCA error register data.
However, as no agreement was reached so far for the MCA-based trace
events, for now, let's add events only for memory errors.
A latter patch is planned to change the tracepoint, for those types
of event.

Cc: Aristeu Rozanski <[email protected]>
Cc: Doug Thompson <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index f06ce9a..eee7360 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -468,7 +468,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
const int layer2,
const char *msg,
const char *other_detail,
- const void *mcelog);
+ const void *arch_log);

/*
* edac_device APIs
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 10f3750..5006461 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -33,6 +33,10 @@
#include "edac_core.h"
#include "edac_module.h"

+#define CREATE_TRACE_POINTS
+#define TRACE_INCLUDE_PATH ../../include/ras
+#include <ras/ras_event.h>
+
/* lock to memory controller's control array */
static DEFINE_MUTEX(mem_ctls_mutex);
static LIST_HEAD(mc_devices);
@@ -384,6 +388,7 @@ struct mem_ctl_info *edac_mc_alloc(unsigned mc_num,
* which will perform kobj unregistration and the actual free
* will occur during the kobject callback operation
*/
+
return mci;
}
EXPORT_SYMBOL_GPL(edac_mc_alloc);
@@ -909,12 +914,12 @@ static void edac_ce_error(struct mem_ctl_info *mci,
if (edac_mc_get_log_ce()) {
if (other_detail && *other_detail)
edac_mc_printk(mci, KERN_WARNING,
- "CE %s on %s (%s%s - %s)\n",
+ "CE %s on %s (%s %s - %s)\n",
msg, label, location,
detail, other_detail);
else
edac_mc_printk(mci, KERN_WARNING,
- "CE %s on %s (%s%s)\n",
+ "CE %s on %s (%s %s)\n",
msg, label, location,
detail);
}
@@ -953,12 +958,12 @@ static void edac_ue_error(struct mem_ctl_info *mci,
if (edac_mc_get_log_ue()) {
if (other_detail && *other_detail)
edac_mc_printk(mci, KERN_WARNING,
- "UE %s on %s (%s%s - %s)\n",
+ "UE %s on %s (%s %s - %s)\n",
msg, label, location, detail,
other_detail);
else
edac_mc_printk(mci, KERN_WARNING,
- "UE %s on %s (%s%s)\n",
+ "UE %s on %s (%s %s)\n",
msg, label, location, detail);
}

@@ -975,6 +980,27 @@ static void edac_ue_error(struct mem_ctl_info *mci,
}

#define OTHER_LABEL " or "
+
+/**
+ * edac_mc_handle_error - reports a memory event to userspace
+ *
+ * @type: severity of the error (CE/UE/Fatal)
+ * @mci: a struct mem_ctl_info pointer
+ * @page_frame_number: mem page where the error occurred
+ * @offset_in_page: offset of the error inside the page
+ * @syndrome: ECC syndrome
+ * @layer0: Memory layer0 position
+ * @layer1: Memory layer2 position
+ * @layer2: Memory layer3 position
+ * @msg: Message meaningful to the end users that
+ * explains the event
+ * @other_detail: Technical details about the event that
+ * may help hardware manufacturers and
+ * EDAC developers to analyse the event
+ * @arch_log: Architecture-specific struct that can
+ * be used to add extended information to the
+ * tracepoint, like dumping MCE registers.
+ */
void edac_mc_handle_error(const enum hw_event_mc_err_type type,
struct mem_ctl_info *mci,
const unsigned long page_frame_number,
@@ -985,7 +1011,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
const int layer2,
const char *msg,
const char *other_detail,
- const void *mcelog)
+ const void *arch_log)
{
/* FIXME: too much for stack: move it to some pre-alocated area */
char detail[80], location[80];
@@ -1120,6 +1146,13 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
edac_layer_name[mci->layers[i].type],
pos[i]);
}
+ if (p > location)
+ *(p - 1) = '\0';
+
+ /* Report the error via the trace interface */
+ trace_mc_event(type, mci->mc_idx, msg, label, layer0, layer1, layer2,
+ page_frame_number, offset_in_page, grain, syndrome,
+ other_detail);

/* Memory type dependent details about the error */
if (type == HW_EVENT_ERR_CORRECTED) {
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
new file mode 100644
index 0000000..65835df
--- /dev/null
+++ b/include/ras/ras_event.h
@@ -0,0 +1,100 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ras
+#define TRACE_INCLUDE_FILE ras_event
+
+#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HW_EVENT_MC_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+#include <linux/ktime.h>
+
+/*
+ * Hardware Events Report
+ *
+ * Those events are generated when hardware detected a corrected or
+ * uncorrected event, and are meant to replace the current API to report
+ * errors defined on both EDAC and MCE subsystems.
+ *
+ * FIXME: Add events for handling memory errors originated from the
+ * MCE subsystem.
+ */
+
+/*
+ * Hardware-independent Memory Controller specific events
+ */
+
+/*
+ * Default error mechanisms for Memory Controller errors (CE and UE)
+ */
+TRACE_EVENT(mc_event,
+
+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ int layer0,
+ int layer1,
+ int layer2,
+ unsigned long pfn,
+ unsigned long offset,
+ unsigned long grain,
+ unsigned long syndrome,
+ const char *driver_detail),
+
+ TP_ARGS(err_type, mc_index, error_msg, label, layer0, layer1, layer2,
+ pfn, offset, grain, syndrome, driver_detail),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, err_type )
+ __field( unsigned int, mc_index )
+ __string( msg, error_msg )
+ __string( label, label )
+ __field( int, layer0 )
+ __field( int, layer1 )
+ __field( int, layer2 )
+ __field( int, pfn )
+ __field( int, offset )
+ __field( int, grain )
+ __field( int, syndrome )
+ __string( driver_detail, driver_detail )
+ ),
+
+ TP_fast_assign(
+ __entry->err_type = err_type;
+ __entry->mc_index = mc_index;
+ __assign_str(msg, error_msg);
+ __assign_str(label, label);
+ __entry->layer0 = layer0;
+ __entry->layer1 = layer1;
+ __entry->layer2 = layer2;
+ __entry->pfn = pfn;
+ __entry->offset = offset;
+ __entry->grain = grain;
+ __entry->syndrome = syndrome;
+ __assign_str(driver_detail, driver_detail);
+ ),
+
+ TP_printk("%s error:%s%s on memory stick \"%s\" (mc:%d location:%d:%d:%d page:0x%08x offset:0x%08x grain:%d syndrome:0x%08x%s%s)",
+ (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+ ((__entry->err_type == HW_EVENT_ERR_FATAL) ?
+ "Fatal" : "Uncorrected"),
+ ((char *)__get_str(msg))[0] ? " " : "",
+ __get_str(msg),
+ __get_str(label),
+ __entry->mc_index,
+ __entry->layer0,
+ __entry->layer1,
+ __entry->layer2,
+ __entry->pfn,
+ __entry->offset,
+ __entry->grain,
+ __entry->syndrome,
+ ((char *)__get_str(driver_detail))[0] ? " " : "",
+ __get_str(driver_detail))
+);
+
+#endif /* _TRACE_HW_EVENT_MC_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

2012-05-18 16:40:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 11:31:46AM -0300, Mauro Carvalho Chehab wrote:
> Ok, but you won't use trace_sched_switch() as a memory tracepoint, as
> they represent different things.
>
> Memory errors are different than CPU errors. So, their tracepoints
> will be different.

WTF? A tracepoint is a tracepoint.

> > Right, and this is why I'm asking you to have the following tracepoint proto:
> >
> > + TP_PROTO(const unsigned int err_type,
> > + const unsigned int mc_index,
> > + const char *error_msg,
> > + const char *label,
> > + const char *location,
> > + const char *detail)
> >
> > where detail contains all the crap one driver adds for technical people
> > to pinpoint where the error is.
> >
> > And not have _TWO_ detail arguments!
>
> And what I'm saying is that this should be, instead:
>
> + TP_PROTO(const unsigned int err_type,
> + const unsigned int mc_index,
> + const char *error_msg,
> + const char *label,
> + int layer0,
> + int layer1,
> + int layer2,
> + unsigned long pfn,
> + unsigned long offset,
> + unsigned long grain,
> + unsigned long syndrome,
> + const char *driver_detail),
>
> So, having just one detail argument, filled by the driver, and not
> folding "location" and core "details" into strings, but keeping as they
> are.

And this way you're enforcing an interface that all drivers will have
to adhere to. What if "grain" doesn't mean a thing for a driver, or
"syndrome" or whatever? What if some other entity wants to use that
tracepoint?

See what I'm sayin?

Having

TP_PROTO(const unsigned int err_type,
const unsigned int mc_index,
const char *error_msg,
const char *label,
const char *location,
const char *detail)

is a bit more generic and userspace can parse it however it likes.

Actually, I'd slim this up even more:

TP_PROTO(const unsigned int mc_index,
const char *error_msg,
const char *label,
const char *location,
const char *detail)

and have error_msg contain the "Corrected/Uncorrected/Fatal" things
and this way you can drop all the ternary operators in the tracepoint
definition.

> > Btw, the output looks like this here:
> >
> > <...>-2723 [001] .N.. 89.107045: mc_event: Corrected error: on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0xfc5b driver:amd64_edac)
> >
> > Come to think of it, the "driver:amd64_edac" is not really needed
> > because on every single system there's only one EDAC driver running and
> > I don't think the fact that we're telling in the tracepoint who detected
> > the error is meaningfull information.
> >
> > Which means, you can remove the EDAC_MOD_STR argument you're passing to
> > edac_mc_handle_error() and have one less argument.
>
> That's what I said you, but you didn't seem to agree, as I understood that
> you've required to keep "amd64_edac" at the trace, due to:
> http://markmail.org/message/nr3ooep7gc7mhgdl.
>
> If you're ok, I'll remove EDAC_MOD_STR argument from the amd64_edac calls
> on a separate patch (with can be merged latter with the patch that converted
> amd64_edac to the new function calls).

Ok.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-18 17:27:35

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 18-05-2012 13:40, Borislav Petkov escreveu:
> On Fri, May 18, 2012 at 11:31:46AM -0300, Mauro Carvalho Chehab wrote:
>> Ok, but you won't use trace_sched_switch() as a memory tracepoint, as
>> they represent different things.
>>
>> Memory errors are different than CPU errors. So, their tracepoints
>> will be different.
>
> WTF? A tracepoint is a tracepoint.

This is the event you've mentioned:

TRACE_EVENT(sched_switch,

TP_PROTO(struct task_struct *prev,
struct task_struct *next),

It doesn't have the same arguments of a memory tracepoint, like:

TRACE_EVENT(mm_vmscan_wakeup_kswapd,

TP_PROTO(int nid, int zid, int order),

trace events for different things will have different arguments.

>
>>> Right, and this is why I'm asking you to have the following tracepoint proto:
>>>
>>> + TP_PROTO(const unsigned int err_type,
>>> + const unsigned int mc_index,
>>> + const char *error_msg,
>>> + const char *label,
>>> + const char *location,
>>> + const char *detail)
>>>
>>> where detail contains all the crap one driver adds for technical people
>>> to pinpoint where the error is.
>>>
>>> And not have _TWO_ detail arguments!
>>
>> And what I'm saying is that this should be, instead:
>>
>> + TP_PROTO(const unsigned int err_type,
>> + const unsigned int mc_index,
>> + const char *error_msg,
>> + const char *label,
>> + int layer0,
>> + int layer1,
>> + int layer2,
>> + unsigned long pfn,
>> + unsigned long offset,
>> + unsigned long grain,
>> + unsigned long syndrome,
>> + const char *driver_detail),
>>
>> So, having just one detail argument, filled by the driver, and not
>> folding "location" and core "details" into strings, but keeping as they
>> are.
>
> And this way you're enforcing an interface that all drivers will have
> to adhere to. What if "grain" doesn't mean a thing for a driver, or
> "syndrome" or whatever? What if some other entity wants to use that
> tracepoint?

This enforcement is already there at the EDAC API/ABI. This patch series
won't change that.

Memory controllers that can't get the error address (page/offset/grain)
or the syndrome fills those information with 0.

Maybe there is a driver deficiency (or a bad hardware implementation)
that would not allow to report where the error happened, but that doesn't
mean that we should create a crappy API due to bad hardware.

>
> See what I'm sayin?
>
> Having
>
> TP_PROTO(const unsigned int err_type,
> const unsigned int mc_index,
> const char *error_msg,
> const char *label,
> const char *location,
> const char *detail)
>
> is a bit more generic and userspace can parse it however it likes.
>
> Actually, I'd slim this up even more:
>
> TP_PROTO(const unsigned int mc_index,
> const char *error_msg,
> const char *label,
> const char *location,
> const char *detail)

Well, this even more generic:
TP_PROTO(const char *msg)

but the more we convert other fields into strings, the harder is for
userspace to handle it, as fields may be on different order, string
conversion patches might change or break the parsing behavior,
different drivers will require different parsing expressions, etc.

> and have error_msg contain the "Corrected/Uncorrected/Fatal" things
> and this way you can drop all the ternary operators in the tracepoint
> definition.
>
>>> Btw, the output looks like this here:
>>>
>>> <...>-2723 [001] .N.. 89.107045: mc_event: Corrected error: on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0xfc5b driver:amd64_edac)
>>>
>>> Come to think of it, the "driver:amd64_edac" is not really needed
>>> because on every single system there's only one EDAC driver running and
>>> I don't think the fact that we're telling in the tracepoint who detected
>>> the error is meaningfull information.
>>>
>>> Which means, you can remove the EDAC_MOD_STR argument you're passing to
>>> edac_mc_handle_error() and have one less argument.
>>
>> That's what I said you, but you didn't seem to agree, as I understood that
>> you've required to keep "amd64_edac" at the trace, due to:
>> http://markmail.org/message/nr3ooep7gc7mhgdl.
>>
>> If you're ok, I'll remove EDAC_MOD_STR argument from the amd64_edac calls
>> on a separate patch (with can be merged latter with the patch that converted
>> amd64_edac to the new function calls).
>
> Ok.

Ok, I'll prepare a patch for it after we finish discussing this patch.

Regards,
Mauro
>

2012-05-18 18:53:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 02:27:10PM -0300, Mauro Carvalho Chehab wrote:
> Em 18-05-2012 13:40, Borislav Petkov escreveu:
> > On Fri, May 18, 2012 at 11:31:46AM -0300, Mauro Carvalho Chehab wrote:
> >> Ok, but you won't use trace_sched_switch() as a memory tracepoint, as
> >> they represent different things.
> >>
> >> Memory errors are different than CPU errors. So, their tracepoints
> >> will be different.
> >
> > WTF? A tracepoint is a tracepoint.
>
> This is the event you've mentioned:
>
> TRACE_EVENT(sched_switch,
>
> TP_PROTO(struct task_struct *prev,
> struct task_struct *next),
>
> It doesn't have the same arguments of a memory tracepoint, like:
>
> TRACE_EVENT(mm_vmscan_wakeup_kswapd,
>
> TP_PROTO(int nid, int zid, int order),
>
> trace events for different things will have different arguments.

It seems you and I speak two different english languages. Here's what I meant:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a688ce22..96947fb50328 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1918,6 +1918,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
trace_sched_switch(prev, next);
+ trace_sched_switch_rq(prev, next, rq);
}

because for some reason we want to dump something from the runqueue
pointer now. We can't change the current tracepoint because userspace
scripts use it.

Now imagine the bloat levels when people start adding
tracepoints left and right...

> >
> >>> Right, and this is why I'm asking you to have the following tracepoint proto:
> >>>
> >>> + TP_PROTO(const unsigned int err_type,
> >>> + const unsigned int mc_index,
> >>> + const char *error_msg,
> >>> + const char *label,
> >>> + const char *location,
> >>> + const char *detail)
> >>>
> >>> where detail contains all the crap one driver adds for technical people
> >>> to pinpoint where the error is.
> >>>
> >>> And not have _TWO_ detail arguments!
> >>
> >> And what I'm saying is that this should be, instead:
> >>
> >> + TP_PROTO(const unsigned int err_type,
> >> + const unsigned int mc_index,
> >> + const char *error_msg,
> >> + const char *label,
> >> + int layer0,
> >> + int layer1,
> >> + int layer2,
> >> + unsigned long pfn,
> >> + unsigned long offset,
> >> + unsigned long grain,
> >> + unsigned long syndrome,
> >> + const char *driver_detail),
> >>
> >> So, having just one detail argument, filled by the driver, and not
> >> folding "location" and core "details" into strings, but keeping as they
> >> are.
> >
> > And this way you're enforcing an interface that all drivers will have
> > to adhere to. What if "grain" doesn't mean a thing for a driver, or
> > "syndrome" or whatever? What if some other entity wants to use that
> > tracepoint?
>
> This enforcement is already there at the EDAC API/ABI. This patch series
> won't change that.
>
> Memory controllers that can't get the error address (page/offset/grain)
> or the syndrome fills those information with 0.

Setting something to 0 because it is N/A is the first sign that
something is wrong with your ABI. Having an arbitrary string which each
driver parses as it sees fit is perfectly fine.

> > See what I'm sayin?
> >
> > Having
> >
> > TP_PROTO(const unsigned int err_type,
> > const unsigned int mc_index,
> > const char *error_msg,
> > const char *label,
> > const char *location,
> > const char *detail)
> >
> > is a bit more generic and userspace can parse it however it likes.
> >
> > Actually, I'd slim this up even more:
> >
> > TP_PROTO(const unsigned int mc_index,
> > const char *error_msg,
> > const char *label,
> > const char *location,
> > const char *detail)
>
> Well, this even more generic:
> TP_PROTO(const char *msg)
>
> but the more we convert other fields into strings, the harder is for
> userspace to handle it,

For the millionth time, it is not hard for userspace to parse anything!

> as fields may be on different order, string conversion patches might
> change or break the parsing behavior, different drivers will require
> different parsing expressions, etc.

That's why _each_ _driver_ will have its format and the userspace tools
parsing it will know about it!

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-18 19:10:46

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

> That's why _each_ _driver_ will have its format and the userspace tools
> parsing it will know about it!

Sounds like a full employment program for parser writers.

There are some interesting fields that should be common to all
drivers ... so have twenty parsers that can handle:

paddr: 0x1234
PADDR: 0x1234
Paddr = 0x1234
Phys = 1234
addr: 0x1234
Address: 0x000000001234

looks like a lot of make-work ... when the OS can standardize in the ABI
that there is a 64-bit binary value that is the physical address of the
error (and another 64-bit mask saying which, if any, bits are valid).

So we should be looking for the set of always relevant values that can
be kept explicitly separate in fields in TP_PROTO, and per-driver specific
stuff (grain/syndrome??) bits that will have to go into the "details"
string and require some driver specific user-mode parsing to use.

-Tony

2012-05-18 21:12:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 07:10:42PM +0000, Luck, Tony wrote:
> > That's why _each_ _driver_ will have its format and the userspace tools
> > parsing it will know about it!
>
> Sounds like a full employment program for parser writers.
>
> There are some interesting fields that should be common to all
> drivers ... so have twenty parsers that can handle:
>
> paddr: 0x1234
> PADDR: 0x1234
> Paddr = 0x1234
> Phys = 1234
> addr: 0x1234
> Address: 0x000000001234
>
> looks like a lot of make-work ... when the OS can standardize in the ABI
> that there is a 64-bit binary value that is the physical address of the
> error (and another 64-bit mask saying which, if any, bits are valid).

Makes sense, I gotta say :)

> So we should be looking for the set of always relevant values that
> can be kept explicitly separate in fields in TP_PROTO, and per-driver
> specific stuff (grain/syndrome??) bits that will have to go into the
> "details" string and require some driver specific user-mode parsing to
> use.

How about we put all the values which are globally valid for all drivers
in separate fields and leave the "(...)" brackets for details where each
driver can put its own relevant stuff?

For the record, I very much like this reasoning :).

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-19 09:26:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Fri, May 18, 2012 at 11:12:11PM +0200, Borislav Petkov wrote:
> On Fri, May 18, 2012 at 07:10:42PM +0000, Luck, Tony wrote:
> > > That's why _each_ _driver_ will have its format and the userspace tools
> > > parsing it will know about it!
> >
> > Sounds like a full employment program for parser writers.
> >
> > There are some interesting fields that should be common to all
> > drivers ... so have twenty parsers that can handle:
> >
> > paddr: 0x1234
> > PADDR: 0x1234
> > Paddr = 0x1234
> > Phys = 1234
> > addr: 0x1234
> > Address: 0x000000001234
> >
> > looks like a lot of make-work ... when the OS can standardize in the ABI
> > that there is a 64-bit binary value that is the physical address of the
> > error (and another 64-bit mask saying which, if any, bits are valid).
>
> Makes sense, I gotta say :)
>
> > So we should be looking for the set of always relevant values that
> > can be kept explicitly separate in fields in TP_PROTO, and per-driver
> > specific stuff (grain/syndrome??) bits that will have to go into the
> > "details" string and require some driver specific user-mode parsing to
> > use.
>
> How about we put all the values which are globally valid for all drivers
> in separate fields and leave the "(...)" brackets for details where each
> driver can put its own relevant stuff?
>
> For the record, I very much like this reasoning :).

On a second thought, filtering out the globally valid fields for all
drivers might require a lot of careful driver auditing.

What would be better and easier is to add those single fields to the
tracepoint which are relevant to the user (and which are more or less
globally valid for all drivers by inferrence) and leave the rest lumped
together in a single char *.

Which is basically what I'm suggesting for a couple of days now :-)

TP_PROTO(const unsigned int err_type,
const unsigned int mc_index,
const char *error_msg,
const char *label,
const char *location,
const char *detail)

and I'm really not sure about err_type - this is an edac-specific define
and it means nothing outside the kernel so its string representation
could very well could be merged with error_msg and we can drop the ( ? :
) ugliness in the tracepoint definition itself.

IOW:


TP_PROTO(const unsigned int mc_index,
const char *error_msg,
const char *label,
const char *location,
const char *detail)

Now I get all warm and cosy simply I'm staring at this :-).

Hmmm.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-21 15:35:01

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 19-05-2012 06:26, Borislav Petkov escreveu:
> On Fri, May 18, 2012 at 11:12:11PM +0200, Borislav Petkov wrote:
>> On Fri, May 18, 2012 at 07:10:42PM +0000, Luck, Tony wrote:
>>>> That's why _each_ _driver_ will have its format and the userspace tools
>>>> parsing it will know about it!
>>>
>>> Sounds like a full employment program for parser writers.
>>>
>>> There are some interesting fields that should be common to all
>>> drivers ... so have twenty parsers that can handle:
>>>
>>> paddr: 0x1234
>>> PADDR: 0x1234
>>> Paddr = 0x1234
>>> Phys = 1234
>>> addr: 0x1234
>>> Address: 0x000000001234
>>>
>>> looks like a lot of make-work ... when the OS can standardize in the ABI
>>> that there is a 64-bit binary value that is the physical address of the
>>> error (and another 64-bit mask saying which, if any, bits are valid).

The EDAC API works with the address/address mask as 3 values: address page,
address offset inside the page and grain.

The values for address page/offset/grain and syndrome are mandatory on the EDAC
error calls. Drivers that can't obtain that fill with zero. EDAC prints those
values via the current API.

So, passing those values for the tracepoint makes sense.

In order to use a cleaner API, we may add a flags bitfield there, and add a few
flags to mark if the address (page, offset, grain) is valid and if the syndrome
is valid.

>>
>> Makes sense, I gotta say :)
>>
>>> So we should be looking for the set of always relevant values that
>>> can be kept explicitly separate in fields in TP_PROTO, and per-driver
>>> specific stuff (grain/syndrome??) bits that will have to go into the
>>> "details" string and require some driver specific user-mode parsing to
>>> use.
>>
>> How about we put all the values which are globally valid for all drivers
>> in separate fields and leave the "(...)" brackets for details where each
>> driver can put its own relevant stuff?

You're again concerned about the printk formatting, and not about the fields.
I don't care that much about that. Whatever you/Tony wants for the printk
is fine for me.

The TP_printk logic can change on newer patches, as printk output is not
stable in Linux Kernel.

>>
>> For the record, I very much like this reasoning :).
>
> On a second thought, filtering out the globally valid fields for all
> drivers might require a lot of careful driver auditing.

If you want to explode the driver details, then yes, but the address and
syndrome are part of the EDAC internal ABI.

Most drivers are capable of getting address and syndrome, for CE.
For UE, syndrome is generally not calculated.

Anyway, on _all_ internal ABI calls, when syndrome or address aren't
calculated, the drivers pass a value of 0 for them.

So, it is easy to add a logic inside the code that will rise a flag to
indicate userspace when those fields are filled or not. All the EDAC
core needs to do is to check if the value is 0 or not.

> What would be better and easier is to add those single fields to the
> tracepoint which are relevant to the user (and which are more or less
> globally valid for all drivers by inferrence) and leave the rest lumped
> together in a single char *.

Address and syndrome can be relevant for a proper edac handling daemon, as
some advanced error check mechanism could use those values to be able
to filter out some random interference from an error that is occurring at
the same memory grain, e. g. if a machine is producing an error (even being
very sparsed in time) at the same DRAM, and no errors outside it,
it is very likely that such DRAM chip is damaged.

>
> Which is basically what I'm suggesting for a couple of days now :-)
>
> TP_PROTO(const unsigned int err_type,
> const unsigned int mc_index,
> const char *error_msg,
> const char *label,
> const char *location,
> const char *detail)
>
> and I'm really not sure about err_type - this is an edac-specific define
> and it means nothing outside the kernel so its string representation
> could very well could be merged with error_msg and we can drop the ( ? :
> ) ugliness in the tracepoint definition itself.

All error management systems use a mandatory field for the error severity.
On all implementations I know, this is the first field at the management
base, and, together with the error message, the most important one.

Btw, even the Linux Kernel printk logic starts with severity (error level).

Regards,
Mauro

Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Mon, May 21, 2012 at 12:29:29PM -0300, Mauro Carvalho Chehab wrote:
> Em 19-05-2012 06:26, Borislav Petkov escreveu:
> > On Fri, May 18, 2012 at 11:12:11PM +0200, Borislav Petkov wrote:
> >> On Fri, May 18, 2012 at 07:10:42PM +0000, Luck, Tony wrote:
> >>>> That's why _each_ _driver_ will have its format and the userspace tools
> >>>> parsing it will know about it!
> >>>
> >>> Sounds like a full employment program for parser writers.
> >>>
> >>> There are some interesting fields that should be common to all
> >>> drivers ... so have twenty parsers that can handle:
> >>>
> >>> paddr: 0x1234
> >>> PADDR: 0x1234
> >>> Paddr = 0x1234
> >>> Phys = 1234
> >>> addr: 0x1234
> >>> Address: 0x000000001234
> >>>
> >>> looks like a lot of make-work ... when the OS can standardize in the ABI
> >>> that there is a 64-bit binary value that is the physical address of the
> >>> error (and another 64-bit mask saying which, if any, bits are valid).
>
> The EDAC API works with the address/address mask as 3 values: address page,
> address offset inside the page and grain.
>
> The values for address page/offset/grain and syndrome are mandatory on the EDAC
> error calls. Drivers that can't obtain that fill with zero. EDAC prints those
> values via the current API.
>
> So, passing those values for the tracepoint makes sense.
>
> In order to use a cleaner API, we may add a flags bitfield there, and add a few
> flags to mark if the address (page, offset, grain) is valid and if the syndrome
> is valid.

This is exactly the issue - we don't want to overcomplicate the
tracepoint! Simply output the single address value and userspace can cut
and split and paste it however it wants.

<snip some bullshit which I'm tired of addressing everytime>

> >> For the record, I very much like this reasoning :).
> >
> > On a second thought, filtering out the globally valid fields for all
> > drivers might require a lot of careful driver auditing.
>
> If you want to explode the driver details, then yes, but the address and
> syndrome are part of the EDAC internal ABI.
>
> Most drivers are capable of getting address and syndrome, for CE.
> For UE, syndrome is generally not calculated.
>
> Anyway, on _all_ internal ABI calls, when syndrome or address aren't
> calculated, the drivers pass a value of 0 for them.
>
> So, it is easy to add a logic inside the code that will rise a flag to
> indicate userspace when those fields are filled or not. All the EDAC
> core needs to do is to check if the value is 0 or not.

No, we're not going to do that! Adding a bit about the validity of a bit
is wrong design and is screaming B0RKED.

We're going to have single fields for EDAC-global valid values and leave
the driver-specific stuff lumped in one char * string.

Then, if a driver cannot specify sindrome or whatever, it simply DOESN'T
PUT IT IN THE STRING instead of adding validity bits.

[ … ]

> > Which is basically what I'm suggesting for a couple of days now :-)
> >
> > TP_PROTO(const unsigned int err_type,
> > const unsigned int mc_index,
> > const char *error_msg,
> > const char *label,
> > const char *location,
> > const char *detail)
> >
> > and I'm really not sure about err_type - this is an edac-specific define
> > and it means nothing outside the kernel so its string representation
> > could very well could be merged with error_msg and we can drop the ( ? :
> > ) ugliness in the tracepoint definition itself.
>
> All error management systems use a mandatory field for the error severity.

... I'm pretty sure the defines you're exporting are not the same as
your error management systems require. So some kind of mapping is
still needed. Which shows that you're going to need to massage that
information in userspace after all.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-21 17:28:36

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 21-05-2012 13:00, Borislav Petkov escreveu:
> On Mon, May 21, 2012 at 12:29:29PM -0300, Mauro Carvalho Chehab wrote:
>> Em 19-05-2012 06:26, Borislav Petkov escreveu:
>>> On Fri, May 18, 2012 at 11:12:11PM +0200, Borislav Petkov wrote:
>>>> On Fri, May 18, 2012 at 07:10:42PM +0000, Luck, Tony wrote:
>>>>>> That's why _each_ _driver_ will have its format and the userspace tools
>>>>>> parsing it will know about it!
>>>>>
>>>>> Sounds like a full employment program for parser writers.
>>>>>
>>>>> There are some interesting fields that should be common to all
>>>>> drivers ... so have twenty parsers that can handle:
>>>>>
>>>>> paddr: 0x1234
>>>>> PADDR: 0x1234
>>>>> Paddr = 0x1234
>>>>> Phys = 1234
>>>>> addr: 0x1234
>>>>> Address: 0x000000001234
>>>>>
>>>>> looks like a lot of make-work ... when the OS can standardize in the ABI
>>>>> that there is a 64-bit binary value that is the physical address of the
>>>>> error (and another 64-bit mask saying which, if any, bits are valid).
>>
>> The EDAC API works with the address/address mask as 3 values: address page,
>> address offset inside the page and grain.
>>
>> The values for address page/offset/grain and syndrome are mandatory on the EDAC
>> error calls. Drivers that can't obtain that fill with zero. EDAC prints those
>> values via the current API.
>>
>> So, passing those values for the tracepoint makes sense.
>>
>> In order to use a cleaner API, we may add a flags bitfield there, and add a few
>> flags to mark if the address (page, offset, grain) is valid and if the syndrome
>> is valid.
>
> This is exactly the issue - we don't want to overcomplicate the
> tracepoint! Simply output the single address value and userspace can cut
> and split and paste it however it wants.

That's exactly what the latest version of this patch does.

> <snip some bullshit which I'm tired of addressing everytime>
>
>>>> For the record, I very much like this reasoning :).
>>>
>>> On a second thought, filtering out the globally valid fields for all
>>> drivers might require a lot of careful driver auditing.
>>
>> If you want to explode the driver details, then yes, but the address and
>> syndrome are part of the EDAC internal ABI.
>>
>> Most drivers are capable of getting address and syndrome, for CE.
>> For UE, syndrome is generally not calculated.
>>
>> Anyway, on _all_ internal ABI calls, when syndrome or address aren't
>> calculated, the drivers pass a value of 0 for them.
>>
>> So, it is easy to add a logic inside the code that will rise a flag to
>> indicate userspace when those fields are filled or not. All the EDAC
>> core needs to do is to check if the value is 0 or not.
>
> No, we're not going to do that! Adding a bit about the validity of a bit
> is wrong design and is screaming B0RKED.

Huh? Nobody said to add a bit to tell about the validity of another bit.

I'm saying is that a bitfield could be used to indicate that certain integer
values (and not bits) are filled or not. Btw, MCE has it: depending on the
MCE status registers, reading other registers may be needed to evaluate
fully the error.

> We're going to have single fields for EDAC-global valid values and leave
> the driver-specific stuff lumped in one char * string.
>
> Then, if a driver cannot specify sindrome or whatever, it simply DOESN'T
> PUT IT IN THE STRING instead of adding validity bits.
>
> [ … ]

Parsing strings that can be changed from Kernel versions and drivers is a
maintenance nightmare. Both I and Tony pointed it with different examples.

>>> Which is basically what I'm suggesting for a couple of days now :-)
>>>
>>> TP_PROTO(const unsigned int err_type,
>>> const unsigned int mc_index,
>>> const char *error_msg,
>>> const char *label,
>>> const char *location,
>>> const char *detail)
>>>
>>> and I'm really not sure about err_type - this is an edac-specific define
>>> and it means nothing outside the kernel so its string representation
>>> could very well could be merged with error_msg and we can drop the ( ? :
>>> ) ugliness in the tracepoint definition itself.
>>
>> All error management systems use a mandatory field for the error severity.
>
> ... I'm pretty sure the defines you're exporting are not the same as
> your error management systems require. So some kind of mapping is
> still needed. Which shows that you're going to need to massage that
> information in userspace after all.

Yes, field mapping is needed at the management systems, but this is not the
point. It is not Kernel's task to help mapping fields inside userspace apps,
but it is the task of a properly designed API to avoid userspace to guess about
what the kernel is reporting.

A printk/sprintk-designed API, as you're proposing, requires userspace to guess
what the events mean, with the help of a very careful designed parsing rules
that needs to take into account every single EDAC driver (as the (s)printk message
message will vary among them), and needs to be reviewed on every single new kernel
version, as the (s)printk output can change among different Kernel versions. Even
-stable versions might require changing the parsers, as a fixup patch might be
changing the (s)printk arguments.

This is a b0rked design.

Regards,
Mauro





Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Mon, May 21, 2012 at 01:40:08PM -0300, Mauro Carvalho Chehab wrote:
> That's exactly what the latest version of this patch does.

Really, where is the address field?

+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ const char *location,
+ const char *core_detail,
+ const char *driver_detail),


[ … ]

> Huh? Nobody said to add a bit to tell about the validity of another bit.
>
> I'm saying is that a bitfield could be used to indicate that certain integer
> values (and not bits) are filled or not. Btw, MCE has it: depending on the
> MCE status registers, reading other registers may be needed to evaluate
> fully the error.

Bullshit, there'll be no bits showing the validity of other fields.

> > We're going to have single fields for EDAC-global valid values and leave
> > the driver-specific stuff lumped in one char * string.
> >
> > Then, if a driver cannot specify sindrome or whatever, it simply DOESN'T
> > PUT IT IN THE STRING instead of adding validity bits.
> >
> > [ … ]
>
> Parsing strings that can be changed from Kernel versions and drivers is a
> maintenance nightmare. Both I and Tony pointed it with different examples.

Bullshit, stop making up reasons to prove your point.

> >>> Which is basically what I'm suggesting for a couple of days now :-)
> >>>
> >>> TP_PROTO(const unsigned int err_type,
> >>> const unsigned int mc_index,
> >>> const char *error_msg,
> >>> const char *label,
> >>> const char *location,
> >>> const char *detail)
> >>>
> >>> and I'm really not sure about err_type - this is an edac-specific define
> >>> and it means nothing outside the kernel so its string representation
> >>> could very well could be merged with error_msg and we can drop the ( ? :
> >>> ) ugliness in the tracepoint definition itself.
> >>
> >> All error management systems use a mandatory field for the error severity.
> >
> > ... I'm pretty sure the defines you're exporting are not the same as
> > your error management systems require. So some kind of mapping is
> > still needed. Which shows that you're going to need to massage that
> > information in userspace after all.
>
> Yes, field mapping is needed at the management systems, but this is not the
> point. It is not Kernel's task to help mapping fields inside userspace apps,
> but it is the task of a properly designed API to avoid userspace to guess about
> what the kernel is reporting.
>
> A printk/sprintk-designed API, as you're proposing, requires userspace to guess
> what the events mean, with the help of a very careful designed parsing rules
> that needs to take into account every single EDAC driver (as the (s)printk message
> message will vary among them), and needs to be reviewed on every single new kernel
> version, as the (s)printk output can change among different Kernel versions. Even
> -stable versions might require changing the parsers, as a fixup patch might be
> changing the (s)printk arguments.

Enough with the crap already - now you're really desperately looking for
bogus arguments to prove your point.

We're going to have single fields for EDAC-global valid values and leave
the driver-specific stuff lumped in one char * string.


--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-22 03:05:01

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

<offensive comments skipped>

Em 21-05-2012 17:40, Borislav Petkov escreveu:
> On Mon, May 21, 2012 at 01:40:08PM -0300, Mauro Carvalho Chehab wrote:
>> That's exactly what the latest version of this patch does.
>
> Really, where is the address field?
>
> + TP_PROTO(const unsigned int err_type,
> + const unsigned int mc_index,
> + const char *error_msg,
> + const char *label,
> + const char *location,
> + const char *core_detail,
> + const char *driver_detail),
>
>
> [ … ]

The above is not the latest version of it. The latest version is:
http://www.spinics.net/lists/kernel/msg1343822.html

The definition there is:

+/*
+ * Default error mechanisms for Memory Controller errors (CE and UE)
+ */
+TRACE_EVENT(mc_event,
+
+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *error_msg,
+ const char *label,
+ int layer0,
+ int layer1,
+ int layer2,
+ unsigned long pfn,
+ unsigned long offset,
+ unsigned long grain,
+ unsigned long syndrome,
+ const char *driver_detail),
+
+ TP_ARGS(err_type, mc_index, error_msg, label, layer0, layer1, layer2,
+ pfn, offset, grain, syndrome, driver_detail),

The address is there using the edac way to represent it (page, offset, grain).

> We're going to have single fields for EDAC-global valid values and leave
> the driver-specific stuff lumped in one char * string.

That's exactly what I said.

See above. driver_detail is a char string, with
the driver specific stuff. The EDAC global values are represented as-is
without being converted to integers.

Regards,
Mauro

Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
> +TRACE_EVENT(mc_event,
> +
> + TP_PROTO(const unsigned int err_type,
> + const unsigned int mc_index,
> + const char *error_msg,
> + const char *label,
> + int layer0,
> + int layer1,
> + int layer2,

Those are EDAC-internal layer representation, why are they exported to
userspace? Userspace needs only the location and label AFAICT.

If you export them to userspace, they need much more meaningful names -
layer{0,1,2} mean nothing outside of the kernel.

> + unsigned long pfn,
> + unsigned long offset,
> + unsigned long grain,

Why aren't those a single 'unsigned long address' since they all are
computed from it?

> + unsigned long syndrome,
> + const char *driver_detail),
> +
> + TP_ARGS(err_type, mc_index, error_msg, label, layer0, layer1, layer2,
> + pfn, offset, grain, syndrome, driver_detail),
>
> The address is there using the edac way to represent it (page, offset, grain).

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-22 10:18:36

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

Em 22-05-2012 06:28, Borislav Petkov escreveu:
> On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
>> +TRACE_EVENT(mc_event,
>> +
>> + TP_PROTO(const unsigned int err_type,
>> + const unsigned int mc_index,
>> + const char *error_msg,
>> + const char *label,
>> + int layer0,
>> + int layer1,
>> + int layer2,
>
> Those are EDAC-internal layer representation, why are they exported to
> userspace? Userspace needs only the location and label AFAICT.

Those are not the EDAC internal layer representation. They're the physical
location of the DIMM or rank.

> If you export them to userspace, they need much more meaningful names -
> layer{0,1,2} mean nothing outside of the kernel.

Ok. Do you have a better naming suggestion?

What about layer0_pos, layer1_pos, layer2_pos?

>
>> + unsigned long pfn,
>> + unsigned long offset,
>> + unsigned long grain,
>
> Why aren't those a single 'unsigned long address' since they all are
> computed from it?

We can merge pfn and offset into "unsigned long address".

With regards to the grain, it is an address mask, written with a "short" way.
So, grain 32, for example, means:
ffff:ffff:ffff:fffe0

As the current EDAC API exports it as grain, IMO, it is better to keep it as-is,
but it won't be hard to do:
unsigned long mask = ((unsigned long) -1) && (1 - grain)

What do you think?

>> + unsigned long syndrome,
>> + const char *driver_detail),
>> +
>> + TP_ARGS(err_type, mc_index, error_msg, label, layer0, layer1, layer2,
>> + pfn, offset, grain, syndrome, driver_detail),
>>
>> The address is there using the edac way to represent it (page, offset, grain).
>

Regards,
Mauro

2012-05-22 13:05:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events

On Tue, May 22, 2012 at 07:18:21AM -0300, Mauro Carvalho Chehab wrote:
> Em 22-05-2012 06:28, Borislav Petkov escreveu:
> > On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
> >> +TRACE_EVENT(mc_event,
> >> +
> >> + TP_PROTO(const unsigned int err_type,
> >> + const unsigned int mc_index,
> >> + const char *error_msg,
> >> + const char *label,
> >> + int layer0,
> >> + int layer1,
> >> + int layer2,
> >
> > Those are EDAC-internal layer representation, why are they exported to
> > userspace? Userspace needs only the location and label AFAICT.
>
> Those are not the EDAC internal layer representation. They're the physical
> location of the DIMM or rank.

Ok, you've replaced the location char * with the layers.

> > If you export them to userspace, they need much more meaningful names -
> > layer{0,1,2} mean nothing outside of the kernel.
>
> Ok. Do you have a better naming suggestion?
>
> What about layer0_pos, layer1_pos, layer2_pos?

Actually, I'd like them to be called channel/rank/row or something. Having them
numbered I don't know which layer is the top layer (channel/branch/slot)
and the lowest (rank/csrow/...)

Maybe top_layer, middle_layer, lowest_layer? Or something like that...

> >
> >> + unsigned long pfn,
> >> + unsigned long offset,
> >> + unsigned long grain,
> >
> > Why aren't those a single 'unsigned long address' since they all are
> > computed from it?
>
> We can merge pfn and offset into "unsigned long address".

Just have a single "unsigned long address" field and userspace can pick
out the stuff it needs from it.

> With regards to the grain, it is an address mask, written with a "short" way.
> So, grain 32, for example, means:
> ffff:ffff:ffff:fffe0
>
> As the current EDAC API exports it as grain, IMO, it is better to keep it as-is,
> but it won't be hard to do:
> unsigned long mask = ((unsigned long) -1) && (1 - grain)
>
> What do you think?

Why are we even exporting grain actually with each tracepoint
invocation? This is the granularity of reported error in bytes, and it,
as such, is statically assigned to a value in each driver. Userspace can
certainly figure out that value in a different way.

But the more important question is: does the grain help us when handling
the error info in userspace?

It tells us that at this physical address with "grain" granularity we
had an error. So?

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551