Hi, Len,
This is used by APEI ERST and GEHS. But it is a generic hardware
error reporting mechanism and can be used by other hardware error
reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
The patchset is split from the original APEI patchset to make it
explicit that this is a generic mechanism, not APEI specific bits.
[PATCH 1/2] Generic hardware error reporting mechanism
[PATCH 2/2] Hardware error record persistent support
Best Regards,
Huang Ying
Normally, corrected hardware error records will go through the kernel
processing and be logged to disk or network finally. But for
uncorrected errors, system may go panic directly for better error
containment, disk or network is not usable in this half-working
system. To avoid losing these valuable hardware error records, the
error records are saved into some kind of simple persistent storage
such as flash before panic, so that they can be read out after system
reboot successfully.
Different kind of simple persistent storage implementation mechanisms
are provided on different platforms, so an abstract interface for
persistent storage is defined. Different implementations of the
interface can be registered.
Even after successfully reboot, before being erased from the simple
persistent storage, the error records should be guaranteed to be saved
into disk or network firstly. Peek and clear operations on simple
persistent storage is implemented to support this transaction
semantics as follow:
- Peek an error record from simple persistent storage
- Save the error record into disk or network
- Sync the disk file or get ACK from network
- Clear the error record in simple persistent storage
This patch is designed by Andi Kleen and Huang Ying.
Signed-off-by: Huang Ying <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
---
drivers/herror/Makefile | 2
drivers/herror/herr-core.c | 39 ++++++++-
drivers/herror/herr-internal.h | 12 ++
drivers/herror/herr-persist.c | 174 +++++++++++++++++++++++++++++++++++++++++
include/linux/Kbuild | 1
include/linux/herror.h | 48 +++++++++++
6 files changed, 271 insertions(+), 5 deletions(-)
create mode 100644 drivers/herror/herr-internal.h
create mode 100644 drivers/herror/herr-persist.c
--- a/drivers/herror/Makefile
+++ b/drivers/herror/Makefile
@@ -1 +1 @@
-obj-y += herr-core.o
+obj-y += herr-core.o herr-persist.o
--- a/drivers/herror/herr-core.c
+++ b/drivers/herror/herr-core.c
@@ -38,9 +38,9 @@
#include <linux/genalloc.h>
#include <linux/herror.h>
-#define HERR_NOTIFY_BIT 0
+#include "herr-internal.h"
-static unsigned long herr_flags;
+unsigned long herr_flags;
/*
* Record list management and error reporting
@@ -413,6 +413,7 @@ static ssize_t herr_mix_read(struct file
{
int rc;
static DEFINE_MUTEX(read_mutex);
+ u64 record_id;
if (*off != 0)
return -EINVAL;
@@ -420,7 +421,14 @@ static ssize_t herr_mix_read(struct file
rc = mutex_lock_interruptible(&read_mutex);
if (rc)
return rc;
+ rc = herr_persist_peek_user(&record_id, ubuf, usize);
+ if (rc > 0) {
+ herr_persist_clear(record_id);
+ goto out;
+ }
+
rc = herr_rcd_lists_read(ubuf, usize, &read_mutex);
+out:
mutex_unlock(&read_mutex);
return rc;
@@ -429,15 +437,40 @@ static ssize_t herr_mix_read(struct file
static unsigned int herr_mix_poll(struct file *file, poll_table *wait)
{
poll_wait(file, &herr_mix_wait, wait);
- if (!herr_rcd_lists_is_empty())
+ if (!herr_rcd_lists_is_empty() || !herr_persist_read_done())
return POLLIN | POLLRDNORM;
return 0;
}
+static long herr_mix_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
+{
+ void __user *p = (void __user *)arg;
+ int rc;
+ u64 record_id;
+ struct herr_persist_buffer buf;
+
+ switch (cmd) {
+ case HERR_PERSIST_PEEK:
+ rc = copy_from_user(&buf, p, sizeof(buf));
+ if (rc)
+ return -EFAULT;
+ return herr_persist_peek_user(&record_id, buf.buf,
+ buf.buf_size);
+ case HERR_PERSIST_CLEAR:
+ rc = copy_from_user(&record_id, p, sizeof(record_id));
+ if (rc)
+ return -EFAULT;
+ return herr_persist_clear(record_id);
+ default:
+ return -ENOTTY;
+ }
+}
+
static const struct file_operations herr_mix_dev_fops = {
.owner = THIS_MODULE,
.read = herr_mix_read,
.poll = herr_mix_poll,
+ .unlocked_ioctl = herr_mix_ioctl,
};
static int __init herr_mix_dev_init(void)
--- /dev/null
+++ b/drivers/herror/herr-internal.h
@@ -0,0 +1,12 @@
+#ifndef HERR_INTERNAL_H
+#define HERR_INTERNAL_H
+
+#define HERR_NOTIFY_BIT 0
+
+extern unsigned long herr_flags;
+
+int herr_persist_read_done(void);
+ssize_t herr_persist_peek_user(u64 *record_id, char __user *ercd,
+ size_t bufsiz);
+int herr_persist_clear(u64 record_id);
+#endif /* HERR_INTERNAL_H */
--- /dev/null
+++ b/drivers/herror/herr-persist.c
@@ -0,0 +1,174 @@
+/*
+ * Hardware error record persistent support
+ *
+ * Normally, corrected hardware error records will go through the
+ * kernel processing and be logged to disk or network finally. But
+ * for uncorrected errors, system may go panic directly for better
+ * error containment, disk or network is not usable in this
+ * half-working system. To avoid losing these valuable hardware error
+ * records, the error records are saved into some kind of simple
+ * persistent storage such as flash before panic, so that they can be
+ * read out after system reboot successfully.
+ *
+ * Copyright 2010 Intel Corp.
+ * Author: Huang Ying <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation;
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/rculist.h>
+#include <linux/mutex.h>
+
+#include <linux/herror.h>
+
+#include "herr-internal.h"
+
+/*
+ * Simple persistent storage provider list, herr_persists_mutex is
+ * used for writer side mutual exclusion, RCU is used to implement
+ * lock-less reader side.
+ */
+static LIST_HEAD(herr_persists);
+static DEFINE_MUTEX(herr_persists_mutex);
+
+int herr_persist_register(struct herr_persist *persist)
+{
+ if (!persist->peek_user)
+ return -EINVAL;
+ persist->read_done = 0;
+ if (mutex_lock_interruptible(&herr_persists_mutex))
+ return -EINTR;
+ list_add_rcu(&persist->list, &herr_persists);
+ mutex_unlock(&herr_persists_mutex);
+ /*
+ * There may be hardware error records of previous boot in
+ * persistent storage, notify the user space error daemon to
+ * check.
+ */
+ set_bit(HERR_NOTIFY_BIT, &herr_flags);
+ herr_notify();
+ return 0;
+}
+EXPORT_SYMBOL_GPL(herr_persist_register);
+
+void herr_persist_unregister(struct herr_persist *persist)
+{
+ mutex_lock(&herr_persists_mutex);
+ list_del_rcu(&persist->list);
+ mutex_unlock(&herr_persists_mutex);
+ synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(herr_persist_unregister);
+
+/* Can be used in atomic context including NMI */
+int herr_persist_write(const struct herr_record *ercd)
+{
+ struct herr_persist *persist;
+ int rc = -ENODEV;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(persist, &herr_persists, list) {
+ if (!persist->write)
+ continue;
+ rc = persist->write(ercd);
+ if (!rc)
+ break;
+ }
+ rcu_read_unlock();
+ return rc;
+}
+EXPORT_SYMBOL_GPL(herr_persist_write);
+
+int herr_persist_read_done(void)
+{
+ struct herr_persist *persist;
+ int rc = 1;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(persist, &herr_persists, list) {
+ if (!persist->read_done) {
+ rc = 0;
+ break;
+ }
+ }
+ rcu_read_unlock();
+ return rc;
+}
+
+/* Read next error record from persist storage, don't remove it */
+ssize_t herr_persist_peek_user(u64 *record_id, char __user *ercd,
+ size_t bufsiz)
+{
+ struct herr_persist *persist;
+ ssize_t rc = 0;
+
+ if (mutex_lock_interruptible(&herr_persists_mutex))
+ return -EINTR;
+ list_for_each_entry(persist, &herr_persists, list) {
+ if (persist->read_done)
+ continue;
+ rc = persist->peek_user(record_id, ercd, bufsiz);
+ if (rc > 0)
+ break;
+ else if (rc != -EINTR && rc != -EAGAIN && rc != -EINVAL)
+ persist->read_done = 1;
+ }
+ mutex_unlock(&herr_persists_mutex);
+ return rc;
+}
+
+/* Clear specified error record from persist storage */
+int herr_persist_clear(u64 record_id)
+{
+ struct herr_persist *persist;
+ int rc = -ENOENT;
+
+ if (mutex_lock_interruptible(&herr_persists_mutex))
+ return -EINTR;
+ list_for_each_entry(persist, &herr_persists, list) {
+ if (!persist->clear)
+ continue;
+ rc = persist->clear(record_id);
+ if (!rc)
+ break;
+ /*
+ * Failed to clear, mark as read_done, because we can
+ * not skip this one
+ */
+ else if (rc != -EINTR && rc != -EAGAIN && rc != -ENOENT)
+ persist->read_done = 1;
+ }
+ mutex_unlock(&herr_persists_mutex);
+ return rc;
+}
+
+static int herr_persist_record(struct herr_record *ercd, void *data)
+{
+ int *severity = data;
+
+ if (ercd->severity == *severity)
+ return herr_persist_write(ercd);
+ return 0;
+}
+
+void herr_persist_all_records(void)
+{
+ int severity;
+
+ for (severity = HERR_SEV_FATAL; severity >= HERR_SEV_NONE; severity--)
+ herr_for_each_record(herr_persist_record, &severity);
+}
+EXPORT_SYMBOL_GPL(herr_persist_all_records);
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -142,6 +142,7 @@ header-y += hdlc.h
header-y += hdlcdrv.h
header-y += hdreg.h
header-y += herror_record.h
+header-y += herror.h
header-y += hid.h
header-y += hiddev.h
header-y += hidraw.h
--- a/include/linux/herror.h
+++ b/include/linux/herror.h
@@ -1,10 +1,22 @@
#ifndef LINUX_HERROR_H
#define LINUX_HERROR_H
+#include <linux/ioctl.h>
+#include <linux/herror_record.h>
+
+struct herr_persist_buffer {
+ void __user *buf;
+ unsigned int buf_size;
+};
+
+#define HERR_PERSIST_PEEK _IOW('H', 1, struct herr_persist_buffer)
+#define HERR_PERSIST_CLEAR _IOW('H', 2, u64)
+
+#ifdef __KERNEL__
+
#include <linux/types.h>
#include <linux/list.h>
#include <linux/device.h>
-#include <linux/herror_record.h>
/*
* Hardware error reporting
@@ -32,4 +44,38 @@ void herr_notify(void);
/* Traverse all error records not consumed by user space */
typedef int (*herr_traverse_func_t)(struct herr_record *ercd, void *data);
int herr_for_each_record(herr_traverse_func_t func, void *data);
+
+
+/*
+ * Simple Persistent Storage
+ */
+
+struct herr_persist;
+/* Put an error record into simple persistent storage */
+int herr_persist_write(const struct herr_record *ercd);
+/* Save all error records not yet consumed in persistent storage */
+void herr_persist_all_records(void);
+
+/*
+ * Simple Persistent Storage Provider Management
+ */
+struct herr_persist {
+ struct list_head list;
+ char *name;
+ unsigned int read_done:1;
+ /* Write an error record into storage, must be NMI-safe */
+ int (*write)(const struct herr_record *ercd);
+ /*
+ * Read out an error record from storage to user space, don't
+ * remove it, the HERR_RCD_PERSIST must be set in record flags
+ */
+ ssize_t (*peek_user)(u64 *record_id, char __user *ubuf, size_t usize);
+ /* Clear an error record */
+ int (*clear)(u64 record_id);
+};
+
+/* Register (un-register) simple persistent storage provider */
+int herr_persist_register(struct herr_persist *persist);
+void herr_persist_unregister(struct herr_persist *persist);
+#endif
#endif
There are many hardware error detecting and reporting components in
kernel, including x86 Machine Check, PCIe AER, EDAC, APEI GHES
etc. Each one has its error reporting implementation, including user
space interface, error record format, in kernel buffer, etc. This
patch provides a generic hardware error reporting mechanism to reduce
the duplicated effort and add more common services.
A highly extensible generic hardware error record data structure is
defined to accommodate various hardware error information from various
hardware error sources. The overall structure of error record is as
follow:
-----------------------------------------------------------------
| rcd hdr | sec 1 hdr | sec 1 data | sec 2 hdr | sec2 data | ...
-----------------------------------------------------------------
Several error sections can be incorporated into one error record to
accumulate information from multiple hardware components related to
one error. For example, for an error on a device on the secondary
side of a PCIe bridge, it is useful to record error information from
the PCIe bridge and the PCIe device. Multiple section can be used to
hold both the cooked and the raw error information. So that the
abstract information can be provided by the cooked one and no
information will be lost because the raw one is provided too.
There are "reversion" (rev) and "length" field in record header and
"type" and "length" field in section header, so the user space error
daemon can skip unrecognized error record or error section. This
makes old version error daemon can work with the newer kernel.
New error section type can be added to support new error type, error
sources.
The hardware error reporting mechanism designed by the patch
integrates well with device model in kernel. struct dev_herr_info is
defined and pointed to by "error" field of struct device. This is
used to hold error reporting related information for each device. One
sysfs directory "error" will be created for each hardware error
reporting device. Some files for error reporting statistics and
control are created in sysfs "error" directory. For example, the
"error" directory for APEI GHES is as follow.
/sys/devices/platform/GHES.0/error/logs
/sys/devices/platform/GHES.0/error/overflows
/sys/devices/platform/GHES.0/error/throttles
Where "logs" is number of error records logged; "throttles" is number
of error records not logged because the reporting rate is too high;
"overflows" is number of error records not logged because there is no
space available.
Not all devices will report errors, so struct dev_herr_info and sysfs
directory/files are only allocated/created for devices explicitly
enable it. So to enumerate the error sources of system, you just need
to enumerate "error" directory for each device directory in
/sys/devices.
One device file (/dev/error/error) which mixed error records from all
hardware error reporting devices is created to convey error records
from kernel space to user space. Because hardware devices are dealt
with, a device file is the most natural way to do that. Because
hardware error reporting should not hurts system performance, the
throughput of the interface should be controlled to a low level (done
by user space error daemon), ordinary "read" is sufficient from
performance point of view.
The patch provides common services for hardware error reporting
devices too.
A lock-less hardware error record allocator is provided. So for
hardware error that can be ignored (such as corrected errors), it is
not needed to pre-allocate the error record or allocate the error
record on stack. Because the possibility for two hardware parts to go
error simultaneously is very small, one big unified memory pool for
hardware errors is better than one memory pool or buffer for each
device.
After filling in all necessary fields in hardware error record, the
error reporting is quite straightforward, just calling
herr_record_report, parameters are the error record itself and the
corresponding struct device.
Hardware errors may burst, for example, same hardware errors may be
reported at high rate within a short interval, this will use up all
pre-allocated memory for error reporting, so that other hardware
errors come from same or different hardware device can not be logged.
To deal with this issue, a throttle algorithm is implemented. The
logging rate for errors come from one hardware error device is
throttled based on the available pre-allocated memory for error
reporting. In this way we can log as many kinds of errors as possible
comes from as many devices as possible.
This patch is designed by Andi Kleen and Huang Ying.
Signed-off-by: Huang Ying <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
---
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/base/Makefile | 1
drivers/base/herror.c | 98 ++++++++
drivers/herror/Kconfig | 5
drivers/herror/Makefile | 1
drivers/herror/herr-core.c | 488 ++++++++++++++++++++++++++++++++++++++++++
include/linux/Kbuild | 1
include/linux/device.h | 14 +
include/linux/herror.h | 35 +++
include/linux/herror_record.h | 100 ++++++++
kernel/Makefile | 1
12 files changed, 747 insertions(+)
create mode 100644 drivers/base/herror.c
create mode 100644 drivers/herror/Kconfig
create mode 100644 drivers/herror/Makefile
create mode 100644 drivers/herror/herr-core.c
create mode 100644 include/linux/herror.h
create mode 100644 include/linux/herror_record.h
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
source "drivers/staging/Kconfig"
source "drivers/platform/Kconfig"
+
+source "drivers/herror/Kconfig"
endmenu
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -115,3 +115,4 @@ obj-$(CONFIG_VLYNQ) += vlynq/
obj-$(CONFIG_STAGING) += staging/
obj-y += platform/
obj-y += ieee802154/
+obj-$(CONFIG_HERR_CORE) += herror/
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -18,6 +18,7 @@ ifeq ($(CONFIG_SYSFS),y)
obj-$(CONFIG_MODULES) += module.o
endif
obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
+obj-$(CONFIG_HERR_CORE) += herror.o
ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
--- /dev/null
+++ b/drivers/base/herror.c
@@ -0,0 +1,98 @@
+/*
+ * Hardware error reporting related functions
+ *
+ * Copyright 2010 Intel Corp.
+ * Author: Huang Ying <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation;
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+
+#define HERR_COUNTER_ATTR(_name) \
+ static ssize_t herr_##_name##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+ { \
+ int counter; \
+ \
+ counter = atomic_read(&dev->error->_name); \
+ return sprintf(buf, "%d\n", counter); \
+ } \
+ static ssize_t herr_##_name##_store(struct device *dev, \
+ struct device_attribute *attr, \
+ const char *buf, \
+ size_t count) \
+ { \
+ atomic_set(&dev->error->_name, 0); \
+ return count; \
+ } \
+ static struct device_attribute herr_attr_##_name = \
+ __ATTR(_name, 0600, herr_##_name##_show, \
+ herr_##_name##_store)
+
+HERR_COUNTER_ATTR(logs);
+HERR_COUNTER_ATTR(overflows);
+HERR_COUNTER_ATTR(throttles);
+
+static struct attribute *herr_attrs[] = {
+ &herr_attr_logs.attr,
+ &herr_attr_overflows.attr,
+ &herr_attr_throttles.attr,
+ NULL,
+};
+
+static struct attribute_group herr_attr_group = {
+ .name = "error",
+ .attrs = herr_attrs,
+};
+
+static void device_herr_init(struct device *dev)
+{
+ atomic_set(&dev->error->logs, 0);
+ atomic_set(&dev->error->overflows, 0);
+ atomic_set(&dev->error->throttles, 0);
+ atomic64_set(&dev->error->timestamp, 0);
+}
+
+int device_enable_error_reporting(struct device *dev)
+{
+ int rc;
+
+ BUG_ON(dev->error);
+ dev->error = kzalloc(sizeof(*dev->error), GFP_KERNEL);
+ if (!dev->error)
+ return -ENOMEM;
+ device_herr_init(dev);
+ rc = sysfs_create_group(&dev->kobj, &herr_attr_group);
+ if (rc)
+ goto err;
+ return 0;
+err:
+ kfree(dev->error);
+ dev->error = NULL;
+ return rc;
+}
+EXPORT_SYMBOL_GPL(device_enable_error_reporting);
+
+void device_disable_error_reporting(struct device *dev)
+{
+ if (dev->error) {
+ sysfs_remove_group(&dev->kobj, &herr_attr_group);
+ kfree(dev->error);
+ }
+}
+EXPORT_SYMBOL_GPL(device_disable_error_reporting);
--- /dev/null
+++ b/drivers/herror/Kconfig
@@ -0,0 +1,5 @@
+config HERR_CORE
+ bool "Hardware error reporting"
+ depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select LLIST
+ select GENERIC_ALLOCATOR
--- /dev/null
+++ b/drivers/herror/Makefile
@@ -0,0 +1 @@
+obj-y += herr-core.o
--- /dev/null
+++ b/drivers/herror/herr-core.c
@@ -0,0 +1,488 @@
+/*
+ * Generic hardware error reporting support
+ *
+ * This file provides some common services for hardware error
+ * reporting, including hardware error record lock-less allocator,
+ * error reporting mechanism, user space interface etc.
+ *
+ * Copyright 2010 Intel Corp.
+ * Author: Huang Ying <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation;
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/rculist.h>
+#include <linux/mutex.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/trace_clock.h>
+#include <linux/uaccess.h>
+#include <linux/poll.h>
+#include <linux/ratelimit.h>
+#include <linux/nmi.h>
+#include <linux/llist.h>
+#include <linux/genalloc.h>
+#include <linux/herror.h>
+
+#define HERR_NOTIFY_BIT 0
+
+static unsigned long herr_flags;
+
+/*
+ * Record list management and error reporting
+ */
+
+struct herr_node {
+ struct llist_node llist;
+ struct herr_record ercd __attribute__((aligned(HERR_MIN_ALIGN)));
+};
+
+#define HERR_NODE_LEN(rcd_len) \
+ ((rcd_len) + sizeof(struct herr_node) - sizeof(struct herr_record))
+
+#define HERR_MIN_ALLOC_ORDER HERR_MIN_ALIGN_ORDER
+#define HERR_CHUNKS_PER_CPU 2
+#define HERR_RCD_LIST_NUM 2
+
+struct herr_rcd_lists {
+ struct llist_head *write;
+ struct llist_head *read;
+ struct llist_head heads[HERR_RCD_LIST_NUM];
+};
+
+static DEFINE_PER_CPU(struct herr_rcd_lists, herr_rcd_lists);
+
+static DEFINE_PER_CPU(struct gen_pool *, herr_gen_pool);
+
+static void herr_rcd_lists_init(void)
+{
+ int cpu, i;
+ struct herr_rcd_lists *lists;
+
+ for_each_possible_cpu(cpu) {
+ lists = per_cpu_ptr(&herr_rcd_lists, cpu);
+ for (i = 0; i < HERR_RCD_LIST_NUM; i++)
+ init_llist_head(&lists->heads[i]);
+ lists->write = &lists->heads[0];
+ lists->read = &lists->heads[1];
+ }
+}
+
+static void herr_pool_fini(void)
+{
+ struct gen_pool *pool;
+ struct gen_pool_chunk *chunk;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ pool = per_cpu(herr_gen_pool, cpu);
+ gen_pool_for_each_chunk(chunk, pool)
+ free_page(chunk->start_addr);
+ gen_pool_destroy(pool);
+ }
+}
+
+static int herr_pool_init(void)
+{
+ struct gen_pool **pool;
+ int cpu, rc, nid, i;
+ unsigned long addr;
+
+ for_each_possible_cpu(cpu) {
+ pool = per_cpu_ptr(&herr_gen_pool, cpu);
+ rc = -ENOMEM;
+ nid = cpu_to_node(cpu);
+ *pool = gen_pool_create(HERR_MIN_ALLOC_ORDER, nid);
+ if (!*pool)
+ goto err_pool_fini;
+ for (i = 0; i < HERR_CHUNKS_PER_CPU; i++) {
+ rc = -ENOMEM;
+ addr = __get_free_page(GFP_KERNEL);
+ if (!addr)
+ goto err_pool_fini;
+ rc = gen_pool_add(*pool, addr, PAGE_SIZE, nid);
+ if (rc)
+ goto err_pool_fini;
+ }
+ }
+
+ return 0;
+err_pool_fini:
+ herr_pool_fini();
+ return rc;
+}
+
+/* Max interval: about 2 second */
+#define HERR_THROTTLE_BASE_INTVL NSEC_PER_USEC
+#define HERR_THROTTLE_MAX_RATIO 21
+#define HERR_THROTTLE_MAX_INTVL \
+ ((1ULL << HERR_THROTTLE_MAX_RATIO) * HERR_THROTTLE_BASE_INTVL)
+/*
+ * Pool size/used ratio considered spare, before this, interval
+ * between error reporting is ignored. After this, minimal interval
+ * needed is increased exponentially to max interval.
+ */
+#define HERR_THROTTLE_SPARE_RATIO 3
+
+static int herr_throttle(struct device *dev)
+{
+ struct gen_pool *pool;
+ unsigned long long last, now, min_intvl;
+ unsigned int size, used, ratio;
+
+ pool = __get_cpu_var(herr_gen_pool);
+ size = gen_pool_size(pool);
+ used = size - gen_pool_avail(pool);
+ if (HERR_THROTTLE_SPARE_RATIO * used < size)
+ goto pass;
+ now = trace_clock_local();
+ last = atomic64_read(&dev->error->timestamp);
+ ratio = (used * HERR_THROTTLE_SPARE_RATIO - size) * \
+ HERR_THROTTLE_MAX_RATIO;
+ ratio = ratio / (size * HERR_THROTTLE_SPARE_RATIO - size) + 1;
+ min_intvl = (1ULL << ratio) * HERR_THROTTLE_BASE_INTVL;
+ if ((long long)(now - last) > min_intvl)
+ goto pass;
+ atomic_inc(&dev->error->throttles);
+ return 0;
+pass:
+ return 1;
+}
+
+static u64 herr_record_next_id(void)
+{
+ static atomic64_t seq = ATOMIC64_INIT(0);
+
+ if (!atomic64_read(&seq))
+ atomic64_set(&seq, (u64)get_seconds() << 32);
+
+ return atomic64_inc_return(&seq);
+}
+
+void herr_record_init(struct herr_record *ercd)
+{
+ ercd->flags = 0;
+ ercd->rev = HERR_RCD_REV1_0;
+ ercd->id = herr_record_next_id();
+ ercd->timestamp = trace_clock_local();
+}
+EXPORT_SYMBOL_GPL(herr_record_init);
+
+struct herr_record *herr_record_alloc(unsigned int len, struct device *dev,
+ unsigned int flags)
+{
+ struct gen_pool *pool;
+ struct herr_node *enode;
+ struct herr_record *ercd = NULL;
+
+ BUG_ON(!dev->error);
+ preempt_disable();
+ if (!(flags & HERR_ALLOC_NO_THROTTLE)) {
+ if (!herr_throttle(dev)) {
+ preempt_enable_no_resched();
+ return NULL;
+ }
+ }
+
+ pool = __get_cpu_var(herr_gen_pool);
+ enode = (struct herr_node *)gen_pool_alloc(pool, HERR_NODE_LEN(len));
+ if (enode) {
+ ercd = &enode->ercd;
+ herr_record_init(ercd);
+ ercd->length = len;
+
+ atomic64_set(&dev->error->timestamp, trace_clock_local());
+ atomic_inc(&dev->error->logs);
+ } else
+ atomic_inc(&dev->error->overflows);
+ preempt_enable_no_resched();
+
+ return ercd;
+}
+EXPORT_SYMBOL_GPL(herr_record_alloc);
+
+int herr_record_report(struct herr_record *ercd, struct device *dev)
+{
+ struct herr_rcd_lists *lists;
+ struct herr_node *enode;
+
+ preempt_disable();
+ lists = this_cpu_ptr(&herr_rcd_lists);
+ enode = container_of(ercd, struct herr_node, ercd);
+ llist_add(&enode->llist, lists->write);
+ preempt_enable_no_resched();
+
+ set_bit(HERR_NOTIFY_BIT, &herr_flags);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(herr_record_report);
+
+void herr_record_free(struct herr_record *ercd)
+{
+ struct herr_node *enode;
+ struct gen_pool *pool;
+
+ enode = container_of(ercd, struct herr_node, ercd);
+ pool = get_cpu_var(herr_gen_pool);
+ gen_pool_free(pool, (unsigned long)enode,
+ HERR_NODE_LEN(enode->ercd.length));
+ put_cpu_var(pool);
+}
+EXPORT_SYMBOL_GPL(herr_record_free);
+
+/*
+ * The low 16 bit is freeze count, high 16 bit is thaw count. If they
+ * are not equal, someone is freezing the reader
+ */
+static u32 herr_freeze_thaw;
+
+/*
+ * Stop the reader to consume error records, so that the error records
+ * can be checked in kernel space safely.
+ */
+static void herr_freeze_reader(void)
+{
+ u32 old, new;
+
+ do {
+ new = old = herr_freeze_thaw;
+ new = ((new + 1) & 0xffff) | (old & 0xffff0000);
+ } while (cmpxchg(&herr_freeze_thaw, old, new) != old);
+}
+
+static void herr_thaw_reader(void)
+{
+ u32 old, new;
+
+ do {
+ old = herr_freeze_thaw;
+ new = old + 0x10000;
+ } while (cmpxchg(&herr_freeze_thaw, old, new) != old);
+}
+
+static int herr_reader_is_frozen(void)
+{
+ u32 freeze_thaw = herr_freeze_thaw;
+ return (freeze_thaw & 0xffff) != (freeze_thaw >> 16);
+}
+
+int herr_for_each_record(herr_traverse_func_t func, void *data)
+{
+ int i, cpu, rc = 0;
+ struct herr_rcd_lists *lists;
+ struct herr_node *enode;
+
+ preempt_disable();
+ herr_freeze_reader();
+ for_each_possible_cpu(cpu) {
+ lists = per_cpu_ptr(&herr_rcd_lists, cpu);
+ for (i = 0; i < HERR_RCD_LIST_NUM; i++) {
+ struct llist_head *head = &lists->heads[i];
+ llist_for_each_entry(enode, head->first, llist) {
+ rc = func(&enode->ercd, data);
+ if (rc)
+ goto out;
+ }
+ }
+ }
+out:
+ herr_thaw_reader();
+ preempt_enable_no_resched();
+ return rc;
+}
+EXPORT_SYMBOL_GPL(herr_for_each_record);
+
+static ssize_t herr_rcd_lists_read(char __user *ubuf, size_t usize,
+ struct mutex *read_mutex)
+{
+ int cpu, rc = 0, read;
+ struct herr_rcd_lists *lists;
+ struct gen_pool *pool;
+ ssize_t len, rsize = 0;
+ struct herr_node *enode;
+ struct llist_head *old_read;
+ struct llist_node *to_read;
+
+ do {
+ read = 0;
+ for_each_possible_cpu(cpu) {
+ lists = per_cpu_ptr(&herr_rcd_lists, cpu);
+ pool = per_cpu(herr_gen_pool, cpu);
+ if (llist_empty(lists->read)) {
+ if (llist_empty(lists->write))
+ continue;
+ /*
+ * Error records are output in batch, so old
+ * error records can be output before new ones.
+ */
+ old_read = lists->read;
+ lists->read = lists->write;
+ lists->write = old_read;
+ }
+ rc = rsize ? 0 : -EBUSY;
+ if (herr_reader_is_frozen())
+ goto out;
+ to_read = llist_del_first(lists->read);
+ if (herr_reader_is_frozen())
+ goto out_readd;
+ enode = llist_entry(to_read, struct herr_node, llist);
+ len = enode->ercd.length;
+ rc = rsize ? 0 : -EINVAL;
+ if (len > usize - rsize)
+ goto out_readd;
+ rc = -EFAULT;
+ if (copy_to_user(ubuf + rsize, &enode->ercd, len))
+ goto out_readd;
+ gen_pool_free(pool, (unsigned long)enode,
+ HERR_NODE_LEN(len));
+ rsize += len;
+ read = 1;
+ }
+ if (need_resched()) {
+ mutex_unlock(read_mutex);
+ cond_resched();
+ mutex_lock(read_mutex);
+ }
+ } while (read);
+ rc = 0;
+out:
+ return rc ? rc : rsize;
+out_readd:
+ llist_add(to_read, lists->read);
+ goto out;
+}
+
+static int herr_rcd_lists_is_empty(void)
+{
+ int cpu, i;
+ struct herr_rcd_lists *lists;
+
+ for_each_possible_cpu(cpu) {
+ lists = per_cpu_ptr(&herr_rcd_lists, cpu);
+ for (i = 0; i < HERR_RCD_LIST_NUM; i++) {
+ if (!llist_empty(&lists->heads[i]))
+ return 0;
+ }
+ }
+ return 1;
+}
+
+
+/*
+ * Hardware Error Mix Reporting Device
+ */
+
+static int herr_major;
+static DECLARE_WAIT_QUEUE_HEAD(herr_mix_wait);
+
+static char *herr_devnode(struct device *dev, mode_t *mode)
+{
+ return kasprintf(GFP_KERNEL, "error/%s", dev_name(dev));
+}
+
+struct class herr_class = {
+ .name = "error",
+ .devnode = herr_devnode,
+};
+EXPORT_SYMBOL_GPL(herr_class);
+
+void herr_notify(void)
+{
+ if (test_and_clear_bit(HERR_NOTIFY_BIT, &herr_flags))
+ wake_up_interruptible(&herr_mix_wait);
+}
+EXPORT_SYMBOL_GPL(herr_notify);
+
+static ssize_t herr_mix_read(struct file *filp, char __user *ubuf,
+ size_t usize, loff_t *off)
+{
+ int rc;
+ static DEFINE_MUTEX(read_mutex);
+
+ if (*off != 0)
+ return -EINVAL;
+
+ rc = mutex_lock_interruptible(&read_mutex);
+ if (rc)
+ return rc;
+ rc = herr_rcd_lists_read(ubuf, usize, &read_mutex);
+ mutex_unlock(&read_mutex);
+
+ return rc;
+}
+
+static unsigned int herr_mix_poll(struct file *file, poll_table *wait)
+{
+ poll_wait(file, &herr_mix_wait, wait);
+ if (!herr_rcd_lists_is_empty())
+ return POLLIN | POLLRDNORM;
+ return 0;
+}
+
+static const struct file_operations herr_mix_dev_fops = {
+ .owner = THIS_MODULE,
+ .read = herr_mix_read,
+ .poll = herr_mix_poll,
+};
+
+static int __init herr_mix_dev_init(void)
+{
+ struct device *dev;
+ dev_t devt;
+
+ devt = MKDEV(herr_major, 0);
+ dev = device_create(&herr_class, NULL, devt, NULL, "error");
+ if (IS_ERR(dev))
+ return PTR_ERR(dev);
+
+ return 0;
+}
+device_initcall(herr_mix_dev_init);
+
+static int __init herr_core_init(void)
+{
+ int rc;
+
+ BUILD_BUG_ON(sizeof(struct herr_node) % HERR_MIN_ALIGN);
+ BUILD_BUG_ON(sizeof(struct herr_record) % HERR_MIN_ALIGN);
+ BUILD_BUG_ON(sizeof(struct herr_section) % HERR_MIN_ALIGN);
+
+ herr_rcd_lists_init();
+
+ rc = herr_pool_init();
+ if (rc)
+ goto err;
+
+ rc = class_register(&herr_class);
+ if (rc)
+ goto err_free_pool;
+
+ rc = herr_major = register_chrdev(0, "error", &herr_mix_dev_fops);
+ if (rc < 0)
+ goto err_free_class;
+
+ return 0;
+err_free_class:
+ class_unregister(&herr_class);
+err_free_pool:
+ herr_pool_fini();
+err:
+ return rc;
+}
+/* Initialize data structure used by device driver, so subsys_initcall */
+subsys_initcall(herr_core_init);
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -141,6 +141,7 @@ header-y += gigaset_dev.h
header-y += hdlc.h
header-y += hdlcdrv.h
header-y += hdreg.h
+header-y += herror_record.h
header-y += hid.h
header-y += hiddev.h
header-y += hidraw.h
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -394,6 +394,14 @@ extern int devres_release_group(struct d
extern void *devm_kzalloc(struct device *dev, size_t size, gfp_t gfp);
extern void devm_kfree(struct device *dev, void *p);
+/* Device hardware error reporting related information */
+struct dev_herr_info {
+ atomic_t logs;
+ atomic_t overflows;
+ atomic_t throttles;
+ atomic64_t timestamp;
+};
+
struct device_dma_parameters {
/*
* a low level driver may set these to teach IOMMU code about
@@ -422,6 +430,9 @@ struct device {
void *platform_data; /* Platform specific data, device
core doesn't touch it */
struct dev_pm_info power;
+#ifdef CONFIG_HERR_CORE
+ struct dev_herr_info *error; /* Hardware error reporting info */
+#endif
#ifdef CONFIG_NUMA
int numa_node; /* NUMA node this device is close to */
@@ -523,6 +534,9 @@ static inline bool device_async_suspend_
return !!dev->power.async_suspend;
}
+extern int device_enable_error_reporting(struct device *dev);
+extern void device_disable_error_reporting(struct device *dev);
+
static inline void device_lock(struct device *dev)
{
mutex_lock(&dev->mutex);
--- /dev/null
+++ b/include/linux/herror.h
@@ -0,0 +1,35 @@
+#ifndef LINUX_HERROR_H
+#define LINUX_HERROR_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/device.h>
+#include <linux/herror_record.h>
+
+/*
+ * Hardware error reporting
+ */
+
+#define HERR_ALLOC_NO_THROTTLE 0x0001
+
+struct herr_dev;
+
+/* allocate a herr_record lock-lessly */
+struct herr_record *herr_record_alloc(unsigned int len,
+ struct device *dev,
+ unsigned int flags);
+void herr_record_init(struct herr_record *ercd);
+/* report error */
+int herr_record_report(struct herr_record *ercd, struct device *dev);
+/* free the herr_record allocated before */
+void herr_record_free(struct herr_record *ercd);
+/*
+ * Notify waited user space hardware error daemon for the new error
+ * record, can not be used in NMI context
+ */
+void herr_notify(void);
+
+/* Traverse all error records not consumed by user space */
+typedef int (*herr_traverse_func_t)(struct herr_record *ercd, void *data);
+int herr_for_each_record(herr_traverse_func_t func, void *data);
+#endif
--- /dev/null
+++ b/include/linux/herror_record.h
@@ -0,0 +1,100 @@
+#ifndef LINUX_HERROR_RECORD_H
+#define LINUX_HERROR_RECORD_H
+
+#include <linux/types.h>
+
+/*
+ * Hardware Error Record Definition
+ */
+enum herr_severity {
+ HERR_SEV_NONE,
+ HERR_SEV_CORRECTED,
+ HERR_SEV_RECOVERABLE,
+ HERR_SEV_FATAL,
+};
+
+#define HERR_RCD_REV1_0 0x0100
+#define HERR_MIN_ALIGN_ORDER 3
+#define HERR_MIN_ALIGN (1 << HERR_MIN_ALIGN_ORDER)
+
+enum herr_record_flags {
+ HERR_RCD_PREV = 0x0001, /* record is for previous boot */
+ HERR_RCD_PERSIST = 0x0002, /* record is from flash, need to be
+ * cleared after writing to disk */
+};
+
+/*
+ * sizeof(struct herr_record) and sizeof(struct herr_section) should
+ * be multiple of HERR_MIN_ALIGN to make error record packing easier.
+ */
+struct herr_record {
+ __u16 length;
+ __u16 flags;
+ __u16 rev;
+ __u8 severity;
+ __u8 pad1;
+ __u64 id;
+ __u64 timestamp;
+ __u8 data[0];
+};
+
+/* Section type ID are allocated here */
+enum herr_section_type_id {
+ /* 0x0 - 0xff are reserved by core */
+ /* 0x100 - 0x1ff are allocated to CPER */
+ HERR_TYPE_CPER = 0x0100,
+ HERR_TYPE_GESR = 0x0110, /* acpi_hest_generic_status */
+ /* 0x200 - 0x2ff are allocated to PCI/PCIe subsystem */
+ HERR_TYPE_PCIE_AER = 0x0200,
+};
+
+struct herr_section {
+ __u16 length;
+ __u16 flags;
+ __u32 type;
+ __u8 data[0];
+};
+
+#define herr_record_for_each_section(ercd, esec) \
+ for ((esec) = (struct herr_section *)(ercd)->data; \
+ (void *)(esec) - (void *)(ercd) < (ercd)->length; \
+ (esec) = (void *)(esec) + (esec)->length)
+
+#define HERR_SEC_LEN_ROUND(len) \
+ (((len) + HERR_MIN_ALIGN - 1) & ~(HERR_MIN_ALIGN - 1))
+#define HERR_SEC_LEN(type) \
+ (sizeof(struct herr_section) + HERR_SEC_LEN_ROUND(sizeof(type)))
+
+#define HERR_RECORD_LEN_ROUND1(sec_len1) \
+ (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1))
+#define HERR_RECORD_LEN_ROUND2(sec_len1, sec_len2) \
+ (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \
+ HERR_SEC_LEN_ROUND(sec_len2))
+#define HERR_RECORD_LEN_ROUND3(sec_len1, sec_len2, sec_len3) \
+ (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \
+ HERR_SEC_LEN_ROUND(sec_len2) + HERR_SEC_LEN_ROUND(sec_len3))
+
+#define HERR_RECORD_LEN1(sec_type1) \
+ (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1))
+#define HERR_RECORD_LEN2(sec_type1, sec_type2) \
+ (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \
+ HERR_SEC_LEN(sec_type2))
+#define HERR_RECORD_LEN3(sec_type1, sec_type2, sec_type3) \
+ (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \
+ HERR_SEC_LEN(sec_type2) + HERR_SEC_LEN(sec_type3))
+
+static inline struct herr_section *herr_first_sec(struct herr_record *ercd)
+{
+ return (struct herr_section *)(ercd + 1);
+}
+
+static inline struct herr_section *herr_next_sec(struct herr_section *esrc)
+{
+ return (void *)esrc + esrc->length;
+}
+
+static inline void *herr_sec_data(struct herr_section *esec)
+{
+ return (void *)(esec + 1);
+}
+#endif
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/
obj-$(CONFIG_X86_DS) += trace/
obj-$(CONFIG_RING_BUFFER) += trace/
+obj-$(CONFIG_HERR_CORE) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_PERF_EVENTS) += perf_event.o
On Fri, 2010-11-19 at 16:10 +0800, Huang, Ying wrote:
> Hi, Len,
>
> This is used by APEI ERST and GEHS. But it is a generic hardware
> error reporting mechanism and can be used by other hardware error
> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
>
> The patchset is split from the original APEI patchset to make it
> explicit that this is a generic mechanism, not APEI specific bits.
>
> [PATCH 1/2] Generic hardware error reporting mechanism
> [PATCH 2/2] Hardware error record persistent support
This patchset depends on version 5 of the lockless memory allocator
and list patchset as follow.
[PATCH -v5 0/3] Lockless memory allocator and list
Best Regards,
Huang Ying
Sorry, forget to Cc: Greg for device model part.
Best Regards,
Huang Ying
On Fri, 2010-11-19 at 16:10 +0800, Huang, Ying wrote:
> There are many hardware error detecting and reporting components in
> kernel, including x86 Machine Check, PCIe AER, EDAC, APEI GHES
> etc. Each one has its error reporting implementation, including user
> space interface, error record format, in kernel buffer, etc. This
> patch provides a generic hardware error reporting mechanism to reduce
> the duplicated effort and add more common services.
>
>
> A highly extensible generic hardware error record data structure is
> defined to accommodate various hardware error information from various
> hardware error sources. The overall structure of error record is as
> follow:
>
> -----------------------------------------------------------------
> | rcd hdr | sec 1 hdr | sec 1 data | sec 2 hdr | sec2 data | ...
> -----------------------------------------------------------------
>
> Several error sections can be incorporated into one error record to
> accumulate information from multiple hardware components related to
> one error. For example, for an error on a device on the secondary
> side of a PCIe bridge, it is useful to record error information from
> the PCIe bridge and the PCIe device. Multiple section can be used to
> hold both the cooked and the raw error information. So that the
> abstract information can be provided by the cooked one and no
> information will be lost because the raw one is provided too.
>
> There are "reversion" (rev) and "length" field in record header and
> "type" and "length" field in section header, so the user space error
> daemon can skip unrecognized error record or error section. This
> makes old version error daemon can work with the newer kernel.
>
> New error section type can be added to support new error type, error
> sources.
>
>
> The hardware error reporting mechanism designed by the patch
> integrates well with device model in kernel. struct dev_herr_info is
> defined and pointed to by "error" field of struct device. This is
> used to hold error reporting related information for each device. One
> sysfs directory "error" will be created for each hardware error
> reporting device. Some files for error reporting statistics and
> control are created in sysfs "error" directory. For example, the
> "error" directory for APEI GHES is as follow.
>
> /sys/devices/platform/GHES.0/error/logs
> /sys/devices/platform/GHES.0/error/overflows
> /sys/devices/platform/GHES.0/error/throttles
>
> Where "logs" is number of error records logged; "throttles" is number
> of error records not logged because the reporting rate is too high;
> "overflows" is number of error records not logged because there is no
> space available.
>
> Not all devices will report errors, so struct dev_herr_info and sysfs
> directory/files are only allocated/created for devices explicitly
> enable it. So to enumerate the error sources of system, you just need
> to enumerate "error" directory for each device directory in
> /sys/devices.
>
>
> One device file (/dev/error/error) which mixed error records from all
> hardware error reporting devices is created to convey error records
> from kernel space to user space. Because hardware devices are dealt
> with, a device file is the most natural way to do that. Because
> hardware error reporting should not hurts system performance, the
> throughput of the interface should be controlled to a low level (done
> by user space error daemon), ordinary "read" is sufficient from
> performance point of view.
>
>
> The patch provides common services for hardware error reporting
> devices too.
>
> A lock-less hardware error record allocator is provided. So for
> hardware error that can be ignored (such as corrected errors), it is
> not needed to pre-allocate the error record or allocate the error
> record on stack. Because the possibility for two hardware parts to go
> error simultaneously is very small, one big unified memory pool for
> hardware errors is better than one memory pool or buffer for each
> device.
>
> After filling in all necessary fields in hardware error record, the
> error reporting is quite straightforward, just calling
> herr_record_report, parameters are the error record itself and the
> corresponding struct device.
>
> Hardware errors may burst, for example, same hardware errors may be
> reported at high rate within a short interval, this will use up all
> pre-allocated memory for error reporting, so that other hardware
> errors come from same or different hardware device can not be logged.
> To deal with this issue, a throttle algorithm is implemented. The
> logging rate for errors come from one hardware error device is
> throttled based on the available pre-allocated memory for error
> reporting. In this way we can log as many kinds of errors as possible
> comes from as many devices as possible.
>
>
> This patch is designed by Andi Kleen and Huang Ying.
>
> Signed-off-by: Huang Ying <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> ---
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/base/Makefile | 1
> drivers/base/herror.c | 98 ++++++++
> drivers/herror/Kconfig | 5
> drivers/herror/Makefile | 1
> drivers/herror/herr-core.c | 488 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/Kbuild | 1
> include/linux/device.h | 14 +
> include/linux/herror.h | 35 +++
> include/linux/herror_record.h | 100 ++++++++
> kernel/Makefile | 1
> 12 files changed, 747 insertions(+)
> create mode 100644 drivers/base/herror.c
> create mode 100644 drivers/herror/Kconfig
> create mode 100644 drivers/herror/Makefile
> create mode 100644 drivers/herror/herr-core.c
> create mode 100644 include/linux/herror.h
> create mode 100644 include/linux/herror_record.h
>
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
> source "drivers/staging/Kconfig"
>
> source "drivers/platform/Kconfig"
> +
> +source "drivers/herror/Kconfig"
> endmenu
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -115,3 +115,4 @@ obj-$(CONFIG_VLYNQ) += vlynq/
> obj-$(CONFIG_STAGING) += staging/
> obj-y += platform/
> obj-y += ieee802154/
> +obj-$(CONFIG_HERR_CORE) += herror/
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -18,6 +18,7 @@ ifeq ($(CONFIG_SYSFS),y)
> obj-$(CONFIG_MODULES) += module.o
> endif
> obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
> +obj-$(CONFIG_HERR_CORE) += herror.o
>
> ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
>
> --- /dev/null
> +++ b/drivers/base/herror.c
> @@ -0,0 +1,98 @@
> +/*
> + * Hardware error reporting related functions
> + *
> + * Copyright 2010 Intel Corp.
> + * Author: Huang Ying <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation;
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +
> +#define HERR_COUNTER_ATTR(_name) \
> + static ssize_t herr_##_name##_show(struct device *dev, \
> + struct device_attribute *attr, \
> + char *buf) \
> + { \
> + int counter; \
> + \
> + counter = atomic_read(&dev->error->_name); \
> + return sprintf(buf, "%d\n", counter); \
> + } \
> + static ssize_t herr_##_name##_store(struct device *dev, \
> + struct device_attribute *attr, \
> + const char *buf, \
> + size_t count) \
> + { \
> + atomic_set(&dev->error->_name, 0); \
> + return count; \
> + } \
> + static struct device_attribute herr_attr_##_name = \
> + __ATTR(_name, 0600, herr_##_name##_show, \
> + herr_##_name##_store)
> +
> +HERR_COUNTER_ATTR(logs);
> +HERR_COUNTER_ATTR(overflows);
> +HERR_COUNTER_ATTR(throttles);
> +
> +static struct attribute *herr_attrs[] = {
> + &herr_attr_logs.attr,
> + &herr_attr_overflows.attr,
> + &herr_attr_throttles.attr,
> + NULL,
> +};
> +
> +static struct attribute_group herr_attr_group = {
> + .name = "error",
> + .attrs = herr_attrs,
> +};
> +
> +static void device_herr_init(struct device *dev)
> +{
> + atomic_set(&dev->error->logs, 0);
> + atomic_set(&dev->error->overflows, 0);
> + atomic_set(&dev->error->throttles, 0);
> + atomic64_set(&dev->error->timestamp, 0);
> +}
> +
> +int device_enable_error_reporting(struct device *dev)
> +{
> + int rc;
> +
> + BUG_ON(dev->error);
> + dev->error = kzalloc(sizeof(*dev->error), GFP_KERNEL);
> + if (!dev->error)
> + return -ENOMEM;
> + device_herr_init(dev);
> + rc = sysfs_create_group(&dev->kobj, &herr_attr_group);
> + if (rc)
> + goto err;
> + return 0;
> +err:
> + kfree(dev->error);
> + dev->error = NULL;
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(device_enable_error_reporting);
> +
> +void device_disable_error_reporting(struct device *dev)
> +{
> + if (dev->error) {
> + sysfs_remove_group(&dev->kobj, &herr_attr_group);
> + kfree(dev->error);
> + }
> +}
> +EXPORT_SYMBOL_GPL(device_disable_error_reporting);
> --- /dev/null
> +++ b/drivers/herror/Kconfig
> @@ -0,0 +1,5 @@
> +config HERR_CORE
> + bool "Hardware error reporting"
> + depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
> + select LLIST
> + select GENERIC_ALLOCATOR
> --- /dev/null
> +++ b/drivers/herror/Makefile
> @@ -0,0 +1 @@
> +obj-y += herr-core.o
> --- /dev/null
> +++ b/drivers/herror/herr-core.c
> @@ -0,0 +1,488 @@
> +/*
> + * Generic hardware error reporting support
> + *
> + * This file provides some common services for hardware error
> + * reporting, including hardware error record lock-less allocator,
> + * error reporting mechanism, user space interface etc.
> + *
> + * Copyright 2010 Intel Corp.
> + * Author: Huang Ying <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation;
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/rculist.h>
> +#include <linux/mutex.h>
> +#include <linux/percpu.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/trace_clock.h>
> +#include <linux/uaccess.h>
> +#include <linux/poll.h>
> +#include <linux/ratelimit.h>
> +#include <linux/nmi.h>
> +#include <linux/llist.h>
> +#include <linux/genalloc.h>
> +#include <linux/herror.h>
> +
> +#define HERR_NOTIFY_BIT 0
> +
> +static unsigned long herr_flags;
> +
> +/*
> + * Record list management and error reporting
> + */
> +
> +struct herr_node {
> + struct llist_node llist;
> + struct herr_record ercd __attribute__((aligned(HERR_MIN_ALIGN)));
> +};
> +
> +#define HERR_NODE_LEN(rcd_len) \
> + ((rcd_len) + sizeof(struct herr_node) - sizeof(struct herr_record))
> +
> +#define HERR_MIN_ALLOC_ORDER HERR_MIN_ALIGN_ORDER
> +#define HERR_CHUNKS_PER_CPU 2
> +#define HERR_RCD_LIST_NUM 2
> +
> +struct herr_rcd_lists {
> + struct llist_head *write;
> + struct llist_head *read;
> + struct llist_head heads[HERR_RCD_LIST_NUM];
> +};
> +
> +static DEFINE_PER_CPU(struct herr_rcd_lists, herr_rcd_lists);
> +
> +static DEFINE_PER_CPU(struct gen_pool *, herr_gen_pool);
> +
> +static void herr_rcd_lists_init(void)
> +{
> + int cpu, i;
> + struct herr_rcd_lists *lists;
> +
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + for (i = 0; i < HERR_RCD_LIST_NUM; i++)
> + init_llist_head(&lists->heads[i]);
> + lists->write = &lists->heads[0];
> + lists->read = &lists->heads[1];
> + }
> +}
> +
> +static void herr_pool_fini(void)
> +{
> + struct gen_pool *pool;
> + struct gen_pool_chunk *chunk;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + pool = per_cpu(herr_gen_pool, cpu);
> + gen_pool_for_each_chunk(chunk, pool)
> + free_page(chunk->start_addr);
> + gen_pool_destroy(pool);
> + }
> +}
> +
> +static int herr_pool_init(void)
> +{
> + struct gen_pool **pool;
> + int cpu, rc, nid, i;
> + unsigned long addr;
> +
> + for_each_possible_cpu(cpu) {
> + pool = per_cpu_ptr(&herr_gen_pool, cpu);
> + rc = -ENOMEM;
> + nid = cpu_to_node(cpu);
> + *pool = gen_pool_create(HERR_MIN_ALLOC_ORDER, nid);
> + if (!*pool)
> + goto err_pool_fini;
> + for (i = 0; i < HERR_CHUNKS_PER_CPU; i++) {
> + rc = -ENOMEM;
> + addr = __get_free_page(GFP_KERNEL);
> + if (!addr)
> + goto err_pool_fini;
> + rc = gen_pool_add(*pool, addr, PAGE_SIZE, nid);
> + if (rc)
> + goto err_pool_fini;
> + }
> + }
> +
> + return 0;
> +err_pool_fini:
> + herr_pool_fini();
> + return rc;
> +}
> +
> +/* Max interval: about 2 second */
> +#define HERR_THROTTLE_BASE_INTVL NSEC_PER_USEC
> +#define HERR_THROTTLE_MAX_RATIO 21
> +#define HERR_THROTTLE_MAX_INTVL \
> + ((1ULL << HERR_THROTTLE_MAX_RATIO) * HERR_THROTTLE_BASE_INTVL)
> +/*
> + * Pool size/used ratio considered spare, before this, interval
> + * between error reporting is ignored. After this, minimal interval
> + * needed is increased exponentially to max interval.
> + */
> +#define HERR_THROTTLE_SPARE_RATIO 3
> +
> +static int herr_throttle(struct device *dev)
> +{
> + struct gen_pool *pool;
> + unsigned long long last, now, min_intvl;
> + unsigned int size, used, ratio;
> +
> + pool = __get_cpu_var(herr_gen_pool);
> + size = gen_pool_size(pool);
> + used = size - gen_pool_avail(pool);
> + if (HERR_THROTTLE_SPARE_RATIO * used < size)
> + goto pass;
> + now = trace_clock_local();
> + last = atomic64_read(&dev->error->timestamp);
> + ratio = (used * HERR_THROTTLE_SPARE_RATIO - size) * \
> + HERR_THROTTLE_MAX_RATIO;
> + ratio = ratio / (size * HERR_THROTTLE_SPARE_RATIO - size) + 1;
> + min_intvl = (1ULL << ratio) * HERR_THROTTLE_BASE_INTVL;
> + if ((long long)(now - last) > min_intvl)
> + goto pass;
> + atomic_inc(&dev->error->throttles);
> + return 0;
> +pass:
> + return 1;
> +}
> +
> +static u64 herr_record_next_id(void)
> +{
> + static atomic64_t seq = ATOMIC64_INIT(0);
> +
> + if (!atomic64_read(&seq))
> + atomic64_set(&seq, (u64)get_seconds() << 32);
> +
> + return atomic64_inc_return(&seq);
> +}
> +
> +void herr_record_init(struct herr_record *ercd)
> +{
> + ercd->flags = 0;
> + ercd->rev = HERR_RCD_REV1_0;
> + ercd->id = herr_record_next_id();
> + ercd->timestamp = trace_clock_local();
> +}
> +EXPORT_SYMBOL_GPL(herr_record_init);
> +
> +struct herr_record *herr_record_alloc(unsigned int len, struct device *dev,
> + unsigned int flags)
> +{
> + struct gen_pool *pool;
> + struct herr_node *enode;
> + struct herr_record *ercd = NULL;
> +
> + BUG_ON(!dev->error);
> + preempt_disable();
> + if (!(flags & HERR_ALLOC_NO_THROTTLE)) {
> + if (!herr_throttle(dev)) {
> + preempt_enable_no_resched();
> + return NULL;
> + }
> + }
> +
> + pool = __get_cpu_var(herr_gen_pool);
> + enode = (struct herr_node *)gen_pool_alloc(pool, HERR_NODE_LEN(len));
> + if (enode) {
> + ercd = &enode->ercd;
> + herr_record_init(ercd);
> + ercd->length = len;
> +
> + atomic64_set(&dev->error->timestamp, trace_clock_local());
> + atomic_inc(&dev->error->logs);
> + } else
> + atomic_inc(&dev->error->overflows);
> + preempt_enable_no_resched();
> +
> + return ercd;
> +}
> +EXPORT_SYMBOL_GPL(herr_record_alloc);
> +
> +int herr_record_report(struct herr_record *ercd, struct device *dev)
> +{
> + struct herr_rcd_lists *lists;
> + struct herr_node *enode;
> +
> + preempt_disable();
> + lists = this_cpu_ptr(&herr_rcd_lists);
> + enode = container_of(ercd, struct herr_node, ercd);
> + llist_add(&enode->llist, lists->write);
> + preempt_enable_no_resched();
> +
> + set_bit(HERR_NOTIFY_BIT, &herr_flags);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(herr_record_report);
> +
> +void herr_record_free(struct herr_record *ercd)
> +{
> + struct herr_node *enode;
> + struct gen_pool *pool;
> +
> + enode = container_of(ercd, struct herr_node, ercd);
> + pool = get_cpu_var(herr_gen_pool);
> + gen_pool_free(pool, (unsigned long)enode,
> + HERR_NODE_LEN(enode->ercd.length));
> + put_cpu_var(pool);
> +}
> +EXPORT_SYMBOL_GPL(herr_record_free);
> +
> +/*
> + * The low 16 bit is freeze count, high 16 bit is thaw count. If they
> + * are not equal, someone is freezing the reader
> + */
> +static u32 herr_freeze_thaw;
> +
> +/*
> + * Stop the reader to consume error records, so that the error records
> + * can be checked in kernel space safely.
> + */
> +static void herr_freeze_reader(void)
> +{
> + u32 old, new;
> +
> + do {
> + new = old = herr_freeze_thaw;
> + new = ((new + 1) & 0xffff) | (old & 0xffff0000);
> + } while (cmpxchg(&herr_freeze_thaw, old, new) != old);
> +}
> +
> +static void herr_thaw_reader(void)
> +{
> + u32 old, new;
> +
> + do {
> + old = herr_freeze_thaw;
> + new = old + 0x10000;
> + } while (cmpxchg(&herr_freeze_thaw, old, new) != old);
> +}
> +
> +static int herr_reader_is_frozen(void)
> +{
> + u32 freeze_thaw = herr_freeze_thaw;
> + return (freeze_thaw & 0xffff) != (freeze_thaw >> 16);
> +}
> +
> +int herr_for_each_record(herr_traverse_func_t func, void *data)
> +{
> + int i, cpu, rc = 0;
> + struct herr_rcd_lists *lists;
> + struct herr_node *enode;
> +
> + preempt_disable();
> + herr_freeze_reader();
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + for (i = 0; i < HERR_RCD_LIST_NUM; i++) {
> + struct llist_head *head = &lists->heads[i];
> + llist_for_each_entry(enode, head->first, llist) {
> + rc = func(&enode->ercd, data);
> + if (rc)
> + goto out;
> + }
> + }
> + }
> +out:
> + herr_thaw_reader();
> + preempt_enable_no_resched();
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(herr_for_each_record);
> +
> +static ssize_t herr_rcd_lists_read(char __user *ubuf, size_t usize,
> + struct mutex *read_mutex)
> +{
> + int cpu, rc = 0, read;
> + struct herr_rcd_lists *lists;
> + struct gen_pool *pool;
> + ssize_t len, rsize = 0;
> + struct herr_node *enode;
> + struct llist_head *old_read;
> + struct llist_node *to_read;
> +
> + do {
> + read = 0;
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + pool = per_cpu(herr_gen_pool, cpu);
> + if (llist_empty(lists->read)) {
> + if (llist_empty(lists->write))
> + continue;
> + /*
> + * Error records are output in batch, so old
> + * error records can be output before new ones.
> + */
> + old_read = lists->read;
> + lists->read = lists->write;
> + lists->write = old_read;
> + }
> + rc = rsize ? 0 : -EBUSY;
> + if (herr_reader_is_frozen())
> + goto out;
> + to_read = llist_del_first(lists->read);
> + if (herr_reader_is_frozen())
> + goto out_readd;
> + enode = llist_entry(to_read, struct herr_node, llist);
> + len = enode->ercd.length;
> + rc = rsize ? 0 : -EINVAL;
> + if (len > usize - rsize)
> + goto out_readd;
> + rc = -EFAULT;
> + if (copy_to_user(ubuf + rsize, &enode->ercd, len))
> + goto out_readd;
> + gen_pool_free(pool, (unsigned long)enode,
> + HERR_NODE_LEN(len));
> + rsize += len;
> + read = 1;
> + }
> + if (need_resched()) {
> + mutex_unlock(read_mutex);
> + cond_resched();
> + mutex_lock(read_mutex);
> + }
> + } while (read);
> + rc = 0;
> +out:
> + return rc ? rc : rsize;
> +out_readd:
> + llist_add(to_read, lists->read);
> + goto out;
> +}
> +
> +static int herr_rcd_lists_is_empty(void)
> +{
> + int cpu, i;
> + struct herr_rcd_lists *lists;
> +
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + for (i = 0; i < HERR_RCD_LIST_NUM; i++) {
> + if (!llist_empty(&lists->heads[i]))
> + return 0;
> + }
> + }
> + return 1;
> +}
> +
> +
> +/*
> + * Hardware Error Mix Reporting Device
> + */
> +
> +static int herr_major;
> +static DECLARE_WAIT_QUEUE_HEAD(herr_mix_wait);
> +
> +static char *herr_devnode(struct device *dev, mode_t *mode)
> +{
> + return kasprintf(GFP_KERNEL, "error/%s", dev_name(dev));
> +}
> +
> +struct class herr_class = {
> + .name = "error",
> + .devnode = herr_devnode,
> +};
> +EXPORT_SYMBOL_GPL(herr_class);
> +
> +void herr_notify(void)
> +{
> + if (test_and_clear_bit(HERR_NOTIFY_BIT, &herr_flags))
> + wake_up_interruptible(&herr_mix_wait);
> +}
> +EXPORT_SYMBOL_GPL(herr_notify);
> +
> +static ssize_t herr_mix_read(struct file *filp, char __user *ubuf,
> + size_t usize, loff_t *off)
> +{
> + int rc;
> + static DEFINE_MUTEX(read_mutex);
> +
> + if (*off != 0)
> + return -EINVAL;
> +
> + rc = mutex_lock_interruptible(&read_mutex);
> + if (rc)
> + return rc;
> + rc = herr_rcd_lists_read(ubuf, usize, &read_mutex);
> + mutex_unlock(&read_mutex);
> +
> + return rc;
> +}
> +
> +static unsigned int herr_mix_poll(struct file *file, poll_table *wait)
> +{
> + poll_wait(file, &herr_mix_wait, wait);
> + if (!herr_rcd_lists_is_empty())
> + return POLLIN | POLLRDNORM;
> + return 0;
> +}
> +
> +static const struct file_operations herr_mix_dev_fops = {
> + .owner = THIS_MODULE,
> + .read = herr_mix_read,
> + .poll = herr_mix_poll,
> +};
> +
> +static int __init herr_mix_dev_init(void)
> +{
> + struct device *dev;
> + dev_t devt;
> +
> + devt = MKDEV(herr_major, 0);
> + dev = device_create(&herr_class, NULL, devt, NULL, "error");
> + if (IS_ERR(dev))
> + return PTR_ERR(dev);
> +
> + return 0;
> +}
> +device_initcall(herr_mix_dev_init);
> +
> +static int __init herr_core_init(void)
> +{
> + int rc;
> +
> + BUILD_BUG_ON(sizeof(struct herr_node) % HERR_MIN_ALIGN);
> + BUILD_BUG_ON(sizeof(struct herr_record) % HERR_MIN_ALIGN);
> + BUILD_BUG_ON(sizeof(struct herr_section) % HERR_MIN_ALIGN);
> +
> + herr_rcd_lists_init();
> +
> + rc = herr_pool_init();
> + if (rc)
> + goto err;
> +
> + rc = class_register(&herr_class);
> + if (rc)
> + goto err_free_pool;
> +
> + rc = herr_major = register_chrdev(0, "error", &herr_mix_dev_fops);
> + if (rc < 0)
> + goto err_free_class;
> +
> + return 0;
> +err_free_class:
> + class_unregister(&herr_class);
> +err_free_pool:
> + herr_pool_fini();
> +err:
> + return rc;
> +}
> +/* Initialize data structure used by device driver, so subsys_initcall */
> +subsys_initcall(herr_core_init);
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -141,6 +141,7 @@ header-y += gigaset_dev.h
> header-y += hdlc.h
> header-y += hdlcdrv.h
> header-y += hdreg.h
> +header-y += herror_record.h
> header-y += hid.h
> header-y += hiddev.h
> header-y += hidraw.h
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -394,6 +394,14 @@ extern int devres_release_group(struct d
> extern void *devm_kzalloc(struct device *dev, size_t size, gfp_t gfp);
> extern void devm_kfree(struct device *dev, void *p);
>
> +/* Device hardware error reporting related information */
> +struct dev_herr_info {
> + atomic_t logs;
> + atomic_t overflows;
> + atomic_t throttles;
> + atomic64_t timestamp;
> +};
> +
> struct device_dma_parameters {
> /*
> * a low level driver may set these to teach IOMMU code about
> @@ -422,6 +430,9 @@ struct device {
> void *platform_data; /* Platform specific data, device
> core doesn't touch it */
> struct dev_pm_info power;
> +#ifdef CONFIG_HERR_CORE
> + struct dev_herr_info *error; /* Hardware error reporting info */
> +#endif
>
> #ifdef CONFIG_NUMA
> int numa_node; /* NUMA node this device is close to */
> @@ -523,6 +534,9 @@ static inline bool device_async_suspend_
> return !!dev->power.async_suspend;
> }
>
> +extern int device_enable_error_reporting(struct device *dev);
> +extern void device_disable_error_reporting(struct device *dev);
> +
> static inline void device_lock(struct device *dev)
> {
> mutex_lock(&dev->mutex);
> --- /dev/null
> +++ b/include/linux/herror.h
> @@ -0,0 +1,35 @@
> +#ifndef LINUX_HERROR_H
> +#define LINUX_HERROR_H
> +
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/device.h>
> +#include <linux/herror_record.h>
> +
> +/*
> + * Hardware error reporting
> + */
> +
> +#define HERR_ALLOC_NO_THROTTLE 0x0001
> +
> +struct herr_dev;
> +
> +/* allocate a herr_record lock-lessly */
> +struct herr_record *herr_record_alloc(unsigned int len,
> + struct device *dev,
> + unsigned int flags);
> +void herr_record_init(struct herr_record *ercd);
> +/* report error */
> +int herr_record_report(struct herr_record *ercd, struct device *dev);
> +/* free the herr_record allocated before */
> +void herr_record_free(struct herr_record *ercd);
> +/*
> + * Notify waited user space hardware error daemon for the new error
> + * record, can not be used in NMI context
> + */
> +void herr_notify(void);
> +
> +/* Traverse all error records not consumed by user space */
> +typedef int (*herr_traverse_func_t)(struct herr_record *ercd, void *data);
> +int herr_for_each_record(herr_traverse_func_t func, void *data);
> +#endif
> --- /dev/null
> +++ b/include/linux/herror_record.h
> @@ -0,0 +1,100 @@
> +#ifndef LINUX_HERROR_RECORD_H
> +#define LINUX_HERROR_RECORD_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * Hardware Error Record Definition
> + */
> +enum herr_severity {
> + HERR_SEV_NONE,
> + HERR_SEV_CORRECTED,
> + HERR_SEV_RECOVERABLE,
> + HERR_SEV_FATAL,
> +};
> +
> +#define HERR_RCD_REV1_0 0x0100
> +#define HERR_MIN_ALIGN_ORDER 3
> +#define HERR_MIN_ALIGN (1 << HERR_MIN_ALIGN_ORDER)
> +
> +enum herr_record_flags {
> + HERR_RCD_PREV = 0x0001, /* record is for previous boot */
> + HERR_RCD_PERSIST = 0x0002, /* record is from flash, need to be
> + * cleared after writing to disk */
> +};
> +
> +/*
> + * sizeof(struct herr_record) and sizeof(struct herr_section) should
> + * be multiple of HERR_MIN_ALIGN to make error record packing easier.
> + */
> +struct herr_record {
> + __u16 length;
> + __u16 flags;
> + __u16 rev;
> + __u8 severity;
> + __u8 pad1;
> + __u64 id;
> + __u64 timestamp;
> + __u8 data[0];
> +};
> +
> +/* Section type ID are allocated here */
> +enum herr_section_type_id {
> + /* 0x0 - 0xff are reserved by core */
> + /* 0x100 - 0x1ff are allocated to CPER */
> + HERR_TYPE_CPER = 0x0100,
> + HERR_TYPE_GESR = 0x0110, /* acpi_hest_generic_status */
> + /* 0x200 - 0x2ff are allocated to PCI/PCIe subsystem */
> + HERR_TYPE_PCIE_AER = 0x0200,
> +};
> +
> +struct herr_section {
> + __u16 length;
> + __u16 flags;
> + __u32 type;
> + __u8 data[0];
> +};
> +
> +#define herr_record_for_each_section(ercd, esec) \
> + for ((esec) = (struct herr_section *)(ercd)->data; \
> + (void *)(esec) - (void *)(ercd) < (ercd)->length; \
> + (esec) = (void *)(esec) + (esec)->length)
> +
> +#define HERR_SEC_LEN_ROUND(len) \
> + (((len) + HERR_MIN_ALIGN - 1) & ~(HERR_MIN_ALIGN - 1))
> +#define HERR_SEC_LEN(type) \
> + (sizeof(struct herr_section) + HERR_SEC_LEN_ROUND(sizeof(type)))
> +
> +#define HERR_RECORD_LEN_ROUND1(sec_len1) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1))
> +#define HERR_RECORD_LEN_ROUND2(sec_len1, sec_len2) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \
> + HERR_SEC_LEN_ROUND(sec_len2))
> +#define HERR_RECORD_LEN_ROUND3(sec_len1, sec_len2, sec_len3) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \
> + HERR_SEC_LEN_ROUND(sec_len2) + HERR_SEC_LEN_ROUND(sec_len3))
> +
> +#define HERR_RECORD_LEN1(sec_type1) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1))
> +#define HERR_RECORD_LEN2(sec_type1, sec_type2) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \
> + HERR_SEC_LEN(sec_type2))
> +#define HERR_RECORD_LEN3(sec_type1, sec_type2, sec_type3) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \
> + HERR_SEC_LEN(sec_type2) + HERR_SEC_LEN(sec_type3))
> +
> +static inline struct herr_section *herr_first_sec(struct herr_record *ercd)
> +{
> + return (struct herr_section *)(ercd + 1);
> +}
> +
> +static inline struct herr_section *herr_next_sec(struct herr_section *esrc)
> +{
> + return (void *)esrc + esrc->length;
> +}
> +
> +static inline void *herr_sec_data(struct herr_section *esec)
> +{
> + return (void *)(esec + 1);
> +}
> +#endif
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -100,6 +100,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace/
> obj-$(CONFIG_TRACING) += trace/
> obj-$(CONFIG_X86_DS) += trace/
> obj-$(CONFIG_RING_BUFFER) += trace/
> +obj-$(CONFIG_HERR_CORE) += trace/
> obj-$(CONFIG_SMP) += sched_cpupri.o
> obj-$(CONFIG_IRQ_WORK) += irq_work.o
> obj-$(CONFIG_PERF_EVENTS) += perf_event.o
On Fri, 2010-11-19 at 16:10 +0800, Huang Ying wrote:
> Hi, Len,
>
> This is used by APEI ERST and GEHS. But it is a generic hardware
> error reporting mechanism and can be used by other hardware error
> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
>
> The patchset is split from the original APEI patchset to make it
> explicit that this is a generic mechanism, not APEI specific bits.
>
> [PATCH 1/2] Generic hardware error reporting mechanism
> [PATCH 2/2] Hardware error record persistent support
You call it generic, does that mean the EDAC guys agree, does it work on
AMD and IA64?
If not, Tony could you please apply a cluebat? I thought Intel was going
to sit around the table with all hardware error people and come up with
a unified thing at LPC?
On Fri, Nov 19, 2010 at 7:22 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-19 at 16:10 +0800, Huang Ying wrote:
>> Hi, Len,
>>
>> This is used by APEI ERST and GEHS. But it is a generic hardware
>> error reporting mechanism and can be used by other hardware error
>> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
>>
>> The patchset is split from the original APEI patchset to make it
>> explicit that this is a generic mechanism, not APEI specific bits.
>>
>> [PATCH 1/2] Generic hardware error reporting mechanism
>> [PATCH 2/2] Hardware error record persistent support
>
> You call it generic, does that mean the EDAC guys agree, does it work on
> AMD and IA64?
I call it "generic", because it can be used by EDAC, PCIe AER, Machine
Check, etc to report hardware errors, and it can work on AMD and
IA64.
> If not, Tony could you please apply a cluebat? I thought Intel was going
> to sit around the table with all hardware error people and come up with
> a unified thing at LPC?
Best Regards,
Huang Ying
On Fri, 2010-11-19 at 19:54 +0800, huang ying wrote:
>
> On Fri, Nov 19, 2010 at 7:22 PM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, 2010-11-19 at 16:10 +0800, Huang Ying wrote:
> >> Hi, Len,
> >>
> >> This is used by APEI ERST and GEHS. But it is a generic hardware
> >> error reporting mechanism and can be used by other hardware error
> >> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
> >>
> >> The patchset is split from the original APEI patchset to make it
> >> explicit that this is a generic mechanism, not APEI specific bits.
> >>
> >> [PATCH 1/2] Generic hardware error reporting mechanism
> >> [PATCH 2/2] Hardware error record persistent support
> >
> > You call it generic, does that mean the EDAC guys agree, does it work on
> > AMD and IA64?
>
> I call it "generic", because it can be used by EDAC, PCIe AER, Machine
> Check, etc to report hardware errors, and it can work on AMD and
> IA64.
1) that's not a complete answer -- I asked to the EDAC guys agree? Have
you even tried talking to them?
and 2) 'can' is not the right word here, I though we'd all agreed to
talk about this and agree on some approach before littering the kernel
with tiny special case interfaces.. which this will be if EDAC and
others don't agree with you.
So I want a firm agreement of all parties interested that this is the
way forward, if not you already have my NAK.
On Fri, Nov 19, 2010 at 8:02 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-19 at 19:54 +0800, huang ying wrote:
>>
>> On Fri, Nov 19, 2010 at 7:22 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Fri, 2010-11-19 at 16:10 +0800, Huang Ying wrote:
>> >> Hi, Len,
>> >>
>> >> This is used by APEI ERST and GEHS. But it is a generic hardware
>> >> error reporting mechanism and can be used by other hardware error
>> >> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
>> >>
>> >> The patchset is split from the original APEI patchset to make it
>> >> explicit that this is a generic mechanism, not APEI specific bits.
>> >>
>> >> [PATCH 1/2] Generic hardware error reporting mechanism
>> >> [PATCH 2/2] Hardware error record persistent support
>> >
>> > You call it generic, does that mean the EDAC guys agree, does it work on
>> > AMD and IA64?
>>
>> I call it "generic", because it can be used by EDAC, PCIe AER, Machine
>> Check, etc to report hardware errors, and it can work on AMD and
>> IA64.
>
> 1) that's not a complete answer -- I asked to the EDAC guys agree? Have
> you even tried talking to them?
I have talked with Mauro during LPC. We all agree that we need a
generic hardware error reporting interface. And now, I want to talk
with all interested parties with code.
> and 2) 'can' is not the right word here, I though we'd all agreed to
> talk about this and agree on some approach before littering the kernel
> with tiny special case interfaces.. which this will be if EDAC and
> others don't agree with you.
>
> So I want a firm agreement of all parties interested that this is the
> way forward, if not you already have my NAK.
Why not talk about the code? Which it can do, which it can not do? Why
not talk about the requirement? What do we need?
Best Regards,
Huang Ying
On Fri, 2010-11-19 at 20:48 +0800, huang ying wrote:
>
> I have talked with Mauro during LPC. We all agree that we need a
> generic hardware error reporting interface. And now, I want to talk
> with all interested parties with code.
talking with code could maybe be done.. but definitely not in the form
of a 'pull' request of your approach.
Also, I would think that if you do a RFC (the usual way to talk with
code) is to list the various requirements and how you address them with
your code and asking the other parties if the agree with the approach
you've taken.
You've done none of that, instead you ask Len to merge your muck without
even mentioning what the other parties think.
On Fri, Nov 19, 2010 at 8:55 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-19 at 20:48 +0800, huang ying wrote:
>>
>> I have talked with Mauro during LPC. We all agree that we need a
>> generic hardware error reporting interface. And now, I want to talk
>> with all interested parties with code.
>
> talking with code could maybe be done.. but definitely not in the form
> of a 'pull' request of your approach.
>
> Also, I would think that if you do a RFC (the usual way to talk with
> code) is to list the various requirements and how you address them with
> your code and asking the other parties if the agree with the approach
> you've taken.
I lists the requirements and my solutions in my patch description of
[1/2] and [2/2]. If you have any question, please let me know.
Best Regards,
Huang Ying
On Fri, 2010-11-19 at 21:06 +0800, huang ying wrote:
>
> I lists the requirements and my solutions in my patch description of
> [1/2] and [2/2]. If you have any question, please let me know.
Those talk about _your_ requirements, it does not touch upon any of the
concerns of the other parties, it doesn't even mention them beyond
listing their existence.
It also, very specifically, does not mention if you did talk to any of
those parties and what their thought on the matter was, nor does it ask
them about their opinion.
Instead you present your work as a fait accompli, not a work in flux and
subject to change. You ask for it to be merged -- this does not come
across like a discussion, much less a request for co-operation.
On Fri, Nov 19, 2010 at 9:18 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-19 at 21:06 +0800, huang ying wrote:
>>
>> I lists the requirements and my solutions in my patch description of
>> [1/2] and [2/2]. If you have any question, please let me know.
>
> Those talk about _your_ requirements, it does not touch upon any of the
> concerns of the other parties, it doesn't even mention them beyond
> listing their existence.
They are welcome to talk about their requirement and opinion in the thread.
> It also, very specifically, does not mention if you did talk to any of
> those parties and what their thought on the matter was, nor does it ask
> them about their opinion.
Again opinion is welcome. I just want to talk with code.
> Instead you present your work as a fait accompli, not a work in flux and
> subject to change. You ask for it to be merged -- this does not come
> across like a discussion, much less a request for co-operation.
I heard about that LKML likes talk with code instead of idea. Am I wrong?
Best Regards,
Huang Ying
On Fri, 2010-11-19 at 21:28 +0800, huang ying wrote:
>
> > Instead you present your work as a fait accompli, not a work in flux and
> > subject to change. You ask for it to be merged -- this does not come
> > across like a discussion, much less a request for co-operation.
>
> I heard about that LKML likes talk with code instead of idea. Am I wrong?
No, its your presentation of said code that's wrong.
You're missing the [RFC] tags and open questions in your Changelogs
like: EDAC could use it like so and so, does that sound acceptable
Mauro?
There is no mention you actually did talk to Mauro and Tony and what its
outcome was.
On Fri, Nov 19, 2010 at 9:37 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-19 at 21:28 +0800, huang ying wrote:
>>
>> > Instead you present your work as a fait accompli, not a work in flux and
>> > subject to change. You ask for it to be merged -- this does not come
>> > across like a discussion, much less a request for co-operation.
>>
>> I heard about that LKML likes talk with code instead of idea. Am I wrong?
>
> No, its your presentation of said code that's wrong.
>
> You're missing the [RFC] tags and open questions in your Changelogs
> like: EDAC could use it like so and so, does that sound acceptable
> Mauro?
Maybe. We can not talk with code without that tag?
> There is no mention you actually did talk to Mauro and Tony and what its
> outcome was.
The main outcome is that we need a generic hardware error reporting
interface, that can be used by all hardware error reporting
mechanisms.
Best Regards,
Huang Ying
Putting aside the fact that you've ignored all the requests from the
last round...
On Fri, Nov 19, 2010 at 04:10:32PM +0800, Huang Ying wrote:
> There are many hardware error detecting and reporting components in
> kernel, including x86 Machine Check, PCIe AER, EDAC, APEI GHES
> etc. Each one has its error reporting implementation, including user
> space interface, error record format, in kernel buffer, etc. This
> patch provides a generic hardware error reporting mechanism to reduce
> the duplicated effort and add more common services.
>
>
> A highly extensible generic hardware error record data structure is
> defined to accommodate various hardware error information from various
> hardware error sources. The overall structure of error record is as
> follow:
>
> -----------------------------------------------------------------
> | rcd hdr | sec 1 hdr | sec 1 data | sec 2 hdr | sec2 data | ...
> -----------------------------------------------------------------
>
> Several error sections can be incorporated into one error record to
> accumulate information from multiple hardware components related to
> one error. For example, for an error on a device on the secondary
> side of a PCIe bridge, it is useful to record error information from
> the PCIe bridge and the PCIe device. Multiple section can be used to
> hold both the cooked and the raw error information. So that the
> abstract information can be provided by the cooked one and no
> information will be lost because the raw one is provided too.
>
> There are "reversion" (rev) and "length" field in record header and
> "type" and "length" field in section header, so the user space error
> daemon can skip unrecognized error record or error section. This
> makes old version error daemon can work with the newer kernel.
>
> New error section type can be added to support new error type, error
> sources.
Yes, we need a generic error reporting format. Wait a second, this
error format structure looks very much like a tracepoint record to me -
it has common fields and record-specific fields. And we have all that
infrastructure in the kernel and yet you've decided to reimplement it
anew. Remind me again why?
> The hardware error reporting mechanism designed by the patch
> integrates well with device model in kernel. struct dev_herr_info is
> defined and pointed to by "error" field of struct device. This is
> used to hold error reporting related information for each device. One
> sysfs directory "error" will be created for each hardware error
> reporting device. Some files for error reporting statistics and
> control are created in sysfs "error" directory.
But all this APEI crap sounds suspiciously bloated - why do we need an
error field for _every_ device on the system? Looks like a bunch of
hoopla for only a few error types...
> For example, the
> "error" directory for APEI GHES is as follow.
>
> /sys/devices/platform/GHES.0/error/logs
> /sys/devices/platform/GHES.0/error/overflows
> /sys/devices/platform/GHES.0/error/throttles
>
> Where "logs" is number of error records logged; "throttles" is number
> of error records not logged because the reporting rate is too high;
> "overflows" is number of error records not logged because there is no
> space available.
>
> Not all devices will report errors, so struct dev_herr_info and sysfs
> directory/files are only allocated/created for devices explicitly
> enable it. So to enumerate the error sources of system, you just need
> to enumerate "error" directory for each device directory in
> /sys/devices.
So how do you say which devices should report and which shouldn't report
errors, from userspace with a tool? Default actions? What if you forget
to enable error reporting of a device which actually is important?
> One device file (/dev/error/error) which mixed error records from all
> hardware error reporting devices is created to convey error records
> from kernel space to user space. Because hardware devices are dealt
> with, a device file is the most natural way to do that. Because
> hardware error reporting should not hurts system performance, the
> throughput of the interface should be controlled to a low level (done
> by user space error daemon), ordinary "read" is sufficient from
> performance point of view.
Right, so no need for a daemon but who does the read? cat? How are you
going to collect all those errors? How do you enforce policies? How
do you inject errors? How do you perform actions based on the error
type like disabling or reconfiguring a hw device based on the errors it
generates?
> The patch provides common services for hardware error reporting
> devices too.
>
> A lock-less hardware error record allocator is provided. So for
> hardware error that can be ignored (such as corrected errors), it is
> not needed to pre-allocate the error record or allocate the error
> record on stack. Because the possibility for two hardware parts to go
> error simultaneously is very small, one big unified memory pool for
> hardware errors is better than one memory pool or buffer for each
> device.
Yet another bloat winner. Why do we need a memory allocator for error
records? You either get a single critical error which shuts down the
system and you log it to persistent storage, if possible, or you work at
those uncritical errors one at a time.
IOW, do you have a real-life usecase which justifies the dire need for a
lockless memory allocator specifically for hw errors?
> After filling in all necessary fields in hardware error record, the
> error reporting is quite straightforward, just calling
> herr_record_report, parameters are the error record itself and the
> corresponding struct device.
>
> Hardware errors may burst, for example, same hardware errors may be
> reported at high rate within a short interval, this will use up all
> pre-allocated memory for error reporting, so that other hardware
> errors come from same or different hardware device can not be logged.
> To deal with this issue, a throttle algorithm is implemented. The
> logging rate for errors come from one hardware error device is
> throttled based on the available pre-allocated memory for error
> reporting. In this way we can log as many kinds of errors as possible
> comes from as many devices as possible.
>
>
> This patch is designed by Andi Kleen and Huang Ying.
>
> Signed-off-by: Huang Ying <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> ---
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/base/Makefile | 1
> drivers/base/herror.c | 98 ++++++++
> drivers/herror/Kconfig | 5
> drivers/herror/Makefile | 1
> drivers/herror/herr-core.c | 488
++++++++++++++++++++++++++++++++++++++++++
> include/linux/Kbuild | 1
> include/linux/device.h | 14 +
> include/linux/herror.h | 35 +++
> include/linux/herror_record.h | 100 ++++++++
> kernel/Makefile | 1
> 12 files changed, 747 insertions(+)
> create mode 100644 drivers/base/herror.c
> create mode 100644 drivers/herror/Kconfig
> create mode 100644 drivers/herror/Makefile
> create mode 100644 drivers/herror/herr-core.c
> create mode 100644 include/linux/herror.h
> create mode 100644 include/linux/herror_record.h
And as it was said a countless times already, this whole thing, if at
all accepted should go to drivers/edac/ or drivers/ras/ or whatever.
--
Regards/Gruss,
Boris.
On Fri, Nov 19, 2010 at 12:10 AM, Huang Ying <[email protected]> wrote:
>
> This is used by APEI ERST and GEHS. But it is a generic hardware
> error reporting mechanism and can be used by other hardware error
> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
Yeah, no.
Really.
We don't want some specific hardware error reporting mechanism.
Hardware errors are way less common than other errors, so making
something that is special to them just isn't very interesting.
I seriously suggest that the only _sane_ way to handle hardware errors is to
(a) admit that they are rare
(b) not try to use some odd special mechanism for them
(c) just 'printk' them so that you can use the absolutely most
standard way to report them, and one that administrators are already
used to and has support for network logging with existing tools etc.
(d) and if you want to make them persistent and NMI-safe, just do
that on the _printk_ level. That way, any NMI-safeness or persistency
helps everybody.
I really see _zero_ point to some hw-error-specific model.
Linus
On Fri, Nov 19, 2010 at 12:10 AM, Huang Ying <[email protected]> wrote:
> Normally, corrected hardware error records will go through the kernel
> processing and be logged to disk or network finally. ?But for
> uncorrected errors, system may go panic directly for better error
> containment, disk or network is not usable in this half-working
> system. ?To avoid losing these valuable hardware error records, the
> error records are saved into some kind of simple persistent storage
> such as flash before panic, so that they can be read out after system
> reboot successfully.
I think this is totally the wrong thing to do. TOTALLY.
The fact is, concentrating about "hardware errors" makes this
something that I refuse to merge. It's such an idiotic approach that
it's disgusting.
Now, if this was designed to be a "hardware-backed persistent 'printk'
buffer", and was explicitly meant to save not just some special
hardware error, but catch all printk's (which may be due to hardware
errors or oopses or warnings or whatever), that would be useful.
But limiting it to just some special source of errors makes this
pointless and not ever worth merging.
Linus
On Fri, 19 Nov 2010 07:52:08 -0800
Linus Torvalds <[email protected]> wrote:
> On Fri, Nov 19, 2010 at 12:10 AM, Huang Ying <[email protected]> wrote:
> > Normally, corrected hardware error records will go through the kernel
> > processing and be logged to disk or network finally. __But for
> > uncorrected errors, system may go panic directly for better error
> > containment, disk or network is not usable in this half-working
> > system. __To avoid losing these valuable hardware error records, the
> > error records are saved into some kind of simple persistent storage
> > such as flash before panic, so that they can be read out after system
> > reboot successfully.
>
> I think this is totally the wrong thing to do. TOTALLY.
>
> The fact is, concentrating about "hardware errors" makes this
> something that I refuse to merge. It's such an idiotic approach that
> it's disgusting.
>
> Now, if this was designed to be a "hardware-backed persistent 'printk'
> buffer", and was explicitly meant to save not just some special
> hardware error, but catch all printk's (which may be due to hardware
> errors or oopses or warnings or whatever), that would be useful.
>
> But limiting it to just some special source of errors makes this
> pointless and not ever worth merging.
>
yep. We already have bits and pieces in place for this: kmsg_dump,
ramoops, mtdoops, etc. If your hardware has a non-volatile memory then
just hook it up as a backend driver for kmsg_dump.
On Fri, Nov 19, 2010 at 11:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Nov 19, 2010 at 12:10 AM, Huang Ying <[email protected]> wrote:
>> Normally, corrected hardware error records will go through the kernel
>> processing and be logged to disk or network finally. But for
>> uncorrected errors, system may go panic directly for better error
>> containment, disk or network is not usable in this half-working
>> system. To avoid losing these valuable hardware error records, the
>> error records are saved into some kind of simple persistent storage
>> such as flash before panic, so that they can be read out after system
>> reboot successfully.
>
> I think this is totally the wrong thing to do. TOTALLY.
>
> The fact is, concentrating about "hardware errors" makes this
> something that I refuse to merge. It's such an idiotic approach that
> it's disgusting.
>
> Now, if this was designed to be a "hardware-backed persistent 'printk'
> buffer", and was explicitly meant to save not just some special
> hardware error, but catch all printk's (which may be due to hardware
> errors or oopses or warnings or whatever), that would be useful.
>
> But limiting it to just some special source of errors makes this
> pointless and not ever worth merging.
Yes. APEI ERST can be used to back persistent 'printk', and that is in
our plan too. But APEI ERST is not limited to do that, it can be used
for printk, hardware error, and maybe some other users. When we design
APEI ERST support, we have multiple users in mind.
Best Regards,
Huang Ying
On Fri, Nov 19, 2010 at 11:56 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Nov 19, 2010 at 12:10 AM, Huang Ying <[email protected]> wrote:
>>
>> This is used by APEI ERST and GEHS. But it is a generic hardware
>> error reporting mechanism and can be used by other hardware error
>> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
>
> Yeah, no.
>
> Really.
>
> We don't want some specific hardware error reporting mechanism.
> Hardware errors are way less common than other errors, so making
> something that is special to them just isn't very interesting.
>
> I seriously suggest that the only _sane_ way to handle hardware errors is to
> (a) admit that they are rare
> (b) not try to use some odd special mechanism for them
> (c) just 'printk' them so that you can use the absolutely most
> standard way to report them, and one that administrators are already
> used to and has support for network logging with existing tools etc.
We thought about 'printk' for hardware errors before, but it has some
issues too.
1) It mixes software errors and hardware errors. When Andi Kleen
maintained the Machine Check code, he found many users report the
hardware errors as software bug to software vendor instead of as
hardware error to hardware vendor. Having explicit hardware error
reporting interface may help these users.
2) Hardware error reporting may flush other information in printk
buffer. Considering one pin of your ECC DIMM is broken, tons of 1 bit
corrected memory error will be reported. Although we can enforce some
kind of throttling, your printk buffer may be full of the hardware
error reporting eventually.
3) We need some kind of user space hardware error daemon, which is
used to enforce some policy. For example, if the number of corrected
memory errors reported on one page exceeds the threshold, we can
offline the page to prevent some fatal error to occur in the future,
because fatal error may begin with corrected errors in reality. printk
is good for administrator, and may be not good enough for the hardware
error daemon.
But yes, printk is convenient for administrator or end user. So we
plan to printk some summary information about hardware errors too. But
leave the full hardware error information to the hardware error
specific interface, so that the administrator can get some clue and
hardware error information will not flush the printk buffer.
> (d) and if you want to make them persistent and NMI-safe, just do
> that on the _printk_ level. That way, any NMI-safeness or persistency
> helps everybody.
>
> I really see _zero_ point to some hw-error-specific model.
Best Regards,
Huang Ying
On Fri, Nov 19, 2010 at 6:04 PM, huang ying
<[email protected]> wrote:
>
> We thought about 'printk' for hardware errors before, but it has some
> issues too.
>
> 1) It mixes software errors and hardware errors. When Andi Kleen
> maintained the Machine Check code, he found many users report the
> hardware errors as software bug to software vendor instead of as
> hardware error to hardware vendor. Having explicit hardware error
> reporting interface may help these users.
Bah. Many machine checks _were_ software errors. They were things like
the BIOS not clearing some old pending state etc.
The confusion came not from printk, but simply from ambiguous errors.
When is a machine check hardware-related? It's not at all always
obvious.
Sometimes machine checks are from uninitialized hardware state, where
_software_ hasn't initialized it. Is it a hardware bug? No.
> 2) Hardware error reporting may flush other information in printk
> buffer. Considering one pin of your ECC DIMM is broken, tons of 1 bit
> corrected memory error will be reported. Although we can enforce some
> kind of throttling, your printk buffer may be full of the hardware
> error reporting eventually.
Sure. That doesn't change the fact that finding the data is your
/var/log/messages and your regular logging tools is still a lot more
useful than having some random tool that is specialized and that most
IT people won't know about. And that won't be good at doing network
reporting etc etc.
The thing is, hardware errors aren't that special. Sure, hardware
people always think so. But to anybody else, a hardware error is "just
another source of issues".
Anybody who thinks that hardware errors are special and needs a
special interface is missing that point totally.
And I really do understand why people inside Intel would miss that
point. To YOU guys the hardware errors you report are magical and
special. But that's always true. To _everybody_, the errors _they_
report is special. Like snowflakes, we're all unique. And we're all
the same.
> 3) We need some kind of user space hardware error daemon, which is
> used to enforce some policy. For example, if the number of corrected
> memory errors reported on one page exceeds the threshold, we can
> offline the page to prevent some fatal error to occur in the future,
> because fatal error may begin with corrected errors in reality. printk
> is good for administrator, and may be not good enough for the hardware
> error daemon.
And by "we", who do you mean exactly? The fact is, "we" covers a lot
of ground, and I don't think your statement is in the least true.
Yes, IT people want to know. When they start seeing hardware errors,
they'll start replacing the machine as soon as they can. Whether that
replacement is then "in five minutes" or "four months from now" is up
to their management, their replacement policy, and based on how
critical that machine is.
IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.
And yes, Intel can do guidelines, but when you say there should be
some "enforced policy" by some tool, you're simply just wrong.
Linus
On Fri, Nov 19, 2010 at 9:56 PM, <[email protected]> wrote:
> Yes, we need a generic error reporting format. Wait a second, this
> error format structure looks very much like a tracepoint record to me -
> it has common fields and record-specific fields. And we have all that
> infrastructure in the kernel and yet you've decided to reimplement it
> anew. Remind me again why?
You mean "struct trace_entry"? They are quite different on design. The
record format in patch can incorporate multiple sections into one
record, which is meaningful for hardware error reporting. And we do
not need the fancy
"/sys/kernel/debug/tracing/events/<xxx>/<xxx>/format", user space
error daemon only consumes all error record it recognized and blindly
log all other records.
>> The hardware error reporting mechanism designed by the patch
>> integrates well with device model in kernel. struct dev_herr_info is
>> defined and pointed to by "error" field of struct device. This is
>> used to hold error reporting related information for each device. One
>> sysfs directory "error" will be created for each hardware error
>> reporting device. Some files for error reporting statistics and
>> control are created in sysfs "error" directory.
>
> But all this APEI crap sounds suspiciously bloated
There is no APEI specific code in this patch.
> - why do we need an
> error field for _every_ device on the system? Looks like a bunch of
> hoopla for only a few error types...
Because every device may report hardware errors, but not every device
will do it. So just a pointer is added to "struct device" and
corresponding data structure is only created when needed.
>> For example, the
>> "error" directory for APEI GHES is as follow.
>>
>> /sys/devices/platform/GHES.0/error/logs
>> /sys/devices/platform/GHES.0/error/overflows
>> /sys/devices/platform/GHES.0/error/throttles
>>
>> Where "logs" is number of error records logged; "throttles" is number
>> of error records not logged because the reporting rate is too high;
>> "overflows" is number of error records not logged because there is no
>> space available.
>>
>> Not all devices will report errors, so struct dev_herr_info and sysfs
>> directory/files are only allocated/created for devices explicitly
>> enable it. So to enumerate the error sources of system, you just need
>> to enumerate "error" directory for each device directory in
>> /sys/devices.
>
> So how do you say which devices should report and which shouldn't report
> errors, from userspace with a tool? Default actions? What if you forget
> to enable error reporting of a device which actually is important?
In general all hardware errors should be reported and logged.
>> One device file (/dev/error/error) which mixed error records from all
>> hardware error reporting devices is created to convey error records
>> from kernel space to user space. Because hardware devices are dealt
>> with, a device file is the most natural way to do that. Because
>> hardware error reporting should not hurts system performance, the
>> throughput of the interface should be controlled to a low level (done
>> by user space error daemon), ordinary "read" is sufficient from
>> performance point of view.
>
> Right, so no need for a daemon but who does the read? cat? How are you
> going to collect all those errors? How do you enforce policies?
Some summary hardware error information can be put into printk. Error
daemon is needed because we need not only log the the error but the
predictive recovery. If you really have no daemon, cat can be used to
log the error. I don't fully understand your words, you want to
enforce policies without error daemon?
> How do you inject errors?
We can use another device file to inject error, for example
/dev/error/error_inject. Just write the needed information to this
file. The format can be same as the error record defined as above,
because it is highly extensible.
> How do you perform actions based on the error
> type like disabling or reconfiguring a hw device based on the errors it
> generates?
These are policies and will be done in user space error daemon. For
some emergency error recovery actions, we will do it in kernel.
>> The patch provides common services for hardware error reporting
>> devices too.
>>
>> A lock-less hardware error record allocator is provided. So for
>> hardware error that can be ignored (such as corrected errors), it is
>> not needed to pre-allocate the error record or allocate the error
>> record on stack. Because the possibility for two hardware parts to go
>> error simultaneously is very small, one big unified memory pool for
>> hardware errors is better than one memory pool or buffer for each
>> device.
>
> Yet another bloat winner. Why do we need a memory allocator for error
> records?
The point is lockless not the memory allocator. The lockless memory
allocator is not hardware error reporting specific, it can be used by
other part of the kernel too.
> You either get a single critical error which shuts down the
> system and you log it to persistent storage, if possible, or you work at
> those uncritical errors one at a time.
Uncritical errors can be reported in NMI handler too. So we need
lockless data structure for them.
> IOW, do you have a real-life usecase which justifies the dire need for a
> lockless memory allocator specifically for hw errors?
No. not special for hw errors. It can be used by other part of kernel.
>> After filling in all necessary fields in hardware error record, the
>> error reporting is quite straightforward, just calling
>> herr_record_report, parameters are the error record itself and the
>> corresponding struct device.
>>
>> Hardware errors may burst, for example, same hardware errors may be
>> reported at high rate within a short interval, this will use up all
>> pre-allocated memory for error reporting, so that other hardware
>> errors come from same or different hardware device can not be logged.
>> To deal with this issue, a throttle algorithm is implemented. The
>> logging rate for errors come from one hardware error device is
>> throttled based on the available pre-allocated memory for error
>> reporting. In this way we can log as many kinds of errors as possible
>> comes from as many devices as possible.
>>
>>
>> This patch is designed by Andi Kleen and Huang Ying.
>>
>> Signed-off-by: Huang Ying <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> ---
>> drivers/Kconfig | 2
>> drivers/Makefile | 1
>> drivers/base/Makefile | 1
>> drivers/base/herror.c | 98 ++++++++
>> drivers/herror/Kconfig | 5
>> drivers/herror/Makefile | 1
>> drivers/herror/herr-core.c | 488
> ++++++++++++++++++++++++++++++++++++++++++
>> include/linux/Kbuild | 1
>> include/linux/device.h | 14 +
>> include/linux/herror.h | 35 +++
>> include/linux/herror_record.h | 100 ++++++++
>> kernel/Makefile | 1
>> 12 files changed, 747 insertions(+)
>> create mode 100644 drivers/base/herror.c
>> create mode 100644 drivers/herror/Kconfig
>> create mode 100644 drivers/herror/Makefile
>> create mode 100644 drivers/herror/herr-core.c
>> create mode 100644 include/linux/herror.h
>> create mode 100644 include/linux/herror_record.h
>
> And as it was said a countless times already, this whole thing, if at
> all accepted should go to drivers/edac/ or drivers/ras/ or whatever.
You think drivers/herror is not a good name? We can rename it to
"drivers/ras" if that is the consensus.
Best Regards,
Huang Ying
On Sat, Nov 20, 2010 at 10:15 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Nov 19, 2010 at 6:04 PM, huang ying
> <[email protected]> wrote:
>>
>> We thought about 'printk' for hardware errors before, but it has some
>> issues too.
>>
>> 1) It mixes software errors and hardware errors. When Andi Kleen
>> maintained the Machine Check code, he found many users report the
>> hardware errors as software bug to software vendor instead of as
>> hardware error to hardware vendor. Having explicit hardware error
>> reporting interface may help these users.
>
> Bah. Many machine checks _were_ software errors. They were things like
> the BIOS not clearing some old pending state etc.
I think the BIOS error should be reported to hardware vendor instead
of software vendor. Do you think so?
> The confusion came not from printk, but simply from ambiguous errors.
> When is a machine check hardware-related? It's not at all always
> obvious.
>
> Sometimes machine checks are from uninitialized hardware state, where
> _software_ hasn't initialized it. Is it a hardware bug? No.
That could be possible.
>> 2) Hardware error reporting may flush other information in printk
>> buffer. Considering one pin of your ECC DIMM is broken, tons of 1 bit
>> corrected memory error will be reported. Although we can enforce some
>> kind of throttling, your printk buffer may be full of the hardware
>> error reporting eventually.
>
> Sure. That doesn't change the fact that finding the data is your
> /var/log/messages and your regular logging tools is still a lot more
> useful than having some random tool that is specialized and that most
> IT people won't know about. And that won't be good at doing network
> reporting etc etc.
>
> The thing is, hardware errors aren't that special. Sure, hardware
> people always think so. But to anybody else, a hardware error is "just
> another source of issues".
>
> Anybody who thinks that hardware errors are special and needs a
> special interface is missing that point totally.
>
> And I really do understand why people inside Intel would miss that
> point. To YOU guys the hardware errors you report are magical and
> special. But that's always true. To _everybody_, the errors _they_
> report is special. Like snowflakes, we're all unique. And we're all
> the same.
Yes. Hardware errors and software errors are just two types of errors.
Hardware errors are not so special. So I agree that we need to report
hardware error information with printk. Which is mainly human oriented
interface. We need a tool oriented interface too, to let user space
error daemon to do something like counting errors for hardware
components, offline/hot-remove the error components based on some
policy automatically, etc.
>> 3) We need some kind of user space hardware error daemon, which is
>> used to enforce some policy. For example, if the number of corrected
>> memory errors reported on one page exceeds the threshold, we can
>> offline the page to prevent some fatal error to occur in the future,
>> because fatal error may begin with corrected errors in reality. printk
>> is good for administrator, and may be not good enough for the hardware
>> error daemon.
>
> And by "we", who do you mean exactly? The fact is, "we" covers a lot
> of ground, and I don't think your statement is in the least true.
>
> Yes, IT people want to know. When they start seeing hardware errors,
> they'll start replacing the machine as soon as they can. Whether that
> replacement is then "in five minutes" or "four months from now" is up
> to their management, their replacement policy, and based on how
> critical that machine is.
>
> IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.
Because some external cause like cosmic rays and electromagnetic
interference can cause hardware errors too. We need error counting to
distinguish between external caused hardware errors and real hardware
errors.
Usually, the hardware components reporting corrected hardware errors
can work for some while. But if the corrected errors reporting rate
goes high, the possibility for hardware to stop work (because of some
fatal error) goes high too. The error counting can help IT people to
know the urgency.
And user space error daemon can help IT people to do some recovery
operation automatically, for example, trigger the memory or CPU
offline/hot-remove based on policy set by IT people.
> And yes, Intel can do guidelines, but when you say there should be
> some "enforced policy" by some tool, you're simply just wrong.
Yes. The replacement policy should be determined by IT people. My
previous expression is confusing. We need to provide some mechanism in
user space error daemon to help IT people to do that automatically.
For example, we provide error counting for each hardware components,
and let IT people set the threshold.
So, do you agree that we need some tool oriented interface in addition
to printk?
Best Regards,
Huang Ying
On Sat, Nov 20, 2010 at 10:52:56AM +0800, huang ying wrote:
> On Fri, Nov 19, 2010 at 9:56 PM, <[email protected]> wrote:
> > Yes, we need a generic error reporting format. Wait a second, this
> > error format structure looks very much like a tracepoint record to me -
> > it has common fields and record-specific fields. And we have all that
> > infrastructure in the kernel and yet you've decided to reimplement it
> > anew. Remind me again why?
>
> You mean "struct trace_entry"? They are quite different on design. The
> record format in patch can incorporate multiple sections into one
> record, which is meaningful for hardware error reporting.
Nope, I mean a tracepoint and it can do all those things.
> And we do not need the fancy
> "/sys/kernel/debug/tracing/events/<xxx>/<xxx>/format", user space
> error daemon only consumes all error record it recognized and blindly
> log all other records.
Nobody said you needed that - the tracepoint contains all the
information you need.
[..]
> > - why do we need an
> > error field for _every_ device on the system? Looks like a bunch of
> > hoopla for only a few error types...
>
> Because every device may report hardware errors, but not every device
> will do it. So just a pointer is added to "struct device" and
> corresponding data structure is only created when needed.
>
> >> For example, the
> >> "error" directory for APEI GHES is as follow.
> >>
> >> /sys/devices/platform/GHES.0/error/logs
> >> /sys/devices/platform/GHES.0/error/overflows
> >> /sys/devices/platform/GHES.0/error/throttles
> >>
> >> Where "logs" is number of error records logged; "throttles" is number
> >> of error records not logged because the reporting rate is too high;
> >> "overflows" is number of error records not logged because there is no
> >> space available.
> >>
> >> Not all devices will report errors, so struct dev_herr_info and sysfs
> >> directory/files are only allocated/created for devices explicitly
> >> enable it. So to enumerate the error sources of system, you just need
> >> to enumerate "error" directory for each device directory in
> >> /sys/devices.
> >
> > So how do you say which devices should report and which shouldn't report
> > errors, from userspace with a tool? Default actions? What if you forget
> > to enable error reporting of a device which actually is important?
>
> In general all hardware errors should be reported and logged.
So why the need to enable/disable them? Why add all that code to
enable/disable them when all devices can report hw errors but not all
do it but all should do it... (I can go on forever). Do you see the
spaghetti statement?
> >> One device file (/dev/error/error) which mixed error records from all
> >> hardware error reporting devices is created to convey error records
> >> from kernel space to user space. Because hardware devices are dealt
> >> with, a device file is the most natural way to do that. Because
> >> hardware error reporting should not hurts system performance, the
> >> throughput of the interface should be controlled to a low level (done
> >> by user space error daemon), ordinary "read" is sufficient from
> >> performance point of view.
> >
> > Right, so no need for a daemon but who does the read? cat? How are you
> > going to collect all those errors? How do you enforce policies?
>
> Some summary hardware error information can be put into printk. Error
> daemon is needed because we need not only log the the error but the
> predictive recovery. If you really have no daemon, cat can be used to
> log the error. I don't fully understand your words, you want to
> enforce policies without error daemon?
Sorry, I misread your original statement. So it is clear that we need
some kind of daemon to do some error post-processing.
>
> > How do you inject errors?
>
> We can use another device file to inject error, for example
> /dev/error/error_inject. Just write the needed information to this
> file. The format can be same as the error record defined as above,
> because it is highly extensible.
Same argument as above - you can do that with tracepoints without
duplicating functionality.
[.. ]
> >> A lock-less hardware error record allocator is provided. So for
> >> hardware error that can be ignored (such as corrected errors), it is
> >> not needed to pre-allocate the error record or allocate the error
> >> record on stack. Because the possibility for two hardware parts to go
> >> error simultaneously is very small, one big unified memory pool for
> >> hardware errors is better than one memory pool or buffer for each
> >> device.
> >
> > Yet another bloat winner. Why do we need a memory allocator for error
> > records?
>
> The point is lockless not the memory allocator. The lockless memory
> allocator is not hardware error reporting specific, it can be used by
> other part of the kernel too.
Wait a second, are we talking about hardware errors or memory management
here? If you want to push your lockless memory allocator, send it in to
linux-mm and let the guys there look at it, but not in conjunction with
hw errors. That's like I'm going for a run and, btw, while I'm at it, I
can buy a coffee machine.
> > You either get a single critical error which shuts down the
> > system and you log it to persistent storage, if possible, or you work at
> > those uncritical errors one at a time.
>
> Uncritical errors can be reported in NMI handler too. So we need
> lockless data structure for them.
Why? What's wrong with using a single struct on the stack? Are you
afraid that we might blow the NMI stack although NMIs don't nest?
[.. ]
Dude, let me save you the trouble: all everybody is trying to say is
that you can achieve all that with stuff already available in the
kernel. And HW errors are not that special to need a special subsystem
grown for them - you just need to handle them properly. The only thing
you should provide is the backend to persistent storage so that error
info can be put there - everything else is bloat.
--
Regards/Gruss,
Boris.
On Sat, Nov 20, 2010 at 5:00 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Nov 20, 2010 at 10:52:56AM +0800, huang ying wrote:
>> On Fri, Nov 19, 2010 at 9:56 PM, <[email protected]> wrote:
>> > Yes, we need a generic error reporting format. Wait a second, this
>> > error format structure looks very much like a tracepoint record to me -
>> > it has common fields and record-specific fields. And we have all that
>> > infrastructure in the kernel and yet you've decided to reimplement it
>> > anew. Remind me again why?
>>
>> You mean "struct trace_entry"? They are quite different on design. The
>> record format in patch can incorporate multiple sections into one
>> record, which is meaningful for hardware error reporting.
>
> Nope, I mean a tracepoint and it can do all those things.
Can you use a tracepoint to output error information of multiple
devices in one error record? That is useful for hardware error
reporting.
>> And we do not need the fancy
>> "/sys/kernel/debug/tracing/events/<xxx>/<xxx>/format", user space
>> error daemon only consumes all error record it recognized and blindly
>> log all other records.
>
> Nobody said you needed that - the tracepoint contains all the
> information you need.
Yes. We do not need that. That is only the side effect.
> [..]
>
>> > - why do we need an
>> > error field for _every_ device on the system? Looks like a bunch of
>> > hoopla for only a few error types...
>>
>> Because every device may report hardware errors, but not every device
>> will do it. So just a pointer is added to "struct device" and
>> corresponding data structure is only created when needed.
>>
>> >> For example, the
>> >> "error" directory for APEI GHES is as follow.
>> >>
>> >> /sys/devices/platform/GHES.0/error/logs
>> >> /sys/devices/platform/GHES.0/error/overflows
>> >> /sys/devices/platform/GHES.0/error/throttles
>> >>
>> >> Where "logs" is number of error records logged; "throttles" is number
>> >> of error records not logged because the reporting rate is too high;
>> >> "overflows" is number of error records not logged because there is no
>> >> space available.
>> >>
>> >> Not all devices will report errors, so struct dev_herr_info and sysfs
>> >> directory/files are only allocated/created for devices explicitly
>> >> enable it. So to enumerate the error sources of system, you just need
>> >> to enumerate "error" directory for each device directory in
>> >> /sys/devices.
>> >
>> > So how do you say which devices should report and which shouldn't report
>> > errors, from userspace with a tool? Default actions? What if you forget
>> > to enable error reporting of a device which actually is important?
>>
>> In general all hardware errors should be reported and logged.
>
> So why the need to enable/disable them? Why add all that code to
> enable/disable them when all devices can report hw errors but not all
> do it but all should do it... (I can go on forever). Do you see the
> spaghetti statement?
Because some error reporting devices itself may not work properly, we
may need a way to deal with these devices.
>> >> One device file (/dev/error/error) which mixed error records from all
>> >> hardware error reporting devices is created to convey error records
>> >> from kernel space to user space. Because hardware devices are dealt
>> >> with, a device file is the most natural way to do that. Because
>> >> hardware error reporting should not hurts system performance, the
>> >> throughput of the interface should be controlled to a low level (done
>> >> by user space error daemon), ordinary "read" is sufficient from
>> >> performance point of view.
>> >
>> > Right, so no need for a daemon but who does the read? cat? How are you
>> > going to collect all those errors? How do you enforce policies?
>>
>> Some summary hardware error information can be put into printk. Error
>> daemon is needed because we need not only log the the error but the
>> predictive recovery. If you really have no daemon, cat can be used to
>> log the error. I don't fully understand your words, you want to
>> enforce policies without error daemon?
>
> Sorry, I misread your original statement. So it is clear that we need
> some kind of daemon to do some error post-processing.
>
>>
>> > How do you inject errors?
>>
>> We can use another device file to inject error, for example
>> /dev/error/error_inject. Just write the needed information to this
>> file. The format can be same as the error record defined as above,
>> because it is highly extensible.
>
> Same argument as above - you can do that with tracepoints without
> duplicating functionality.
>
> [.. ]
>
>> >> A lock-less hardware error record allocator is provided. So for
>> >> hardware error that can be ignored (such as corrected errors), it is
>> >> not needed to pre-allocate the error record or allocate the error
>> >> record on stack. Because the possibility for two hardware parts to go
>> >> error simultaneously is very small, one big unified memory pool for
>> >> hardware errors is better than one memory pool or buffer for each
>> >> device.
>> >
>> > Yet another bloat winner. Why do we need a memory allocator for error
>> > records?
>>
>> The point is lockless not the memory allocator. The lockless memory
>> allocator is not hardware error reporting specific, it can be used by
>> other part of the kernel too.
>
> Wait a second, are we talking about hardware errors or memory management
> here? If you want to push your lockless memory allocator, send it in to
> linux-mm and let the guys there look at it, but not in conjunction with
> hw errors. That's like I'm going for a run and, btw, while I'm at it, I
> can buy a coffee machine.
The lockless memory allocator is not in this patch. I push it in
another patch. The error record allocator is just a simplest wrapper
with throttle support integrated.
>> > You either get a single critical error which shuts down the
>> > system and you log it to persistent storage, if possible, or you work at
>> > those uncritical errors one at a time.
>>
>> Uncritical errors can be reported in NMI handler too. So we need
>> lockless data structure for them.
>
> Why? What's wrong with using a single struct on the stack? Are you
> afraid that we might blow the NMI stack although NMIs don't nest?
The lockless memory allocator is used not to save space on stack. It
is part of the lockless data structure needed by hardware error
reporting.
> [.. ]
>
> Dude, let me save you the trouble: all everybody is trying to say is
> that you can achieve all that with stuff already available in the
> kernel.
Tracepoint is not printk, it was not used for error reporting.
> And HW errors are not that special to need a special subsystem
> grown for them - you just need to handle them properly. The only thing
> you should provide is the backend to persistent storage so that error
> info can be put there - everything else is bloat.
It seems that the main different opinion between us is that you want
to implement hardware error reporting inside tracepoint, but I want to
implement it outside tracepoint. Your point is code sharing, my point
is code simplicity. Tracepoint itself is already quite complex, adding
hardware error reporting makes it even more complex. In fact, hardware
error reporting is quite simple. You can see, this patch is quite
small. Even the code added into tracepoint to support hardware error
reporting may be more complex than this patch.
As for code sharing, the main part can be shared between tracepoint
and hardware error reporting is lockless data structure and some NMI
handling facility. But we do not need to implement hardware error
reporting inside tracepoint to share that. Maybe we can refactor
lockless data structure and NMI handling facility inside tracepoint
into a general one, and use that in tracepoint and hardware error
reporting. For example, "irq_work" can be used in hardware error
reporting too.
Best Regards,
Huang Ying
On 10-11-20 02:11 AM, huang ying wrote:
>
> I think the BIOS error should be reported to hardware vendor instead
> of software vendor. Do you think so?
If you (and the code) are absolutely certain that a particular error instance
is totally due to the BIOS, then stick the words "BIOS ERROR" into the printk().
Problem solved.
And in the even that the diagnosis is wrong, the rest of us will still
have the complete picture of what happened from dmesg, rather than seeing
random kernel errors (from other code) happen later without knowing there
was some kind of BIOS or hardware fault that triggered it.
Having them all in one place is rather useful.
And you can still configure rsyslogd to _also_ send the BIOS/hardware
errors to a separate destination, if that turns out to be useful.
Cheers
On Sat, Nov 20, 2010 at 9:39 PM, Mark Lord <[email protected]> wrote:
> On 10-11-20 02:11 AM, huang ying wrote:
>>
>> I think the BIOS error should be reported to hardware vendor instead
>> of software vendor. Do you think so?
>
> If you (and the code) are absolutely certain that a particular error
> instance
> is totally due to the BIOS, then stick the words "BIOS ERROR" into the
> printk().
>
> Problem solved.
>
> And in the even that the diagnosis is wrong, the rest of us will still
> have the complete picture of what happened from dmesg, rather than seeing
> random kernel errors (from other code) happen later without knowing there
> was some kind of BIOS or hardware fault that triggered it.
>
> Having them all in one place is rather useful.
> And you can still configure rsyslogd to _also_ send the BIOS/hardware
> errors to a separate destination, if that turns out to be useful.
I have no objection to report hardware errror with printk too. But we
need a user space hardware error daemon too, which needs a
tool-oriented interface. Do you think printk is a good interface for
tool to extract and parse error records? I think it is mainly human
oriented.
Best Regards,
Huang Ying
Hmm. This seems to have gotten bounced by a bad smtp setup here
locally. Sorry if you get it twice..
Linus
On Sat, Nov 20, 2010 at 8:04 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Nov 19, 2010 at 11:11 PM, huang ying
> <[email protected]> wrote:
>> On Sat, Nov 20, 2010 at 10:15 AM, Linus Torvalds
>>> Bah. Many machine checks _were_ software errors. They were things like
>>> the BIOS not clearing some old pending state etc.
>>
>> I think the BIOS error should be reported to hardware vendor instead
>> of software vendor. Do you think so?
>
> They won't care. The only people who care are _us_. Software people.
> We may be able to work around a broken BIOS.
>
> Also, sometimes the machine checks are really our fault. Read the
> Intel documentation on page tables etc, it says that you can get
> machine checks if you inconsistent page attributes. Or maybe that was
> AMD.
>
> The point is, it's simply not _true_ that hardware errors are always a
> hardware bug. It never has been.
>
> And it's not _true_ that people care about them the same way. The only
> thing that is true is that a sysadmin wants to see them, but he wants
> to see them _exactly_ the same way he wants to see a kernel oops etc.
>
>>> IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.
>>
>> Because some external cause like cosmic rays and electromagnetic
>> interference can cause hardware errors too. We need error counting to
>> distinguish between external caused hardware errors and real hardware
>> errors.
>
> Do you really think that a system administrator is too stupid to count to three?
>
> Yes, admittedly I've met some people like that. But no, "cosmic rays"
> do not change anything.
>
> People have had this for _ages_ with simple parity-protected RAM (with
> ECC just being another fancier form of it). People _know_.
>
> If you get an ECC report randomly once a month per machine, you know
> it's something like cosmic rays.
>
> And if you notice that _one_ of your machines gets five ECC errors per
> minute, you know it's something else. As an MIS person you might still
> decide keep the dang thing, because it's just the print server for the
> admin people, and you know that your paycheck is handled by another
> machine. But if it's the Quake server, you realize that it needs to be
> replaced _today_.
>
> See? That's not the kind of rational decision that some automated
> program can make.
>
> It really is that simple. No amount of "automatic counting" will ever
> help you. Quite the reverse. It will just complicate the thing.
>
>> So, do you agree that we need some tool oriented interface in addition
>> to printk?
>
> No. Any such tool will just _hide_ the information from the MIS people
> who don't even know about it.
>
> But you could certainly make a simple agreed-upon format. We have BUG:
> and WARNING: in the kernel logs. Why not HWPROBLEM: or something?
>
> MIS people love their perl scripts. And the people who can't do perl
> can still use the standard log tools.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Linus
>
On Sun, Nov 21, 2010 at 7:57 AM, Linus Torvalds
<[email protected]> wrote:
[...]
> On Sat, Nov 20, 2010 at 8:04 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Fri, Nov 19, 2010 at 11:11 PM, huang ying
>> <[email protected]> wrote:
>>> On Sat, Nov 20, 2010 at 10:15 AM, Linus Torvalds
>>>> Bah. Many machine checks _were_ software errors. They were things like
>>>> the BIOS not clearing some old pending state etc.
>>>
>>> I think the BIOS error should be reported to hardware vendor instead
>>> of software vendor. Do you think so?
>>
>> They won't care. The only people who care are _us_. Software people.
>> We may be able to work around a broken BIOS.
>>
>> Also, sometimes the machine checks are really our fault. Read the
>> Intel documentation on page tables etc, it says that you can get
>> machine checks if you inconsistent page attributes. Or maybe that was
>> AMD.
>>
>> The point is, it's simply not _true_ that hardware errors are always a
>> hardware bug. It never has been.
>>
>> And it's not _true_ that people care about them the same way. The only
>> thing that is true is that a sysadmin wants to see them, but he wants
>> to see them _exactly_ the same way he wants to see a kernel oops etc.
>>
>>>> IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.
>>>
>>> Because some external cause like cosmic rays and electromagnetic
>>> interference can cause hardware errors too. We need error counting to
>>> distinguish between external caused hardware errors and real hardware
>>> errors.
>>
>> Do you really think that a system administrator is too stupid to count to three?
Yes. They can. But people like tools. For example I can calculate, but
sometimes I use a calculator. :)
>> Yes, admittedly I've met some people like that. But no, "cosmic rays"
>> do not change anything.
>>
>> People have had this for _ages_ with simple parity-protected RAM (with
>> ECC just being another fancier form of it). People _know_.
>>
>> If you get an ECC report randomly once a month per machine, you know
>> it's something like cosmic rays.
>>
>> And if you notice that _one_ of your machines gets five ECC errors per
>> minute, you know it's something else. As an MIS person you might still
>> decide keep the dang thing, because it's just the print server for the
>> admin people, and you know that your paycheck is handled by another
>> machine. But if it's the Quake server, you realize that it needs to be
>> replaced _today_.
>>
>> See? That's not the kind of rational decision that some automated
>> program can make.
We just provide the mechanism in the automated program, let MIS person
fill in the policy. They can setup the automated program in print
server just email them if error exceed threshold, and setup the Quake
server to hot-remove the error DIMM if error exceed threshold.
Some server machine can do more than just replace the whole machine.
Some hardware components like DIMM, CPU, etc can be hot-removed, these
can be done by tool instead of human. We can trigger these operations
automatically in a more timely way if we have a automated tools. After
error exceed threshold, administrator may need several hours to notice
it, but the automated tools can trigger it almost immediately.
And the user space tool can help us to identify the error hardware
components too. For example, there is no common way to identify which
DIMM goes error from the physical address reported by hardware.
Sometimes some very tricky method is used, EDAC people use a
motherboard specific table to map to the DIMM slot. On some machine,
SMBIOS table can be used, but on some other machine, SMBIOS table is
just crap. I think it is not good to do all these dirty and maybe
machine specific work in kernel.
>> It really is that simple. No amount of "automatic counting" will ever
>> help you. Quite the reverse. It will just complicate the thing.
>>
>>> So, do you agree that we need some tool oriented interface in addition
>>> to printk?
>>
>> No. Any such tool will just _hide_ the information from the MIS people
>> who don't even know about it.
I don't want to hide the information from the MIS people with the
tool. I want to show the information to MIS people in a better way.
For example, we can email MIS people under some situation. And we can
implement a SNMP agent inside the tool, so that the MIS people can
monitor the hardware status remotely. This can be integrated with the
MIS people's other administrator tool.
>> But you could certainly make a simple agreed-upon format. We have BUG:
>> and WARNING: in the kernel logs. Why not HWPROBLEM: or something?
There is a "[Hardware Error]: " prefix for printk in kernel. We can
use that to mark hardware errors. It is already used by Machine Check.
>> MIS people love their perl scripts. And the people who can't do perl
>> can still use the standard log tools.
Perl scripts are just another kind of user space tools for hardware
errors. We just want to write a better tool for them with the help of
a tool oriented error reporting interface.
Best Regards,
Huang Ying
2010/11/19 Linus Torvalds <[email protected]>:
> On Fri, Nov 19, 2010 at 12:10 AM, Huang Ying <[email protected]> wrote:
>>
>> This is used by APEI ERST and GEHS. But it is a generic hardware
>> error reporting mechanism and can be used by other hardware error
>> reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc.
>
> Yeah, no.
>
> Really.
>
> We don't want some specific hardware error reporting mechanism.
> Hardware errors are way less common than other errors, so making
> something that is special to them just isn't very interesting.
Reading the following google paper on memory errors:
http://www.google.com/research/pubs/pub35162.html
I suppose they weren't really reporting memory errors with printk.
Because of this:
"The scale of the system and the data being collected make the
analysis non-trivial. Each one of many ten-thousands of machines in
the fleet logs every ten minutes hundreds of parameters, adding up to
many TBytes."
This would add up to gigabytes of generated data, for each machine, in
some minutes. It seems to me that printk isn't really suited to report
large amounts of raw data.
On Sat, Nov 20, 2010 at 4:42 PM, huang ying
<[email protected]> wrote:
>
> I don't want to hide the information from the MIS people with the
> tool. I want to show the information to MIS people in a better way.
You really don't understand, do you?
People won't even _know_ about your tool. It's too f*cking
specialized. They'll have come from other Unixes, they'll have come
from older Linux versions, they don't know, they don't care.
They _do_ know about system logs.
The most common kind of "system admin" is the random end-user. Now,
admittedly Intel seems to have its head up its arse on the whole
"regular people care about ECC and random memory corruption", and it
may be that consumer chips simply won't support the whole magic error
handling code, but the point remains: we don't want yet another
obscure error reporting tool that almost nobody knows about.
Especially for errors that are so rare that you'll never notice if you
are missing them.
Linus
On Sun, Nov 21, 2010 at 8:50 AM, Linus Torvalds
<[email protected]> wrote:
> On Sat, Nov 20, 2010 at 4:42 PM, huang ying
> <[email protected]> wrote:
>>
>> I don't want to hide the information from the MIS people with the
>> tool. I want to show the information to MIS people in a better way.
>
> You really don't understand, do you?
I mean the tool can cook the raw error information from kernel and
report it in a better way. Yes. You are right that the user space
error daemon is not popular now. But every tool has its beginning,
isn't it? I know it is impossible for this tool becomes popular in
desktop users because hardware error is really rare for them. But it
may become popular for server farm administrators, to them hardware
errors are common and they really care about the RAS.
> People won't even _know_ about your tool. It's too f*cking
> specialized. They'll have come from other Unixes, they'll have come
> from older Linux versions, they don't know, they don't care.
>
> They _do_ know about system logs.
I have no objection to report hardware errors in system logs too. So
these people can get the information too. I just want to add another
tool oriented interface too. So that some other users (like cluster
administrator) can get their work done better too.
> The most common kind of "system admin" is the random end-user. Now,
> admittedly Intel seems to have its head up its arse on the whole
> "regular people care about ECC and random memory corruption", and it
> may be that consumer chips simply won't support the whole magic error
> handling code, but the point remains: we don't want yet another
> obscure error reporting tool that almost nobody knows about.
> Especially for errors that are so rare that you'll never notice if you
> are missing them.
For desktop users, that is true. But for cluster administrator, the
hardware errors are really common. Some engineer of local search
engine vendor told me that they have broken DIMM everyday.
Best Regards,
Huang Ying
On 10-11-20 08:06 PM, huang ying wrote:
>
> I have no objection to report hardware errors in system logs too. So
> these people can get the information too. I just want to add another
> tool oriented interface too. So that some other users (like cluster
> administrator) can get their work done better too.
So, use the standard interface for the tool: syslog.
Have this obscure new tool simply parse the log messages,
and the send/save the data whatever way you like.
No new specialized kernel API required.
On Tue, 2010-11-23 at 07:43 +0800, Mark Lord wrote:
> On 10-11-20 08:06 PM, huang ying wrote:
> >
> > I have no objection to report hardware errors in system logs too. So
> > these people can get the information too. I just want to add another
> > tool oriented interface too. So that some other users (like cluster
> > administrator) can get their work done better too.
>
> So, use the standard interface for the tool: syslog.
Although it may be possible to extract some information from syslog and
parse it in a fault tolerant way, we can only use that human oriented
interface for a tool? That sounds like hack.
> Have this obscure new tool simply parse the log messages,
> and the send/save the data whatever way you like.
>
> No new specialized kernel API required.
Adding a new device file will be seen as a new kernel API?
Best Regards,
Huang Ying
On 10-11-22 07:29 PM, Huang Ying wrote:
> On Tue, 2010-11-23 at 07:43 +0800, Mark Lord wrote:
>> On 10-11-20 08:06 PM, huang ying wrote:
>>>
>>> I have no objection to report hardware errors in system logs too. So
>>> these people can get the information too. I just want to add another
>>> tool oriented interface too. So that some other users (like cluster
>>> administrator) can get their work done better too.
>>
>> So, use the standard interface for the tool: syslog.
>
> Although it may be possible to extract some information from syslog and
> parse it in a fault tolerant way, we can only use that human oriented
> interface for a tool? That sounds like hack.
No, that sounds like the *NIX programming philosophy.
You may have already noticed that most *NIX tools store
and manage data in _text_ form. That makes it easy to
understand, easy to parse/process, and generally better
in almost every respect.
Other platforms (GNOME, MS-Windows) prefer a binary format
that requires special tools to view/access. Ugh.
Cheers
> If you (and the code) are absolutely certain that a particular error instance
> is totally due to the BIOS, then stick the words "BIOS ERROR" into the
> printk().
FYI, trenn added some printk prefixes for this case a while back;
#define FW_BUG "[Firmware Bug]: "
#define FW_WARN "[Firmware Warn]: "
#define FW_INFO "[Firmware Info]: "
cheers,
-Len Brown, Intel Open Source Technology Center
> ... most *NIX tools store and manage data in _text_ form.
If the hardware error dump is complicated there is a trade-off
between making things human readable and putting a lot of
comlicated parsing code into the kernel. Maybe the kernel
should just dump hex "text" in some cases and let a user-program
parse the syslog?
What do do if the hardware error log is very large?
Is there a limit on how much is practical to send through syslog?
thanks,
-Len Brown, Intel Open Source Technology Center
On Thu, 2010-11-25 at 10:41 +0800, Mark Lord wrote:
> On 10-11-22 07:29 PM, Huang Ying wrote:
> > On Tue, 2010-11-23 at 07:43 +0800, Mark Lord wrote:
> >> On 10-11-20 08:06 PM, huang ying wrote:
> >>>
> >>> I have no objection to report hardware errors in system logs too. So
> >>> these people can get the information too. I just want to add another
> >>> tool oriented interface too. So that some other users (like cluster
> >>> administrator) can get their work done better too.
> >>
> >> So, use the standard interface for the tool: syslog.
> >
> > Although it may be possible to extract some information from syslog and
> > parse it in a fault tolerant way, we can only use that human oriented
> > interface for a tool? That sounds like hack.
>
> No, that sounds like the *NIX programming philosophy.
>
> You may have already noticed that most *NIX tools store
> and manage data in _text_ form. That makes it easy to
> understand, easy to parse/process, and generally better
> in almost every respect.
I have no objection to text form interface. I said printk is not a tool
oriented interface not because it is a text form interface but some
other issues. For example, messages from different CPU/context may be
interleaved; all kinds information mixed together, without overall
format or meta-data, etc.
> Other platforms (GNOME, MS-Windows) prefer a binary format
> that requires special tools to view/access. Ugh.
Linux kernel uses binary format interfaces too.
Best Regards,
Huang Ying
Em 25-11-2010 02:27, Len Brown escreveu:
>> ... most *NIX tools store and manage data in _text_ form.
>
> If the hardware error dump is complicated there is a trade-off
> between making things human readable and putting a lot of
> comlicated parsing code into the kernel. Maybe the kernel
> should just dump hex "text" in some cases and let a user-program
> parse the syslog?
>
> What do do if the hardware error log is very large?
> Is there a limit on how much is practical to send through syslog?
If you look what sysadm's do with the Unix logs, you'll see that they
use either one of the following approaches:
1) have something looking at syslog (and/or serial console logs), and
storing them for their analisys, in text format;
2) convert syslog errors into a SNMP object UID's, on a machine-readable code,
in order to manage them via some SNMP management system.
On both cases, the approach is there for a long time.
If an error "magic" code is added, both ways will break, as sysadm's won't be
able to understand the meaning of the magic number, and the SNMP conversion
tools won't be ready to convert that magic code into something else.
Of course, with time, the SNMP parsers will eventually add the needed decoders
for the magic numbers, in order to convert them into a MIB representation.
So, even being a number, such code is not machine readable (at least not for the
right tools), as it is not an SNMP object, so, the management systems won't catch
it without a parser.
So, IMO, the better is to keep providing a text message.
We might think on adding a way to directly output a SNMP UID from kernel,
but this seems overkill to me, and anything else would just be meaningless
for most sysadmins.
Thanks,
Mauro.