Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751654Ab0KSIpR (ORCPT ); Fri, 19 Nov 2010 03:45:17 -0500 Received: from mga02.intel.com ([134.134.136.20]:16833 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751293Ab0KSIpP (ORCPT ); Fri, 19 Nov 2010 03:45:15 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.59,221,1288594800"; d="scan'208";a="575440370" Subject: Re: [PATCH 1/2] Generic hardware error reporting mechanism From: Huang Ying To: Greg KH Cc: "linux-kernel@vger.kernel.org" , Andi Kleen , "linux-acpi@vger.kernel.org" , Peter Zijlstra , Andrew Morton , Linus Torvalds , Ingo Molnar , Mauro Carvalho Chehab , Borislav Petkov , Thomas Gleixner , Len Brown In-Reply-To: <1290154233-28695-2-git-send-email-ying.huang@intel.com> References: <1290154233-28695-1-git-send-email-ying.huang@intel.com> <1290154233-28695-2-git-send-email-ying.huang@intel.com> Content-Type: text/plain; charset="UTF-8" Date: Fri, 19 Nov 2010 16:45:11 +0800 Message-ID: <1290156311.2903.84.camel@yhuang-dev> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 34672 Lines: 963 Sorry, forget to Cc: Greg for device model part. Best Regards, Huang Ying On Fri, 2010-11-19 at 16:10 +0800, Huang, Ying wrote: > There are many hardware error detecting and reporting components in > kernel, including x86 Machine Check, PCIe AER, EDAC, APEI GHES > etc. Each one has its error reporting implementation, including user > space interface, error record format, in kernel buffer, etc. This > patch provides a generic hardware error reporting mechanism to reduce > the duplicated effort and add more common services. > > > A highly extensible generic hardware error record data structure is > defined to accommodate various hardware error information from various > hardware error sources. The overall structure of error record is as > follow: > > ----------------------------------------------------------------- > | rcd hdr | sec 1 hdr | sec 1 data | sec 2 hdr | sec2 data | ... > ----------------------------------------------------------------- > > Several error sections can be incorporated into one error record to > accumulate information from multiple hardware components related to > one error. For example, for an error on a device on the secondary > side of a PCIe bridge, it is useful to record error information from > the PCIe bridge and the PCIe device. Multiple section can be used to > hold both the cooked and the raw error information. So that the > abstract information can be provided by the cooked one and no > information will be lost because the raw one is provided too. > > There are "reversion" (rev) and "length" field in record header and > "type" and "length" field in section header, so the user space error > daemon can skip unrecognized error record or error section. This > makes old version error daemon can work with the newer kernel. > > New error section type can be added to support new error type, error > sources. > > > The hardware error reporting mechanism designed by the patch > integrates well with device model in kernel. struct dev_herr_info is > defined and pointed to by "error" field of struct device. This is > used to hold error reporting related information for each device. One > sysfs directory "error" will be created for each hardware error > reporting device. Some files for error reporting statistics and > control are created in sysfs "error" directory. For example, the > "error" directory for APEI GHES is as follow. > > /sys/devices/platform/GHES.0/error/logs > /sys/devices/platform/GHES.0/error/overflows > /sys/devices/platform/GHES.0/error/throttles > > Where "logs" is number of error records logged; "throttles" is number > of error records not logged because the reporting rate is too high; > "overflows" is number of error records not logged because there is no > space available. > > Not all devices will report errors, so struct dev_herr_info and sysfs > directory/files are only allocated/created for devices explicitly > enable it. So to enumerate the error sources of system, you just need > to enumerate "error" directory for each device directory in > /sys/devices. > > > One device file (/dev/error/error) which mixed error records from all > hardware error reporting devices is created to convey error records > from kernel space to user space. Because hardware devices are dealt > with, a device file is the most natural way to do that. Because > hardware error reporting should not hurts system performance, the > throughput of the interface should be controlled to a low level (done > by user space error daemon), ordinary "read" is sufficient from > performance point of view. > > > The patch provides common services for hardware error reporting > devices too. > > A lock-less hardware error record allocator is provided. So for > hardware error that can be ignored (such as corrected errors), it is > not needed to pre-allocate the error record or allocate the error > record on stack. Because the possibility for two hardware parts to go > error simultaneously is very small, one big unified memory pool for > hardware errors is better than one memory pool or buffer for each > device. > > After filling in all necessary fields in hardware error record, the > error reporting is quite straightforward, just calling > herr_record_report, parameters are the error record itself and the > corresponding struct device. > > Hardware errors may burst, for example, same hardware errors may be > reported at high rate within a short interval, this will use up all > pre-allocated memory for error reporting, so that other hardware > errors come from same or different hardware device can not be logged. > To deal with this issue, a throttle algorithm is implemented. The > logging rate for errors come from one hardware error device is > throttled based on the available pre-allocated memory for error > reporting. In this way we can log as many kinds of errors as possible > comes from as many devices as possible. > > > This patch is designed by Andi Kleen and Huang Ying. > > Signed-off-by: Huang Ying > Reviewed-by: Andi Kleen > --- > drivers/Kconfig | 2 > drivers/Makefile | 1 > drivers/base/Makefile | 1 > drivers/base/herror.c | 98 ++++++++ > drivers/herror/Kconfig | 5 > drivers/herror/Makefile | 1 > drivers/herror/herr-core.c | 488 ++++++++++++++++++++++++++++++++++++++++++ > include/linux/Kbuild | 1 > include/linux/device.h | 14 + > include/linux/herror.h | 35 +++ > include/linux/herror_record.h | 100 ++++++++ > kernel/Makefile | 1 > 12 files changed, 747 insertions(+) > create mode 100644 drivers/base/herror.c > create mode 100644 drivers/herror/Kconfig > create mode 100644 drivers/herror/Makefile > create mode 100644 drivers/herror/herr-core.c > create mode 100644 include/linux/herror.h > create mode 100644 include/linux/herror_record.h > > --- a/drivers/Kconfig > +++ b/drivers/Kconfig > @@ -111,4 +111,6 @@ source "drivers/xen/Kconfig" > source "drivers/staging/Kconfig" > > source "drivers/platform/Kconfig" > + > +source "drivers/herror/Kconfig" > endmenu > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -115,3 +115,4 @@ obj-$(CONFIG_VLYNQ) += vlynq/ > obj-$(CONFIG_STAGING) += staging/ > obj-y += platform/ > obj-y += ieee802154/ > +obj-$(CONFIG_HERR_CORE) += herror/ > --- a/drivers/base/Makefile > +++ b/drivers/base/Makefile > @@ -18,6 +18,7 @@ ifeq ($(CONFIG_SYSFS),y) > obj-$(CONFIG_MODULES) += module.o > endif > obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o > +obj-$(CONFIG_HERR_CORE) += herror.o > > ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG > > --- /dev/null > +++ b/drivers/base/herror.c > @@ -0,0 +1,98 @@ > +/* > + * Hardware error reporting related functions > + * > + * Copyright 2010 Intel Corp. > + * Author: Huang Ying > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License version > + * 2 as published by the Free Software Foundation; > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA > + */ > + > +#include > +#include > +#include > + > +#define HERR_COUNTER_ATTR(_name) \ > + static ssize_t herr_##_name##_show(struct device *dev, \ > + struct device_attribute *attr, \ > + char *buf) \ > + { \ > + int counter; \ > + \ > + counter = atomic_read(&dev->error->_name); \ > + return sprintf(buf, "%d\n", counter); \ > + } \ > + static ssize_t herr_##_name##_store(struct device *dev, \ > + struct device_attribute *attr, \ > + const char *buf, \ > + size_t count) \ > + { \ > + atomic_set(&dev->error->_name, 0); \ > + return count; \ > + } \ > + static struct device_attribute herr_attr_##_name = \ > + __ATTR(_name, 0600, herr_##_name##_show, \ > + herr_##_name##_store) > + > +HERR_COUNTER_ATTR(logs); > +HERR_COUNTER_ATTR(overflows); > +HERR_COUNTER_ATTR(throttles); > + > +static struct attribute *herr_attrs[] = { > + &herr_attr_logs.attr, > + &herr_attr_overflows.attr, > + &herr_attr_throttles.attr, > + NULL, > +}; > + > +static struct attribute_group herr_attr_group = { > + .name = "error", > + .attrs = herr_attrs, > +}; > + > +static void device_herr_init(struct device *dev) > +{ > + atomic_set(&dev->error->logs, 0); > + atomic_set(&dev->error->overflows, 0); > + atomic_set(&dev->error->throttles, 0); > + atomic64_set(&dev->error->timestamp, 0); > +} > + > +int device_enable_error_reporting(struct device *dev) > +{ > + int rc; > + > + BUG_ON(dev->error); > + dev->error = kzalloc(sizeof(*dev->error), GFP_KERNEL); > + if (!dev->error) > + return -ENOMEM; > + device_herr_init(dev); > + rc = sysfs_create_group(&dev->kobj, &herr_attr_group); > + if (rc) > + goto err; > + return 0; > +err: > + kfree(dev->error); > + dev->error = NULL; > + return rc; > +} > +EXPORT_SYMBOL_GPL(device_enable_error_reporting); > + > +void device_disable_error_reporting(struct device *dev) > +{ > + if (dev->error) { > + sysfs_remove_group(&dev->kobj, &herr_attr_group); > + kfree(dev->error); > + } > +} > +EXPORT_SYMBOL_GPL(device_disable_error_reporting); > --- /dev/null > +++ b/drivers/herror/Kconfig > @@ -0,0 +1,5 @@ > +config HERR_CORE > + bool "Hardware error reporting" > + depends on ARCH_HAVE_NMI_SAFE_CMPXCHG > + select LLIST > + select GENERIC_ALLOCATOR > --- /dev/null > +++ b/drivers/herror/Makefile > @@ -0,0 +1 @@ > +obj-y += herr-core.o > --- /dev/null > +++ b/drivers/herror/herr-core.c > @@ -0,0 +1,488 @@ > +/* > + * Generic hardware error reporting support > + * > + * This file provides some common services for hardware error > + * reporting, including hardware error record lock-less allocator, > + * error reporting mechanism, user space interface etc. > + * > + * Copyright 2010 Intel Corp. > + * Author: Huang Ying > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License version > + * 2 as published by the Free Software Foundation; > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define HERR_NOTIFY_BIT 0 > + > +static unsigned long herr_flags; > + > +/* > + * Record list management and error reporting > + */ > + > +struct herr_node { > + struct llist_node llist; > + struct herr_record ercd __attribute__((aligned(HERR_MIN_ALIGN))); > +}; > + > +#define HERR_NODE_LEN(rcd_len) \ > + ((rcd_len) + sizeof(struct herr_node) - sizeof(struct herr_record)) > + > +#define HERR_MIN_ALLOC_ORDER HERR_MIN_ALIGN_ORDER > +#define HERR_CHUNKS_PER_CPU 2 > +#define HERR_RCD_LIST_NUM 2 > + > +struct herr_rcd_lists { > + struct llist_head *write; > + struct llist_head *read; > + struct llist_head heads[HERR_RCD_LIST_NUM]; > +}; > + > +static DEFINE_PER_CPU(struct herr_rcd_lists, herr_rcd_lists); > + > +static DEFINE_PER_CPU(struct gen_pool *, herr_gen_pool); > + > +static void herr_rcd_lists_init(void) > +{ > + int cpu, i; > + struct herr_rcd_lists *lists; > + > + for_each_possible_cpu(cpu) { > + lists = per_cpu_ptr(&herr_rcd_lists, cpu); > + for (i = 0; i < HERR_RCD_LIST_NUM; i++) > + init_llist_head(&lists->heads[i]); > + lists->write = &lists->heads[0]; > + lists->read = &lists->heads[1]; > + } > +} > + > +static void herr_pool_fini(void) > +{ > + struct gen_pool *pool; > + struct gen_pool_chunk *chunk; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + pool = per_cpu(herr_gen_pool, cpu); > + gen_pool_for_each_chunk(chunk, pool) > + free_page(chunk->start_addr); > + gen_pool_destroy(pool); > + } > +} > + > +static int herr_pool_init(void) > +{ > + struct gen_pool **pool; > + int cpu, rc, nid, i; > + unsigned long addr; > + > + for_each_possible_cpu(cpu) { > + pool = per_cpu_ptr(&herr_gen_pool, cpu); > + rc = -ENOMEM; > + nid = cpu_to_node(cpu); > + *pool = gen_pool_create(HERR_MIN_ALLOC_ORDER, nid); > + if (!*pool) > + goto err_pool_fini; > + for (i = 0; i < HERR_CHUNKS_PER_CPU; i++) { > + rc = -ENOMEM; > + addr = __get_free_page(GFP_KERNEL); > + if (!addr) > + goto err_pool_fini; > + rc = gen_pool_add(*pool, addr, PAGE_SIZE, nid); > + if (rc) > + goto err_pool_fini; > + } > + } > + > + return 0; > +err_pool_fini: > + herr_pool_fini(); > + return rc; > +} > + > +/* Max interval: about 2 second */ > +#define HERR_THROTTLE_BASE_INTVL NSEC_PER_USEC > +#define HERR_THROTTLE_MAX_RATIO 21 > +#define HERR_THROTTLE_MAX_INTVL \ > + ((1ULL << HERR_THROTTLE_MAX_RATIO) * HERR_THROTTLE_BASE_INTVL) > +/* > + * Pool size/used ratio considered spare, before this, interval > + * between error reporting is ignored. After this, minimal interval > + * needed is increased exponentially to max interval. > + */ > +#define HERR_THROTTLE_SPARE_RATIO 3 > + > +static int herr_throttle(struct device *dev) > +{ > + struct gen_pool *pool; > + unsigned long long last, now, min_intvl; > + unsigned int size, used, ratio; > + > + pool = __get_cpu_var(herr_gen_pool); > + size = gen_pool_size(pool); > + used = size - gen_pool_avail(pool); > + if (HERR_THROTTLE_SPARE_RATIO * used < size) > + goto pass; > + now = trace_clock_local(); > + last = atomic64_read(&dev->error->timestamp); > + ratio = (used * HERR_THROTTLE_SPARE_RATIO - size) * \ > + HERR_THROTTLE_MAX_RATIO; > + ratio = ratio / (size * HERR_THROTTLE_SPARE_RATIO - size) + 1; > + min_intvl = (1ULL << ratio) * HERR_THROTTLE_BASE_INTVL; > + if ((long long)(now - last) > min_intvl) > + goto pass; > + atomic_inc(&dev->error->throttles); > + return 0; > +pass: > + return 1; > +} > + > +static u64 herr_record_next_id(void) > +{ > + static atomic64_t seq = ATOMIC64_INIT(0); > + > + if (!atomic64_read(&seq)) > + atomic64_set(&seq, (u64)get_seconds() << 32); > + > + return atomic64_inc_return(&seq); > +} > + > +void herr_record_init(struct herr_record *ercd) > +{ > + ercd->flags = 0; > + ercd->rev = HERR_RCD_REV1_0; > + ercd->id = herr_record_next_id(); > + ercd->timestamp = trace_clock_local(); > +} > +EXPORT_SYMBOL_GPL(herr_record_init); > + > +struct herr_record *herr_record_alloc(unsigned int len, struct device *dev, > + unsigned int flags) > +{ > + struct gen_pool *pool; > + struct herr_node *enode; > + struct herr_record *ercd = NULL; > + > + BUG_ON(!dev->error); > + preempt_disable(); > + if (!(flags & HERR_ALLOC_NO_THROTTLE)) { > + if (!herr_throttle(dev)) { > + preempt_enable_no_resched(); > + return NULL; > + } > + } > + > + pool = __get_cpu_var(herr_gen_pool); > + enode = (struct herr_node *)gen_pool_alloc(pool, HERR_NODE_LEN(len)); > + if (enode) { > + ercd = &enode->ercd; > + herr_record_init(ercd); > + ercd->length = len; > + > + atomic64_set(&dev->error->timestamp, trace_clock_local()); > + atomic_inc(&dev->error->logs); > + } else > + atomic_inc(&dev->error->overflows); > + preempt_enable_no_resched(); > + > + return ercd; > +} > +EXPORT_SYMBOL_GPL(herr_record_alloc); > + > +int herr_record_report(struct herr_record *ercd, struct device *dev) > +{ > + struct herr_rcd_lists *lists; > + struct herr_node *enode; > + > + preempt_disable(); > + lists = this_cpu_ptr(&herr_rcd_lists); > + enode = container_of(ercd, struct herr_node, ercd); > + llist_add(&enode->llist, lists->write); > + preempt_enable_no_resched(); > + > + set_bit(HERR_NOTIFY_BIT, &herr_flags); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(herr_record_report); > + > +void herr_record_free(struct herr_record *ercd) > +{ > + struct herr_node *enode; > + struct gen_pool *pool; > + > + enode = container_of(ercd, struct herr_node, ercd); > + pool = get_cpu_var(herr_gen_pool); > + gen_pool_free(pool, (unsigned long)enode, > + HERR_NODE_LEN(enode->ercd.length)); > + put_cpu_var(pool); > +} > +EXPORT_SYMBOL_GPL(herr_record_free); > + > +/* > + * The low 16 bit is freeze count, high 16 bit is thaw count. If they > + * are not equal, someone is freezing the reader > + */ > +static u32 herr_freeze_thaw; > + > +/* > + * Stop the reader to consume error records, so that the error records > + * can be checked in kernel space safely. > + */ > +static void herr_freeze_reader(void) > +{ > + u32 old, new; > + > + do { > + new = old = herr_freeze_thaw; > + new = ((new + 1) & 0xffff) | (old & 0xffff0000); > + } while (cmpxchg(&herr_freeze_thaw, old, new) != old); > +} > + > +static void herr_thaw_reader(void) > +{ > + u32 old, new; > + > + do { > + old = herr_freeze_thaw; > + new = old + 0x10000; > + } while (cmpxchg(&herr_freeze_thaw, old, new) != old); > +} > + > +static int herr_reader_is_frozen(void) > +{ > + u32 freeze_thaw = herr_freeze_thaw; > + return (freeze_thaw & 0xffff) != (freeze_thaw >> 16); > +} > + > +int herr_for_each_record(herr_traverse_func_t func, void *data) > +{ > + int i, cpu, rc = 0; > + struct herr_rcd_lists *lists; > + struct herr_node *enode; > + > + preempt_disable(); > + herr_freeze_reader(); > + for_each_possible_cpu(cpu) { > + lists = per_cpu_ptr(&herr_rcd_lists, cpu); > + for (i = 0; i < HERR_RCD_LIST_NUM; i++) { > + struct llist_head *head = &lists->heads[i]; > + llist_for_each_entry(enode, head->first, llist) { > + rc = func(&enode->ercd, data); > + if (rc) > + goto out; > + } > + } > + } > +out: > + herr_thaw_reader(); > + preempt_enable_no_resched(); > + return rc; > +} > +EXPORT_SYMBOL_GPL(herr_for_each_record); > + > +static ssize_t herr_rcd_lists_read(char __user *ubuf, size_t usize, > + struct mutex *read_mutex) > +{ > + int cpu, rc = 0, read; > + struct herr_rcd_lists *lists; > + struct gen_pool *pool; > + ssize_t len, rsize = 0; > + struct herr_node *enode; > + struct llist_head *old_read; > + struct llist_node *to_read; > + > + do { > + read = 0; > + for_each_possible_cpu(cpu) { > + lists = per_cpu_ptr(&herr_rcd_lists, cpu); > + pool = per_cpu(herr_gen_pool, cpu); > + if (llist_empty(lists->read)) { > + if (llist_empty(lists->write)) > + continue; > + /* > + * Error records are output in batch, so old > + * error records can be output before new ones. > + */ > + old_read = lists->read; > + lists->read = lists->write; > + lists->write = old_read; > + } > + rc = rsize ? 0 : -EBUSY; > + if (herr_reader_is_frozen()) > + goto out; > + to_read = llist_del_first(lists->read); > + if (herr_reader_is_frozen()) > + goto out_readd; > + enode = llist_entry(to_read, struct herr_node, llist); > + len = enode->ercd.length; > + rc = rsize ? 0 : -EINVAL; > + if (len > usize - rsize) > + goto out_readd; > + rc = -EFAULT; > + if (copy_to_user(ubuf + rsize, &enode->ercd, len)) > + goto out_readd; > + gen_pool_free(pool, (unsigned long)enode, > + HERR_NODE_LEN(len)); > + rsize += len; > + read = 1; > + } > + if (need_resched()) { > + mutex_unlock(read_mutex); > + cond_resched(); > + mutex_lock(read_mutex); > + } > + } while (read); > + rc = 0; > +out: > + return rc ? rc : rsize; > +out_readd: > + llist_add(to_read, lists->read); > + goto out; > +} > + > +static int herr_rcd_lists_is_empty(void) > +{ > + int cpu, i; > + struct herr_rcd_lists *lists; > + > + for_each_possible_cpu(cpu) { > + lists = per_cpu_ptr(&herr_rcd_lists, cpu); > + for (i = 0; i < HERR_RCD_LIST_NUM; i++) { > + if (!llist_empty(&lists->heads[i])) > + return 0; > + } > + } > + return 1; > +} > + > + > +/* > + * Hardware Error Mix Reporting Device > + */ > + > +static int herr_major; > +static DECLARE_WAIT_QUEUE_HEAD(herr_mix_wait); > + > +static char *herr_devnode(struct device *dev, mode_t *mode) > +{ > + return kasprintf(GFP_KERNEL, "error/%s", dev_name(dev)); > +} > + > +struct class herr_class = { > + .name = "error", > + .devnode = herr_devnode, > +}; > +EXPORT_SYMBOL_GPL(herr_class); > + > +void herr_notify(void) > +{ > + if (test_and_clear_bit(HERR_NOTIFY_BIT, &herr_flags)) > + wake_up_interruptible(&herr_mix_wait); > +} > +EXPORT_SYMBOL_GPL(herr_notify); > + > +static ssize_t herr_mix_read(struct file *filp, char __user *ubuf, > + size_t usize, loff_t *off) > +{ > + int rc; > + static DEFINE_MUTEX(read_mutex); > + > + if (*off != 0) > + return -EINVAL; > + > + rc = mutex_lock_interruptible(&read_mutex); > + if (rc) > + return rc; > + rc = herr_rcd_lists_read(ubuf, usize, &read_mutex); > + mutex_unlock(&read_mutex); > + > + return rc; > +} > + > +static unsigned int herr_mix_poll(struct file *file, poll_table *wait) > +{ > + poll_wait(file, &herr_mix_wait, wait); > + if (!herr_rcd_lists_is_empty()) > + return POLLIN | POLLRDNORM; > + return 0; > +} > + > +static const struct file_operations herr_mix_dev_fops = { > + .owner = THIS_MODULE, > + .read = herr_mix_read, > + .poll = herr_mix_poll, > +}; > + > +static int __init herr_mix_dev_init(void) > +{ > + struct device *dev; > + dev_t devt; > + > + devt = MKDEV(herr_major, 0); > + dev = device_create(&herr_class, NULL, devt, NULL, "error"); > + if (IS_ERR(dev)) > + return PTR_ERR(dev); > + > + return 0; > +} > +device_initcall(herr_mix_dev_init); > + > +static int __init herr_core_init(void) > +{ > + int rc; > + > + BUILD_BUG_ON(sizeof(struct herr_node) % HERR_MIN_ALIGN); > + BUILD_BUG_ON(sizeof(struct herr_record) % HERR_MIN_ALIGN); > + BUILD_BUG_ON(sizeof(struct herr_section) % HERR_MIN_ALIGN); > + > + herr_rcd_lists_init(); > + > + rc = herr_pool_init(); > + if (rc) > + goto err; > + > + rc = class_register(&herr_class); > + if (rc) > + goto err_free_pool; > + > + rc = herr_major = register_chrdev(0, "error", &herr_mix_dev_fops); > + if (rc < 0) > + goto err_free_class; > + > + return 0; > +err_free_class: > + class_unregister(&herr_class); > +err_free_pool: > + herr_pool_fini(); > +err: > + return rc; > +} > +/* Initialize data structure used by device driver, so subsys_initcall */ > +subsys_initcall(herr_core_init); > --- a/include/linux/Kbuild > +++ b/include/linux/Kbuild > @@ -141,6 +141,7 @@ header-y += gigaset_dev.h > header-y += hdlc.h > header-y += hdlcdrv.h > header-y += hdreg.h > +header-y += herror_record.h > header-y += hid.h > header-y += hiddev.h > header-y += hidraw.h > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -394,6 +394,14 @@ extern int devres_release_group(struct d > extern void *devm_kzalloc(struct device *dev, size_t size, gfp_t gfp); > extern void devm_kfree(struct device *dev, void *p); > > +/* Device hardware error reporting related information */ > +struct dev_herr_info { > + atomic_t logs; > + atomic_t overflows; > + atomic_t throttles; > + atomic64_t timestamp; > +}; > + > struct device_dma_parameters { > /* > * a low level driver may set these to teach IOMMU code about > @@ -422,6 +430,9 @@ struct device { > void *platform_data; /* Platform specific data, device > core doesn't touch it */ > struct dev_pm_info power; > +#ifdef CONFIG_HERR_CORE > + struct dev_herr_info *error; /* Hardware error reporting info */ > +#endif > > #ifdef CONFIG_NUMA > int numa_node; /* NUMA node this device is close to */ > @@ -523,6 +534,9 @@ static inline bool device_async_suspend_ > return !!dev->power.async_suspend; > } > > +extern int device_enable_error_reporting(struct device *dev); > +extern void device_disable_error_reporting(struct device *dev); > + > static inline void device_lock(struct device *dev) > { > mutex_lock(&dev->mutex); > --- /dev/null > +++ b/include/linux/herror.h > @@ -0,0 +1,35 @@ > +#ifndef LINUX_HERROR_H > +#define LINUX_HERROR_H > + > +#include > +#include > +#include > +#include > + > +/* > + * Hardware error reporting > + */ > + > +#define HERR_ALLOC_NO_THROTTLE 0x0001 > + > +struct herr_dev; > + > +/* allocate a herr_record lock-lessly */ > +struct herr_record *herr_record_alloc(unsigned int len, > + struct device *dev, > + unsigned int flags); > +void herr_record_init(struct herr_record *ercd); > +/* report error */ > +int herr_record_report(struct herr_record *ercd, struct device *dev); > +/* free the herr_record allocated before */ > +void herr_record_free(struct herr_record *ercd); > +/* > + * Notify waited user space hardware error daemon for the new error > + * record, can not be used in NMI context > + */ > +void herr_notify(void); > + > +/* Traverse all error records not consumed by user space */ > +typedef int (*herr_traverse_func_t)(struct herr_record *ercd, void *data); > +int herr_for_each_record(herr_traverse_func_t func, void *data); > +#endif > --- /dev/null > +++ b/include/linux/herror_record.h > @@ -0,0 +1,100 @@ > +#ifndef LINUX_HERROR_RECORD_H > +#define LINUX_HERROR_RECORD_H > + > +#include > + > +/* > + * Hardware Error Record Definition > + */ > +enum herr_severity { > + HERR_SEV_NONE, > + HERR_SEV_CORRECTED, > + HERR_SEV_RECOVERABLE, > + HERR_SEV_FATAL, > +}; > + > +#define HERR_RCD_REV1_0 0x0100 > +#define HERR_MIN_ALIGN_ORDER 3 > +#define HERR_MIN_ALIGN (1 << HERR_MIN_ALIGN_ORDER) > + > +enum herr_record_flags { > + HERR_RCD_PREV = 0x0001, /* record is for previous boot */ > + HERR_RCD_PERSIST = 0x0002, /* record is from flash, need to be > + * cleared after writing to disk */ > +}; > + > +/* > + * sizeof(struct herr_record) and sizeof(struct herr_section) should > + * be multiple of HERR_MIN_ALIGN to make error record packing easier. > + */ > +struct herr_record { > + __u16 length; > + __u16 flags; > + __u16 rev; > + __u8 severity; > + __u8 pad1; > + __u64 id; > + __u64 timestamp; > + __u8 data[0]; > +}; > + > +/* Section type ID are allocated here */ > +enum herr_section_type_id { > + /* 0x0 - 0xff are reserved by core */ > + /* 0x100 - 0x1ff are allocated to CPER */ > + HERR_TYPE_CPER = 0x0100, > + HERR_TYPE_GESR = 0x0110, /* acpi_hest_generic_status */ > + /* 0x200 - 0x2ff are allocated to PCI/PCIe subsystem */ > + HERR_TYPE_PCIE_AER = 0x0200, > +}; > + > +struct herr_section { > + __u16 length; > + __u16 flags; > + __u32 type; > + __u8 data[0]; > +}; > + > +#define herr_record_for_each_section(ercd, esec) \ > + for ((esec) = (struct herr_section *)(ercd)->data; \ > + (void *)(esec) - (void *)(ercd) < (ercd)->length; \ > + (esec) = (void *)(esec) + (esec)->length) > + > +#define HERR_SEC_LEN_ROUND(len) \ > + (((len) + HERR_MIN_ALIGN - 1) & ~(HERR_MIN_ALIGN - 1)) > +#define HERR_SEC_LEN(type) \ > + (sizeof(struct herr_section) + HERR_SEC_LEN_ROUND(sizeof(type))) > + > +#define HERR_RECORD_LEN_ROUND1(sec_len1) \ > + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1)) > +#define HERR_RECORD_LEN_ROUND2(sec_len1, sec_len2) \ > + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \ > + HERR_SEC_LEN_ROUND(sec_len2)) > +#define HERR_RECORD_LEN_ROUND3(sec_len1, sec_len2, sec_len3) \ > + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \ > + HERR_SEC_LEN_ROUND(sec_len2) + HERR_SEC_LEN_ROUND(sec_len3)) > + > +#define HERR_RECORD_LEN1(sec_type1) \ > + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1)) > +#define HERR_RECORD_LEN2(sec_type1, sec_type2) \ > + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \ > + HERR_SEC_LEN(sec_type2)) > +#define HERR_RECORD_LEN3(sec_type1, sec_type2, sec_type3) \ > + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \ > + HERR_SEC_LEN(sec_type2) + HERR_SEC_LEN(sec_type3)) > + > +static inline struct herr_section *herr_first_sec(struct herr_record *ercd) > +{ > + return (struct herr_section *)(ercd + 1); > +} > + > +static inline struct herr_section *herr_next_sec(struct herr_section *esrc) > +{ > + return (void *)esrc + esrc->length; > +} > + > +static inline void *herr_sec_data(struct herr_section *esec) > +{ > + return (void *)(esec + 1); > +} > +#endif > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -100,6 +100,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace/ > obj-$(CONFIG_TRACING) += trace/ > obj-$(CONFIG_X86_DS) += trace/ > obj-$(CONFIG_RING_BUFFER) += trace/ > +obj-$(CONFIG_HERR_CORE) += trace/ > obj-$(CONFIG_SMP) += sched_cpupri.o > obj-$(CONFIG_IRQ_WORK) += irq_work.o > obj-$(CONFIG_PERF_EVENTS) += perf_event.o -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/