2011-05-17 08:08:47

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 0/9] ACPI, APEI patches for 2.6.40

[PATCH 1/9] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
[PATCH 2/9] lib, Add lock-less NULL terminated single list
[PATCH 3/9] lib, Make gen_pool memory allocator lockless
[PATCH 4/9] ACPI, APEI, GHES, printk support for recoverable error via NMI
[PATCH 5/9] HWPoison: add memory_failure_queue()
[PATCH 6/9] ACPI, APEI, GHES: Add hardware memory error recovery support
[PATCH 7/9] PCIe, AER, add aer_recover_queue
[PATCH 8/9] ACPI, APEI, GHES: Add PCIe AER recovery support
[PATCH 9/9] ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled


2011-05-17 08:09:03

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 1/9] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG

cmpxchg() is widely used by lockless code, including NMI-safe lockless
code. But on some architectures, the cmpxchg() implementation is not
NMI-safe, on these architectures the lockless code may need to a
spin_trylock_irqsave() based implementation.

This patch adds a Kconfig option: ARCH_HAVE_NMI_SAFE_CMPXCHG, so that
NMI-safe lockless code can depend on it or provide different
implementation according to it.

On many architectures, cmpxchg is only NMI-safe for several specific
operand sizes. So, ARCH_HAVE_NMI_SAFE_CMPXCHG define in this patch
only guarantees cmpxchg is NMI-safe for sizeof(unsigned long).

Signed-off-by: Huang Ying <[email protected]>
Acked-by: Mike Frysinger <[email protected]>
Acked-by: Paul Mundt <[email protected]>
Acked-by: Hans-Christian Egtvedt <[email protected]>
Acked-by: Benjamin Herrenschmidt <[email protected]>
Acked-by: Chris Metcalf <[email protected]>
CC: Richard Henderson <[email protected]>
CC: Mikael Starvik <[email protected]>
CC: David Howells <[email protected]>
CC: Yoshinori Sato <[email protected]>
CC: Tony Luck <[email protected]>
CC: Hirokazu Takata <[email protected]>
CC: Geert Uytterhoeven <[email protected]>
CC: Michal Simek <[email protected]>
CC: Ralf Baechle <[email protected]>
CC: Kyle McMartin <[email protected]>
CC: Martin Schwidefsky <[email protected]>
CC: Chen Liqin <[email protected]>
CC: "David S. Miller" <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Chris Zankel <[email protected]>
---
arch/Kconfig | 3 +++
arch/alpha/Kconfig | 1 +
arch/avr32/Kconfig | 1 +
arch/frv/Kconfig | 1 +
arch/ia64/Kconfig | 1 +
arch/m68k/Kconfig | 1 +
arch/parisc/Kconfig | 1 +
arch/powerpc/Kconfig | 1 +
arch/s390/Kconfig | 1 +
arch/sh/Kconfig | 1 +
arch/sparc/Kconfig | 1 +
arch/tile/Kconfig | 1 +
arch/x86/Kconfig | 1 +
13 files changed, 15 insertions(+)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -178,4 +178,7 @@ config HAVE_ARCH_JUMP_LABEL
config HAVE_ARCH_MUTEX_CPU_RELAX
bool

+config ARCH_HAVE_NMI_SAFE_CMPXCHG
+ bool
+
source "kernel/gcov/Kconfig"
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -12,6 +12,7 @@ config ALPHA
select GENERIC_IRQ_PROBE
select AUTO_IRQ_AFFINITY if SMP
select GENERIC_IRQ_SHOW
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG
help
The Alpha is a 64-bit general-purpose processor designed and
marketed by the Digital Equipment Corporation of blessed memory,
--- a/arch/avr32/Kconfig
+++ b/arch/avr32/Kconfig
@@ -10,6 +10,7 @@ config AVR32
select GENERIC_IRQ_PROBE
select HARDIRQS_SW_RESEND
select GENERIC_IRQ_SHOW
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG
help
AVR32 is a high-performance 32-bit RISC microprocessor core,
designed for cost-sensitive embedded applications, with particular
--- a/arch/frv/Kconfig
+++ b/arch/frv/Kconfig
@@ -7,6 +7,7 @@ config FRV
select HAVE_PERF_EVENTS
select HAVE_GENERIC_HARDIRQS
select GENERIC_IRQ_SHOW
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG

config ZONE_DMA
bool
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -27,6 +27,7 @@ config IA64
select GENERIC_PENDING_IRQ if SMP
select IRQ_PER_CPU
select GENERIC_IRQ_SHOW
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG
default y
help
The Itanium Processor Family is Intel's 64-bit successor to
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -5,6 +5,7 @@ config M68K
select HAVE_AOUT if MMU
select GENERIC_ATOMIC64 if MMU
select HAVE_GENERIC_HARDIRQS if !MMU
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS

config RWSEM_GENERIC_SPINLOCK
bool
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -15,6 +15,7 @@ config PARISC
select HAVE_GENERIC_HARDIRQS
select GENERIC_IRQ_PROBE
select IRQ_PER_CPU
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG

help
The PA-RISC microprocessor is designed by Hewlett-Packard and used
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
select IRQ_PER_CPU
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG

config EARLY_PRINTK
bool
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -81,6 +81,7 @@ config S390
select INIT_ALL_POSSIBLE
select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_LZMA
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -11,6 +11,7 @@ config SUPERH
select HAVE_DMA_ATTRS
select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A)
select PERF_USE_VMALLOC
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_BZIP2
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -53,6 +53,7 @@ config SPARC64
select HAVE_GENERIC_HARDIRQS
select GENERIC_IRQ_SHOW
select IRQ_PREFLOW_FASTEOI
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG

config ARCH_DEFCONFIG
string
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -12,6 +12,7 @@ config TILE
select GENERIC_IRQ_PROBE
select GENERIC_PENDING_IRQ if SMP
select GENERIC_IRQ_SHOW
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG if !M386

# FIXME: investigate whether we need/want these options.
# select HAVE_IOREMAP_PROT
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -72,6 +72,7 @@ config X86
select IRQ_FORCED_THREADING
select USE_GENERIC_SMP_HELPERS if SMP
select ARCH_NO_SYSDEV_OPS
+ select ARCH_HAVE_NMI_SAFE_CMPXCHG

config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS)

2011-05-17 08:08:59

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 2/9] lib, Add lock-less NULL terminated single list

Cmpxchg is used to implement adding new entry to the list, deleting
all entries from the list, deleting first entry of the list and some
other operations.

Because this is a single list, so the tail can not be accessed in O(1).

If there are multiple producers and multiple consumers, llist_add can
be used in producers and llist_del_all can be used in consumers. They
can work simultaneously without lock. But llist_del_first can not be
used here. Because llist_del_first depends on list->first->next does
not changed if list->first is not changed during its operation, but
llist_del_first, llist_add, llist_add (or llist_del_all, llist_add,
llist_add) sequence in another consumer may violate that.

If there are multiple producers and one consumer, llist_add can be
used in producers and llist_del_all or llist_del_first can be used in
the consumer.

This can be summarized as follow:

| add | del_first | del_all
add | - | - | -
del_first | | L | L
del_all | | | -

Where "-" stands for no lock is needed, while "L" stands for lock is
needed.

The list entries deleted via llist_del_all can be traversed with
traversing function such as llist_for_each etc. But the list entries
can not be traversed safely before deleted from the list. The order
of deleted entries is from the newest to the oldest added one. If you
want to traverse from the oldest to the newest, you must reverse the
order by yourself before traversing.

The basic atomic operation of this list is cmpxchg on long. On
architectures that don't have NMI-safe cmpxchg implementation, the
list can NOT be used in NMI handler. So code uses the list in NMI
handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.

Signed-off-by: Huang Ying <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/llist.h | 126 ++++++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2
lib/llist.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 260 insertions(+)
create mode 100644 include/linux/llist.h
create mode 100644 lib/llist.c

--- /dev/null
+++ b/include/linux/llist.h
@@ -0,0 +1,126 @@
+#ifndef LLIST_H
+#define LLIST_H
+/*
+ * Lock-less NULL terminated single linked list
+ *
+ * If there are multiple producers and multiple consumers, llist_add
+ * can be used in producers and llist_del_all can be used in
+ * consumers. They can work simultaneously without lock. But
+ * llist_del_first can not be used here. Because llist_del_first
+ * depends on list->first->next does not changed if list->first is not
+ * changed during its operation, but llist_del_first, llist_add,
+ * llist_add (or llist_del_all, llist_add, llist_add) sequence in
+ * another consumer may violate that.
+ *
+ * If there are multiple producers and one consumer, llist_add can be
+ * used in producers and llist_del_all or llist_del_first can be used
+ * in the consumer.
+ *
+ * This can be summarized as follow:
+ *
+ * | add | del_first | del_all
+ * add | - | - | -
+ * del_first | | L | L
+ * del_all | | | -
+ *
+ * Where "-" stands for no lock is needed, while "L" stands for lock
+ * is needed.
+ *
+ * The list entries deleted via llist_del_all can be traversed with
+ * traversing function such as llist_for_each etc. But the list
+ * entries can not be traversed safely before deleted from the list.
+ * The order of deleted entries is from the newest to the oldest added
+ * one. If you want to traverse from the oldest to the newest, you
+ * must reverse the order by yourself before traversing.
+ *
+ * The basic atomic operation of this list is cmpxchg on long. On
+ * architectures that don't have NMI-safe cmpxchg implementation, the
+ * list can NOT be used in NMI handler. So code uses the list in NMI
+ * handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
+ */
+
+struct llist_head {
+ struct llist_node *first;
+};
+
+struct llist_node {
+ struct llist_node *next;
+};
+
+#define LLIST_HEAD_INIT(name) { NULL }
+#define LLIST_HEAD(name) struct llist_head name = LLIST_HEAD_INIT(name)
+
+/**
+ * init_llist_head - initialize lock-less list head
+ * @head: the head for your lock-less list
+ */
+static inline void init_llist_head(struct llist_head *list)
+{
+ list->first = NULL;
+}
+
+/**
+ * llist_entry - get the struct of this entry
+ * @ptr: the &struct llist_node pointer.
+ * @type: the type of the struct this is embedded in.
+ * @member: the name of the llist_node within the struct.
+ */
+#define llist_entry(ptr, type, member) \
+ container_of(ptr, type, member)
+
+/**
+ * llist_for_each - iterate over some deleted entries of a lock-less list
+ * @pos: the &struct llist_node to use as a loop cursor
+ * @node: the first entry of deleted list entries
+ *
+ * In general, some entries of the lock-less list can be traversed
+ * safely only after being deleted from list, so start with an entry
+ * instead of list head.
+ *
+ * If being used on entries deleted from lock-less list directly, the
+ * traverse order is from the newest to the oldest added entry. If
+ * you want to traverse from the oldest to the newest, you must
+ * reverse the order by yourself before traversing.
+ */
+#define llist_for_each(pos, node) \
+ for ((pos) = (node); pos; (pos) = (pos)->next)
+
+/**
+ * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type
+ * @pos: the type * to use as a loop cursor.
+ * @node: the fist entry of deleted list entries.
+ * @member: the name of the llist_node with the struct.
+ *
+ * In general, some entries of the lock-less list can be traversed
+ * safely only after being removed from list, so start with an entry
+ * instead of list head.
+ *
+ * If being used on entries deleted from lock-less list directly, the
+ * traverse order is from the newest to the oldest added entry. If
+ * you want to traverse from the oldest to the newest, you must
+ * reverse the order by yourself before traversing.
+ */
+#define llist_for_each_entry(pos, node, member) \
+ for ((pos) = llist_entry((node), typeof(*(pos)), member); \
+ &(pos)->member != NULL; \
+ (pos) = llist_entry((pos)->member.next, typeof(*(pos)), member))
+
+/**
+ * llist_empty - tests whether a lock-less list is empty
+ * @head: the list to test
+ *
+ * Not guaranteed to be accurate or up to date. Just a quick way to
+ * test whether the list is empty without deleting something from the
+ * list.
+ */
+static inline int llist_empty(const struct llist_head *head)
+{
+ return ACCESS_ONCE(head->first) == NULL;
+}
+
+void llist_add(struct llist_node *new, struct llist_head *head);
+void llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
+ struct llist_head *head);
+struct llist_node *llist_del_first(struct llist_head *head);
+struct llist_node *llist_del_all(struct llist_head *head);
+#endif /* LLIST_H */
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -272,4 +272,7 @@ config AVERAGE

If unsure, say N.

+config LLIST
+ bool
+
endmenu
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -115,6 +115,8 @@ obj-$(CONFIG_AVERAGE) += average.o

obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o

+obj-$(CONFIG_LLIST) += llist.o
+
hostprogs-y := gen_crc32table
clean-files := crc32table.h

--- /dev/null
+++ b/lib/llist.c
@@ -0,0 +1,129 @@
+/*
+ * Lock-less NULL terminated single linked list
+ *
+ * The basic atomic operation of this list is cmpxchg on long. On
+ * architectures that don't have NMI-safe cmpxchg implementation, the
+ * list can NOT be used in NMI handler. So code uses the list in NMI
+ * handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
+ *
+ * Copyright 2010,2011 Intel Corp.
+ * Author: Huang Ying <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation;
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/llist.h>
+
+#include <asm/system.h>
+
+/**
+ * llist_add - add a new entry
+ * @new: new entry to be added
+ * @head: the head for your lock-less list
+ */
+void llist_add(struct llist_node *new, struct llist_head *head)
+{
+ struct llist_node *entry, *old_entry;
+
+#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ BUG_ON(in_nmi());
+#endif
+
+ entry = head->first;
+ do {
+ old_entry = entry;
+ new->next = entry;
+ cpu_relax();
+ } while ((entry = cmpxchg(&head->first, old_entry, new)) != old_entry);
+}
+EXPORT_SYMBOL_GPL(llist_add);
+
+/**
+ * llist_add_batch - add several linked entries in batch
+ * @new_first: first entry in batch to be added
+ * @new_last: last entry in batch to be added
+ * @head: the head for your lock-less list
+ */
+void llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
+ struct llist_head *head)
+{
+ struct llist_node *entry, *old_entry;
+
+#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ BUG_ON(in_nmi());
+#endif
+
+ entry = head->first;
+ do {
+ old_entry = entry;
+ new_last->next = entry;
+ cpu_relax();
+ } while ((entry = cmpxchg(&head->first, old_entry, new_first)) != old_entry);
+}
+EXPORT_SYMBOL_GPL(llist_add_batch);
+
+/**
+ * llist_del_first - delete the first entry of lock-less list
+ * @head: the head for your lock-less list
+ *
+ * If list is empty, return NULL, otherwise, return the first entry
+ * deleted, this is the newest added one.
+ *
+ * Only one llist_del_first user can be used simultaneously with
+ * multiple llist_add users without lock. Because otherwise
+ * llist_del_first, llist_add, llist_add (or llist_del_all, llist_add,
+ * llist_add) sequence in another user may change @head->first->next,
+ * but keep @head->first. If multiple consumers are needed, please
+ * use llist_del_all or use lock between consumers.
+ */
+struct llist_node *llist_del_first(struct llist_head *head)
+{
+ struct llist_node *entry, *old_entry, *next;
+
+#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ BUG_ON(in_nmi());
+#endif
+
+ entry = head->first;
+ do {
+ if (entry == NULL)
+ return NULL;
+ old_entry = entry;
+ next = entry->next;
+ cpu_relax();
+ } while ((entry = cmpxchg(&head->first, old_entry, next)) != old_entry);
+
+ return entry;
+}
+EXPORT_SYMBOL_GPL(llist_del_first);
+
+/**
+ * llist_del_all - delete all entries from lock-less list
+ * @head: the head of lock-less list to delete all entries
+ *
+ * If list is empty, return NULL, otherwise, delete all entries and
+ * return the pointer to the first entry. The order of entries
+ * deleted is from the newest to the oldest added one.
+ */
+struct llist_node *llist_del_all(struct llist_head *head)
+{
+#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ BUG_ON(in_nmi());
+#endif
+
+ return xchg(&head->first, NULL);
+}
+EXPORT_SYMBOL_GPL(llist_del_all);

2011-05-17 08:09:06

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 3/9] lib, Make gen_pool memory allocator lockless

This version of the gen_pool memory allocator supports lockless
operation.

This makes it safe to use in NMI handlers and other special
unblockable contexts that could otherwise deadlock on locks. This is
implemented by using atomic operations and retries on any conflicts.
The disadvantage is that there may be livelocks in extreme cases. For
better scalability, one gen_pool allocator can be used for each CPU.

The lockless operation only works if there is enough memory available.
If new memory is added to the pool a lock has to be still taken. So
any user relying on locklessness has to ensure that sufficient memory
is preallocated.

The basic atomic operation of this allocator is cmpxchg on long. On
architectures that don't have NMI-safe cmpxchg implementation, the
allocator can NOT be used in NMI handler. So code uses the allocator
in NMI handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.

Signed-off-by: Huang Ying <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/bitmap.h | 1
include/linux/genalloc.h | 37 +++++-
lib/bitmap.c | 2
lib/genalloc.c | 284 ++++++++++++++++++++++++++++++++++++++---------
4 files changed, 267 insertions(+), 57 deletions(-)

--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -142,6 +142,7 @@ extern void bitmap_release_region(unsign
extern int bitmap_allocate_region(unsigned long *bitmap, int pos, int order);
extern void bitmap_copy_le(void *dst, const unsigned long *src, int nbits);

+#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) % BITS_PER_LONG))
#define BITMAP_LAST_WORD_MASK(nbits) \
( \
((nbits) % BITS_PER_LONG) ? \
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -1,8 +1,28 @@
+#ifndef GENALLOC_H
+#define GENALLOC_H
/*
- * Basic general purpose allocator for managing special purpose memory
- * not managed by the regular kmalloc/kfree interface.
- * Uses for this includes on-device special memory, uncached memory
- * etc.
+ * Basic general purpose allocator for managing special purpose
+ * memory, for example, memory that is not managed by the regular
+ * kmalloc/kfree interface. Uses for this includes on-device special
+ * memory, uncached memory etc.
+ *
+ * It is safe to use the allocator in NMI handlers and other special
+ * unblockable contexts that could otherwise deadlock on locks. This
+ * is implemented by using atomic operations and retries on any
+ * conflicts. The disadvantage is that there may be livelocks in
+ * extreme cases. For better scalability, one allocator can be used
+ * for each CPU.
+ *
+ * The lockless operation only works if there is enough memory
+ * available. If new memory is added to the pool a lock has to be
+ * still taken. So any user relying on locklessness has to ensure
+ * that sufficient memory is preallocated.
+ *
+ * The basic atomic operation of this allocator is cmpxchg on long.
+ * On architectures that don't have NMI-safe cmpxchg implementation,
+ * the allocator can NOT be used in NMI handler. So code uses the
+ * allocator in NMI handler should depend on
+ * CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
*
* This source code is licensed under the GNU General Public License,
* Version 2. See the file COPYING for more details.
@@ -13,7 +33,7 @@
* General purpose special memory pool descriptor.
*/
struct gen_pool {
- rwlock_t lock;
+ spinlock_t lock;
struct list_head chunks; /* list of chunks in this pool */
int min_alloc_order; /* minimum allocation order */
};
@@ -22,8 +42,8 @@ struct gen_pool {
* General purpose special memory pool chunk descriptor.
*/
struct gen_pool_chunk {
- spinlock_t lock;
struct list_head next_chunk; /* next chunk in pool */
+ atomic_t avail;
unsigned long start_addr; /* starting address of memory chunk */
unsigned long end_addr; /* ending address of memory chunk */
unsigned long bits[0]; /* bitmap for allocating memory chunk */
@@ -34,3 +54,8 @@ extern int gen_pool_add(struct gen_pool
extern void gen_pool_destroy(struct gen_pool *);
extern unsigned long gen_pool_alloc(struct gen_pool *, size_t);
extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+extern void gen_pool_for_each_chunk(struct gen_pool *,
+ void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
+extern size_t gen_pool_avail(struct gen_pool *);
+extern size_t gen_pool_size(struct gen_pool *);
+#endif /* GENALLOC_H */
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -271,8 +271,6 @@ int __bitmap_weight(const unsigned long
}
EXPORT_SYMBOL(__bitmap_weight);

-#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) % BITS_PER_LONG))
-
void bitmap_set(unsigned long *map, int start, int nr)
{
unsigned long *p = map + BIT_WORD(start);
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -1,8 +1,26 @@
/*
- * Basic general purpose allocator for managing special purpose memory
- * not managed by the regular kmalloc/kfree interface.
- * Uses for this includes on-device special memory, uncached memory
- * etc.
+ * Basic general purpose allocator for managing special purpose
+ * memory, for example, memory that is not managed by the regular
+ * kmalloc/kfree interface. Uses for this includes on-device special
+ * memory, uncached memory etc.
+ *
+ * It is safe to use the allocator in NMI handlers and other special
+ * unblockable contexts that could otherwise deadlock on locks. This
+ * is implemented by using atomic operations and retries on any
+ * conflicts. The disadvantage is that there may be livelocks in
+ * extreme cases. For better scalability, one allocator can be used
+ * for each CPU.
+ *
+ * The lockless operation only works if there is enough memory
+ * available. If new memory is added to the pool a lock has to be
+ * still taken. So any user relying on locklessness has to ensure
+ * that sufficient memory is preallocated.
+ *
+ * The basic atomic operation of this allocator is cmpxchg on long.
+ * On architectures that don't have NMI-safe cmpxchg implementation,
+ * the allocator can NOT be used in NMI handler. So code uses the
+ * allocator in NMI handler should depend on
+ * CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
*
* Copyright 2005 (C) Jes Sorensen <[email protected]>
*
@@ -13,8 +31,109 @@
#include <linux/slab.h>
#include <linux/module.h>
#include <linux/bitmap.h>
+#include <linux/rculist.h>
+#include <linux/interrupt.h>
#include <linux/genalloc.h>

+static int set_bits_ll(unsigned long *addr, unsigned long mask_to_set)
+{
+ unsigned long val, nval;
+
+ nval = *addr;
+ do {
+ val = nval;
+ if (val & mask_to_set)
+ return -EBUSY;
+ cpu_relax();
+ } while ((nval = cmpxchg(addr, val, val | mask_to_set)) != val);
+
+ return 0;
+}
+
+static int clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear)
+{
+ unsigned long val, nval;
+
+ nval = *addr;
+ do {
+ val = nval;
+ if ((val & mask_to_clear) != mask_to_clear)
+ return -EBUSY;
+ cpu_relax();
+ } while ((nval = cmpxchg(addr, val, val & ~mask_to_clear)) != val);
+
+ return 0;
+}
+
+/*
+ * bitmap_set_ll - set the specified number of bits at the specified position
+ * @map: pointer to a bitmap
+ * @start: a bit position in @map
+ * @nr: number of bits to set
+ *
+ * Set @nr bits start from @start in @map lock-lessly. Several users
+ * can set/clear the same bitmap simultaneously without lock. If two
+ * users set the same bit, one user will return remain bits, otherwise
+ * return 0.
+ */
+static int bitmap_set_ll(unsigned long *map, int start, int nr)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const int size = start + nr;
+ int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+ while (nr - bits_to_set >= 0) {
+ if (set_bits_ll(p, mask_to_set))
+ return nr;
+ nr -= bits_to_set;
+ bits_to_set = BITS_PER_LONG;
+ mask_to_set = ~0UL;
+ p++;
+ }
+ if (nr) {
+ mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+ if (set_bits_ll(p, mask_to_set))
+ return nr;
+ }
+
+ return 0;
+}
+
+/*
+ * bitmap_clear_ll - clear the specified number of bits at the specified position
+ * @map: pointer to a bitmap
+ * @start: a bit position in @map
+ * @nr: number of bits to set
+ *
+ * Clear @nr bits start from @start in @map lock-lessly. Several users
+ * can set/clear the same bitmap simultaneously without lock. If two
+ * users clear the same bit, one user will return remain bits,
+ * otherwise return 0.
+ */
+static int bitmap_clear_ll(unsigned long *map, int start, int nr)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const int size = start + nr;
+ int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+ while (nr - bits_to_clear >= 0) {
+ if (clear_bits_ll(p, mask_to_clear))
+ return nr;
+ nr -= bits_to_clear;
+ bits_to_clear = BITS_PER_LONG;
+ mask_to_clear = ~0UL;
+ p++;
+ }
+ if (nr) {
+ mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+ if (clear_bits_ll(p, mask_to_clear))
+ return nr;
+ }
+
+ return 0;
+}

/**
* gen_pool_create - create a new special memory pool
@@ -30,7 +149,7 @@ struct gen_pool *gen_pool_create(int min

pool = kmalloc_node(sizeof(struct gen_pool), GFP_KERNEL, nid);
if (pool != NULL) {
- rwlock_init(&pool->lock);
+ spin_lock_init(&pool->lock);
INIT_LIST_HEAD(&pool->chunks);
pool->min_alloc_order = min_alloc_order;
}
@@ -58,15 +177,15 @@ int gen_pool_add(struct gen_pool *pool,

chunk = kmalloc_node(nbytes, GFP_KERNEL | __GFP_ZERO, nid);
if (unlikely(chunk == NULL))
- return -1;
+ return -ENOMEM;

- spin_lock_init(&chunk->lock);
chunk->start_addr = addr;
chunk->end_addr = addr + size;
+ atomic_set(&chunk->avail, size);

- write_lock(&pool->lock);
- list_add(&chunk->next_chunk, &pool->chunks);
- write_unlock(&pool->lock);
+ spin_lock(&pool->lock);
+ list_add_rcu(&chunk->next_chunk, &pool->chunks);
+ spin_unlock(&pool->lock);

return 0;
}
@@ -86,7 +205,6 @@ void gen_pool_destroy(struct gen_pool *p
int order = pool->min_alloc_order;
int bit, end_bit;

-
list_for_each_safe(_chunk, _next_chunk, &pool->chunks) {
chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
list_del(&chunk->next_chunk);
@@ -108,44 +226,50 @@ EXPORT_SYMBOL(gen_pool_destroy);
* @size: number of bytes to allocate from the pool
*
* Allocate the requested number of bytes from the specified pool.
- * Uses a first-fit algorithm.
+ * Uses a first-fit algorithm. Can not be used in NMI handler on
+ * architectures without NMI-safe cmpxchg implementation.
*/
unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size)
{
- struct list_head *_chunk;
struct gen_pool_chunk *chunk;
- unsigned long addr, flags;
+ unsigned long addr = 0;
int order = pool->min_alloc_order;
- int nbits, start_bit, end_bit;
+ int nbits, start_bit = 0, end_bit, remain;
+
+#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ BUG_ON(in_nmi());
+#endif

if (size == 0)
return 0;

nbits = (size + (1UL << order) - 1) >> order;
-
- read_lock(&pool->lock);
- list_for_each(_chunk, &pool->chunks) {
- chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+ rcu_read_lock();
+ list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk) {
+ if (size > atomic_read(&chunk->avail))
+ continue;

end_bit = (chunk->end_addr - chunk->start_addr) >> order;
-
- spin_lock_irqsave(&chunk->lock, flags);
- start_bit = bitmap_find_next_zero_area(chunk->bits, end_bit, 0,
- nbits, 0);
- if (start_bit >= end_bit) {
- spin_unlock_irqrestore(&chunk->lock, flags);
+retry:
+ start_bit = bitmap_find_next_zero_area(chunk->bits, end_bit,
+ start_bit, nbits, 0);
+ if (start_bit >= end_bit)
continue;
+ remain = bitmap_set_ll(chunk->bits, start_bit, nbits);
+ if (remain) {
+ remain = bitmap_clear_ll(chunk->bits, start_bit,
+ nbits - remain);
+ BUG_ON(remain);
+ goto retry;
}

addr = chunk->start_addr + ((unsigned long)start_bit << order);
-
- bitmap_set(chunk->bits, start_bit, nbits);
- spin_unlock_irqrestore(&chunk->lock, flags);
- read_unlock(&pool->lock);
- return addr;
+ size = nbits << order;
+ atomic_sub(size, &chunk->avail);
+ break;
}
- read_unlock(&pool->lock);
- return 0;
+ rcu_read_unlock();
+ return addr;
}
EXPORT_SYMBOL(gen_pool_alloc);

@@ -155,33 +279,95 @@ EXPORT_SYMBOL(gen_pool_alloc);
* @addr: starting address of memory to free back to pool
* @size: size in bytes of memory to free
*
- * Free previously allocated special memory back to the specified pool.
+ * Free previously allocated special memory back to the specified
+ * pool. Can not be used in NMI handler on architectures without
+ * NMI-safe cmpxchg implementation.
*/
void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
{
- struct list_head *_chunk;
struct gen_pool_chunk *chunk;
- unsigned long flags;
int order = pool->min_alloc_order;
- int bit, nbits;
-
- nbits = (size + (1UL << order) - 1) >> order;
+ int start_bit, nbits, remain;

- read_lock(&pool->lock);
- list_for_each(_chunk, &pool->chunks) {
- chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
+#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ BUG_ON(in_nmi());
+#endif

+ nbits = (size + (1UL << order) - 1) >> order;
+ rcu_read_lock();
+ list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk) {
if (addr >= chunk->start_addr && addr < chunk->end_addr) {
BUG_ON(addr + size > chunk->end_addr);
- spin_lock_irqsave(&chunk->lock, flags);
- bit = (addr - chunk->start_addr) >> order;
- while (nbits--)
- __clear_bit(bit++, chunk->bits);
- spin_unlock_irqrestore(&chunk->lock, flags);
- break;
+ start_bit = (addr - chunk->start_addr) >> order;
+ remain = bitmap_clear_ll(chunk->bits, start_bit, nbits);
+ BUG_ON(remain);
+ size = nbits << order;
+ atomic_add(size, &chunk->avail);
+ rcu_read_unlock();
+ return;
}
}
- BUG_ON(nbits > 0);
- read_unlock(&pool->lock);
+ rcu_read_unlock();
+ BUG();
}
EXPORT_SYMBOL(gen_pool_free);
+
+/**
+ * gen_pool_for_each_chunk - call func for every chunk of generic memory pool
+ * @pool: the generic memory pool
+ * @func: func to call
+ * @data: additional data used by @func
+ *
+ * Call @func for every chunk of generic memory pool. The @func is
+ * called with rcu_read_lock held.
+ */
+void gen_pool_for_each_chunk(struct gen_pool *pool,
+ void (*func)(struct gen_pool *pool, struct gen_pool_chunk *chunk, void *data),
+ void *data)
+{
+ struct gen_pool_chunk *chunk;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(chunk, &(pool)->chunks, next_chunk)
+ func(pool, chunk, data);
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(gen_pool_for_each_chunk);
+
+/**
+ * gen_pool_avail - get available free space of the pool
+ * @pool: pool to get available free space
+ *
+ * Return available free space of the specified pool.
+ */
+size_t gen_pool_avail(struct gen_pool *pool)
+{
+ struct gen_pool_chunk *chunk;
+ size_t avail = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk)
+ avail += atomic_read(&chunk->avail);
+ rcu_read_unlock();
+ return avail;
+}
+EXPORT_SYMBOL_GPL(gen_pool_avail);
+
+/**
+ * gen_pool_size - get size in bytes of memory managed by the pool
+ * @pool: pool to get size
+ *
+ * Return size in bytes of memory managed by the pool.
+ */
+size_t gen_pool_size(struct gen_pool *pool)
+{
+ struct gen_pool_chunk *chunk;
+ size_t size = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk)
+ size += chunk->end_addr - chunk->start_addr;
+ rcu_read_unlock();
+ return size;
+}
+EXPORT_SYMBOL_GPL(gen_pool_size);

2011-05-17 08:10:59

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 4/9] ACPI, APEI, GHES, printk support for recoverable error via NMI

Some APEI GHES recoverable errors are reported via NMI, but printk is
not safe in NMI context.

To solve the issue, a lock-less memory allocator is used to allocate
memory in NMI handler, save the error record into the allocated
memory, put the error record into a lock-less list. On the other
hand, a irq_work is used to delay the operation from NMI context to
IRQ context. The irq_work IRQ handler will remove nodes from
lock-less list, printk the error record and do some further processing
include recovery operation, then free the memory.

Signed-off-by: Huang Ying <[email protected]>
---
drivers/acpi/apei/Kconfig | 2
drivers/acpi/apei/ghes.c | 193 ++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 181 insertions(+), 14 deletions(-)

--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -12,6 +12,8 @@ config ACPI_APEI_GHES
tristate "APEI Generic Hardware Error Source"
depends on ACPI_APEI && X86
select ACPI_HED
+ select LLIST
+ select GENERIC_ALLOCATOR
help
Generic Hardware Error Source provides a way to report
platform hardware errors (such as that from chipset). It
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -12,7 +12,7 @@
* For more information about Generic Hardware Error Source, please
* refer to ACPI Specification version 4.0, section 17.3.2.6
*
- * Copyright 2010 Intel Corp.
+ * Copyright 2010,2011 Intel Corp.
* Author: Huang Ying <[email protected]>
*
* This program is free software; you can redistribute it and/or
@@ -42,6 +42,9 @@
#include <linux/mutex.h>
#include <linux/ratelimit.h>
#include <linux/vmalloc.h>
+#include <linux/irq_work.h>
+#include <linux/llist.h>
+#include <linux/genalloc.h>
#include <acpi/apei.h>
#include <acpi/atomicio.h>
#include <acpi/hed.h>
@@ -53,6 +56,15 @@
#define GHES_PFX "GHES: "

#define GHES_ESTATUS_MAX_SIZE 65536
+#define GHES_ESOURCE_PREALLOC_MAX_SIZE 65536
+
+#define GHES_ESTATUS_POOL_MIN_ALLOC_ORDER 3
+
+#define GHES_ESTATUS_NODE_LEN(estatus_len) \
+ (sizeof(struct ghes_estatus_node) + (estatus_len))
+#define GHES_ESTATUS_FROM_NODE(estatus_node) \
+ ((struct acpi_hest_generic_status *) \
+ ((struct ghes_estatus_node *)(estatus_node) + 1))

/*
* One struct ghes is created for each generic hardware error source.
@@ -77,6 +89,11 @@ struct ghes {
};
};

+struct ghes_estatus_node {
+ struct llist_node llnode;
+ struct acpi_hest_generic *generic;
+};
+
static int ghes_panic_timeout __read_mostly = 30;

/*
@@ -121,6 +138,19 @@ static struct vm_struct *ghes_ioremap_ar
static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);

+/*
+ * printk is not safe in NMI context. So in NMI handler, we allocate
+ * required memory from lock-less memory allocator
+ * (ghes_estatus_pool), save estatus into it, put them into lock-less
+ * list (ghes_estatus_llist), then delay printk into IRQ context via
+ * irq_work (ghes_proc_irq_work). ghes_estatus_size_request record
+ * required pool size by all NMI error source.
+ */
+static struct gen_pool *ghes_estatus_pool;
+static unsigned long ghes_estatus_pool_size_request;
+static struct llist_head ghes_estatus_llist;
+static struct irq_work ghes_proc_irq_work;
+
static int ghes_ioremap_init(void)
{
ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -180,6 +210,55 @@ static void ghes_iounmap_irq(void __iome
__flush_tlb_one(vaddr);
}

+static int ghes_estatus_pool_init(void)
+{
+ ghes_estatus_pool = gen_pool_create(GHES_ESTATUS_POOL_MIN_ALLOC_ORDER, -1);
+ if (!ghes_estatus_pool)
+ return -ENOMEM;
+ return 0;
+}
+
+static void ghes_estatus_pool_free_chunk_page(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk,
+ void *data)
+{
+ free_page(chunk->start_addr);
+}
+
+static void ghes_estatus_pool_exit(void)
+{
+ gen_pool_for_each_chunk(ghes_estatus_pool,
+ ghes_estatus_pool_free_chunk_page, NULL);
+ gen_pool_destroy(ghes_estatus_pool);
+}
+
+static int ghes_estatus_pool_expand(unsigned long len)
+{
+ unsigned long i, pages, size, addr;
+ int ret;
+
+ ghes_estatus_pool_size_request += PAGE_ALIGN(len);
+ size = gen_pool_size(ghes_estatus_pool);
+ if (size >= ghes_estatus_pool_size_request)
+ return 0;
+ pages = (ghes_estatus_pool_size_request - size) / PAGE_SIZE;
+ for (i = 0; i < pages; i++) {
+ addr = __get_free_page(GFP_KERNEL);
+ if (!addr)
+ return -ENOMEM;
+ ret = gen_pool_add(ghes_estatus_pool, addr, PAGE_SIZE, -1);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static void ghes_estatus_pool_shrink(unsigned long len)
+{
+ ghes_estatus_pool_size_request -= PAGE_ALIGN(len);
+}
+
static struct ghes *ghes_new(struct acpi_hest_generic *generic)
{
struct ghes *ghes;
@@ -341,13 +420,13 @@ static void ghes_clear_estatus(struct gh
ghes->flags &= ~GHES_TO_CLEAR;
}

-static void ghes_do_proc(struct ghes *ghes)
+static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
{
int sev, processed = 0;
struct acpi_hest_generic_data *gdata;

- sev = ghes_severity(ghes->estatus->error_severity);
- apei_estatus_for_each_section(ghes->estatus, gdata) {
+ sev = ghes_severity(estatus->error_severity);
+ apei_estatus_for_each_section(estatus, gdata) {
#ifdef CONFIG_X86_MCE
if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
CPER_SEC_PLATFORM_MEM)) {
@@ -360,13 +439,15 @@ static void ghes_do_proc(struct ghes *gh
}
}

-static void ghes_print_estatus(const char *pfx, struct ghes *ghes)
+static void ghes_print_estatus(const char *pfx,
+ const struct acpi_hest_generic *generic,
+ const struct acpi_hest_generic_status *estatus)
{
/* Not more than 2 messages every 5 seconds */
static DEFINE_RATELIMIT_STATE(ratelimit, 5*HZ, 2);

if (pfx == NULL) {
- if (ghes_severity(ghes->estatus->error_severity) <=
+ if (ghes_severity(estatus->error_severity) <=
GHES_SEV_CORRECTED)
pfx = KERN_WARNING HW_ERR;
else
@@ -375,8 +456,8 @@ static void ghes_print_estatus(const cha
if (__ratelimit(&ratelimit)) {
printk(
"%s""Hardware error from APEI Generic Hardware Error Source: %d\n",
- pfx, ghes->generic->header.source_id);
- apei_estatus_print(pfx, ghes->estatus);
+ pfx, generic->header.source_id);
+ apei_estatus_print(pfx, estatus);
}
}

@@ -387,8 +468,8 @@ static int ghes_proc(struct ghes *ghes)
rc = ghes_read_estatus(ghes, 0);
if (rc)
goto out;
- ghes_print_estatus(NULL, ghes);
- ghes_do_proc(ghes);
+ ghes_print_estatus(NULL, ghes->generic, ghes->estatus);
+ ghes_do_proc(ghes->estatus);

out:
ghes_clear_estatus(ghes);
@@ -447,6 +528,40 @@ static int ghes_notify_sci(struct notifi
return ret;
}

+static void ghes_proc_in_irq(struct irq_work *irq_work)
+{
+ struct llist_node *llnode, *next, *tail = NULL;
+ struct ghes_estatus_node *estatus_node;
+ struct acpi_hest_generic_status *estatus;
+ u32 len, node_len;
+
+ /*
+ * Because the time order of estatus in list is reversed,
+ * revert it back to proper order.
+ */
+ llnode = llist_del_all(&ghes_estatus_llist);
+ while (llnode) {
+ next = llnode->next;
+ llnode->next = tail;
+ tail = llnode;
+ llnode = next;
+ }
+ llnode = tail;
+ while (llnode) {
+ next = llnode->next;
+ estatus_node = llist_entry(llnode, struct ghes_estatus_node,
+ llnode);
+ estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
+ len = apei_estatus_len(estatus);
+ node_len = GHES_ESTATUS_NODE_LEN(len);
+ ghes_do_proc(estatus);
+ ghes_print_estatus(NULL, estatus_node->generic, estatus);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);
+ llnode = next;
+ }
+}
+
static int ghes_notify_nmi(struct notifier_block *this,
unsigned long cmd, void *data)
{
@@ -476,7 +591,8 @@ static int ghes_notify_nmi(struct notifi

if (sev_global >= GHES_SEV_PANIC) {
oops_begin();
- ghes_print_estatus(KERN_EMERG HW_ERR, ghes_global);
+ ghes_print_estatus(KERN_EMERG HW_ERR, ghes_global->generic,
+ ghes_global->estatus);
/* reboot to log the error! */
if (panic_timeout == 0)
panic_timeout = ghes_panic_timeout;
@@ -484,12 +600,31 @@ static int ghes_notify_nmi(struct notifi
}

list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ u32 len, node_len;
+ struct ghes_estatus_node *estatus_node;
+ struct acpi_hest_generic_status *estatus;
+#endif
if (!(ghes->flags & GHES_TO_CLEAR))
continue;
- /* Do not print estatus because printk is not NMI safe */
- ghes_do_proc(ghes);
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ /* Save estatus for further processing in IRQ context */
+ len = apei_estatus_len(ghes->estatus);
+ node_len = GHES_ESTATUS_NODE_LEN(len);
+ estatus_node = (void *)gen_pool_alloc(ghes_estatus_pool,
+ node_len);
+ if (estatus_node) {
+ estatus_node->generic = ghes->generic;
+ estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
+ memcpy(estatus, ghes->estatus, len);
+ llist_add(&estatus_node->llnode, &ghes_estatus_llist);
+ }
+#endif
ghes_clear_estatus(ghes);
}
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+ irq_work_queue(&ghes_proc_irq_work);
+#endif

out:
raw_spin_unlock(&ghes_nmi_lock);
@@ -504,10 +639,26 @@ static struct notifier_block ghes_notifi
.notifier_call = ghes_notify_nmi,
};

+static unsigned long ghes_esource_prealloc_size(
+ const struct acpi_hest_generic *generic)
+{
+ unsigned long block_length, prealloc_records, prealloc_size;
+
+ block_length = min_t(unsigned long, generic->error_block_length,
+ GHES_ESTATUS_MAX_SIZE);
+ prealloc_records = max_t(unsigned long,
+ generic->records_to_preallocate, 1);
+ prealloc_size = min_t(unsigned long, block_length * prealloc_records,
+ GHES_ESOURCE_PREALLOC_MAX_SIZE);
+
+ return prealloc_size;
+}
+
static int __devinit ghes_probe(struct platform_device *ghes_dev)
{
struct acpi_hest_generic *generic;
struct ghes *ghes = NULL;
+ unsigned long len;
int rc = -EINVAL;

generic = *(struct acpi_hest_generic **)ghes_dev->dev.platform_data;
@@ -573,6 +724,8 @@ static int __devinit ghes_probe(struct p
mutex_unlock(&ghes_list_mutex);
break;
case ACPI_HEST_NOTIFY_NMI:
+ len = ghes_esource_prealloc_size(generic);
+ ghes_estatus_pool_expand(len);
mutex_lock(&ghes_list_mutex);
if (list_empty(&ghes_nmi))
register_die_notifier(&ghes_notifier_nmi);
@@ -597,6 +750,7 @@ static int __devexit ghes_remove(struct
{
struct ghes *ghes;
struct acpi_hest_generic *generic;
+ unsigned long len;

ghes = platform_get_drvdata(ghes_dev);
generic = ghes->generic;
@@ -627,6 +781,8 @@ static int __devexit ghes_remove(struct
* freed after NMI handler finishes.
*/
synchronize_rcu();
+ len = ghes_esource_prealloc_size(generic);
+ ghes_estatus_pool_shrink(len);
break;
default:
BUG();
@@ -662,15 +818,23 @@ static int __init ghes_init(void)
return -EINVAL;
}

+ init_irq_work(&ghes_proc_irq_work, ghes_proc_in_irq);
+
rc = ghes_ioremap_init();
if (rc)
goto err;

- rc = platform_driver_register(&ghes_platform_driver);
+ rc = ghes_estatus_pool_init();
if (rc)
goto err_ioremap_exit;

+ rc = platform_driver_register(&ghes_platform_driver);
+ if (rc)
+ goto err_pool_exit;
+
return 0;
+err_pool_exit:
+ ghes_estatus_pool_exit();
err_ioremap_exit:
ghes_ioremap_exit();
err:
@@ -680,6 +844,7 @@ err:
static void __exit ghes_exit(void)
{
platform_driver_unregister(&ghes_platform_driver);
+ ghes_estatus_pool_exit();
ghes_ioremap_exit();
}

2011-05-17 08:10:58

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 5/9] HWPoison: add memory_failure_queue()

memory_failure() is the entry point for HWPoison memory error
recovery. It must be called in process context. But commonly
hardware memory errors are notified via MCE or NMI, so some delayed
execution mechanism must be used. In MCE handler, a work queue + ring
buffer mechanism is used.

In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
(Generic Hardware Error Source) can be used to report memory errors
too. To add support to APEI GHES memory recovery, a mechanism similar
to that of MCE is implemented. memory_failure_queue() is the new
entry point that can be called in IRQ context. The next step is to
make MCE handler uses this interface too.

Signed-off-by: Huang Ying <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/mm.h | 1
mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 93 insertions(+)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1664,6 +1664,7 @@ enum mf_flags {
};
extern void memory_failure(unsigned long pfn, int trapno);
extern int __memory_failure(unsigned long pfn, int trapno, int flags);
+extern void memory_failure_queue(unsigned long pfn, int trapno, int flags);
extern int unpoison_memory(unsigned long pfn);
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -52,6 +52,7 @@
#include <linux/swapops.h>
#include <linux/hugetlb.h>
#include <linux/memory_hotplug.h>
+#include <linux/kfifo.h>
#include "internal.h"

int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -1182,6 +1183,97 @@ void memory_failure(unsigned long pfn, i
__memory_failure(pfn, trapno, 0);
}

+#define MEMORY_FAILURE_FIFO_ORDER 4
+#define MEMORY_FAILURE_FIFO_SIZE (1 << MEMORY_FAILURE_FIFO_ORDER)
+
+struct memory_failure_entry {
+ unsigned long pfn;
+ int trapno;
+ int flags;
+};
+
+struct memory_failure_cpu {
+ DECLARE_KFIFO(fifo, struct memory_failure_entry,
+ MEMORY_FAILURE_FIFO_SIZE);
+ spinlock_t lock;
+ struct work_struct work;
+};
+
+static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
+
+/**
+ * memory_failure_queue - Schedule handling memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ * @flags: Flags for memory failure handling
+ *
+ * This function is called by the low level hardware error handler
+ * when it detects hardware memory corruption of a page. It schedules
+ * the recovering of error page, including dropping pages, killing
+ * processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Can run in IRQ context.
+ */
+void memory_failure_queue(unsigned long pfn, int trapno, int flags)
+{
+ struct memory_failure_cpu *mf_cpu;
+ unsigned long proc_flags;
+ struct memory_failure_entry entry = {
+ .pfn = pfn,
+ .trapno = trapno,
+ .flags = flags,
+ };
+
+ mf_cpu = &get_cpu_var(memory_failure_cpu);
+ spin_lock_irqsave(&mf_cpu->lock, proc_flags);
+ if (kfifo_put(&mf_cpu->fifo, &entry))
+ schedule_work_on(smp_processor_id(), &mf_cpu->work);
+ else
+ pr_err("Memory failure: buffer overflow when recovering memory failure at 0x%#lx\n",
+ pfn);
+ spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
+ put_cpu_var(memory_failure_cpu);
+}
+EXPORT_SYMBOL_GPL(memory_failure_queue);
+
+static void memory_failure_work_func(struct work_struct *work)
+{
+ struct memory_failure_cpu *mf_cpu;
+ struct memory_failure_entry entry = { 0, };
+ unsigned long proc_flags;
+ int gotten;
+
+ mf_cpu = &__get_cpu_var(memory_failure_cpu);
+ for (;;) {
+ spin_lock_irqsave(&mf_cpu->lock, proc_flags);
+ gotten = kfifo_get(&mf_cpu->fifo, &entry);
+ spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
+ if (!gotten)
+ break;
+ __memory_failure(entry.pfn, entry.trapno, entry.flags);
+ }
+}
+
+static int __init memory_failure_init(void)
+{
+ struct memory_failure_cpu *mf_cpu;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ mf_cpu = &per_cpu(memory_failure_cpu, cpu);
+ spin_lock_init(&mf_cpu->lock);
+ INIT_KFIFO(mf_cpu->fifo);
+ INIT_WORK(&mf_cpu->work, memory_failure_work_func);
+ }
+
+ return 0;
+}
+core_initcall(memory_failure_init);
+
/**
* unpoison_memory - Unpoison a previously poisoned page
* @pfn: Page number of the to be unpoisoned page

2011-05-17 08:10:55

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 6/9] ACPI, APEI, GHES: Add hardware memory error recovery support

memory_failure_queue() is called when recoverable memory errors are
notified by firmware to do the recovery work.

Signed-off-by: Huang Ying <[email protected]>
---
drivers/acpi/apei/Kconfig | 7 +++++++
drivers/acpi/apei/ghes.c | 24 +++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)

--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -31,6 +31,13 @@ config ACPI_APEI_PCIEAER
PCIe AER errors may be reported via APEI firmware first mode.
Turn on this option to enable the corresponding support.

+config ACPI_APEI_MEMORY_FAILURE
+ bool "APEI memory error recovering support"
+ depends on ACPI_APEI && MEMORY_FAILURE
+ help
+ Memory errors may be reported via APEI firmware first mode.
+ Turn on this option to enable the memory recovering support.
+
config ACPI_APEI_EINJ
tristate "APEI Error INJection (EINJ)"
depends on ACPI_APEI && DEBUG_FS
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -422,20 +422,30 @@ static void ghes_clear_estatus(struct gh

static void ghes_do_proc(const struct acpi_hest_generic_status *estatus)
{
- int sev, processed = 0;
+ int sev, sec_sev;
struct acpi_hest_generic_data *gdata;

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
-#ifdef CONFIG_X86_MCE
+ sec_sev = ghes_severity(gdata->error_severity);
if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
CPER_SEC_PLATFORM_MEM)) {
- apei_mce_report_mem_error(
- sev == GHES_SEV_CORRECTED,
- (struct cper_sec_mem_err *)(gdata+1));
- processed = 1;
- }
+ struct cper_sec_mem_err *mem_err;
+ mem_err = (struct cper_sec_mem_err *)(gdata+1);
+#ifdef CONFIG_X86_MCE
+ apei_mce_report_mem_error(sev == GHES_SEV_CORRECTED,
+ mem_err);
#endif
+#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
+ if (sev == GHES_SEV_RECOVERABLE &&
+ sec_sev == GHES_SEV_RECOVERABLE &&
+ mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
+ unsigned long pfn;
+ pfn = mem_err->physical_addr >> PAGE_SHIFT;
+ memory_failure_queue(pfn, 0, 0);
+ }
+#endif
+ }
}
}

2011-05-17 08:10:17

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 7/9] PCIe, AER, add aer_recover_queue

In addition to native PCIe AER, now APEI (ACPI Platform Error
Interface) GHES (Generic Hardware Error Source) can be used to report
PCIe AER errors too. To add support to APEI GHES PCIe AER recovery,
aer_recover_queue is added to export the recovery function in native
PCIe AER driver.

Recoverable PCIe AER errors are reported via NMI in APEI GHES. Then
APEI GHES uses irq_work to delay the error processing into an IRQ
handler. But PCIe AER recovery can be very time-consuming, so
aer_recover_queue, which can be used in IRQ handler, delays the real
recovery action into the process context, that is, work queue.

Signed-off-by: Huang Ying <[email protected]>
CC: Jesse Barnes <[email protected]>
CC: Zhang Yanmin <[email protected]>
---
drivers/pci/pcie/aer/aerdrv_core.c | 76 +++++++++++++++++++++++++++++----
drivers/pci/pcie/aer/aerdrv_errprint.c | 3 -
include/linux/aer.h | 3 +
3 files changed, 74 insertions(+), 8 deletions(-)

--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -24,6 +24,7 @@
#include <linux/suspend.h>
#include <linux/delay.h>
#include <linux/slab.h>
+#include <linux/kfifo.h>
#include "aerdrv.h"

static int forceload;
@@ -445,8 +446,7 @@ static struct pcie_port_service_driver *
return drv;
}

-static pci_ers_result_t reset_link(struct pcie_device *aerdev,
- struct pci_dev *dev)
+static pci_ers_result_t reset_link(struct pci_dev *dev)
{
struct pci_dev *udev;
pci_ers_result_t status;
@@ -486,7 +486,6 @@ static pci_ers_result_t reset_link(struc

/**
* do_recovery - handle nonfatal/fatal error recovery process
- * @aerdev: pointer to a pcie_device data structure of root port
* @dev: pointer to a pci_dev data structure of agent detecting an error
* @severity: error severity type
*
@@ -494,8 +493,7 @@ static pci_ers_result_t reset_link(struc
* error detected message to all downstream drivers within a hierarchy in
* question and return the returned code.
*/
-static void do_recovery(struct pcie_device *aerdev, struct pci_dev *dev,
- int severity)
+static void do_recovery(struct pci_dev *dev, int severity)
{
pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
enum pci_channel_state state;
@@ -511,7 +509,7 @@ static void do_recovery(struct pcie_devi
report_error_detected);

if (severity == AER_FATAL) {
- result = reset_link(aerdev, dev);
+ result = reset_link(dev);
if (result != PCI_ERS_RESULT_RECOVERED)
goto failed;
}
@@ -576,9 +574,73 @@ static void handle_error_source(struct p
pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
info->status);
} else
- do_recovery(aerdev, dev, info->severity);
+ do_recovery(dev, info->severity);
}

+#ifdef CONFIG_ACPI_APEI_PCIEAER
+static void aer_recover_work_func(struct work_struct *work);
+
+#define AER_RECOVER_RING_ORDER 4
+#define AER_RECOVER_RING_SIZE (1 << AER_RECOVER_RING_ORDER)
+
+struct aer_recover_entry
+{
+ u8 bus;
+ u8 devfn;
+ u16 domain;
+ int severity;
+};
+
+static DEFINE_KFIFO(aer_recover_ring, struct aer_recover_entry,
+ AER_RECOVER_RING_SIZE);
+/*
+ * Mutual exclusion for writers of aer_recover_ring, reader side don't
+ * need lock, because there is only one reader and lock is not needed
+ * between reader and writer.
+ */
+static DEFINE_SPINLOCK(aer_recover_ring_lock);
+static DECLARE_WORK(aer_recover_work, aer_recover_work_func);
+
+void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
+ int severity)
+{
+ unsigned long flags;
+ struct aer_recover_entry entry = {
+ .bus = bus,
+ .devfn = devfn,
+ .domain = domain,
+ .severity = severity,
+ };
+
+ spin_lock_irqsave(&aer_recover_ring_lock, flags);
+ if (kfifo_put(&aer_recover_ring, &entry))
+ schedule_work(&aer_recover_work);
+ else
+ pr_err("AER recover: Buffer overflow when recovering AER for %04x:%02x:%02x:%x\n",
+ domain, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
+ spin_unlock_irqrestore(&aer_recover_ring_lock, flags);
+}
+EXPORT_SYMBOL_GPL(aer_recover_queue);
+
+static void aer_recover_work_func(struct work_struct *work)
+{
+ struct aer_recover_entry entry;
+ struct pci_dev *pdev;
+
+ while (kfifo_get(&aer_recover_ring, &entry)) {
+ pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
+ entry.devfn);
+ if (!pdev) {
+ pr_err("AER recover: Can not find pci_dev for %04x:%02x:%02x:%x\n",
+ entry.domain, entry.bus,
+ PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
+ continue;
+ }
+ do_recovery(pdev, entry.severity);
+ }
+}
+#endif
+
/**
* get_device_error_info - read error status from dev and store it to info
* @dev: pointer to the device expected to have a error record
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -204,7 +204,7 @@ void aer_print_port_info(struct pci_dev
}

#ifdef CONFIG_ACPI_APEI_PCIEAER
-static int cper_severity_to_aer(int cper_severity)
+int cper_severity_to_aer(int cper_severity)
{
switch (cper_severity) {
case CPER_SEV_RECOVERABLE:
@@ -215,6 +215,7 @@ static int cper_severity_to_aer(int cper
return AER_CORRECTABLE;
}
}
+EXPORT_SYMBOL_GPL(cper_severity_to_aer);

void cper_print_aer(const char *prefix, int cper_severity,
struct aer_capability_regs *aer)
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -51,5 +51,8 @@ static inline int pci_cleanup_aer_uncorr

extern void cper_print_aer(const char *prefix, int cper_severity,
struct aer_capability_regs *aer);
+extern int cper_severity_to_aer(int cper_severity);
+extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
+ int severity);
#endif //_AER_H_

2011-05-17 08:09:57

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 8/9] ACPI, APEI, GHES: Add PCIe AER recovery support

aer_recover_queue() is called when recoverable PCIe AER errors are
notified by firmware to do the recovery work.

Signed-off-by: Huang Ying <[email protected]>
---
drivers/acpi/apei/ghes.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -45,6 +45,8 @@
#include <linux/irq_work.h>
#include <linux/llist.h>
#include <linux/genalloc.h>
+#include <linux/pci.h>
+#include <linux/aer.h>
#include <acpi/apei.h>
#include <acpi/atomicio.h>
#include <acpi/hed.h>
@@ -446,6 +448,27 @@ static void ghes_do_proc(const struct ac
}
#endif
}
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+ else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+ CPER_SEC_PCIE)) {
+ struct cper_sec_pcie *pcie_err;
+ pcie_err = (struct cper_sec_pcie *)(gdata+1);
+ if (sev == GHES_SEV_RECOVERABLE &&
+ sec_sev == GHES_SEV_RECOVERABLE &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+ unsigned int devfn;
+ int aer_severity;
+ devfn = PCI_DEVFN(pcie_err->device_id.device,
+ pcie_err->device_id.function);
+ aer_severity = cper_severity_to_aer(sev);
+ aer_recover_queue(pcie_err->device_id.segment,
+ pcie_err->device_id.bus,
+ devfn, aer_severity);
+ }
+
+ }
+#endif
}
}

2011-05-17 08:09:23

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 9/9] ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled

erst_dbg module can not work with ERST is disabled. So disable module
loading to provide clearer information to user.

Signed-off-by: Huang Ying <[email protected]>
---
drivers/acpi/apei/erst-dbg.c | 4 ++++
1 file changed, 4 insertions(+)

--- a/drivers/acpi/apei/erst-dbg.c
+++ b/drivers/acpi/apei/erst-dbg.c
@@ -213,6 +213,10 @@ static struct miscdevice erst_dbg_dev =

static __init int erst_dbg_init(void)
{
+ if (erst_disable) {
+ pr_info(ERST_DBG_PFX "ERST support is disabled.\n");
+ return -ENODEV;
+ }
return misc_register(&erst_dbg_dev);
}

2011-05-17 08:46:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Huang Ying <[email protected]> wrote:

> memory_failure() is the entry point for HWPoison memory error
> recovery. It must be called in process context. But commonly
> hardware memory errors are notified via MCE or NMI, so some delayed
> execution mechanism must be used. In MCE handler, a work queue + ring
> buffer mechanism is used.
>
> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
> (Generic Hardware Error Source) can be used to report memory errors
> too. To add support to APEI GHES memory recovery, a mechanism similar
> to that of MCE is implemented. memory_failure_queue() is the new
> entry point that can be called in IRQ context. The next step is to
> make MCE handler uses this interface too.
>
> Signed-off-by: Huang Ying <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> include/linux/mm.h | 1
> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 93 insertions(+)

I have to say i disagree with how this is designed and how this is exposed to
user-space - and i pointed this out before.

It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/
again ...

I just had a quick look into the current affairs of mm/memory-inject.c and it
has become an *even* nastier collection of hacks since the last time i
commented on its uglies.

Special hack upon special hack, totally disorganized code, special-purpose,
partly ioctl driven opaque information extraction to user-space using the
erst-dbg device interface. We have all the maintenance overhead and little of
the gains from hw error event features...

In this patch you add:

+struct memory_failure_entry {
+ unsigned long pfn;
+ int trapno;
+ int flags;
+};

Instead of exposing this event to other users who might be interested in these
events - such as the RAS daemon under development by Boris.

We have a proper framework (ring-buffer, NMI execution, etc.) for reporting
events, why are you not using (and extending) it instead of creating this nasty
looking, isolated, ACPI specific low level feature?

Thanks,

Ingo

2011-05-17 08:52:37

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On 05/17/2011 04:46 PM, Ingo Molnar wrote:
>
> * Huang Ying <[email protected]> wrote:
>
>> memory_failure() is the entry point for HWPoison memory error
>> recovery. It must be called in process context. But commonly
>> hardware memory errors are notified via MCE or NMI, so some delayed
>> execution mechanism must be used. In MCE handler, a work queue + ring
>> buffer mechanism is used.
>>
>> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
>> (Generic Hardware Error Source) can be used to report memory errors
>> too. To add support to APEI GHES memory recovery, a mechanism similar
>> to that of MCE is implemented. memory_failure_queue() is the new
>> entry point that can be called in IRQ context. The next step is to
>> make MCE handler uses this interface too.
>>
>> Signed-off-by: Huang Ying <[email protected]>
>> Cc: Andi Kleen <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> ---
>> include/linux/mm.h | 1
>> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 93 insertions(+)
>
> I have to say i disagree with how this is designed and how this is exposed to
> user-space - and i pointed this out before.
>
> It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/
> again ...
>
> I just had a quick look into the current affairs of mm/memory-inject.c and it
> has become an *even* nastier collection of hacks since the last time i
> commented on its uglies.
>
> Special hack upon special hack, totally disorganized code, special-purpose,
> partly ioctl driven opaque information extraction to user-space using the
> erst-dbg device interface. We have all the maintenance overhead and little of
> the gains from hw error event features...

Like the name suggested, erst-dbg is only for debugging. It is not a
user space interface. The user space interface used by APEI now is printk.

> In this patch you add:
>
> +struct memory_failure_entry {
> + unsigned long pfn;
> + int trapno;
> + int flags;
> +};
>
> Instead of exposing this event to other users who might be interested in these
> events - such as the RAS daemon under development by Boris.
>
> We have a proper framework (ring-buffer, NMI execution, etc.) for reporting
> events, why are you not using (and extending) it instead of creating this nasty
> looking, isolated, ACPI specific low level feature?

This patch has nothing to do with hardware error event reporting. It is
just about hardware error recovering.

Best Regards,
Huang Ying

2011-05-17 09:26:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Huang Ying <[email protected]> wrote:

> On 05/17/2011 04:46 PM, Ingo Molnar wrote:
> >
> > * Huang Ying <[email protected]> wrote:
> >
> >> memory_failure() is the entry point for HWPoison memory error
> >> recovery. It must be called in process context. But commonly
> >> hardware memory errors are notified via MCE or NMI, so some delayed
> >> execution mechanism must be used. In MCE handler, a work queue + ring
> >> buffer mechanism is used.
> >>
> >> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
> >> (Generic Hardware Error Source) can be used to report memory errors
> >> too. To add support to APEI GHES memory recovery, a mechanism similar
> >> to that of MCE is implemented. memory_failure_queue() is the new
> >> entry point that can be called in IRQ context. The next step is to
> >> make MCE handler uses this interface too.
> >>
> >> Signed-off-by: Huang Ying <[email protected]>
> >> Cc: Andi Kleen <[email protected]>
> >> Cc: Wu Fengguang <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> ---
> >> include/linux/mm.h | 1
> >> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 93 insertions(+)
> >
> > I have to say i disagree with how this is designed and how this is exposed to
> > user-space - and i pointed this out before.
> >
> > It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/
> > again ...
> >
> > I just had a quick look into the current affairs of mm/memory-inject.c and it
> > has become an *even* nastier collection of hacks since the last time i
> > commented on its uglies.
> >
> > Special hack upon special hack, totally disorganized code, special-purpose,
> > partly ioctl driven opaque information extraction to user-space using the
> > erst-dbg device interface. We have all the maintenance overhead and little of
> > the gains from hw error event features...
>
> Like the name suggested, erst-dbg is only for debugging. [...]

Great, if printk does everything then can the debugging code be removed so that
tooling does not accidentally make non-debugging use of it? I can write a patch
for that.

> [...] It is not a user space interface. The user space interface used by
> APEI now is printk.

We definitely want printks obviously and primarily - often that is the only
thing the admin sees, and most of the time there's no automatable 'policy
action' anyway: human intervention is still the most common 'action' that is
performed on exceptional system events.

Does all the (unspecified) tooling you are enabling here work based off on
printk only, or does it perhaps make use of the erst-dbg hack? :-)

[ Wrt. printks we definitely would like to have a printk free-form-ASCII event
gateway for tooling wants to use printk events in the regular flow of events
that are not available via the syslog - Steve sent a print-string-event patch
for that some time ago and that works well. ]

> > In this patch you add:
> >
> > +struct memory_failure_entry {
> > + unsigned long pfn;
> > + int trapno;
> > + int flags;
> > +};
> >
> > Instead of exposing this event to other users who might be interested in these
> > events - such as the RAS daemon under development by Boris.
> >
> > We have a proper framework (ring-buffer, NMI execution, etc.) for reporting
> > events, why are you not using (and extending) it instead of creating this nasty
> > looking, isolated, ACPI specific low level feature?
>
> This patch has nothing to do with hardware error event reporting. It is just
> about hardware error recovering.

Hardware error event reporting and recovery go hand in hand. First is the
event, the second is the action.

Your structure demonstrates this already: it's called memory_failure_entry. It
does:

+ * This function is called by the low level hardware error handler
+ * when it detects hardware memory corruption of a page. It schedules
+ * the recovering of error page, including dropping pages, killing
+ * processes etc.

So based off an error event it does one from a short list of in-kernel policy
actions.

If put into a proper framework this would be a lot more widely useful: we could
for example trigger the killing of tasks (and other policy action) if other
(bad) events are triggered - not just the ones that fit into the narrow ACPI
scheme you have here.

Certain fatal IO errors would be an example, or SLAB memory corruptions or OOM
errors - or any other event we are able to report today.

So why are we not working towards integrating this into our event
reporting/handling framework, as i suggested it from day one on when you
started posting these patches?

Thanks,

Ingo

2011-05-18 01:10:20

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On 05/17/2011 05:26 PM, Ingo Molnar wrote:
>
> * Huang Ying <[email protected]> wrote:
>
>> On 05/17/2011 04:46 PM, Ingo Molnar wrote:
>>>
>>> * Huang Ying <[email protected]> wrote:
>>>
>>>> memory_failure() is the entry point for HWPoison memory error
>>>> recovery. It must be called in process context. But commonly
>>>> hardware memory errors are notified via MCE or NMI, so some delayed
>>>> execution mechanism must be used. In MCE handler, a work queue + ring
>>>> buffer mechanism is used.
>>>>
>>>> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
>>>> (Generic Hardware Error Source) can be used to report memory errors
>>>> too. To add support to APEI GHES memory recovery, a mechanism similar
>>>> to that of MCE is implemented. memory_failure_queue() is the new
>>>> entry point that can be called in IRQ context. The next step is to
>>>> make MCE handler uses this interface too.
>>>>
>>>> Signed-off-by: Huang Ying <[email protected]>
>>>> Cc: Andi Kleen <[email protected]>
>>>> Cc: Wu Fengguang <[email protected]>
>>>> Cc: Andrew Morton <[email protected]>
>>>> ---
>>>> include/linux/mm.h | 1
>>>> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 93 insertions(+)
>>>
>>> I have to say i disagree with how this is designed and how this is exposed to
>>> user-space - and i pointed this out before.
>>>
>>> It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/
>>> again ...
>>>
>>> I just had a quick look into the current affairs of mm/memory-inject.c and it
>>> has become an *even* nastier collection of hacks since the last time i
>>> commented on its uglies.
>>>
>>> Special hack upon special hack, totally disorganized code, special-purpose,
>>> partly ioctl driven opaque information extraction to user-space using the
>>> erst-dbg device interface. We have all the maintenance overhead and little of
>>> the gains from hw error event features...
>>
>> Like the name suggested, erst-dbg is only for debugging. [...]
>
> Great, if printk does everything then can the debugging code be removed so that
> tooling does not accidentally make non-debugging use of it? I can write a patch
> for that.

The erst-dbg is only used for us to test whether the BIOS ERST
implementation works. If you have concerns about its mis-usage, how
about moving it to debugfs to make it clear that it is not a API, just
for debugging?

>> [...] It is not a user space interface. The user space interface used by
>> APEI now is printk.
>
> We definitely want printks obviously and primarily - often that is the only
> thing the admin sees, and most of the time there's no automatable 'policy
> action' anyway: human intervention is still the most common 'action' that is
> performed on exceptional system events.
>
> Does all the (unspecified) tooling you are enabling here work based off on
> printk only, or does it perhaps make use of the erst-dbg hack? :-)

The only tool makes use of erst-dbg is the debugging tool to test BIOS
ERST implementation. There is absolutely NO other tool I am enabling
uses erst-dbg.

> [ Wrt. printks we definitely would like to have a printk free-form-ASCII event
> gateway for tooling wants to use printk events in the regular flow of events
> that are not available via the syslog - Steve sent a print-string-event patch
> for that some time ago and that works well. ]

Thanks for your reminding, I will take a look at it.

>>> In this patch you add:
>>>
>>> +struct memory_failure_entry {
>>> + unsigned long pfn;
>>> + int trapno;
>>> + int flags;
>>> +};
>>>
>>> Instead of exposing this event to other users who might be interested in these
>>> events - such as the RAS daemon under development by Boris.
>>>
>>> We have a proper framework (ring-buffer, NMI execution, etc.) for reporting
>>> events, why are you not using (and extending) it instead of creating this nasty
>>> looking, isolated, ACPI specific low level feature?
>>
>> This patch has nothing to do with hardware error event reporting. It is just
>> about hardware error recovering.
>
> Hardware error event reporting and recovery go hand in hand. First is the
> event, the second is the action.
>
> Your structure demonstrates this already: it's called memory_failure_entry. It
> does:
>
> + * This function is called by the low level hardware error handler
> + * when it detects hardware memory corruption of a page. It schedules
> + * the recovering of error page, including dropping pages, killing
> + * processes etc.
>
> So based off an error event it does one from a short list of in-kernel policy
> actions.
>
> If put into a proper framework this would be a lot more widely useful: we could
> for example trigger the killing of tasks (and other policy action) if other
> (bad) events are triggered - not just the ones that fit into the narrow ACPI
> scheme you have here.
>
> Certain fatal IO errors would be an example, or SLAB memory corruptions or OOM
> errors - or any other event we are able to report today.
>
> So why are we not working towards integrating this into our event
> reporting/handling framework, as i suggested it from day one on when you
> started posting these patches?

The memory_failure_queue() introduced in this patch is general, that is,
it can be used not only by ACPI/APEI, but also any other hardware error
handlers, including your event reporting/handling framework.

Best Regards,
Huang Ying

2011-05-20 11:56:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Huang Ying <[email protected]> wrote:

> > So why are we not working towards integrating this into our event
> > reporting/handling framework, as i suggested it from day one on when you
> > started posting these patches?
>
> The memory_failure_queue() introduced in this patch is general, that is, it
> can be used not only by ACPI/APEI, but also any other hardware error
> handlers, including your event reporting/handling framework.

Well, the bit you are steadfastly ignoring is what i have made clear well
before you started adding these facilities: THEY ALREADY EXISTS to a large
degree :-)

So you were and are duplicating code instead of using and extending existing
event processing facilities. It does not matter one little bit that the code
you added is partly 'generic', it's still overlapping and duplicated.

Thanks,

Ingo

2011-05-22 08:14:36

by huang ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <[email protected]> wrote:
>
> * Huang Ying <[email protected]> wrote:
>
>> > So why are we not working towards integrating this into our event
>> > reporting/handling framework, as i suggested it from day one on when you
>> > started posting these patches?
>>
>> The memory_failure_queue() introduced in this patch is general, that is, it
>> can be used not only by ACPI/APEI, but also any other hardware error
>> handlers, including your event reporting/handling framework.
>
> Well, the bit you are steadfastly ignoring is what i have made clear well
> before you started adding these facilities: THEY ALREADY EXISTS to a large
> degree :-)
>
> So you were and are duplicating code instead of using and extending existing
> event processing facilities. It does not matter one little bit that the code
> you added is partly 'generic', it's still overlapping and duplicated.

How to do hardware error recovering in your perf framework? IMHO, it
can be something as follow:

- NMI handler run for the hardware error, where hardware error
information is collected and put into a ring buffer, an irq_work is
triggered for further work
- In irq_work handler, memory_failure_queue() is called to do the real
recovering work for recoverable memory error in ring buffer.

What's your idea about hardware error recovering in perf?

Best Regards,
Huang Ying

2011-05-22 10:00:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* huang ying <[email protected]> wrote:

> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Huang Ying <[email protected]> wrote:
> >
> >> > So why are we not working towards integrating this into our event
> >> > reporting/handling framework, as i suggested it from day one on when you
> >> > started posting these patches?
> >>
> >> The memory_failure_queue() introduced in this patch is general, that is, it
> >> can be used not only by ACPI/APEI, but also any other hardware error
> >> handlers, including your event reporting/handling framework.
> >
> > Well, the bit you are steadfastly ignoring is what i have made clear well
> > before you started adding these facilities: THEY ALREADY EXISTS to a large
> > degree :-)
> >
> > So you were and are duplicating code instead of using and extending existing
> > event processing facilities. It does not matter one little bit that the code
> > you added is partly 'generic', it's still overlapping and duplicated.
>
> How to do hardware error recovering in your perf framework? IMHO, it can be
> something as follow:
>
> - NMI handler run for the hardware error, where hardware error
> information is collected and put into a ring buffer, an irq_work is
> triggered for further work
> - In irq_work handler, memory_failure_queue() is called to do the real
> recovering work for recoverable memory error in ring buffer.
>
> What's your idea about hardware error recovering in perf?

The first step, the whole irq_work and ring buffer already looks largely
duplicated: you can collect into a perf event ring-buffer from NMI context like
the regular perf events do.

The generalization that *would* make sense is not at the irq_work level really,
instead we could generalize a 'struct event' for kernel internal producers and
consumers of events that have no explicit PMU connection.

This new 'struct event' would be slimmer and would only contain the fields and
features that generic event consumers and producers need. Tracing events could
be updated to use these kinds of slimmer events.

It would still plug nicely into existing event ABIs, would work with event
filters, etc. so the tooling side would remain focused and unified.

Something like that. It is rather clear by now that splitting out irq_work was
a mistake. But mistakes can be fixed and some really nice code could come out
of it! Would you be interested in looking into this?

Thanks,

Ingo

2011-05-22 12:32:26

by huang ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On Sun, May 22, 2011 at 6:00 PM, Ingo Molnar <[email protected]> wrote:
>
> * huang ying <[email protected]> wrote:
>
>> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Huang Ying <[email protected]> wrote:
>> >
>> >> > So why are we not working towards integrating this into our event
>> >> > reporting/handling framework, as i suggested it from day one on when you
>> >> > started posting these patches?
>> >>
>> >> The memory_failure_queue() introduced in this patch is general, that is, it
>> >> can be used not only by ACPI/APEI, but also any other hardware error
>> >> handlers, including your event reporting/handling framework.
>> >
>> > Well, the bit you are steadfastly ignoring is what i have made clear well
>> > before you started adding these facilities: THEY ALREADY EXISTS to a large
>> > degree :-)
>> >
>> > So you were and are duplicating code instead of using and extending existing
>> > event processing facilities. It does not matter one little bit that the code
>> > you added is partly 'generic', it's still overlapping and duplicated.
>>
>> How to do hardware error recovering in your perf framework?  IMHO, it can be
>> something as follow:
>>
>> - NMI handler run for the hardware error, where hardware error
>> information is collected and put into a ring buffer, an irq_work is
>> triggered for further work
>> - In irq_work handler, memory_failure_queue() is called to do the real
>> recovering work for recoverable memory error in ring buffer.
>>
>> What's your idea about hardware error recovering in perf?
>
> The first step, the whole irq_work and ring buffer already looks largely
> duplicated: you can collect into a perf event ring-buffer from NMI context like
> the regular perf events do.

Why duplicated? perf uses the general irq_work too.

> The generalization that *would* make sense is not at the irq_work level really,
> instead we could generalize a 'struct event' for kernel internal producers and
> consumers of events that have no explicit PMU connection.
>
> This new 'struct event' would be slimmer and would only contain the fields and
> features that generic event consumers and producers need. Tracing events could
> be updated to use these kinds of slimmer events.
>
> It would still plug nicely into existing event ABIs, would work with event
> filters, etc. so the tooling side would remain focused and unified.
>
> Something like that. It is rather clear by now that splitting out irq_work was
> a mistake. But mistakes can be fixed and some really nice code could come out
> of it! Would you be interested in looking into this?

Yes. This can transfer hardware error data from kernel to user space.
Then, how to do hardware error recovering in this big picture? IMHO,
we will need to call something like memory_failure_queue() in IRQ
context for memory error.

Best Regards,
Huang Ying

2011-05-22 13:25:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* huang ying <[email protected]> wrote:

> On Sun, May 22, 2011 at 6:00 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * huang ying <[email protected]> wrote:
> >
> >> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <[email protected]> wrote:
> >> >
> >> > * Huang Ying <[email protected]> wrote:
> >> >
> >> >> > So why are we not working towards integrating this into our event
> >> >> > reporting/handling framework, as i suggested it from day one on when you
> >> >> > started posting these patches?
> >> >>
> >> >> The memory_failure_queue() introduced in this patch is general, that is, it
> >> >> can be used not only by ACPI/APEI, but also any other hardware error
> >> >> handlers, including your event reporting/handling framework.
> >> >
> >> > Well, the bit you are steadfastly ignoring is what i have made clear well
> >> > before you started adding these facilities: THEY ALREADY EXISTS to a large
> >> > degree :-)
> >> >
> >> > So you were and are duplicating code instead of using and extending existing
> >> > event processing facilities. It does not matter one little bit that the code
> >> > you added is partly 'generic', it's still overlapping and duplicated.
> >>
> >> How to do hardware error recovering in your perf framework? ?IMHO, it can be
> >> something as follow:
> >>
> >> - NMI handler run for the hardware error, where hardware error
> >> information is collected and put into a ring buffer, an irq_work is
> >> triggered for further work
> >> - In irq_work handler, memory_failure_queue() is called to do the real
> >> recovering work for recoverable memory error in ring buffer.
> >>
> >> What's your idea about hardware error recovering in perf?
> >
> > The first step, the whole irq_work and ring buffer already looks largely
> > duplicated: you can collect into a perf event ring-buffer from NMI context like
> > the regular perf events do.
>
> Why duplicated? perf uses the general irq_work too.

Yes, of course, because - if you still remember - Peter split irq_work out of
perf events:

e360adbe2924: irq_work: Add generic hardirq context callbacks

|
| Perf currently has such a mechanism, so extract that and provide it as a
| generic feature, independent of perf so that others may also benefit.
|

:-)

But in hindsight the level of abstraction (for this usecase) was set too low,
because we lose wider access to the actual events themselves:

> > The generalization that *would* make sense is not at the irq_work level
> > really, instead we could generalize a 'struct event' for kernel internal
> > producers and consumers of events that have no explicit PMU connection.
> >
> > This new 'struct event' would be slimmer and would only contain the fields
> > and features that generic event consumers and producers need. Tracing
> > events could be updated to use these kinds of slimmer events.
> >
> > It would still plug nicely into existing event ABIs, would work with event
> > filters, etc. so the tooling side would remain focused and unified.
> >
> > Something like that. It is rather clear by now that splitting out irq_work
> > was a mistake. But mistakes can be fixed and some really nice code could
> > come out of it! Would you be interested in looking into this?
>
> Yes. This can transfer hardware error data from kernel to user space. Then,
> how to do hardware error recovering in this big picture? IMHO, we will need
> to call something like memory_failure_queue() in IRQ context for memory
> error.

That's where 'active filters' come into the picture - see my other mail (that
was in the context of unidentified NMI errors/events) where i outlined how they
would work in this case and elsewhere. Via active filters we could share most
of the code, gain access to the events and still have kernel driven policy
action.

Thanks,

Ingo

2011-05-23 02:38:55

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On 05/22/2011 09:25 PM, Ingo Molnar wrote:
>>> The generalization that *would* make sense is not at the irq_work level
>>> really, instead we could generalize a 'struct event' for kernel internal
>>> producers and consumers of events that have no explicit PMU connection.
>>>
>>> This new 'struct event' would be slimmer and would only contain the fields
>>> and features that generic event consumers and producers need. Tracing
>>> events could be updated to use these kinds of slimmer events.
>>>
>>> It would still plug nicely into existing event ABIs, would work with event
>>> filters, etc. so the tooling side would remain focused and unified.
>>>
>>> Something like that. It is rather clear by now that splitting out irq_work
>>> was a mistake. But mistakes can be fixed and some really nice code could
>>> come out of it! Would you be interested in looking into this?
>>
>> Yes. This can transfer hardware error data from kernel to user space. Then,
>> how to do hardware error recovering in this big picture? IMHO, we will need
>> to call something like memory_failure_queue() in IRQ context for memory
>> error.
>
> That's where 'active filters' come into the picture - see my other mail (that
> was in the context of unidentified NMI errors/events) where i outlined how they
> would work in this case and elsewhere. Via active filters we could share most
> of the code, gain access to the events and still have kernel driven policy
> action.

Is that something as follow?

- NMI handler run for the hardware error, where hardware error
information is collected and put into perf ring buffer as 'event'.

- Some 'active filters' are run for each 'event' in NMI context.

- Some operations can not be done in NMI handler, so they are delayed to
an IRQ handler (can be done with something like irq_work).

- Some other 'active filters' are run for each 'event' in IRQ context.
(For memory error, we can call memory_failure_queue() here).

Where some 'active filters' are kernel built-in, some 'active filters'
can be customized via kernel command line or by user space.


If my understanding as above is correct, I think this is a general and
complex solution. It is a little hard for user to understand which
'active filters' are in effect. He may need some runtime assistant to
understand the code (maybe /sys/events/active_filters, which list all
filters in effect now), because that is hard only by reading the source
code. Anyway, this is a design style choice.

There are still some issues, I don't know how to solve in above framework.

- If there are two processes request the same type of hardware error
events. One hardware error event will be copied to two ring buffers
(each for one process), but the 'active filters' should be run only
once for each hardware error event.

- How to deal with ring-buffer overflow? For example, there is full of
corrected memory error in ring-buffer, and now a recoverable memory
error occurs but it can not be put into perf ring buffer because of
ring-buffer overflow, how to deal with the recoverable memory error?

Best Regards,
Huang Ying

2011-05-23 11:02:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Huang Ying <[email protected]> wrote:

> > That's where 'active filters' come into the picture - see my other mail
> > (that was in the context of unidentified NMI errors/events) where i
> > outlined how they would work in this case and elsewhere. Via active filters
> > we could share most of the code, gain access to the events and still have
> > kernel driven policy action.
>
> Is that something as follow?
>
> - NMI handler run for the hardware error, where hardware error
> information is collected and put into perf ring buffer as 'event'.

Correct.

Note that for MCE errors we want the 'persistent event' framework Boris has
posted: we want these events to be buffered up to a point even if there is no
tool listening in on them:

- this gives us boot-time MCE error coverage

- this protects us against a logging daemon being restarted and events
getting lost

> - Some 'active filters' are run for each 'event' in NMI context.

Yeah. Whether it's a human-ASCII space 'filter' or really just a callback you
register with that event is secondary - both would work.

> - Some operations can not be done in NMI handler, so they are delayed to
> an IRQ handler (can be done with something like irq_work).

Yes.

> - Some other 'active filters' are run for each 'event' in IRQ context.
> (For memory error, we can call memory_failure_queue() here).

Correct.

> Where some 'active filters' are kernel built-in, some 'active filters' can be
> customized via kernel command line or by user space.

Yes.

> If my understanding as above is correct, I think this is a general and
> complex solution. It is a little hard for user to understand which 'active
> filters' are in effect. He may need some runtime assistant to understand the
> code (maybe /sys/events/active_filters, which list all filters in effect
> now), because that is hard only by reading the source code. Anyway, this is
> a design style choice.

I don't think it's complex: the built-in rules are in plain sight (can be in
the source code or can even be explicitly registered callbacks), the
configuration/tooling installed rules will be as complex as the admin or tool
wants them to be.

> There are still some issues, I don't know how to solve in above framework.
>
> - If there are two processes request the same type of hardware error
> events. One hardware error event will be copied to two ring buffers (each
> for one process), but the 'active filters' should be run only once for each
> hardware error event.

With persistent events 'active filters' should only be attached to the central
persistent event.

> - How to deal with ring-buffer overflow? For example, there is full of
> corrected memory error in ring-buffer, and now a recoverable memory error
> occurs but it can not be put into perf ring buffer because of ring-buffer
> overflow, how to deal with the recoverable memory error?

The solution is to make it large enough. With *every* queueing solution there
will be some sort of queue size limit.

Thanks,

Ingo

2011-05-23 16:45:54

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 5/9] HWPoison: add memory_failure_queue()

>> - NMI handler run for the hardware error, where hardware error
>> information is collected and put into perf ring buffer as 'event'.
>
> Correct.
>
> Note that for MCE errors we want the 'persistent event' framework Boris has
> posted: we want these events to be buffered up to a point even if there is no
> tool listening in on them:

This is a very opportune time to have this discussion. I've been working
on getting "in context" recoverable errors working. Sandybridge Server
platforms will allow for recovery for both instruction and data fetches
in the current execution context. These are flagged in the machine check
bank with the "AR" (Action Required) set to 1 (along with several other
bits making up a recognizable signature).

The critical feature here is that we must not return from the machine
check handler to the context that tripped over the error. In the case
of the data fault, we'll just re-execute the same access and take
another machine check. In the case of the instruction fault there is
no valid context to return to (MCGSTATUS.RIPV is zero).

There are a couple of cases where recovery is possible:

1) The memory error was found while executing user mode code.

The code I have now for recovery makes use of TIF_MCE_NOTIFY to
make sure that we don't return to the user, but instead end up
in arch/x86/kernel/signal.c:do_notify_resume() where we arrange
to have the process handle its own recovery (using mm/memory-failure.c
to figure out the type of page, and probably resulting in the mapping
out of the page and sending SIGBUS to the process).

In your proposed solution, we'd generate an event that would be handled
by some process/daemon ... but how would we ensure that the affected
process does not run in the mean time? Could we create some analogous
method to the ptrace stopped state, and hand control of the affected
process to the daemon that gets the event?

2) The memory error was found in certain special sections of the
kernel for which recovery is possible (e.g. while copying to/from
user memory, perhaps also page copy and page clear).

Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when returning
from do_machine_check() to kernel code.

In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to be
in places where pre-emption is allowed ... so perhaps we can also use
the stop-and-switch option here?

-Tony

2011-05-24 02:10:35

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On 05/23/2011 07:01 PM, Ingo Molnar wrote:
>> If my understanding as above is correct, I think this is a general and
>> complex solution. It is a little hard for user to understand which 'active
>> filters' are in effect. He may need some runtime assistant to understand the
>> code (maybe /sys/events/active_filters, which list all filters in effect
>> now), because that is hard only by reading the source code. Anyway, this is
>> a design style choice.
>
> I don't think it's complex: the built-in rules are in plain sight (can be in
> the source code or can even be explicitly registered callbacks), the
> configuration/tooling installed rules will be as complex as the admin or tool
> wants them to be.
>
>> There are still some issues, I don't know how to solve in above framework.
>>
>> - If there are two processes request the same type of hardware error
>> events. One hardware error event will be copied to two ring buffers (each
>> for one process), but the 'active filters' should be run only once for each
>> hardware error event.
>
> With persistent events 'active filters' should only be attached to the central
> persistent event.

OK. I see.

>> - How to deal with ring-buffer overflow? For example, there is full of
>> corrected memory error in ring-buffer, and now a recoverable memory error
>> occurs but it can not be put into perf ring buffer because of ring-buffer
>> overflow, how to deal with the recoverable memory error?
>
> The solution is to make it large enough. With *every* queueing solution there
> will be some sort of queue size limit.

Another solution could be:

Create two ring-buffer. One is for logging and will be read by RAS
daemon; the other is for recovering, the event record will be removed
from the ring-buffer after all 'active filters' have been run on it.
Even RAS daemon being restarted or hang, recoverable error can be taken
cared of.

Best Regards,
Huang Ying

2011-05-24 02:52:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Huang Ying <[email protected]> wrote:

> >> - How to deal with ring-buffer overflow? For example, there is full of
> >> corrected memory error in ring-buffer, and now a recoverable memory error
> >> occurs but it can not be put into perf ring buffer because of ring-buffer
> >> overflow, how to deal with the recoverable memory error?
> >
> > The solution is to make it large enough. With *every* queueing solution there
> > will be some sort of queue size limit.
>
> Another solution could be:
>
> Create two ring-buffer. One is for logging and will be read by RAS
> daemon; the other is for recovering, the event record will be removed
> from the ring-buffer after all 'active filters' have been run on it.
> Even RAS daemon being restarted or hang, recoverable error can be taken
> cared of.

Well, filters will always be executed since they execute when the event is
inserted - not when it's extracted.

So if you worry about losing *filter* executions (and dependent policy action)
- there should be no loss there, ever.

But yes, the scheme you outline would work as well: a counting-only event with
a filter specified - this will do no buffering at all.

So ... to get the ball rolling in this area one of you guys active in RAS
should really try a first approximation for the active filter approach: add a
test-TRACE_EVENT() for the errors you are interested in and define a convenient
way to register policy action with post-filter events. This should work even
without having the 'active' portion defined at the ABI and filter-string level.

Thanks,

Ingo

2011-05-24 03:08:03

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

On 05/24/2011 10:48 AM, Ingo Molnar wrote:
>
> * Huang Ying <[email protected]> wrote:
>
>>>> - How to deal with ring-buffer overflow? For example, there is full of
>>>> corrected memory error in ring-buffer, and now a recoverable memory error
>>>> occurs but it can not be put into perf ring buffer because of ring-buffer
>>>> overflow, how to deal with the recoverable memory error?
>>>
>>> The solution is to make it large enough. With *every* queueing solution there
>>> will be some sort of queue size limit.
>>
>> Another solution could be:
>>
>> Create two ring-buffer. One is for logging and will be read by RAS
>> daemon; the other is for recovering, the event record will be removed
>> from the ring-buffer after all 'active filters' have been run on it.
>> Even RAS daemon being restarted or hang, recoverable error can be taken
>> cared of.
>
> Well, filters will always be executed since they execute when the event is
> inserted - not when it's extracted.

For filters executed in NMI context, they can be executed when the event
is inserted, no need for buffering. But for filters executed in
deferred IRQ context, they need to be executed when event's extracted.

> So if you worry about losing *filter* executions (and dependent policy action)
> - there should be no loss there, ever.
>
> But yes, the scheme you outline would work as well: a counting-only event with
> a filter specified - this will do no buffering at all.
>
> So ... to get the ball rolling in this area one of you guys active in RAS
> should really try a first approximation for the active filter approach: add a
> test-TRACE_EVENT() for the errors you are interested in and define a convenient
> way to register policy action with post-filter events. This should work even
> without having the 'active' portion defined at the ABI and filter-string level.

Best Regards,
Huang Ying

2011-05-24 04:24:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Huang Ying <[email protected]> wrote:

> On 05/24/2011 10:48 AM, Ingo Molnar wrote:
> >
> > * Huang Ying <[email protected]> wrote:
> >
> >>>> - How to deal with ring-buffer overflow? For example, there is full of
> >>>> corrected memory error in ring-buffer, and now a recoverable memory error
> >>>> occurs but it can not be put into perf ring buffer because of ring-buffer
> >>>> overflow, how to deal with the recoverable memory error?
> >>>
> >>> The solution is to make it large enough. With *every* queueing solution there
> >>> will be some sort of queue size limit.
> >>
> >> Another solution could be:
> >>
> >> Create two ring-buffer. One is for logging and will be read by RAS
> >> daemon; the other is for recovering, the event record will be removed
> >> from the ring-buffer after all 'active filters' have been run on it.
> >> Even RAS daemon being restarted or hang, recoverable error can be taken
> >> cared of.
> >
> > Well, filters will always be executed since they execute when the event is
> > inserted - not when it's extracted.
>
> For filters executed in NMI context, they can be executed when the event
> is inserted, no need for buffering. But for filters executed in
> deferred IRQ context, they need to be executed when event's extracted.

Correct.

Thanks,

Ingo

2011-05-25 07:41:32

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

(2011/05/22 19:00), Ingo Molnar wrote:
> * huang ying <[email protected]> wrote:
>> How to do hardware error recovering in your perf framework? IMHO, it can be
>> something as follow:
>>
>> - NMI handler run for the hardware error, where hardware error
>> information is collected and put into a ring buffer, an irq_work is
>> triggered for further work
>> - In irq_work handler, memory_failure_queue() is called to do the real
>> recovering work for recoverable memory error in ring buffer.
>>
>> What's your idea about hardware error recovering in perf?
>
> The first step, the whole irq_work and ring buffer already looks largely
> duplicated: you can collect into a perf event ring-buffer from NMI context like
> the regular perf events do.
>
> The generalization that *would* make sense is not at the irq_work level really,
> instead we could generalize a 'struct event' for kernel internal producers and
> consumers of events that have no explicit PMU connection.
>
> This new 'struct event' would be slimmer and would only contain the fields and
> features that generic event consumers and producers need. Tracing events could
> be updated to use these kinds of slimmer events.
>
> It would still plug nicely into existing event ABIs, would work with event
> filters, etc. so the tooling side would remain focused and unified.
>
> Something like that. It is rather clear by now that splitting out irq_work was
> a mistake. But mistakes can be fixed and some really nice code could come out
> of it! Would you be interested in looking into this?

Err...?

Then is it better to write some nice code and throw away the following patch?


Thanks,
H.Seto

=====

[PATCH] x86, mce: replace MCE_SELF_VECTOR by irq_work

Use provided generic mechanism.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
arch/x86/include/asm/entry_arch.h | 4 ---
arch/x86/include/asm/hw_irq.h | 1 -
arch/x86/include/asm/irq_vectors.h | 5 ----
arch/x86/kernel/cpu/mcheck/mce.c | 47 ++++-------------------------------
arch/x86/kernel/entry_64.S | 5 ----
arch/x86/kernel/irqinit.c | 3 --
6 files changed, 6 insertions(+), 59 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 1cd6d26..0baa628 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -53,8 +53,4 @@ BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
#endif

-#ifdef CONFIG_X86_MCE
-BUILD_INTERRUPT(mce_self_interrupt,MCE_SELF_VECTOR)
-#endif
-
#endif
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index bb9efe8..13f5504 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -34,7 +34,6 @@ extern void irq_work_interrupt(void);
extern void spurious_interrupt(void);
extern void thermal_interrupt(void);
extern void reschedule_interrupt(void);
-extern void mce_self_interrupt(void);

extern void invalidate_interrupt(void);
extern void invalidate_interrupt0(void);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 6e976ee..6665026 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -109,11 +109,6 @@

#define UV_BAU_MESSAGE 0xf5

-/*
- * Self IPI vector for machine checks
- */
-#define MCE_SELF_VECTOR 0xf4
-
/* Xen vector callback to receive events in a HVM domain */
#define XEN_HVM_EVTCHN_CALLBACK 0xf3

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ff1ae9b..e81d48b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -10,7 +10,6 @@
#include <linux/thread_info.h>
#include <linux/capability.h>
#include <linux/miscdevice.h>
-#include <linux/interrupt.h>
#include <linux/ratelimit.h>
#include <linux/kallsyms.h>
#include <linux/rcupdate.h>
@@ -38,12 +37,9 @@
#include <linux/mm.h>
#include <linux/debugfs.h>
#include <linux/edac_mce.h>
+#include <linux/irq_work.h>

#include <asm/processor.h>
-#include <asm/hw_irq.h>
-#include <asm/apic.h>
-#include <asm/idle.h>
-#include <asm/ipi.h>
#include <asm/mce.h>
#include <asm/msr.h>

@@ -461,22 +457,13 @@ static inline void mce_get_rip(struct mce *m, struct pt_regs *regs)
m->ip = mce_rdmsrl(rip_msr);
}

-#ifdef CONFIG_X86_LOCAL_APIC
-/*
- * Called after interrupts have been reenabled again
- * when a MCE happened during an interrupts off region
- * in the kernel.
- */
-asmlinkage void smp_mce_self_interrupt(struct pt_regs *regs)
+DEFINE_PER_CPU(struct irq_work, mce_irq_work);
+
+static void mce_irq_work_cb(struct irq_work *entry)
{
- ack_APIC_irq();
- exit_idle();
- irq_enter();
mce_notify_irq();
mce_schedule_work();
- irq_exit();
}
-#endif

static void mce_report_event(struct pt_regs *regs)
{
@@ -492,29 +479,7 @@ static void mce_report_event(struct pt_regs *regs)
return;
}

-#ifdef CONFIG_X86_LOCAL_APIC
- /*
- * Without APIC do not notify. The event will be picked
- * up eventually.
- */
- if (!cpu_has_apic)
- return;
-
- /*
- * When interrupts are disabled we cannot use
- * kernel services safely. Trigger an self interrupt
- * through the APIC to instead do the notification
- * after interrupts are reenabled again.
- */
- apic->send_IPI_self(MCE_SELF_VECTOR);
-
- /*
- * Wait for idle afterwards again so that we don't leave the
- * APIC in a non idle state because the normal APIC writes
- * cannot exclude us.
- */
- apic_wait_icr_idle();
-#endif
+ irq_work_queue(&__get_cpu_var(mce_irq_work));
}

DEFINE_PER_CPU(unsigned, mce_poll_count);
@@ -1444,7 +1409,7 @@ void __cpuinit mcheck_cpu_init(struct cpuinfo_x86 *c)
__mcheck_cpu_init_vendor(c);
__mcheck_cpu_init_timer();
INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
-
+ init_irq_work(&__get_cpu_var(mce_irq_work), &mce_irq_work_cb);
}

/*
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 8a445a0..9fa6546 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -991,11 +991,6 @@ apicinterrupt THRESHOLD_APIC_VECTOR \
apicinterrupt THERMAL_APIC_VECTOR \
thermal_interrupt smp_thermal_interrupt

-#ifdef CONFIG_X86_MCE
-apicinterrupt MCE_SELF_VECTOR \
- mce_self_interrupt smp_mce_self_interrupt
-#endif
-
#ifdef CONFIG_SMP
apicinterrupt CALL_FUNCTION_SINGLE_VECTOR \
call_function_single_interrupt smp_call_function_single_interrupt
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index f470e4e..f09d4bb 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -272,9 +272,6 @@ static void __init apic_intr_init(void)
#ifdef CONFIG_X86_MCE_THRESHOLD
alloc_intr_gate(THRESHOLD_APIC_VECTOR, threshold_interrupt);
#endif
-#if defined(CONFIG_X86_MCE) && defined(CONFIG_X86_LOCAL_APIC)
- alloc_intr_gate(MCE_SELF_VECTOR, mce_self_interrupt);
-#endif

#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)
/* self generated IPI for local APIC timer */
--
1.7.4.4

2011-05-25 14:09:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Luck, Tony <[email protected]> wrote:

> In your proposed solution, we'd generate an event that would be
> handled by some process/daemon ... but how would we ensure that the
> affected process does not run in the mean time? Could we create
> some analogous method to the ptrace stopped state, and hand control
> of the affected process to the daemon that gets the event?

Ok, i think there is a bit of a misunderstanding here - which is not
a surprise really: we made generic arguments all along with very few
specifics.

The RAS daemon would deal with 'slow' policy action: fully recovered
events. It would also log various events so that people can do post
mortem etc.

The main point of defining events here is so that there's a single
method of transport and a single flexible method of defining and
extracting events.

Some of the event processing would occur in the kernel: in code that
knows about memory_failure() and calls it while making sure we do not
execute any user-space instruction.

Some of the code would execute *very* early and in a very atomic way,
still in NMI context: panicing the box if the error is so severe.

Neither of these are steps that the RAS daemon can or wants to
handle.

The RAS tools would interact with the regular perf facilities setting
and configuring the various RAS related events. They'd handle the
'severity' config bits, they'd initiate testing (injection), etc.

Ideally the RAS daemon and tools would do what syslog does (and
more), with more structured events. In the end of the day most of the
'policy action' is taken by humans anyway, who want to take a look at
some ASCII output. So printk() integration and obvious ASCII output
for everything is important along the way.

> 2) The memory error was found in certain special sections of the
> kernel for which recovery is possible (e.g. while copying to/from
> user memory, perhaps also page copy and page clear).
>
> Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when
> returning from do_machine_check() to kernel code.

Well, since we are already in interrupt context (albeit in a very
atomic NMI context), sending a self-IPI is not strictly necessary. We
could fix up the return address and jump to the right handler
straight away during the IRET.

A self-IPI might also not execute *immediately* - there's always the
chance of APIC related delays.

> In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to
> be in places where pre-emption is allowed ... so perhaps we can
> also use the stop-and-switch option here?

Yes, these are generally preemptible cases - and if they are not we
can make the error fatal (we do not have to handle *every* complex
case, giving up is a fair answer as well - we do not want rare code
to be complex really).

But you don't need to stop-and-switch: just stack-nesting on top of
whatever preemptible code was running there would be enough, wouldnt
it? That stops a task from executing until the decision has been made
whether it can continue or not.

Thanks,

Ingo

2011-05-25 14:11:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()


* Hidetoshi Seto <[email protected]> wrote:

> (2011/05/22 19:00), Ingo Molnar wrote:
> > * huang ying <[email protected]> wrote:
> >> How to do hardware error recovering in your perf framework? IMHO, it can be
> >> something as follow:
> >>
> >> - NMI handler run for the hardware error, where hardware error
> >> information is collected and put into a ring buffer, an irq_work is
> >> triggered for further work
> >> - In irq_work handler, memory_failure_queue() is called to do the real
> >> recovering work for recoverable memory error in ring buffer.
> >>
> >> What's your idea about hardware error recovering in perf?
> >
> > The first step, the whole irq_work and ring buffer already looks largely
> > duplicated: you can collect into a perf event ring-buffer from NMI context like
> > the regular perf events do.
> >
> > The generalization that *would* make sense is not at the irq_work level really,
> > instead we could generalize a 'struct event' for kernel internal producers and
> > consumers of events that have no explicit PMU connection.
> >
> > This new 'struct event' would be slimmer and would only contain the fields and
> > features that generic event consumers and producers need. Tracing events could
> > be updated to use these kinds of slimmer events.
> >
> > It would still plug nicely into existing event ABIs, would work with event
> > filters, etc. so the tooling side would remain focused and unified.
> >
> > Something like that. It is rather clear by now that splitting out irq_work was
> > a mistake. But mistakes can be fixed and some really nice code could come out
> > of it! Would you be interested in looking into this?
>
> Err...?
>
> Then is it better to write some nice code and throw away the following patch?

No, i think your patch is already a pretty nice simplification of the
MCE code - using irq_work is obviously better than the open-coded MCE
vector approach!

These are exactly the kind of small steps towards generalizations
that i wanted to see: each step without being intrusive and breaking
stuff and working towards improving the status quo.

Thanks,

Ingo

2011-05-26 01:34:36

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 5/9] HWPoison: add memory_failure_queue()

(2011/05/25 23:11), Ingo Molnar wrote:
> No, i think your patch is already a pretty nice simplification of the
> MCE code - using irq_work is obviously better than the open-coded MCE
> vector approach!
>
> These are exactly the kind of small steps towards generalizations
> that i wanted to see: each step without being intrusive and breaking
> stuff and working towards improving the status quo.

Thank you, Ingo!

I have some minor patches for mce codes too so I'll post this and those
in new thread soon.

Thanks,
H.Seto

2011-05-29 06:55:19

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 0/9] ACPI, APEI patches for 2.6.40

> [PATCH 1/9] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
> [PATCH 2/9] lib, Add lock-less NULL terminated single list
> [PATCH 3/9] lib, Make gen_pool memory allocator lockless
> [PATCH 4/9] ACPI, APEI, GHES, printk support for recoverable error via NMI

I think the 4 above should go upstream to enable the
printk for recoverable errors. If queuing messages
from NMI context gets re-implemented in the future,
then so be it.

I'll skip the rest, however, except #9:-)

thanks,
-Len

> [PATCH 5/9] HWPoison: add memory_failure_queue()
> [PATCH 6/9] ACPI, APEI, GHES: Add hardware memory error recovery support
> [PATCH 7/9] PCIe, AER, add aer_recover_queue
> [PATCH 8/9] ACPI, APEI, GHES: Add PCIe AER recovery support
> [PATCH 9/9] ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled
>

2011-05-29 11:31:16

by huang ying

[permalink] [raw]
Subject: Re: [PATCH 0/9] ACPI, APEI patches for 2.6.40

On Sun, May 29, 2011 at 2:55 PM, Len Brown <[email protected]> wrote:
>> [PATCH 1/9] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
>> [PATCH 2/9] lib, Add lock-less NULL terminated single list
>> [PATCH 3/9] lib, Make gen_pool memory allocator lockless
>> [PATCH 4/9] ACPI, APEI, GHES, printk support for recoverable error via NMI
>
> I think the 4 above should go upstream to enable the
> printk for recoverable errors.  If queuing messages
> from NMI context gets re-implemented in the future,
> then so be it.

The GHES records are queued just for printk. It has no contradiction
with perf based logging.

> I'll skip the rest, however, except #9:-)
>
> thanks,
> -Len
>
>> [PATCH 5/9] HWPoison: add memory_failure_queue()
>> [PATCH 6/9] ACPI, APEI, GHES: Add hardware memory error recovery support
>> [PATCH 7/9] PCIe, AER, add aer_recover_queue
>> [PATCH 8/9] ACPI, APEI, GHES: Add PCIe AER recovery support
>> [PATCH 9/9] ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled

I think #5 and #7 just some general code that can be used by perf
based recovering too. Because we are still waiting for perf based
hardware error logging/recovering infrastructure. I think we can take
#6 and #8 for now and maybe replace them with perf based solution when
that is ready.

BTW: Can you take a look at another patch with following title?

[PATCH] ACPI, APEI, Add APEI _OSC support

https://lkml.org/lkml/2011/5/25/28

This is needed to enable GHES reporting in firmware on some machines.

Best Regards,
Huang Ying

2011-05-30 06:49:10

by Chen, Gong

[permalink] [raw]
Subject: Re: [PATCH 0/9] ACPI, APEI patches for 2.6.40

于 5/29/2011 2:55 PM, Len Brown 写道:
>> [PATCH 1/9] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
>> [PATCH 2/9] lib, Add lock-less NULL terminated single list
>> [PATCH 3/9] lib, Make gen_pool memory allocator lockless
>> [PATCH 4/9] ACPI, APEI, GHES, printk support for recoverable error via NMI
>
> I think the 4 above should go upstream to enable the
> printk for recoverable errors. If queuing messages
> from NMI context gets re-implemented in the future,
> then so be it.
>
> I'll skip the rest, however, except #9:-)
>
> thanks,
> -Len
>
>> [PATCH 5/9] HWPoison: add memory_failure_queue()
>> [PATCH 6/9] ACPI, APEI, GHES: Add hardware memory error recovery support
>> [PATCH 7/9] PCIe, AER, add aer_recover_queue
>> [PATCH 8/9] ACPI, APEI, GHES: Add PCIe AER recovery support
>> [PATCH 9/9] ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled
>>
> --

Hi, Len

Would you please help to review and merge my patch to fix an issue for
ERST debug module ? Here is the link:
https://patchwork.kernel.org/patch/802662/

2011-06-01 18:50:00

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 7/9] PCIe, AER, add aer_recover_queue

On Tue, 17 May 2011 16:08:37 +0800
Huang Ying <[email protected]> wrote:

> In addition to native PCIe AER, now APEI (ACPI Platform Error
> Interface) GHES (Generic Hardware Error Source) can be used to report
> PCIe AER errors too. To add support to APEI GHES PCIe AER recovery,
> aer_recover_queue is added to export the recovery function in native
> PCIe AER driver.
>
> Recoverable PCIe AER errors are reported via NMI in APEI GHES. Then
> APEI GHES uses irq_work to delay the error processing into an IRQ
> handler. But PCIe AER recovery can be very time-consuming, so
> aer_recover_queue, which can be used in IRQ handler, delays the real
> recovery action into the process context, that is, work queue.

Have the dependencies for this already landed? Or does this one need
to go first?

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2011-06-02 05:09:33

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 7/9] PCIe, AER, add aer_recover_queue

Hi, Jesse,

On 06/02/2011 02:49 AM, Jesse Barnes wrote:
> On Tue, 17 May 2011 16:08:37 +0800
> Huang Ying <[email protected]> wrote:
>
>> In addition to native PCIe AER, now APEI (ACPI Platform Error
>> Interface) GHES (Generic Hardware Error Source) can be used to report
>> PCIe AER errors too. To add support to APEI GHES PCIe AER recovery,
>> aer_recover_queue is added to export the recovery function in native
>> PCIe AER driver.
>>
>> Recoverable PCIe AER errors are reported via NMI in APEI GHES. Then
>> APEI GHES uses irq_work to delay the error processing into an IRQ
>> handler. But PCIe AER recovery can be very time-consuming, so
>> aer_recover_queue, which can be used in IRQ handler, delays the real
>> recovery action into the process context, that is, work queue.
>
> Have the dependencies for this already landed? Or does this one need
> to go first?

This patch has no external dependencies and can go first.

Best Regards,
Huang Ying

2011-06-02 15:12:43

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 7/9] PCIe, AER, add aer_recover_queue

On Tue, 17 May 2011 16:08:37 +0800
Huang Ying <[email protected]> wrote:

> In addition to native PCIe AER, now APEI (ACPI Platform Error
> Interface) GHES (Generic Hardware Error Source) can be used to report
> PCIe AER errors too. To add support to APEI GHES PCIe AER recovery,
> aer_recover_queue is added to export the recovery function in native
> PCIe AER driver.
>
> Recoverable PCIe AER errors are reported via NMI in APEI GHES. Then
> APEI GHES uses irq_work to delay the error processing into an IRQ
> handler. But PCIe AER recovery can be very time-consuming, so
> aer_recover_queue, which can be used in IRQ handler, delays the real
> recovery action into the process context, that is, work queue.

Applied to linux-next, thanks.

--
Jesse Barnes, Intel Open Source Technology Center