2014-07-01 19:23:53

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH -v3 0/4] RAS: Correctable Errors Collector thing

From: Borislav Petkov <[email protected]>

Ok,

the next version.

Main changes from the last one are that we have a ce_ring now to which
MCEs get logged in atomic context first and then, in process context,
put into the CEC, just like this is done with the mce_ring.

Also, the decay of the elements in the CEC happens not after CLEAN_ELEMS
insertions of new elements only but the incrementation of an already
inserted element counts too. We want to do that because otherwise we're
not fair in aging the elements.

Constructive feedback is, as always, appreciated.

Thanks.

Changelog:

so here's v2 with the feedback from last time addressed (... hopefully).
This is ontop of Gong's extlog stuff which is currently a moving target
but I've based this stuff on it as we're starting slowly to relocate
generic RAS stuff into drivers/ras/.

A couple of points I was thinking about which we should talk about:

* This version automatically removes the oldest element from the array
when it gets full. With 512 PFNs max size, I think we should be ok.

* If CEC (let's call this thing that) can perform all RAS actions
needed/required, we should not forward correctable errors to userspace
because it simply doesn't need to. Unless there is something more we
want to do in userspace... we could make it configurable, dunno.
This version simply collects the errors and does the soft offlining,
thus issuing to dmesg something like this:

[ 520.872376] RAS: Soft-offlining pfn: 0xdead
[ 520.874384] soft offline: 0xdead page already poisoned

I'm not sure what we want to do with this info - we need to think about
it more but we're flexible there so... :-)

My main reasoning behind not forwarding each single correctable error
is that we don't want to upset the user unnecessarily and cause those
expensive support calls.

* Concerning policy and at which error count we should soft-offline a
page and whether we should make it configurable or not and what the
interface would be: we still don't know and we probably need to talk
about it too. Right now, using 10 bits for that count feels right. The
count gets decayed anyway.

But, do we need to run it on lotsa live systems and hear feedback?
Definitely.

* As to why we're putting this in the kernel and enabling it by default:
a userspace daemon is much more fragile than doing this in the kernel.
And regardless of distro, everyone gets this.

Borislav Petkov (4):
x86, MCE: Make the mce_ring explicit
RAS: Add a Corrected Errors Collector
MCE, CE: Wire in the CE collector
MCE, CE: Add debugging glue

arch/x86/kernel/cpu/mcheck/mce.c | 132 +++++++++++++---
drivers/ras/Kconfig | 11 ++
drivers/ras/Makefile | 3 +-
drivers/ras/ce.c | 322 +++++++++++++++++++++++++++++++++++++++
include/linux/ras.h | 3 +
5 files changed, 450 insertions(+), 21 deletions(-)
create mode 100644 drivers/ras/ce.c

--
2.0.0


2014-07-01 19:23:55

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH -v3 3/4] MCE, CE: Wire in the CE collector

From: Borislav Petkov <[email protected]>

Add the CE collector to the polling path which collects the correctable
errors. Collect only DRAM ECC errors for now.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 84 ++++++++++++++++++++++++++++++++++++----
1 file changed, 76 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 4c0167070e2e..a15a09b29ed0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -36,6 +36,7 @@
#include <linux/nmi.h>
#include <linux/cpu.h>
#include <linux/smp.h>
+#include <linux/ras.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/debugfs.h>
@@ -475,6 +476,9 @@ struct mce_ring {
};
static DEFINE_PER_CPU(struct mce_ring, mce_ring);

+/* This gets all correctable errors. */
+static DEFINE_PER_CPU(struct mce_ring, ce_ring);
+
/* Runs with CPU affinity in workqueue */
static inline int mce_ring_empty(struct mce_ring *r)
{
@@ -522,7 +526,8 @@ int mce_available(struct cpuinfo_x86 *c)

static void mce_schedule_work(void)
{
- if (!mce_ring_empty(&__get_cpu_var(mce_ring)))
+ if (!mce_ring_empty(&__get_cpu_var(mce_ring)) ||
+ !mce_ring_empty(&__get_cpu_var( ce_ring)))
schedule_work(&__get_cpu_var(mce_work));
}

@@ -574,6 +579,57 @@ static void mce_read_aux(struct mce *m, int i)

DEFINE_PER_CPU(unsigned, mce_poll_count);

+static bool dram_ce_error(struct mce *m)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+
+ if (c->x86_vendor == X86_VENDOR_AMD) {
+ /* ErrCodeExt[20:16] */
+ u8 xec = (m->status >> 16) & 0x1f;
+
+ return (xec == 0x0 || xec == 0x8);
+ } else if (c->x86_vendor == X86_VENDOR_INTEL)
+ /*
+ * Tony: "You need to look at the low 16 bits of "status"
+ * (the MCACOD) field and see which is the most significant bit
+ * set (ignoring bit 12, the "filter" bit). If the answer is
+ * bit 7 - then this is a memory error. But you can't just
+ * blindly check bit 7 because if bit 8 is set, then this is a
+ * cache error, and if bit 11 is set, then it is a bus/ inter-
+ * connect error - and either way bit 7 just gives more detail
+ * on what cache/bus/interconnect error happened."
+ */
+ return (m->status & 0xef80) == BIT(7);
+ else
+ return false;
+}
+
+static void __log_ce(struct mce *m, enum mcp_flags flags)
+{
+ /*
+ * Don't get the IP here because it's unlikely to have anything to do
+ * with the actual error location.
+ */
+ if ((flags & MCP_DONTLOG) || mca_cfg.dont_log_ce)
+ return;
+
+ if (dram_ce_error(m)) {
+ /*
+ * In the cases where we don't have a valid address after all,
+ * do not collect but log.
+ */
+ if (!(m->status & MCI_STATUS_ADDRV))
+ goto log;
+
+ mce_ring_add(&__get_cpu_var(ce_ring), m->addr >> PAGE_SHIFT);
+ return;
+ }
+
+log:
+ mce_log(m);
+}
+
+
/*
* Poll for corrected events or events that happened before reset.
* Those are just logged through /dev/mcelog.
@@ -627,12 +683,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)

if (!(flags & MCP_TIMESTAMP))
m.tsc = 0;
- /*
- * Don't get the IP here because it's unlikely to
- * have anything to do with the actual error location.
- */
- if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
- mce_log(&m);
+
+ __log_ce(&m, flags);

/*
* Clear state for this bank.
@@ -1229,6 +1281,10 @@ static void mce_process_work(struct work_struct *dummy)

while (mce_ring_get(&__get_cpu_var(mce_ring), &pfn))
memory_failure(pfn, MCE_VECTOR, 0);
+
+ /* Now process CEs too. */
+ while (mce_ring_get(&__get_cpu_var(ce_ring), &pfn))
+ ce_add_elem(pfn);
}

#ifdef CONFIG_X86_MCE_INTEL
@@ -2554,5 +2610,17 @@ static int __init mcheck_debugfs_init(void)

return 0;
}
-late_initcall(mcheck_debugfs_init);
+#else
+static int __init mcheck_debugfs_init(void) {}
#endif
+
+static int __init mcheck_late_init(void)
+{
+ if (mcheck_debugfs_init())
+ pr_err("Error creating debugfs nodes!\n");
+
+ ce_init();
+
+ return 0;
+}
+late_initcall(mcheck_late_init);
--
2.0.0

2014-07-01 19:23:54

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH -v3 1/4] x86, MCE: Make the mce_ring explicit

From: Borislav Petkov <[email protected]>

We would want to hand down a ring on which to operate so make it an
argument.

Use the obvious mce_ring_empty() in mce_ring_get() for better
readability.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 4fc57975acc1..4c0167070e2e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -476,23 +476,21 @@ struct mce_ring {
static DEFINE_PER_CPU(struct mce_ring, mce_ring);

/* Runs with CPU affinity in workqueue */
-static int mce_ring_empty(void)
+static inline int mce_ring_empty(struct mce_ring *r)
{
- struct mce_ring *r = &__get_cpu_var(mce_ring);
-
return r->start == r->end;
}

-static int mce_ring_get(unsigned long *pfn)
+static int mce_ring_get(struct mce_ring *r, unsigned long *pfn)
{
- struct mce_ring *r;
int ret = 0;

*pfn = 0;
get_cpu();
- r = &__get_cpu_var(mce_ring);
- if (r->start == r->end)
+
+ if (mce_ring_empty(r))
goto out;
+
*pfn = r->ring[r->start];
r->start = (r->start + 1) % MCE_RING_SIZE;
ret = 1;
@@ -502,9 +500,8 @@ out:
}

/* Always runs in MCE context with preempt off */
-static int mce_ring_add(unsigned long pfn)
+static int mce_ring_add(struct mce_ring *r, unsigned long pfn)
{
- struct mce_ring *r = &__get_cpu_var(mce_ring);
unsigned next;

next = (r->end + 1) % MCE_RING_SIZE;
@@ -525,7 +522,7 @@ int mce_available(struct cpuinfo_x86 *c)

static void mce_schedule_work(void)
{
- if (!mce_ring_empty())
+ if (!mce_ring_empty(&__get_cpu_var(mce_ring)))
schedule_work(&__get_cpu_var(mce_work));
}

@@ -1122,7 +1119,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
* RED-PEN don't ignore overflow for mca_cfg.tolerant == 0
*/
if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
- mce_ring_add(m.addr >> PAGE_SHIFT);
+ mce_ring_add(&__get_cpu_var(mce_ring),
+ m.addr >> PAGE_SHIFT);

mce_log(&m);

@@ -1229,7 +1227,7 @@ static void mce_process_work(struct work_struct *dummy)
{
unsigned long pfn;

- while (mce_ring_get(&pfn))
+ while (mce_ring_get(&__get_cpu_var(mce_ring), &pfn))
memory_failure(pfn, MCE_VECTOR, 0);
}

--
2.0.0

2014-07-01 19:24:40

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH -v3 4/4] MCE, CE: Add debugging glue

From: Borislav Petkov <[email protected]>

For testing purposes only.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 28 +++++++++++++++++++++++++++-
drivers/ras/ce.c | 28 +++++++++++++++++++++++++++-
include/linux/ras.h | 1 +
3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a15a09b29ed0..4a17e1d86ae7 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1418,6 +1418,8 @@ int mce_notify_irq(void)
/* Not more than two messages every minute */
static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);

+ mce_schedule_work();
+
if (test_and_clear_bit(0, &mce_need_notify)) {
/* wake processes polling /dev/mcelog */
wake_up_interruptible(&mce_chrdev_wait);
@@ -2596,9 +2598,29 @@ static int fake_panic_set(void *data, u64 val)
DEFINE_SIMPLE_ATTRIBUTE(fake_panic_fops, fake_panic_get,
fake_panic_set, "%llu\n");

+static u64 cec_pfn;
+
+static int cec_pfn_get(void *data, u64 *val)
+{
+ *val = cec_pfn;
+
+ ce_dump_array();
+
+ return 0;
+}
+
+static int cec_pfn_set(void *data, u64 val)
+{
+ cec_pfn = val;
+
+ return ce_add_elem(val);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(cec_pfn_ops, cec_pfn_get, cec_pfn_set, "0x%llx\n");
+
static int __init mcheck_debugfs_init(void)
{
- struct dentry *dmce, *ffake_panic;
+ struct dentry *dmce, *ffake_panic, *cec_pfn;

dmce = mce_get_debugfs_dir();
if (!dmce)
@@ -2608,6 +2630,10 @@ static int __init mcheck_debugfs_init(void)
if (!ffake_panic)
return -ENOMEM;

+ cec_pfn = debugfs_create_file("cec_pfn", 0400, dmce, NULL, &cec_pfn_ops);
+ if (!cec_pfn)
+ return -ENOMEM;
+
return 0;
}
#else
diff --git a/drivers/ras/ce.c b/drivers/ras/ce.c
index 4cab8e8a4ef0..c3c7936527f1 100644
--- a/drivers/ras/ce.c
+++ b/drivers/ras/ce.c
@@ -94,6 +94,26 @@ static struct ce_array {

static DEFINE_MUTEX(ce_mutex);

+void ce_dump_array(void)
+{
+ struct ce_array *ca = &ce_arr;
+ u64 prev = 0;
+ int i;
+
+ pr_info("{ n: %d\n", ca->n);
+ for (i = 0; i < ca->n; i++) {
+ u64 this = PFN(ca->array[i]);
+
+ pr_info(" %03d: [%016llu|%03llx]\n", i, this, FULL_COUNT(ca->array[i]));
+
+ WARN_ON(prev > this);
+
+ prev = this;
+ }
+
+ pr_info("}\n");
+}
+
/*
* Decrement decay value. We're using DECAY_BITS bits to denote decay of an
* element in the array. On insertion and any access, it gets maxed
@@ -180,6 +200,7 @@ static u64 del_lru_elem_unlocked(struct ce_array *ca)
{
unsigned int min = FULL_COUNT_MASK;
int i, min_idx = 0;
+ u64 pfn;

for (i = 0; i < ca->n; i++) {
unsigned int this = FULL_COUNT(ca->array[i]);
@@ -189,9 +210,14 @@ static u64 del_lru_elem_unlocked(struct ce_array *ca)
}
}

+ pfn = PFN(ca->array[min_idx]);
+
+ pr_err("%s: Deleting ca[%d]: %016llu|%03x\n",
+ __func__, min_idx, pfn, min);
+
__del_elem(ca, min_idx);

- return PFN(ca->array[min_idx]);
+ return pfn;
}

/*
diff --git a/include/linux/ras.h b/include/linux/ras.h
index 24e82fb0fe99..a9aed3900453 100644
--- a/include/linux/ras.h
+++ b/include/linux/ras.h
@@ -13,4 +13,5 @@ static inline int ras_add_daemon_trace(void) { return 0; }

void __init ce_init(void);
int ce_add_elem(u64 pfn);
+void ce_dump_array(void);
#endif
--
2.0.0

2014-07-01 19:24:39

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH -v3 2/4] RAS: Add a Corrected Errors Collector

From: Borislav Petkov <[email protected]>

A simple data structure for collecting correctable errors along with
accessors. Larger description in the code itself.

Signed-off-by: Borislav Petkov <[email protected]>
---
drivers/ras/Kconfig | 11 ++
drivers/ras/Makefile | 3 +-
drivers/ras/ce.c | 296 +++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ras.h | 2 +
4 files changed, 311 insertions(+), 1 deletion(-)
create mode 100644 drivers/ras/ce.c

diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index f9da613052c2..c6977b943506 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -1,2 +1,13 @@
config RAS
bool
+
+config RAS_CE
+ bool "Correctable Errors Collector"
+ default y if X86_MCE
+ select MEMORY_FAILURE
+ ---help---
+ This is a small cache which collects correctable memory errors per 4K page
+ PFN and counts their repeated occurrance. Once the counter for a PFN overflows,
+ we try to soft-offline that page as we take it to mean that it has reached a
+ relatively high error count and would probably be best if we don't use it
+ anymore.
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index d7f73341ced3..7e7f3f9aa948 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -1 +1,2 @@
-obj-$(CONFIG_RAS) += ras.o debugfs.o
+obj-$(CONFIG_RAS) += ras.o debugfs.o
+obj-$(CONFIG_RAS_CE) += ce.o
diff --git a/drivers/ras/ce.c b/drivers/ras/ce.c
new file mode 100644
index 000000000000..4cab8e8a4ef0
--- /dev/null
+++ b/drivers/ras/ce.c
@@ -0,0 +1,296 @@
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+
+#include <asm/bug.h>
+
+/*
+ * RAS Correctable Errors Collector
+ *
+ * This is a simple gadget which collects correctable errors and counts their
+ * occurrence per physical page address.
+ *
+ * We've opted for possibly the simplest data structure to collect those - an
+ * array of the size of a memory page. It stores 512 u64's with the following
+ * structure:
+ *
+ * [63 ... PFN ... 12 | 11 ... generation ... 10 | 9 ... count ... 0]
+ *
+ * The generation in the two highest order bits is two bits which are set to 11b
+ * on every insertion. During the course of this entry's existence, it
+ * gets decremented during spring cleaning to 10b, then 01b and then 00b.
+ *
+ * This way we're employing the numeric ordering to make sure that newly
+ * inserted/touched elements have higher 12-bit counts (which we've
+ * manufactured) and thus iterating over the array initially won't kick out
+ * those last inserted elements.
+ *
+ * Spring cleaning is what we do when we reach a certain number CLEAN_ELEMS of
+ * elements entered into the page; during which, we're decaying all elements.
+ * If, after decay, an element gets inserted again, its generation is set to 11b
+ * to make sure it has higher numerical count than other, older elements and
+ * thus emulate an an LRU-like behavior when deleting elements to free up space
+ * in the page.
+ *
+ * When an element reaches it's max count of COUNT_MASK, we try to poison it by
+ * assuming that errors triggered COUNT_MASK times in a single page are
+ * excessive and that page shouldn't be used anymore.
+ *
+ * To the question why we've chosen a page and moving elements around with
+ * memmove, it is because it is a very simple structure to handle and max data
+ * movement is 4K which on highly optimized modern CPUs is almost unnoticeable.
+ * We wanted to avoid the pointer traversal of more complex structures like a
+ * linked list or some sort of a balancing search tree.
+ *
+ * Deleting an element takes O(n) but since it is only a single page, it should
+ * be fast enough and it shouldn't happen all too often depending on error
+ * patterns.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) "RAS: " fmt
+
+/*
+ * We use DECAY_BITS bits of PAGE_SHIFT bits for counting decay, i.e., how long
+ * elements have stayed in the array without accessed again.
+ */
+#define DECAY_BITS 2
+#define DECAY_MASK ((1ULL << DECAY_BITS) - 1)
+#define MAX_ELEMS (PAGE_SIZE / sizeof(u64))
+
+/*
+ * Threshold amount of inserted elements after which we start spring
+ * cleaning.
+ */
+#define CLEAN_ELEMS (MAX_ELEMS >> DECAY_BITS)
+
+/* Bits which count the number of errors happened in this 4K page. */
+#define COUNT_BITS (PAGE_SHIFT - DECAY_BITS)
+#define COUNT_MASK ((1ULL << COUNT_BITS) - 1)
+#define FULL_COUNT_MASK (PAGE_SIZE - 1)
+
+/*
+ * u64: [ 63 ... 12 | DECAY_BITS | COUNT_BITS ]
+ */
+
+#define PFN(e) ((e) >> PAGE_SHIFT)
+#define DECAY(e) (((e) >> COUNT_BITS) & DECAY_MASK)
+#define COUNT(e) ((unsigned int)(e) & COUNT_MASK)
+#define FULL_COUNT(e) ((e) & (PAGE_SIZE - 1))
+
+static struct ce_array {
+ u64 *array; /* container page */
+ unsigned n; /* number of elements in the array */
+
+ unsigned decay_count; /*
+ * number of element insertions/incrementations
+ * since the last spring cleaning.
+ */
+} ce_arr;
+/* ^^^^^
+ * |
+ * | This variable is passed in internally from the API functions.
+ */
+
+static DEFINE_MUTEX(ce_mutex);
+
+/*
+ * Decrement decay value. We're using DECAY_BITS bits to denote decay of an
+ * element in the array. On insertion and any access, it gets maxed
+ */
+static void do_spring_cleaning(struct ce_array *ca)
+{
+ int i;
+
+ for (i = 0; i < ca->n; i++) {
+ u8 decay = DECAY(ca->array[i]);
+
+ if (!decay)
+ continue;
+
+ decay--;
+
+ ca->array[i] &= ~(DECAY_MASK << COUNT_BITS);
+ ca->array[i] |= (decay << COUNT_BITS);
+ }
+ ca->decay_count = 0;
+}
+
+/*
+ * @to: index of the smallest element which is >= then @pfn.
+ *
+ * Return the index of the pfn if found, otherwise negative value.
+ */
+static int __find_elem(struct ce_array *ca, u64 pfn, unsigned *to)
+{
+ u64 this_pfn;
+ int min = 0, max = ca->n;
+
+ while (min < max) {
+ int tmp = (max + min) >> 1;
+
+ this_pfn = PFN(ca->array[tmp]);
+
+ if (this_pfn < pfn)
+ min = tmp + 1;
+ else if (this_pfn > pfn)
+ max = tmp;
+ else {
+ min = tmp;
+ break;
+ }
+ }
+
+ if (to)
+ *to = min;
+
+ this_pfn = PFN(ca->array[min]);
+
+ if (this_pfn == pfn)
+ return min;
+
+ return -ENOKEY;
+}
+
+static int find_elem(struct ce_array *ca, u64 pfn, unsigned *to)
+{
+ WARN_ON(!to);
+
+ if (!ca->n) {
+ *to = 0;
+ return -ENOKEY;
+ }
+ return __find_elem(ca, pfn, to);
+}
+
+static void __del_elem(struct ce_array *ca, int idx)
+{
+ /*
+ * Save us a function call when deleting the last element.
+ */
+ if (ca->n - (idx + 1))
+ memmove((void *)&ca->array[idx],
+ (void *)&ca->array[idx + 1],
+ (ca->n - (idx + 1)) * sizeof(u64));
+
+ ca->n--;
+}
+
+static u64 del_lru_elem_unlocked(struct ce_array *ca)
+{
+ unsigned int min = FULL_COUNT_MASK;
+ int i, min_idx = 0;
+
+ for (i = 0; i < ca->n; i++) {
+ unsigned int this = FULL_COUNT(ca->array[i]);
+ if (min > this) {
+ min = this;
+ min_idx = i;
+ }
+ }
+
+ __del_elem(ca, min_idx);
+
+ return PFN(ca->array[min_idx]);
+}
+
+/*
+ * We return the 0th pfn in the error case under the assumption that it cannot
+ * be poisoned and excessive CEs in there are a serious deal anyway.
+ */
+static u64 __maybe_unused del_lru_elem(void)
+{
+ struct ce_array *ca = &ce_arr;
+ u64 pfn;
+
+ if (!ca->n)
+ return 0;
+
+ mutex_lock(&ce_mutex);
+ pfn = del_lru_elem_unlocked(ca);
+ mutex_unlock(&ce_mutex);
+
+ return pfn;
+}
+
+
+int ce_add_elem(u64 pfn)
+{
+ struct ce_array *ca = &ce_arr;
+ unsigned to;
+ int count, ret = 0;
+
+ /*
+ * We can be called very early on the identify_cpu path where we are not
+ * initialized yet. We ignore the error for simplicity.
+ */
+ if (!ce_arr.array)
+ return 0;
+
+ mutex_lock(&ce_mutex);
+
+ if (ca->n == MAX_ELEMS)
+ WARN_ON(!del_lru_elem_unlocked(ca));
+
+ ret = find_elem(ca, pfn, &to);
+ if (ret < 0) {
+ /*
+ * Shift range [to-end] to make room for one more element.
+ */
+ memmove((void *)&ca->array[to + 1],
+ (void *)&ca->array[to],
+ (ca->n - to) * sizeof(u64));
+
+ ca->array[to] = (pfn << PAGE_SHIFT) |
+ (DECAY_MASK << COUNT_BITS) | 1;
+
+ ca->n++;
+ ret = 0;
+
+ goto decay;
+ }
+
+ count = COUNT(ca->array[to]);
+
+ if (count < COUNT_MASK) {
+ ca->array[to] |= (DECAY_MASK << COUNT_BITS);
+ ca->array[to]++;
+
+ goto decay;
+ } else {
+ u64 pfn = ca->array[to] >> PAGE_SHIFT;
+
+ /*
+ * We have reached max count for this page, soft-offline it.
+ */
+ pr_err("Soft-offlining pfn: 0x%llx\n", pfn);
+ memory_failure_queue(pfn, 0, MF_SOFT_OFFLINE);
+ __del_elem(ca, to);
+
+ ret = 0;
+
+ goto unlock;
+ }
+
+decay:
+ ca->decay_count++;
+
+ if (ca->decay_count >= CLEAN_ELEMS)
+ do_spring_cleaning(ca);
+
+unlock:
+ mutex_unlock(&ce_mutex);
+
+ return ret;
+}
+
+void __init ce_init(void)
+{
+ ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!ce_arr.array) {
+ pr_err("Error allocating CE array page!\n");
+ return;
+ }
+
+ pr_info("Correctable Errors collector initialized.\n");
+}
diff --git a/include/linux/ras.h b/include/linux/ras.h
index 2aceeafd6fe5..24e82fb0fe99 100644
--- a/include/linux/ras.h
+++ b/include/linux/ras.h
@@ -11,4 +11,6 @@ static inline void ras_debugfs_init(void) { return; }
static inline int ras_add_daemon_trace(void) { return 0; }
#endif

+void __init ce_init(void);
+int ce_add_elem(u64 pfn);
#endif
--
2.0.0

2014-07-05 10:55:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v3 2/4] RAS: Add a Corrected Errors Collector

On Tue, Jul 01, 2014 at 09:23:41PM +0200, Borislav Petkov wrote:
> From: Borislav Petkov <[email protected]>
>
> A simple data structure for collecting correctable errors along with
> accessors. Larger description in the code itself.
>
> Signed-off-by: Borislav Petkov <[email protected]>
> ---
> drivers/ras/Kconfig | 11 ++
> drivers/ras/Makefile | 3 +-
> drivers/ras/ce.c | 296 +++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ras.h | 2 +
> 4 files changed, 311 insertions(+), 1 deletion(-)
> create mode 100644 drivers/ras/ce.c
>
> diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
> index f9da613052c2..c6977b943506 100644
> --- a/drivers/ras/Kconfig
> +++ b/drivers/ras/Kconfig
> @@ -1,2 +1,13 @@
> config RAS
> bool
> +
> +config RAS_CE
> + bool "Correctable Errors Collector"
> + default y if X86_MCE
> + select MEMORY_FAILURE

This has an unmet Kconfig dependency on 32-bit due to MEMORY_FAILURE not
enabled there. I readded the RAS menuconfig stuff along with Gong's nice
RAS description. Will be in the next version.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-07-07 17:00:20

by Max Asbock

[permalink] [raw]
Subject: Re: [PATCH -v3 3/4] MCE, CE: Wire in the CE collector

On Tue, 2014-07-01 at 21:23 +0200, Borislav Petkov wrote:
> From: Borislav Petkov <[email protected]>
>
> Add the CE collector to the polling path which collects the correctable
> errors. Collect only DRAM ECC errors for now.
>
> Signed-off-by: Borislav Petkov <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 84 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 76 insertions(+), 8 deletions(-)
>

> +
> +static void __log_ce(struct mce *m, enum mcp_flags flags)
> +{
> + /*
> + * Don't get the IP here because it's unlikely to have anything to do
> + * with the actual error location.
> + */

The above comment doesn't belong here. This function is about how to
dispose of corrected errors, not what data should be collected.
(Besides this comment is obsolete as the IP is always collected in
mce_gather_info()).

> + if ((flags & MCP_DONTLOG) || mca_cfg.dont_log_ce)
> + return;
> +

> + if (dram_ce_error(m)) {
> + /*
> + * In the cases where we don't have a valid address after all,
> + * do not collect but log.
> + */
> + if (!(m->status & MCI_STATUS_ADDRV))
> + goto log;
> +
> + mce_ring_add(&__get_cpu_var(ce_ring), m->addr >> PAGE_SHIFT);
> + return;
> + }
> +
> +log:
> + mce_log(m);
> +}
The above code is a bit convoluted, it amounts to:

if (we have a corrected dram error && we have an address for it)
mce_ring_add()
else
mcelog()

Is that the intention? This might be problematic for downstream
consumers of the errors such as the EDAC drivers which keep counts of
errors. If errors are silently removed from the stream these counts will
be bogus. Somebody might wonder why a page was off-lined while the EDAC
driver reports zero corrected DRAM error counts.

- Max

2014-07-07 22:09:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v3 3/4] MCE, CE: Wire in the CE collector

On Mon, Jul 07, 2014 at 10:00:11AM -0700, Max Asbock wrote:
> The above code is a bit convoluted, it amounts to:
>
> if (we have a corrected dram error && we have an address for it)
> mce_ring_add()
> else
> mcelog()
>
> Is that the intention?

Yes.

> This might be problematic for downstream consumers of the errors
> such as the EDAC drivers which keep counts of errors. If errors are
> silently removed from the stream these counts will be bogus. Somebody
> might wonder why a page was off-lined while the EDAC driver reports
> zero corrected DRAM error counts.

This is one of the things that needs discussion. First of all, this
solution aims to be generic enough, even for boxes which don't have an
EDAC driver.

Because if you don't have an EDAC driver, you'd still want to have some
sort of error mitigation.

Granted, most boxes *do* have an EDAC driver but still...

Then, the question is becoming very interesting: how much do we want
EDAC to get involved in the whole deal? Because if we're doing the error
collecting, decaying and page offlining in the kernel, I don't see why
we would need EDAC at all. Or any other agent for that matter.

Maybe for the counters only, if we have to be fair. If, say, people
would want to monitor error levels or so.

I guess we can add some interface which only increments the error counts
without the other error handling EDAC normally would do...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-12-17 02:26:36

by Calvin Owens

[permalink] [raw]
Subject: Re: [PATCH -v3 0/4] RAS: Correctable Errors Collector thing

On Tue, Jul 1, 2014 at 12:23 PM, Borislav Petkov <[email protected]> wrote:
> Ok,
>
> the next version.
>
> Main changes from the last one are that we have a ce_ring now to which
> MCEs get logged in atomic context first and then, in process context,
> put into the CEC, just like this is done with the mce_ring.
>
> Also, the decay of the elements in the CEC happens not after CLEAN_ELEMS
> insertions of new elements only but the incrementation of an already
> inserted element counts too. We want to do that because otherwise we're
> not fair in aging the elements.
>
> Constructive feedback is, as always, appreciated.
>
> Thanks.
>
> Changelog:
>
> so here's v2 with the feedback from last time addressed (... hopefully).
> This is ontop of Gong's extlog stuff which is currently a moving target
> but I've based this stuff on it as we're starting slowly to relocate
> generic RAS stuff into drivers/ras/.
>
> A couple of points I was thinking about which we should talk about:
>
> * This version automatically removes the oldest element from the array
> when it gets full. With 512 PFNs max size, I think we should be ok.
>
> * If CEC (let's call this thing that) can perform all RAS actions
> needed/required, we should not forward correctable errors to userspace
> because it simply doesn't need to. Unless there is something more we
> want to do in userspace... we could make it configurable, dunno.

Hmm. I can definitely imagine that in a scenario where you're testing
hardware you would want to know about all the corrected errors in
userspace. You could just tail dmesg watching for the message below, but
that's somewhat crappy.

Also, figuring out what physical DIMM on the motherboard a given
physical address is on is rather messy as I understand it, since it
varies between manufacturers. I'm not sure supporting that inside the
kernel is a good idea, so people who care about this would still need
some way to get the errors in userspace too.

> This version simply collects the errors and does the soft offlining,
> thus issuing to dmesg something like this:
>
> [ 520.872376] RAS: Soft-offlining pfn: 0xdead
> [ 520.874384] soft offline: 0xdead page already poisoned
>
> I'm not sure what we want to do with this info - we need to think about
> it more but we're flexible there so... :-)

Somehow exposing the array tracking the errors could be interesting,
although I'm not sure how useful that would actually be in practice.
That would also get more complicated as this starts to handle things
like corrected cache and bus errors.

> My main reasoning behind not forwarding each single correctable error
> is that we don't want to upset the user unnecessarily and cause those
> expensive support calls.
>
> * Concerning policy and at which error count we should soft-offline a
> page and whether we should make it configurable or not and what the
> interface would be: we still don't know and we probably need to talk
> about it too. Right now, using 10 bits for that count feels right. The
> count gets decayed anyway.

This should definitely be configurable IMO: different people will want
to manage this in different ways. We're very aggressive about offlining
pages with corrected errors, for example.

> But, do we need to run it on lotsa live systems and hear feedback?
> Definitely.

I'll keep an eye out for buggy machines to test on ;)

> * As to why we're putting this in the kernel and enabling it by default:
> a userspace daemon is much more fragile than doing this in the kernel.
> And regardless of distro, everyone gets this.

I very much agree.

Thanks,
Calvin

> Borislav Petkov (4):
> x86, MCE: Make the mce_ring explicit
> RAS: Add a Corrected Errors Collector
> MCE, CE: Wire in the CE collector
> MCE, CE: Add debugging glue
>
> arch/x86/kernel/cpu/mcheck/mce.c | 132 +++++++++++++---
> drivers/ras/Kconfig | 11 ++
> drivers/ras/Makefile | 3 +-
> drivers/ras/ce.c | 322 +++++++++++++++++++++++++++++++++++++++
> include/linux/ras.h | 3 +
> 5 files changed, 450 insertions(+), 21 deletions(-)
> create mode 100644 drivers/ras/ce.c
>
> --
> 2.0.0

2014-12-17 21:17:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v3 0/4] RAS: Correctable Errors Collector thing

On Tue, Dec 16, 2014 at 06:26:03PM -0800, Calvin Owens wrote:
> Hmm. I can definitely imagine that in a scenario where you're testing
> hardware you would want to know about all the corrected errors in
> userspace. You could just tail dmesg watching for the message below,
> but that's somewhat crappy.

Oh yeah, we have the tracepoints for that - structured error logs. And
yes, we will definitely have a cec=disable or similar cmdline switch to
turn it off.

> Also, figuring out what physical DIMM on the motherboard a given
> physical address is on is rather messy as I understand it, since it
> varies between manufacturers. I'm not sure supporting that inside the
> kernel is a good idea, so people who care about this would still need
> some way to get the errors in userspace too.

Right, the edac drivers are all an attempt to do that pinpointing. It
doesn't always work optimally though.

There's also drivers/acpi/acpi_extlog.c which happens with fw support
and should be much more useful; it is a new thing from Intel though and
needs to spread out first.

> Somehow exposing the array tracking the errors could be interesting,
> although I'm not sure how useful that would actually be in practice.

Yeah, that's in debugfs, see the 4th patch: "[PATCH -v3 4/4] MCE, CE:
Add debugging glue".

> That would also get more complicated as this starts to handle things
> like corrected cache and bus errors.

Right, I'm not sure what we even want to do with those, if at all. And
what rates are those and whether we can even do proper - and what kind
of - recovery using them. This thing wants to deal with memory errors
only, for now at least.

> This should definitely be configurable IMO: different people will
> want to manage this in different ways. We're very aggressive about
> offlining pages with corrected errors, for example.

Ok, that sounds interesting. So you're saying, you would want to
configure the overflow count at which to offline the page. What else?
Decay time too?

Currently, we run do_spring_cleaning() when the fill levels reach
CLEAN_ELEMS, i.e., every time we do CLEAN_ELEMS insertion/incrementation
operations, we decay the currently present elements.

I can imagine where we want to control those aspects like wait until the
array fills up with PFNs and run the decaying then. Or we don't run the
decaying at all and offline pages the moment they reach saturation. And
so on and so on...

> I'll keep an eye out for buggy machines to test on ;)

Cool, thanks.

> > * As to why we're putting this in the kernel and enabling it by default:
> > a userspace daemon is much more fragile than doing this in the kernel.
> > And regardless of distro, everyone gets this.
>
> I very much agree.

Cool, thanks for the insights.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--