LinuxLists.cc - [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

2009-09-18 10:20:42

Subject: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Current MCE log ring buffer has following bugs and issues:

- On larger systems the 32 size buffer easily overflow, losing events.

- We had some reports of events getting corrupted which were also
blamed on the ring buffer.

- There's a known livelock, now hit by more people, under high error
rate.

We fix these bugs and issues via making MCE log ring buffer as
lock-less per-CPU ring buffer.

- One ring buffer for each CPU

+ MCEs are added to corresponding local per-CPU buffer, instead of
one big global buffer. Contention/unfairness between CPUs is
eleminated.

+ MCE records are read out and removed from per-CPU buffers by mutex
protected global reader function. Because there are no many
readers in system to contend in most cases.

- Per-CPU ring buffer data structure

+ An array is used to hold MCE records. integer "head" indicates
next writing position and integer "tail" indicates next reading
position.

+ To distinguish buffer empty and full, head and tail wrap to 0 at
MCE_LOG_LIMIT instead of MCE_LOG_LEN. Then the real next writing
position is head % MCE_LOG_LEN, and real next reading position is
tail % MCE_LOG_LEN. If buffer is empty, head == tail, if buffer is
full, head % MCE_LOG_LEN == tail % MCE_LOG_LEN and head != tail.

- Lock-less for writer side

+ MCE log writer may come from NMI, so the writer side must be
lock-less. For per-CPU buffer of one CPU, writers may come from
process, IRQ or NMI context, so "head" is increased with
cmpxchg_local() to allocate buffer space.

+ Reader side is protected with a mutex to guarantee only one reader
is active in the whole system.

Performance test show that the throughput of per-CPU mcelog buffer can
reach 430k records/s compared with 5.3k records/s for original
implementation on a 2-core 2.1GHz Core2 machine.

ChangeLog:

v7:

- Rebased on latest x86-tip.git/x86/mce

- Revise mce print code.

v6:

- Rebased on latest x86-tip.git/x86/mce + x86/tip

v5:

- Rebased on latest x86-tip.git/mce4.

- Fix parsing mce_log data structure in dumping kernel.

v4:

- Rebased on x86-tip.git x86/mce3 branch

- Fix a synchronization issue about mce.finished

- Fix comment style

v3:

- Use DEFINE_PER_CPU to allocate per_cpu mcelog buffer.

- Use cond_resched() to prevent possible system not reponsing issue
for large user buffer.

v2:

- Use alloc_percpu() to allocate per_cpu mcelog buffer.

- Use ndelay to implement witer timeout.

Signed-off-by: Huang Ying <[email protected]>
CC: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 18 +-
arch/x86/kernel/cpu/mcheck/mce.c | 295 +++++++++++++++++++++++----------------
2 files changed, 190 insertions(+), 123 deletions(-)

--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -76,20 +76,28 @@ struct mce {
* is set.
*/

-#define MCE_LOG_LEN 32
+#define MCE_LOG_LEN 32
+#define MCE_LOG_LIMIT (MCE_LOG_LEN * 2 - 1)
+
+static inline int mce_log_index(int n)
+{
+ return n >= MCE_LOG_LEN ? n - MCE_LOG_LEN : n;
+}
+
+struct mce_log_cpu;

struct mce_log {
- char signature[12]; /* "MACHINECHECK" */
+ char signature[12]; /* "MACHINECHEC2" */
unsigned len; /* = MCE_LOG_LEN */
- unsigned next;
unsigned flags;
unsigned recordlen; /* length of struct mce */
- struct mce entry[MCE_LOG_LEN];
+ unsigned nr_mcelog_cpus;
+ struct mce_log_cpu **mcelog_cpus;
};

#define MCE_OVERFLOW 0 /* bit 0 in flags means overflow */

-#define MCE_LOG_SIGNATURE "MACHINECHECK"
+#define MCE_LOG_SIGNATURE "MACHINECHEC2"

#define MCE_GET_RECORD_LEN _IOR('M', 1, int)
#define MCE_GET_LOG_LEN _IOR('M', 2, int)
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -52,6 +52,9 @@ int mce_disabled __read_mostly;

#define SPINUNIT 100 /* 100ns */

+/* Timeout log reader to wait writer to finish */
+#define WRITER_TIMEOUT_NS NSEC_PER_MSEC
+
atomic_t mce_entry;

DEFINE_PER_CPU(unsigned, mce_exception_count);
@@ -125,42 +128,50 @@ static struct mce_log mcelog = {
.recordlen = sizeof(struct mce),
};

+struct mce_log_cpu {
+ int head;
+ int tail;
+ unsigned long flags;
+ struct mce entry[MCE_LOG_LEN];
+};
+
+DEFINE_PER_CPU(struct mce_log_cpu, mce_log_cpus);
+
void mce_log(struct mce *mce)
{
- unsigned next, entry;
+ struct mce_log_cpu *mcelog_cpu = &__get_cpu_var(mce_log_cpus);
+ int head, ihead, tail, next;

mce->finished = 0;
- wmb();
- for (;;) {
- entry = rcu_dereference(mcelog.next);
- for (;;) {
- /*
- * When the buffer fills up discard new entries.
- * Assume that the earlier errors are the more
- * interesting ones:
- */
- if (entry >= MCE_LOG_LEN) {
- set_bit(MCE_OVERFLOW,
- (unsigned long *)&mcelog.flags);
- return;
- }
- /* Old left over entry. Skip: */
- if (mcelog.entry[entry].finished) {
- entry++;
- continue;
- }
- break;
+ /*
+ * mce->finished must be set to 0 before written to ring
+ * buffer
+ */
+ smp_wmb();
+ do {
+ head = mcelog_cpu->head;
+ tail = mcelog_cpu->tail;
+ ihead = mce_log_index(head);
+ /*
+ * When the buffer fills up discard new entries.
+ * Assume that the earlier errors are the more
+ * interesting.
+ */
+ if (ihead == mce_log_index(tail) && head != tail) {
+ set_bit(MCE_OVERFLOW, &mcelog_cpu->flags);
+ return;
}
- smp_rmb();
- next = entry + 1;
- if (cmpxchg(&mcelog.next, entry, next) == entry)
- break;
- }
- memcpy(mcelog.entry + entry, mce, sizeof(struct mce));
- wmb();
- mcelog.entry[entry].finished = 1;
- wmb();
-
+ next = head == MCE_LOG_LIMIT ? 0 : head + 1;
+ } while (cmpxchg_local(&mcelog_cpu->head, head, next) != head);
+ memcpy(mcelog_cpu->entry + ihead, mce, sizeof(struct mce));
+ /*
+ * ".finished" of MCE record in ring buffer must be set after
+ * copy
+ */
+ smp_wmb();
+ mcelog_cpu->entry[ihead].finished = 1;
+ /* bit 0 of notify_user should be set after finished be set */
+ smp_wmb();
mce->finished = 1;
set_bit(0, &mce_need_notify);
}
@@ -200,6 +211,26 @@ static void print_mce_tail(void)
KERN_EMERG "Run through mcelog --ascii to decode and contact your hardware vendor\n");
}

+static void print_mce_cpu(int cpu, struct mce *final, u64 mask, u64 res)
+{
+ int i;
+ struct mce_log_cpu *mcelog_cpu;
+
+ mcelog_cpu = &per_cpu(mce_log_cpus, cpu);
+ for (i = 0; i < MCE_LOG_LEN; i++) {
+ struct mce *m = &mcelog_cpu->entry[i];
+ if (!m->finished)
+ continue;
+ if (!(m->status & MCI_STATUS_VAL))
+ continue;
+ if ((m->status & mask) != res)
+ continue;
+ if (final && !memcmp(m, final, sizeof(struct mce)))
+ continue;
+ print_mce(m);
+ }
+}
+
#define PANIC_TIMEOUT 5 /* 5 seconds */

static atomic_t mce_paniced;
@@ -222,7 +253,7 @@ static void wait_for_panic(void)

static void mce_panic(char *msg, struct mce *final, char *exp)
{
- int i;
+ int cpu;

if (!fake_panic) {
/*
@@ -241,23 +272,12 @@ static void mce_panic(char *msg, struct
}
print_mce_head();
/* First print corrected ones that are still unlogged */
- for (i = 0; i < MCE_LOG_LEN; i++) {
- struct mce *m = &mcelog.entry[i];
- if (!(m->status & MCI_STATUS_VAL))
- continue;
- if (!(m->status & MCI_STATUS_UC))
- print_mce(m);
- }
- /* Now print uncorrected but with the final one last */
- for (i = 0; i < MCE_LOG_LEN; i++) {
- struct mce *m = &mcelog.entry[i];
- if (!(m->status & MCI_STATUS_VAL))
- continue;
- if (!(m->status & MCI_STATUS_UC))
- continue;
- if (!final || memcmp(m, final, sizeof(struct mce)))
- print_mce(m);
- }
+ for_each_online_cpu(cpu)
+ print_mce_cpu(cpu, final, MCI_STATUS_UC, 0);
+ /* Print uncorrected but without the final one */
+ for_each_online_cpu(cpu)
+ print_mce_cpu(cpu, final, MCI_STATUS_UC, MCI_STATUS_UC);
+ /* Finally print the final mce */
if (final)
print_mce(final);
if (cpu_missing)
@@ -1207,6 +1227,22 @@ static int __cpuinit mce_cap_init(void)
return 0;
}

+/*
+ * Initialize MCE per-CPU log buffer
+ */
+static __cpuinit void mce_log_init(void)
+{
+ int cpu;
+
+ if (mcelog.mcelog_cpus)
+ return;
+ mcelog.nr_mcelog_cpus = num_possible_cpus();
+ mcelog.mcelog_cpus = kzalloc(sizeof(void *) * num_possible_cpus(),
+ GFP_KERNEL);
+ for_each_possible_cpu(cpu)
+ mcelog.mcelog_cpus[cpu] = &per_cpu(mce_log_cpus, cpu);
+}
+
static void mce_init(void)
{
mce_banks_t all_banks;
@@ -1367,6 +1403,7 @@ void __cpuinit mcheck_init(struct cpuinf
return;
}
mce_cpu_quirks(c);
+ mce_log_init();

machine_check_vector = do_machine_check;

@@ -1415,94 +1452,116 @@ static int mce_release(struct inode *ino
return 0;
}

-static void collect_tscs(void *data)
+static ssize_t mce_read_cpu(struct mce_log_cpu *mcelog_cpu,
+ char __user *inubuf, size_t usize)
{
- unsigned long *cpu_tsc = (unsigned long *)data;
+ char __user *ubuf = inubuf;
+ int head, tail, pos, i, err = 0;

- rdtscll(cpu_tsc[smp_processor_id()]);
-}
+ head = mcelog_cpu->head;
+ tail = mcelog_cpu->tail;

-static DEFINE_MUTEX(mce_read_mutex);
+ if (head == tail)
+ return 0;

-static ssize_t mce_read(struct file *filp, char __user *ubuf, size_t usize,
- loff_t *off)
-{
- char __user *buf = ubuf;
- unsigned long *cpu_tsc;
- unsigned prev, next;
- int i, err;
+ for (pos = tail; pos != head && usize >= sizeof(struct mce);
+ pos = pos == MCE_LOG_LIMIT ? 0 : pos+1) {
+ i = mce_log_index(pos);
+ if (!mcelog_cpu->entry[i].finished) {
+ int timeout = WRITER_TIMEOUT_NS;
+ while (!mcelog_cpu->entry[i].finished) {
+ if (timeout-- <= 0) {
+ memset(mcelog_cpu->entry + i, 0,
+ sizeof(struct mce));
+ head = mcelog_cpu->head;
+ printk(KERN_WARNING "mcelog: timeout "
+ "waiting for writer to finish!\n");
+ goto timeout;
+ }
+ ndelay(1);
+ }
+ }
+ /*
+ * finished field should be checked before
+ * copy_to_user()
+ */
+ smp_rmb();
+ err |= copy_to_user(ubuf, mcelog_cpu->entry + i,
+ sizeof(struct mce));
+ ubuf += sizeof(struct mce);
+ usize -= sizeof(struct mce);
+ mcelog_cpu->entry[i].finished = 0;
+timeout:
+ ;
+ }
+ /*
+ * mcelog_cpu->tail must be updated after ".finished" of
+ * corresponding MCE records are clear.
+ */
+ smp_wmb();
+ mcelog_cpu->tail = pos;

- cpu_tsc = kmalloc(nr_cpu_ids * sizeof(long), GFP_KERNEL);
- if (!cpu_tsc)
- return -ENOMEM;
+ return err ? -EFAULT : ubuf - inubuf;
+}

- mutex_lock(&mce_read_mutex);
- next = rcu_dereference(mcelog.next);
+static int mce_empty_cpu(struct mce_log_cpu *mcelog_cpu)
+{
+ return mcelog_cpu->head == mcelog_cpu->tail;
+}

- /* Only supports full reads right now */
- if (*off != 0 || usize < MCE_LOG_LEN*sizeof(struct mce)) {
- mutex_unlock(&mce_read_mutex);
- kfree(cpu_tsc);
+static int mce_empty(void)
+{
+ int cpu;
+ struct mce_log_cpu *mcelog_cpu;

- return -EINVAL;
+ for_each_possible_cpu(cpu) {
+ mcelog_cpu = &per_cpu(mce_log_cpus, cpu);
+ if (!mce_empty_cpu(mcelog_cpu))
+ return 0;
}
+ return 1;
+}

- err = 0;
- prev = 0;
- do {
- for (i = prev; i < next; i++) {
- unsigned long start = jiffies;
+static ssize_t mce_read(struct file *filp, char __user *inubuf, size_t usize,
+ loff_t *off)
+{
+ char __user *ubuf = inubuf;
+ struct mce_log_cpu *mcelog_cpu;
+ int cpu, new_mce, err = 0;
+ static DEFINE_MUTEX(mce_read_mutex);

- while (!mcelog.entry[i].finished) {
- if (time_after_eq(jiffies, start + 2)) {
- memset(mcelog.entry + i, 0,
- sizeof(struct mce));
- goto timeout;
- }
- cpu_relax();
- }
- smp_rmb();
- err |= copy_to_user(buf, mcelog.entry + i,
- sizeof(struct mce));
- buf += sizeof(struct mce);
-timeout:
- ;
+ mutex_lock(&mce_read_mutex);
+ do {
+ new_mce = 0;
+ for_each_possible_cpu(cpu) {
+ if (usize < sizeof(struct mce))
+ goto out;
+ mcelog_cpu = &per_cpu(mce_log_cpus, cpu);
+ err = mce_read_cpu(mcelog_cpu, ubuf,
+ sizeof(struct mce));
+ if (err > 0) {
+ ubuf += sizeof(struct mce);
+ usize -= sizeof(struct mce);
+ new_mce = 1;
+ err = 0;
+ } else if (err < 0)
+ goto out;
}
-
- memset(mcelog.entry + prev, 0,
- (next - prev) * sizeof(struct mce));
- prev = next;
- next = cmpxchg(&mcelog.next, prev, 0);
- } while (next != prev);
-
- synchronize_sched();
-
- /*
- * Collect entries that were still getting written before the
- * synchronize.
- */
- on_each_cpu(collect_tscs, cpu_tsc, 1);
-
- for (i = next; i < MCE_LOG_LEN; i++) {
- if (mcelog.entry[i].finished &&
- mcelog.entry[i].tsc < cpu_tsc[mcelog.entry[i].cpu]) {
- err |= copy_to_user(buf, mcelog.entry+i,
- sizeof(struct mce));
- smp_rmb();
- buf += sizeof(struct mce);
- memset(&mcelog.entry[i], 0, sizeof(struct mce));
+ if (need_resched()) {
+ mutex_unlock(&mce_read_mutex);
+ cond_resched();
+ mutex_lock(&mce_read_mutex);
}
- }
+ } while (new_mce || !mce_empty());
+out:
mutex_unlock(&mce_read_mutex);
- kfree(cpu_tsc);
-
- return err ? -EFAULT : buf - ubuf;
+ return err ? err : ubuf - inubuf;
}

static unsigned int mce_poll(struct file *file, poll_table *wait)
{
poll_wait(file, &mce_wait, wait);
- if (rcu_dereference(mcelog.next))
+ if (!mce_empty())
return POLLIN | POLLRDNORM;
return 0;
}

2009-09-18 11:10:17

by Ingo Molnar

[permalink] [raw]

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

* Huang Ying <[email protected]> wrote:

> Current MCE log ring buffer has following bugs and issues:
>
> - On larger systems the 32 size buffer easily overflow, losing events.
>
> - We had some reports of events getting corrupted which were also
> blamed on the ring buffer.
>
> - There's a known livelock, now hit by more people, under high error
> rate.
>
> We fix these bugs and issues via making MCE log ring buffer as
> lock-less per-CPU ring buffer.

I like the direction of this (the current MCE ring-buffer code is a bad
local hack that should never have been merged upstream in that form) -
but i'd like to see a MUCH more ambitious (and much more useful!)
approach insted of using an explicit ring-buffer.

Please define MCE generic tracepoints using TRACE_EVENT() and use
perfcounters to access them.

This approach solves all the problems you listed and it also adds a
large number of new features to MCE events:

- Multiple user-space agents can access MCE events. You can have an
mcelog daemon running but also a system-wide tracer capturing
important events in flight-recorder mode.

- Sampling support: the kernel and the user-space call-chain of MCE
events can be stored and analyzed as well. This way actual patterns
of bad behavior can be matched to precisely what kind of activity
happened in the kernel (and/or in the app) around that moment in
time.

- Coupling with other hardware and software events: the PMU can track a
number of other anomalies - monitoring software might chose to
monitor those plus the MCE events as well - in one coherent stream of
events.

- Discovery of MCE sources - tracepoints are enumerated and tools can
act upon the existence (or non-existence) of various channels of MCE
information.

- Filtering support: you just subscribe to and act upon the events you
are interested in. Then even on a per event source basis there's
in-kernel filter expressions available that can restrict the amount
of data that hits the event channel.

- Arbitrary deep per cpu buffering of events - you can buffer 32
entries or you can buffer as much as you want, as long as you have
the RAM.

- An NMI-safe ring-buffer implementation - mappable to user-space.

- Built-in support for timestamping of events, PID markers, CPU
markers, etc.

- A rich ABI accessible over system call interface. Per cpu, per task
and per workload monitoring of MCE events can be done this way. The
ABI itself has a nice, meaningful structure.

- Extensible ABI: new fields can be added without breaking tooling.
New tracepoints can be added as the hardware side evolves. There's
various parsers that can be used.

- Lots of scheduling/buffering/batching modes of operandi for MCE
events. poll() support. mmap() support. read() support. You name it.

- Rich tooling support: even without any MCE specific extensions added
the 'perf' tool today offers various views of MCE data: perf report,
perf stat, perf trace can all be used to view logged MCE events and
perhaps correlate them to certain user-space usage patterns. But it
can be used directly as well, for user-space agents and policy action
in mcelog, etc.

- Significant code reduction and cleanup in the MCE code: the whole
mcelog facility can be dropped in essence.

- (these are the top of the list - there more advantages as well.)

Such a design would basically propel the MCE code into the twenty first
century. Once we have these facilities we can phase out /dev/mcelog for
good. It would turn Linux MCE events from a quirky hack that doesnt even
work after years of hacking into a modern, extensible event logging
facility that uses event sources and flexible transports to user-space.

It would actually be code that is not a problem child like today but one
that we can take pride in and which is fun to work on :-)

Now, an approach like this shouldnt just be a blind export of mce_log()
into a single artificial generic event [which is a pretty poor API to
begin with] - it should be the definition of meaningful
tracepoints/events that describe the hardware's structure.

I'd rather have a good enumeration of various sources of MCEs as
separate tracepoints than some badly jumbled mess of all MCE sources in
one inflexible ABI as /dev/mcelog does it today.

Note, if you need any perfcounter infrastructure extensions/help for
this then we'll be glad to provide that. I'm sure there's a few things
to enhance and a few things to fix - there always are with any
non-trivial new user :-) But heck would i take _those_ forward looking
problems over any of the current MCE design mess, any day of the week.

Thanks,

Ingo

2009-09-21 05:37:20

by Huang, Ying

[permalink] [raw]

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Hi, Ingo,

I think it is a interesting idea to replace the original MCE logging
part implementation (MCE log ring buffer + /dev/mcelog misc device) with
tracing infrastructure. The tracing infrastructure has definitely more
features than original method, but it is more complex than original
method too.

The tracing infrastructure is good at tracing. But I think it may be too
complex to be used as a (hardware) error processing mechanism. I think
the error processing should be simple and relatively independent with
other part of the kernel to avoid triggering more (hardware) error
during error processing. What do you think about this?

Even if we will reach consensus at some point to re-write MCE logging
part totally. We still need to fix the bugs in original implementation
before the new implementation being ready. Do you agree?

Best Regards,
Huang Ying

2009-09-22 13:40:15

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH] x86: mce: New MCE logging design

* Huang Ying <[email protected]> wrote:

> Hi, Ingo,
>
> I think it is a interesting idea to replace the original MCE logging
> part implementation (MCE log ring buffer + /dev/mcelog misc device)
> with tracing infrastructure. The tracing infrastructure has definitely
> more features than original method, but it is more complex than
> original method too.

TRACE_EVENT() can be used to define the tracepoint itself - but most of
the benefit comes from using the performance events subsystem for event
reporting. It's a perfect match for MCE events - see below.

> The tracing infrastructure is good at tracing. But I think it may be
> too complex to be used as a (hardware) error processing mechanism. I
> think the error processing should be simple and relatively independent
> with other part of the kernel to avoid triggering more (hardware)
> error during error processing. What do you think about this?

The fact that event reporting is used for other things as well might
even make it more fault-resilient than open-coded MCE code: just due to
it possibly being hot in the cache already (and thus not needing extra
DRAM cycles to a possibly faulty piece of hardware) while the MCE code,
not used by anything else, has to be brought into the cache for error
processing - triggering more faults.

But ... that argument is even largely irrelevant: the overriding core
kernel principle was always that of reuse of facilities. We want to keep
performance event reporting simple and robust too, so re-using
facilities is a good thing and the needs of MCE logging will improve
perf events - and vice versa.

Furthermore, to me as x86 maintainer it matters a lot that a piece of
code i maintain is clean, well-designed and maintainable in the long
run. The current /dev/mcelog code is not clean and not maintainable,
and that needs to be fixed before it can be extended.

That MCE logging also becomes more robust and more feature-rich as a
result is a happy side-effect.

> Even if we will reach consensus at some point to re-write MCE logging
> part totally. We still need to fix the bugs in original implementation
> before the new implementation being ready. Do you agree?

Well, this is all 2.6.33 stuff and that's plenty of time to do the
generic events + perf events approach.

Anyway, i think we can work on these MCE enhancements much better if we
move it to the level of actual patches. Let me give you a (very small)
demonstration of my requests.

The patch attached below adds two new events to the MCE code, to the
thermal throttling code:

mce_thermal_throttle_hot
mce_thermal_throttle_cold

If you apply the patch below to latest -tip and boot that kernel and do:

cd tools/perf/
make -j install

And type 'perf list', those two new MCE events will show up in the list
of available events:

# perf list
[...]
mce:mce_thermal_throttle_hot [Tracepoint event]
mce:mce_thermal_throttle_cold [Tracepoint event]
[...]

Initially you can check whether those events are occuring:

# perf stat -a -e mce:mce_thermal_throttle_hot sleep 1

Performance counter stats for 'sleep 1':

0 mce:mce_thermal_throttle_hot # 0.000 M/sec

1.000672214 seconds time elapsed

None on a normal system. Now, i tried it on a system that gets frequent
thermal events, and there it gives:

# perf stat -a -e mce:mce_thermal_throttle_hot sleep 1

Performance counter stats for 'sleep 1':

73 mce:mce_thermal_throttle_hot # 0.000 M/sec

1.000765260 seconds time elapsed

Now i can observe the frequency of those thermal throttling events:

# perf stat --repeat 10 -a -e mce:mce_thermal_throttle_hot sleep 1

Performance counter stats for 'sleep 1' (10 runs):

184 mce:mce_thermal_throttle_hot # 0.000 M/sec ( +- 1.971% )

1.006488821 seconds time elapsed ( +- 0.203% )

184/sec is pretty hefty, with a relatively stable rate.

Now i try to see exactly which code is affected by those thermal
throttling events. For that i use 'perf record', and profile the whole
system for 10 seconds:

# perf record -f -e mce:mce_thermal_throttle_hot -c 1 -a sleep 10
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.802 MB perf.data (~35026 samples) ]

And i used 'perf report' to analyze where the MCE events occured:

# Samples: 2495
#
# Overhead Command Shared Object Symbol
# ........ .............. ........................ ......
#
55.39% hackbench [vdso] [.] 0x000000f77e1430
35.27% hackbench f77c2430 [.] 0x000000f77c2430
3.05% hackbench [kernel] [k] ia32_setup_sigcontext
0.88% bash [vdso] [.] 0x000000f776f430
0.64% bash f7731430 [.] 0x000000f7731430
0.52% hackbench ./hackbench [.] sender
0.48% nice f7753430 [.] 0x000000f7753430
0.36% hackbench ./hackbench [.] receiver
0.24% perf /lib64/libc-2.5.so [.] __GI_read
0.24% hackbench /lib/libc-2.5.so [.] __libc_read

Most of the events concentrate themselves to around syscall entry
points. That's not because the system is doing that most of the time -
it's probably because the fast syscall entry point is a critical path of
the CPU and overheating events likely concentrate themselves to around
that place.

Now lets how we can do very detailed MCE logs, with per CPU data,
timestamps and other events mixed. For that i use perf record again:

# perf record -R -f -e mce:mce_thermal_throttle_hot:r -e mce:mce_thermal_throttle_cold:r -e irq:irq_handler_entry:r -M -c 1 -a sleep 10
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 1.167 MB perf.data (~50990 samples) ]

And i use 'perf trace' to look at the result:

hackbench-5070 [000] 1542.044184301: mce_thermal_throttle_hot: throttle_count=150852
hackbench-5070 [000] 1542.044672817: mce_thermal_throttle_cold: throttle_count=150852
hackbench-5056 [000] 1542.047029212: mce_thermal_throttle_hot: throttle_count=150853
hackbench-5057 [000] 1542.047518733: mce_thermal_throttle_cold: throttle_count=150853
hackbench-5058 [000] 1542.049822672: mce_thermal_throttle_hot: throttle_count=150854
hackbench-5046 [000] 1542.050312612: mce_thermal_throttle_cold: throttle_count=150854
hackbench-5072 [000] 1542.052720357: mce_thermal_throttle_hot: throttle_count=150855
hackbench-5072 [000] 1542.053210612: mce_thermal_throttle_cold: throttle_count=150855
:5074-5074 [000] 1542.055689468: mce_thermal_throttle_hot: throttle_count=150856
hackbench-5011 [000] 1542.056179849: mce_thermal_throttle_cold: throttle_count=150856
:5061-5061 [000] 1542.056937152: mce_thermal_throttle_hot: throttle_count=150857
hackbench-5048 [000] 1542.057428970: mce_thermal_throttle_cold: throttle_count=150857
hackbench-5053 [000] 1542.060372255: mce_thermal_throttle_hot: throttle_count=150858
hackbench-5054 [000] 1542.060863006: mce_thermal_throttle_cold: throttle_count=150858

The first thing we can see it from the trace, only the first core
reports thermal events (the box is a dual-core box). The other thing is
the timing: the CPU is in 'hot' state for about ~490 microseconds, then
in 'cold' state for ~250 microseconds.

So the CPU spends about 65% of its time in hot-throttled state, to cool
down enough to switch back to cold mode again.

Also, most of the overheating events hit the 'hackbench' app. (which is
no surprise on this box as it is only running that workload - but might
be interesting on other systems with more mixed workloads.)

So even in its completely MCE-unaware form, and with this small patch
that took 5 minutes to write, the performance events subsystem and
tooling can already give far greater insight into characteristics of
these events than the MCE subsystem did before.

Obviously the reported events can be used for different, MCE specific
things as well:

- Report an alert on the desktop (without having to scan the syslog all
the time)

- Mail the sysadmin

- Do some specific policy action such as soft-throttle the workload
until the situation and the log data can be examined.

As i outlined it in my previous mail, the perf events subsystem provides
a wide range of facilities that can be used for that purpose.

If there are aspects of MCE reporting usecases that necessiate the
extension of perf events, then we can do that - both sides will benefit
from it. What we dont want to do is to prolongue the pain caused by the
inferior, buggy and poorly designed /dev/mcelog ABI and code. It's not
serving our users well and people are spending time on fixing something
that has been misdesigned from the get go instead of concentrating on
improving the right facilities.

So this is really how i see the future of the MCE logging subsystem: use
meaningful generic events that empowers users, admins and tools to do
something intelligent with the available stream of MCE data. The
hardware tells us a _lot_ of information - lets use and expose that in a
structured form.

Ingo

---
arch/x86/kernel/cpu/mcheck/therm_throt.c | 9 +++++
include/trace/events/mce.h | 47 +++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 1 deletion(-)

Index: linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -31,6 +31,9 @@
#include <asm/mce.h>
#include <asm/msr.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/mce.h>
+
/* How long to wait between reporting thermal events */
#define CHECK_INTERVAL (300 * HZ)

@@ -118,8 +121,12 @@ static int therm_throt_process(bool is_t
was_throttled = state->is_throttled;
state->is_throttled = is_throttled;

- if (is_throttled)
+ if (is_throttled) {
state->throttle_count++;
+ trace_mce_thermal_throttle_hot(state->throttle_count);
+ } else {
+ trace_mce_thermal_throttle_cold(state->throttle_count);
+ }

if (time_before64(now, state->next_check) &&
state->throttle_count != state->last_throttle_count)
Index: linux/include/trace/events/mce.h
===================================================================
--- /dev/null
+++ linux/include/trace/events/mce.h
@@ -0,0 +1,47 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mce
+
+#if !defined(_TRACE_MCE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MCE_H
+
+#include <linux/ktime.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mce_thermal_throttle_hot,
+
+ TP_PROTO(unsigned int throttle_count),
+
+ TP_ARGS(throttle_count),
+
+ TP_STRUCT__entry(
+ __field( u64, throttle_count )
+ ),
+
+ TP_fast_assign(
+ __entry->throttle_count = throttle_count;
+ ),
+
+ TP_printk("throttle_count=%Lu", __entry->throttle_count)
+);
+
+TRACE_EVENT(mce_thermal_throttle_cold,
+
+ TP_PROTO(unsigned int throttle_count),
+
+ TP_ARGS(throttle_count),
+
+ TP_STRUCT__entry(
+ __field( u64, throttle_count )
+ ),
+
+ TP_fast_assign(
+ __entry->throttle_count = throttle_count;
+ ),
+
+ TP_printk("throttle_count=%Lu", __entry->throttle_count)
+);
+
+#endif /* _TRACE_MCE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

2009-10-05 06:24:39

Subject: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: [PATCH] x86: mce: New MCE logging design

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: [PATCH 01/10] x86, mce: remove tsc handling from mce_read

Subject: [PATCH 02/10] x86, mce: mce_read can check args without mutex

Subject: [PATCH 03/10] x86, mce: change writer timeout in mce_read

Subject: [PATCH 04/10] x86, mce: use do-while in mce_log

Subject: [PATCH 05/10] x86, mce: make mce_log buffer to per-CPU, prep

Subject: [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU

Subject: [PATCH 07/10] x86, mce: remove for-loop in mce_log

Subject: [PATCH 08/10] x86, mce: change barriers in mce_log

Subject: [PATCH 09/10] x86, mce: make mce_log buffer to ring buffer

Subject: [PATCH 10/10] x86, mce: move mce_log_init() into mce_cap_init()

Subject: Re: [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: Re: [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer

Subject: Re: [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU

Subject: Re: [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU

Subject: Re: [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU