2010-02-06 02:47:51

by Don Zickus

[permalink] [raw]
Subject: [PATCH 0/3 v2] new nmi_watchdog using perf events

This patch series tries to implement a new nmi_watchdog using the perf
events subsystem. I am posting this series early for feedback on the
approach. It isn't feature compatible with the old nmi_watchdog yet, nor
does it have all the old workarounds either.

The basic design is to create an in-kernel performance counter that goes off
every few seconds and checks for cpu lockups. It is fairly straight
forward. Some of the quirks are making sure the cpu lockup detection works
correctly.

It has been lightly tested for now. Once people are ok with the approach,
I'll expand testing to more machines in our lab.

I tried taking a generic approach so all arches could use it if they want
and then implement some per arch specific hooks. I believe this is what
Ingo was suggesting. The initial work is based off of kernel/softlockup.c.

Any feedback would be great.

v2:
- moved a notify_die call into a #ifdef block
- used better default values for configuring the nmi_watchdog based on
the old nmi_watchdog values

Cheers,
Don

--
damn it forgot to cc lkml

Don Zickus (3):
[RFC][x86] move notify_die from nmi.c to traps.c
[RFC] nmi_watchdog: new implementation using perf events
[RFC] nmi_watchdog: config option to enable new nmi_watchdog

arch/x86/kernel/apic/Makefile | 7 ++-
arch/x86/kernel/apic/hw_nmi.c | 114 ++++++++++++++++++++++++
arch/x86/kernel/apic/nmi.c | 7 --
arch/x86/kernel/traps.c | 7 ++
include/linux/nmi.h | 4 +
kernel/Makefile | 1 +
kernel/nmi_watchdog.c | 196 +++++++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 13 +++
8 files changed, 341 insertions(+), 8 deletions(-)
create mode 100644 arch/x86/kernel/apic/hw_nmi.c
create mode 100644 kernel/nmi_watchdog.c


2010-02-06 02:47:28

by Don Zickus

[permalink] [raw]
Subject: [PATCH 1/3 v2] [x86] move notify_die from nmi.c to traps.c

In order to handle a new nmi_watchdog approach, I need to move the notify_die()
routine out of nmi_watchdog_tick() and into default_do_nmi(). This lets me easily
swap out the old nmi_watchdog with the new one with just a config change.

The change probably makes sense from a high level perspective because the
nmi_watchdog shouldn't be handling notify_die routines anyway. However, this
move does change the semantics a little bit. Instead of checking on every nmi
interrupt if the cpus are stuck, only check them on the nmi_watchdog interrupts.

v2: move notify_die call into #idef block

Signed-off-by: Don Zickus <[email protected]>
---
arch/x86/kernel/apic/nmi.c | 7 -------
arch/x86/kernel/traps.c | 5 +++++
2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c
index 0159a69..5d47682 100644
--- a/arch/x86/kernel/apic/nmi.c
+++ b/arch/x86/kernel/apic/nmi.c
@@ -400,13 +400,6 @@ nmi_watchdog_tick(struct pt_regs *regs, unsigned reason)
int cpu = smp_processor_id();
int rc = 0;

- /* check for other users first */
- if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
- == NOTIFY_STOP) {
- rc = 1;
- touched = 1;
- }
-
sum = get_timer_irqs(cpu);

if (__get_cpu_var(nmi_touch)) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 3339917..3be4687 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -400,7 +400,12 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, reason, 2, SIGINT)
== NOTIFY_STOP)
return;
+
#ifdef CONFIG_X86_LOCAL_APIC
+ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
+ == NOTIFY_STOP)
+ return;
+
/*
* Ok, so this is none of the documented NMI sources,
* so it must be the NMI watchdog.
--
1.6.6.83.gc9a2

2010-02-06 02:47:45

by Don Zickus

[permalink] [raw]
Subject: [PATCH 3/3 v2] nmi_watchdog: config option to enable new nmi_watchdog

These are the bits that enable the new nmi_watchdog and safely isolate the
old nmi_watchdog. Only one or the other can run, not both at the same
time.

Signed-off-by: Don Zickus <[email protected]>
---
arch/x86/kernel/apic/Makefile | 7 ++++++-
arch/x86/kernel/traps.c | 2 ++
include/linux/nmi.h | 4 ++++
kernel/Makefile | 1 +
lib/Kconfig.debug | 13 +++++++++++++
5 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/Makefile b/arch/x86/kernel/apic/Makefile
index 565c1bf..1a4512e 100644
--- a/arch/x86/kernel/apic/Makefile
+++ b/arch/x86/kernel/apic/Makefile
@@ -2,7 +2,12 @@
# Makefile for local APIC drivers and for the IO-APIC code
#

-obj-$(CONFIG_X86_LOCAL_APIC) += apic.o apic_noop.o probe_$(BITS).o ipi.o nmi.o
+obj-$(CONFIG_X86_LOCAL_APIC) += apic.o apic_noop.o probe_$(BITS).o ipi.o
+ifneq ($(CONFIG_NMI_WATCHDOG),y)
+obj-$(CONFIG_X86_LOCAL_APIC) += nmi.o
+endif
+obj-$(CONFIG_NMI_WATCHDOG) += hw_nmi.o
+
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
obj-$(CONFIG_SMP) += ipi.o

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 3be4687..5b89638 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -406,6 +406,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
== NOTIFY_STOP)
return;

+#ifndef CONFIG_NMI_WATCHDOG
/*
* Ok, so this is none of the documented NMI sources,
* so it must be the NMI watchdog.
@@ -413,6 +414,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
if (nmi_watchdog_tick(regs, reason))
return;
if (!do_nmi_callback(regs, cpu))
+#endif /* !CONFIG_NMI_WATCHDOG */
unknown_nmi_error(reason, regs);
#else
unknown_nmi_error(reason, regs);
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index b752e80..a42ff0b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -47,4 +47,8 @@ static inline bool trigger_all_cpu_backtrace(void)
}
#endif

+#ifdef CONFIG_NMI_WATCHDOG
+int hw_nmi_is_cpu_stuck(struct pt_regs *);
+#endif
+
#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..8a5abe5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_AUDIT_TREE) += audit_tree.o
obj-$(CONFIG_KPROBES) += kprobes.o
obj-$(CONFIG_KGDB) += kgdb.o
obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o
+obj-$(CONFIG_NMI_WATCHDOG) += nmi_watchdog.o
obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 25c3ed5..04a43a2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -170,6 +170,19 @@ config DETECT_SOFTLOCKUP
can be detected via the NMI-watchdog, on platforms that
support it.)

+config NMI_WATCHDOG
+ bool "Detect Hard Lockups with an NMI Watchdog"
+ depends on DEBUG_KERNEL && PERF_EVENTS
+ default y
+ help
+ Say Y here to enable the kernel to use the NMI as a watchdog
+ to detect hard lockups. This is useful when a cpu hangs for no
+ reason but can still respond to NMIs. A backtrace is displayed
+ for reviewing and reporting.
+
+ The overhead should be minimal, just an extra NMI every few
+ seconds.
+
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on DETECT_SOFTLOCKUP
--
1.6.6.83.gc9a2

2010-02-06 02:47:37

by Don Zickus

[permalink] [raw]
Subject: [PATCH 2/3 v2] nmi_watchdog: new implementation using perf events

This is a new generic nmi_watchdog implementation using the perf events
infrastructure as suggested by Ingo.

The implementation is simple, just create an in-kernel perf event and register
an overflow handler to check for cpu lockups. I created a generic implementation
that lives in kernel/ and the hardware specific part that for now lives in arch/x86.

I have done light testing to make sure the framework works correctly and it does.

v2: sets the correct timeout values based on the old nmi watchdog

Signed-off-by: Don Zickus <[email protected]>
---
arch/x86/kernel/apic/hw_nmi.c | 114 ++++++++++++++++++++++++
kernel/nmi_watchdog.c | 191 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 305 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/apic/hw_nmi.c
create mode 100644 kernel/nmi_watchdog.c

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
new file mode 100644
index 0000000..8c0e6a4
--- /dev/null
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -0,0 +1,114 @@
+/*
+ * HW NMI watchdog support
+ *
+ * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
+ *
+ * Arch specific calls to support NMI watchdog
+ *
+ * Bits copied from original nmi.c file
+ *
+ */
+
+#include <asm/apic.h>
+#include <linux/smp.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/kernel_stat.h>
+#include <asm/mce.h>
+
+#include <linux/nmi.h>
+#include <linux/module.h>
+
+/* For reliability, we're prepared to waste bits here. */
+static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
+
+static DEFINE_PER_CPU(unsigned, last_irq_sum);
+
+/*
+ * Take the local apic timer and PIT/HPET into account. We don't
+ * know which one is active, when we have highres/dyntick on
+ */
+static inline unsigned int get_timer_irqs(int cpu)
+{
+ return per_cpu(irq_stat, cpu).apic_timer_irqs +
+ per_cpu(irq_stat, cpu).irq0_irqs;
+}
+
+static inline int mce_in_progress(void)
+{
+#if defined(CONFIG_X86_MCE)
+ return atomic_read(&mce_entry) > 0;
+#endif
+ return 0;
+}
+
+int hw_nmi_is_cpu_stuck(struct pt_regs *regs)
+{
+ unsigned int sum;
+ int cpu = smp_processor_id();
+
+ /* FIXME: cheap hack for this check, probably should get its own
+ * die_notifier handler
+ */
+ if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
+ static DEFINE_SPINLOCK(lock); /* Serialise the printks */
+
+ spin_lock(&lock);
+ printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
+ show_regs(regs);
+ dump_stack();
+ spin_unlock(&lock);
+ cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
+ }
+
+ /* if we are doing an mce, just assume the cpu is not stuck */
+ /* Could check oops_in_progress here too, but it's safer not to */
+ if (mce_in_progress())
+ return 0;
+
+ /* We determine if the cpu is stuck by checking whether any
+ * interrupts have happened since we last checked. Of course
+ * an nmi storm could create false positives, but the higher
+ * level logic should account for that
+ */
+ sum = get_timer_irqs(cpu);
+ if (__get_cpu_var(last_irq_sum) == sum) {
+ return 1;
+ } else {
+ __get_cpu_var(last_irq_sum) = sum;
+ return 0;
+ }
+}
+
+void arch_trigger_all_cpu_backtrace(void)
+{
+ int i;
+
+ cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
+
+ printk(KERN_INFO "sending NMI to all CPUs:\n");
+ apic->send_IPI_all(NMI_VECTOR);
+
+ /* Wait for up to 10 seconds for all CPUs to do the backtrace */
+ for (i = 0; i < 10 * 1000; i++) {
+ if (cpumask_empty(to_cpumask(backtrace_mask)))
+ break;
+ mdelay(1);
+ }
+}
+
+/* STUB calls to mimic old nmi_watchdog behaviour */
+unsigned int nmi_watchdog = NMI_NONE;
+EXPORT_SYMBOL(nmi_watchdog);
+atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */
+EXPORT_SYMBOL(nmi_active);
+int nmi_watchdog_enabled;
+int unknown_nmi_panic;
+void cpu_nmi_set_wd_enabled(void) { return; }
+void acpi_nmi_enable(void) { return; }
+void acpi_nmi_disable(void) { return; }
+void stop_apic_nmi_watchdog(void *unused) { return; }
+void setup_apic_nmi_watchdog(void *unused) { return; }
+int __init check_nmi_watchdog(void) { return 0; }
diff --git a/kernel/nmi_watchdog.c b/kernel/nmi_watchdog.c
new file mode 100644
index 0000000..36817b2
--- /dev/null
+++ b/kernel/nmi_watchdog.c
@@ -0,0 +1,191 @@
+/*
+ * Detect Hard Lockups using the NMI
+ *
+ * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
+ *
+ * this code detects hard lockups: incidents in where on a CPU
+ * the kernel does not respond to anything except NMI.
+ *
+ * Note: Most of this code is borrowed heavily from softlockup.c,
+ * so thanks to Ingo for the initial implementation.
+ * Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
+ * to those contributors as well.
+ */
+
+#include <linux/mm.h>
+#include <linux/cpu.h>
+#include <linux/nmi.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/freezer.h>
+#include <linux/lockdep.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/sysctl.h>
+
+#include <asm/irq_regs.h>
+#include <linux/perf_event.h>
+
+static DEFINE_PER_CPU(struct perf_event *, nmi_watchdog_ev);
+static DEFINE_PER_CPU(int, nmi_watchdog_touch);
+static DEFINE_PER_CPU(long, alert_counter);
+
+void touch_nmi_watchdog(void)
+{
+ __raw_get_cpu_var(nmi_watchdog_touch) = 1;
+ touch_softlockup_watchdog();
+}
+EXPORT_SYMBOL(touch_nmi_watchdog);
+
+void touch_all_nmi_watchdog(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ per_cpu(nmi_watchdog_touch, cpu) = 1;
+ touch_softlockup_watchdog();
+}
+
+#ifdef CONFIG_SYSCTL
+/*
+ * proc handler for /proc/sys/kernel/nmi_watchdog
+ */
+int proc_nmi_enabled(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int cpu;
+
+ if (per_cpu(nmi_watchdog_ev, smp_processor_id()) == NULL)
+ nmi_watchdog_enabled = 0;
+ else
+ nmi_watchdog_enabled = 1;
+
+ touch_all_nmi_watchdog();
+ proc_dointvec(table, write, buffer, length, ppos);
+ if (nmi_watchdog_enabled)
+ for_each_online_cpu(cpu)
+ perf_event_enable(per_cpu(nmi_watchdog_ev, cpu));
+ else
+ for_each_online_cpu(cpu)
+ perf_event_disable(per_cpu(nmi_watchdog_ev, cpu));
+ return 0;
+}
+
+#endif /* CONFIG_SYSCTL */
+
+struct perf_event_attr wd_attr = {
+ .type = PERF_TYPE_HARDWARE,
+ .config = PERF_COUNT_HW_CPU_CYCLES,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 1,
+};
+
+static int panic_on_timeout;
+
+void wd_overflow(struct perf_event *event, int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ int cpu = smp_processor_id();
+ int touched = 0;
+
+ if (__get_cpu_var(nmi_watchdog_touch)) {
+ per_cpu(nmi_watchdog_touch, cpu) = 0;
+ touched = 1;
+ }
+
+ /* check to see if the cpu is doing anything */
+ if (!touched && hw_nmi_is_cpu_stuck(regs)) {
+ /*
+ * Ayiee, looks like this CPU is stuck ...
+ * wait a few IRQs (5 seconds) before doing the oops ...
+ */
+ per_cpu(alert_counter,cpu) += 1;
+ if (per_cpu(alert_counter,cpu) == 5) {
+ /*
+ * die_nmi will return ONLY if NOTIFY_STOP happens..
+ */
+ die_nmi("BUG: NMI Watchdog detected LOCKUP",
+ regs, panic_on_timeout);
+ }
+ } else {
+ per_cpu(alert_counter,cpu) = 0;
+ }
+
+ return;
+}
+
+/*
+ * Create/destroy watchdog threads as CPUs come and go:
+ */
+static int __cpuinit
+cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+ int hotcpu = (unsigned long)hcpu;
+ struct perf_event *event;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ per_cpu(nmi_watchdog_touch, hotcpu) = 0;
+ break;
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ /* originally wanted the below chunk to be in CPU_UP_PREPARE, but caps is unpriv for non-CPU0 */
+ wd_attr.sample_period = cpu_khz * 1000;
+ event = perf_event_create_kernel_counter(&wd_attr, hotcpu, -1, wd_overflow);
+ if (IS_ERR(event)) {
+ printk(KERN_ERR "nmi watchdog failed to create perf event on %i: %p\n", hotcpu, event);
+ return NOTIFY_BAD;
+ }
+ per_cpu(nmi_watchdog_ev, hotcpu) = event;
+ perf_event_enable(per_cpu(nmi_watchdog_ev, hotcpu));
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ perf_event_disable(per_cpu(nmi_watchdog_ev, hotcpu));
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ event = per_cpu(nmi_watchdog_ev, hotcpu);
+ per_cpu(nmi_watchdog_ev, hotcpu) = NULL;
+ perf_event_release_kernel(event);
+ break;
+#endif /* CONFIG_HOTPLUG_CPU */
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata cpu_nfb = {
+ .notifier_call = cpu_callback
+};
+
+static int __initdata nonmi_watchdog;
+
+static int __init nonmi_watchdog_setup(char *str)
+{
+ nonmi_watchdog = 1;
+ return 1;
+}
+__setup("nonmi_watchdog", nonmi_watchdog_setup);
+
+static int __init spawn_nmi_watchdog_task(void)
+{
+ void *cpu = (void *)(long)smp_processor_id();
+ int err;
+
+ if (nonmi_watchdog)
+ return 0;
+
+ err = cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
+ if (err == NOTIFY_BAD) {
+ BUG();
+ return 1;
+ }
+ cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
+ register_cpu_notifier(&cpu_nfb);
+
+ return 0;
+}
+early_initcall(spawn_nmi_watchdog_task);
--
1.6.6.83.gc9a2

2010-02-08 07:20:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3 v2] nmi_watchdog: config option to enable new nmi_watchdog


* Don Zickus <[email protected]> wrote:

> +config NMI_WATCHDOG
> + bool "Detect Hard Lockups with an NMI Watchdog"
> + depends on DEBUG_KERNEL && PERF_EVENTS
> + default y
> + help
> + Say Y here to enable the kernel to use the NMI as a watchdog
> + to detect hard lockups. This is useful when a cpu hangs for no
> + reason but can still respond to NMIs. A backtrace is displayed
> + for reviewing and reporting.
> +
> + The overhead should be minimal, just an extra NMI every few
> + seconds.

Thought for later patches: I think an architecture should be able to express
via a Kconfig switch that it actually _has_ NMI events. There's architectures
which dont have a PMU driver and only have software events. There's also
architectures that have a PMU driver but no NMIs.

Something like ARCH_HAS_NMI_PERF_EVENTS?

Also, i havent checked, but what is the practical effect of the new generic
watchdog on x86 CPUs that does not have a native PMU driver yet - such as
P4s?

Anyway, i'll create a tip:perf/nmi topic branch for these patches, it
certainly looks like a useful generalization and a new architecture that has
perf could easily enable it, without having to write its own NMI watchdog
implementation. It's also useful for any new watchdog features that people
might want to add. Plus it makes the x86 PMU code cleaner in the long run as
well.

Thanks,

Ingo

2010-02-08 08:53:19

by Don Zickus

[permalink] [raw]
Subject: [tip:perf/nmi] x86: Move notify_die from nmi.c to traps.c

Commit-ID: e40b17208b6805be50ffe891878662b6076206b9
Gitweb: http://git.kernel.org/tip/e40b17208b6805be50ffe891878662b6076206b9
Author: Don Zickus <[email protected]>
AuthorDate: Fri, 5 Feb 2010 21:47:03 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 8 Feb 2010 08:29:02 +0100

x86: Move notify_die from nmi.c to traps.c

In order to handle a new nmi_watchdog approach, I need to move
the notify_die() routine out of nmi_watchdog_tick() and into
default_do_nmi(). This lets me easily swap out the old
nmi_watchdog with the new one with just a config change.

The change probably makes sense from a high level perspective
because the nmi_watchdog shouldn't be handling notify_die
routines anyway. However, this move does change the semantics a
little bit. Instead of checking on every nmi interrupt if the
cpus are stuck, only check them on the nmi_watchdog interrupts.

v2: Move notify_die call into #idef block

Signed-off-by: Don Zickus <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/nmi.c | 7 -------
arch/x86/kernel/traps.c | 5 +++++
2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c
index 0159a69..5d47682 100644
--- a/arch/x86/kernel/apic/nmi.c
+++ b/arch/x86/kernel/apic/nmi.c
@@ -400,13 +400,6 @@ nmi_watchdog_tick(struct pt_regs *regs, unsigned reason)
int cpu = smp_processor_id();
int rc = 0;

- /* check for other users first */
- if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
- == NOTIFY_STOP) {
- rc = 1;
- touched = 1;
- }
-
sum = get_timer_irqs(cpu);

if (__get_cpu_var(nmi_touch)) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1168e44..51ef893 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -400,7 +400,12 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, reason, 2, SIGINT)
== NOTIFY_STOP)
return;
+
#ifdef CONFIG_X86_LOCAL_APIC
+ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
+ == NOTIFY_STOP)
+ return;
+
/*
* Ok, so this is none of the documented NMI sources,
* so it must be the NMI watchdog.

2010-02-08 08:53:31

by Don Zickus

[permalink] [raw]
Subject: [tip:perf/nmi] nmi_watchdog: Config option to enable new nmi_watchdog

Commit-ID: 84e478c6f1eb9c4bfa1fff2f8108e9a061b46428
Gitweb: http://git.kernel.org/tip/84e478c6f1eb9c4bfa1fff2f8108e9a061b46428
Author: Don Zickus <[email protected]>
AuthorDate: Fri, 5 Feb 2010 21:47:05 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 8 Feb 2010 08:29:03 +0100

nmi_watchdog: Config option to enable new nmi_watchdog

These are the bits that enable the new nmi_watchdog and safely
isolate the old nmi_watchdog. Only one or the other can run,
not both at the same time.

Signed-off-by: Don Zickus <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/Makefile | 7 ++++++-
arch/x86/kernel/traps.c | 2 ++
include/linux/nmi.h | 4 ++++
kernel/Makefile | 1 +
lib/Kconfig.debug | 13 +++++++++++++
5 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/Makefile b/arch/x86/kernel/apic/Makefile
index 565c1bf..1a4512e 100644
--- a/arch/x86/kernel/apic/Makefile
+++ b/arch/x86/kernel/apic/Makefile
@@ -2,7 +2,12 @@
# Makefile for local APIC drivers and for the IO-APIC code
#

-obj-$(CONFIG_X86_LOCAL_APIC) += apic.o apic_noop.o probe_$(BITS).o ipi.o nmi.o
+obj-$(CONFIG_X86_LOCAL_APIC) += apic.o apic_noop.o probe_$(BITS).o ipi.o
+ifneq ($(CONFIG_NMI_WATCHDOG),y)
+obj-$(CONFIG_X86_LOCAL_APIC) += nmi.o
+endif
+obj-$(CONFIG_NMI_WATCHDOG) += hw_nmi.o
+
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
obj-$(CONFIG_SMP) += ipi.o

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 51ef893..973cbc4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -406,6 +406,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
== NOTIFY_STOP)
return;

+#ifndef CONFIG_NMI_WATCHDOG
/*
* Ok, so this is none of the documented NMI sources,
* so it must be the NMI watchdog.
@@ -413,6 +414,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
if (nmi_watchdog_tick(regs, reason))
return;
if (!do_nmi_callback(regs, cpu))
+#endif /* !CONFIG_NMI_WATCHDOG */
unknown_nmi_error(reason, regs);
#else
unknown_nmi_error(reason, regs);
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index b752e80..a42ff0b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -47,4 +47,8 @@ static inline bool trigger_all_cpu_backtrace(void)
}
#endif

+#ifdef CONFIG_NMI_WATCHDOG
+int hw_nmi_is_cpu_stuck(struct pt_regs *);
+#endif
+
#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..8a5abe5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_AUDIT_TREE) += audit_tree.o
obj-$(CONFIG_KPROBES) += kprobes.o
obj-$(CONFIG_KGDB) += kgdb.o
obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o
+obj-$(CONFIG_NMI_WATCHDOG) += nmi_watchdog.o
obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 25c3ed5..f80b67e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -170,6 +170,19 @@ config DETECT_SOFTLOCKUP
can be detected via the NMI-watchdog, on platforms that
support it.)

+config NMI_WATCHDOG
+ bool "Detect Hard Lockups with an NMI Watchdog"
+ depends on DEBUG_KERNEL && PERF_EVENTS
+ default y
+ help
+ Say Y here to enable the kernel to use the NMI as a watchdog
+ to detect hard lockups. This is useful when a cpu hangs for no
+ reason but can still respond to NMIs. A backtrace is displayed
+ for reviewing and reporting.
+
+ The overhead should be minimal, just an extra NMI every few
+ seconds.
+
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on DETECT_SOFTLOCKUP

2010-02-08 08:53:33

by Don Zickus

[permalink] [raw]
Subject: [tip:perf/nmi] nmi_watchdog: Add new, generic implementation, using perf events

Commit-ID: 1fb9d6ad2766a1dd70d167552988375049a97f21
Gitweb: http://git.kernel.org/tip/1fb9d6ad2766a1dd70d167552988375049a97f21
Author: Don Zickus <[email protected]>
AuthorDate: Fri, 5 Feb 2010 21:47:04 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 8 Feb 2010 08:29:02 +0100

nmi_watchdog: Add new, generic implementation, using perf events

This is a new generic nmi_watchdog implementation using the perf
events infrastructure as suggested by Ingo.

The implementation is simple, just create an in-kernel perf
event and register an overflow handler to check for cpu lockups.

I created a generic implementation that lives in kernel/ and
the hardware specific part that for now lives in arch/x86.

This approach has a number of advantages:

- It simplifies the x86 PMU implementation in the long run,
in that it removes the hardcoded low-level PMU implementation
that was the NMI watchdog before.

- It allows new NMI watchdog features to be added in a central
place.

- It allows other architectures to enable the NMI watchdog,
as long as they have perf events (that provide NMIs)
implemented.

- It also allows for more graceful co-existence of existing
perf events apps and the NMI watchdog - before these changes
the relationship was exclusive. (The NMI watchdog will 'spend'
a perf event when enabled. In later iterations we might be
able to piggyback from an existing NMI event without having
to allocate a hardware event for the NMI watchdog - turning
this into a no-hardware-cost feature.)

As for compatibility, we'll keep the old NMI watchdog code as
well until the new one can 100% replace it on all CPUs, old and
new alike. That might take some time as the NMI watchdog has
been ported to many CPU models.

I have done light testing to make sure the framework works
correctly and it does.

v2: Set the correct timeout values based on the old nmi
watchdog

Signed-off-by: Don Zickus <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/hw_nmi.c | 114 ++++++++++++++++++++++++
kernel/nmi_watchdog.c | 191 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 305 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
new file mode 100644
index 0000000..8c0e6a4
--- /dev/null
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -0,0 +1,114 @@
+/*
+ * HW NMI watchdog support
+ *
+ * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
+ *
+ * Arch specific calls to support NMI watchdog
+ *
+ * Bits copied from original nmi.c file
+ *
+ */
+
+#include <asm/apic.h>
+#include <linux/smp.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/kernel_stat.h>
+#include <asm/mce.h>
+
+#include <linux/nmi.h>
+#include <linux/module.h>
+
+/* For reliability, we're prepared to waste bits here. */
+static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
+
+static DEFINE_PER_CPU(unsigned, last_irq_sum);
+
+/*
+ * Take the local apic timer and PIT/HPET into account. We don't
+ * know which one is active, when we have highres/dyntick on
+ */
+static inline unsigned int get_timer_irqs(int cpu)
+{
+ return per_cpu(irq_stat, cpu).apic_timer_irqs +
+ per_cpu(irq_stat, cpu).irq0_irqs;
+}
+
+static inline int mce_in_progress(void)
+{
+#if defined(CONFIG_X86_MCE)
+ return atomic_read(&mce_entry) > 0;
+#endif
+ return 0;
+}
+
+int hw_nmi_is_cpu_stuck(struct pt_regs *regs)
+{
+ unsigned int sum;
+ int cpu = smp_processor_id();
+
+ /* FIXME: cheap hack for this check, probably should get its own
+ * die_notifier handler
+ */
+ if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
+ static DEFINE_SPINLOCK(lock); /* Serialise the printks */
+
+ spin_lock(&lock);
+ printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
+ show_regs(regs);
+ dump_stack();
+ spin_unlock(&lock);
+ cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
+ }
+
+ /* if we are doing an mce, just assume the cpu is not stuck */
+ /* Could check oops_in_progress here too, but it's safer not to */
+ if (mce_in_progress())
+ return 0;
+
+ /* We determine if the cpu is stuck by checking whether any
+ * interrupts have happened since we last checked. Of course
+ * an nmi storm could create false positives, but the higher
+ * level logic should account for that
+ */
+ sum = get_timer_irqs(cpu);
+ if (__get_cpu_var(last_irq_sum) == sum) {
+ return 1;
+ } else {
+ __get_cpu_var(last_irq_sum) = sum;
+ return 0;
+ }
+}
+
+void arch_trigger_all_cpu_backtrace(void)
+{
+ int i;
+
+ cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
+
+ printk(KERN_INFO "sending NMI to all CPUs:\n");
+ apic->send_IPI_all(NMI_VECTOR);
+
+ /* Wait for up to 10 seconds for all CPUs to do the backtrace */
+ for (i = 0; i < 10 * 1000; i++) {
+ if (cpumask_empty(to_cpumask(backtrace_mask)))
+ break;
+ mdelay(1);
+ }
+}
+
+/* STUB calls to mimic old nmi_watchdog behaviour */
+unsigned int nmi_watchdog = NMI_NONE;
+EXPORT_SYMBOL(nmi_watchdog);
+atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */
+EXPORT_SYMBOL(nmi_active);
+int nmi_watchdog_enabled;
+int unknown_nmi_panic;
+void cpu_nmi_set_wd_enabled(void) { return; }
+void acpi_nmi_enable(void) { return; }
+void acpi_nmi_disable(void) { return; }
+void stop_apic_nmi_watchdog(void *unused) { return; }
+void setup_apic_nmi_watchdog(void *unused) { return; }
+int __init check_nmi_watchdog(void) { return 0; }
diff --git a/kernel/nmi_watchdog.c b/kernel/nmi_watchdog.c
new file mode 100644
index 0000000..36817b2
--- /dev/null
+++ b/kernel/nmi_watchdog.c
@@ -0,0 +1,191 @@
+/*
+ * Detect Hard Lockups using the NMI
+ *
+ * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
+ *
+ * this code detects hard lockups: incidents in where on a CPU
+ * the kernel does not respond to anything except NMI.
+ *
+ * Note: Most of this code is borrowed heavily from softlockup.c,
+ * so thanks to Ingo for the initial implementation.
+ * Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
+ * to those contributors as well.
+ */
+
+#include <linux/mm.h>
+#include <linux/cpu.h>
+#include <linux/nmi.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/freezer.h>
+#include <linux/lockdep.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/sysctl.h>
+
+#include <asm/irq_regs.h>
+#include <linux/perf_event.h>
+
+static DEFINE_PER_CPU(struct perf_event *, nmi_watchdog_ev);
+static DEFINE_PER_CPU(int, nmi_watchdog_touch);
+static DEFINE_PER_CPU(long, alert_counter);
+
+void touch_nmi_watchdog(void)
+{
+ __raw_get_cpu_var(nmi_watchdog_touch) = 1;
+ touch_softlockup_watchdog();
+}
+EXPORT_SYMBOL(touch_nmi_watchdog);
+
+void touch_all_nmi_watchdog(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ per_cpu(nmi_watchdog_touch, cpu) = 1;
+ touch_softlockup_watchdog();
+}
+
+#ifdef CONFIG_SYSCTL
+/*
+ * proc handler for /proc/sys/kernel/nmi_watchdog
+ */
+int proc_nmi_enabled(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int cpu;
+
+ if (per_cpu(nmi_watchdog_ev, smp_processor_id()) == NULL)
+ nmi_watchdog_enabled = 0;
+ else
+ nmi_watchdog_enabled = 1;
+
+ touch_all_nmi_watchdog();
+ proc_dointvec(table, write, buffer, length, ppos);
+ if (nmi_watchdog_enabled)
+ for_each_online_cpu(cpu)
+ perf_event_enable(per_cpu(nmi_watchdog_ev, cpu));
+ else
+ for_each_online_cpu(cpu)
+ perf_event_disable(per_cpu(nmi_watchdog_ev, cpu));
+ return 0;
+}
+
+#endif /* CONFIG_SYSCTL */
+
+struct perf_event_attr wd_attr = {
+ .type = PERF_TYPE_HARDWARE,
+ .config = PERF_COUNT_HW_CPU_CYCLES,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 1,
+};
+
+static int panic_on_timeout;
+
+void wd_overflow(struct perf_event *event, int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ int cpu = smp_processor_id();
+ int touched = 0;
+
+ if (__get_cpu_var(nmi_watchdog_touch)) {
+ per_cpu(nmi_watchdog_touch, cpu) = 0;
+ touched = 1;
+ }
+
+ /* check to see if the cpu is doing anything */
+ if (!touched && hw_nmi_is_cpu_stuck(regs)) {
+ /*
+ * Ayiee, looks like this CPU is stuck ...
+ * wait a few IRQs (5 seconds) before doing the oops ...
+ */
+ per_cpu(alert_counter,cpu) += 1;
+ if (per_cpu(alert_counter,cpu) == 5) {
+ /*
+ * die_nmi will return ONLY if NOTIFY_STOP happens..
+ */
+ die_nmi("BUG: NMI Watchdog detected LOCKUP",
+ regs, panic_on_timeout);
+ }
+ } else {
+ per_cpu(alert_counter,cpu) = 0;
+ }
+
+ return;
+}
+
+/*
+ * Create/destroy watchdog threads as CPUs come and go:
+ */
+static int __cpuinit
+cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+ int hotcpu = (unsigned long)hcpu;
+ struct perf_event *event;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ per_cpu(nmi_watchdog_touch, hotcpu) = 0;
+ break;
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ /* originally wanted the below chunk to be in CPU_UP_PREPARE, but caps is unpriv for non-CPU0 */
+ wd_attr.sample_period = cpu_khz * 1000;
+ event = perf_event_create_kernel_counter(&wd_attr, hotcpu, -1, wd_overflow);
+ if (IS_ERR(event)) {
+ printk(KERN_ERR "nmi watchdog failed to create perf event on %i: %p\n", hotcpu, event);
+ return NOTIFY_BAD;
+ }
+ per_cpu(nmi_watchdog_ev, hotcpu) = event;
+ perf_event_enable(per_cpu(nmi_watchdog_ev, hotcpu));
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ perf_event_disable(per_cpu(nmi_watchdog_ev, hotcpu));
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ event = per_cpu(nmi_watchdog_ev, hotcpu);
+ per_cpu(nmi_watchdog_ev, hotcpu) = NULL;
+ perf_event_release_kernel(event);
+ break;
+#endif /* CONFIG_HOTPLUG_CPU */
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata cpu_nfb = {
+ .notifier_call = cpu_callback
+};
+
+static int __initdata nonmi_watchdog;
+
+static int __init nonmi_watchdog_setup(char *str)
+{
+ nonmi_watchdog = 1;
+ return 1;
+}
+__setup("nonmi_watchdog", nonmi_watchdog_setup);
+
+static int __init spawn_nmi_watchdog_task(void)
+{
+ void *cpu = (void *)(long)smp_processor_id();
+ int err;
+
+ if (nonmi_watchdog)
+ return 0;
+
+ err = cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
+ if (err == NOTIFY_BAD) {
+ BUG();
+ return 1;
+ }
+ cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
+ register_cpu_notifier(&cpu_nfb);
+
+ return 0;
+}
+early_initcall(spawn_nmi_watchdog_task);

2010-02-08 09:47:13

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 3/3 v2] nmi_watchdog: config option to enable new nmi_watchdog

On 2/8/10, Ingo Molnar <[email protected]> wrote:
>
> * Don Zickus <[email protected]> wrote:
>
>> +config NMI_WATCHDOG
>> + bool "Detect Hard Lockups with an NMI Watchdog"
>> + depends on DEBUG_KERNEL && PERF_EVENTS
>> + default y
>> + help
>> + Say Y here to enable the kernel to use the NMI as a watchdog
>> + to detect hard lockups. This is useful when a cpu hangs for no
>> + reason but can still respond to NMIs. A backtrace is displayed
>> + for reviewing and reporting.
>> +
>> + The overhead should be minimal, just an extra NMI every few
>> + seconds.
>
> Thought for later patches: I think an architecture should be able to express
> via a Kconfig switch that it actually _has_ NMI events. There's
> architectures
> which dont have a PMU driver and only have software events. There's also
> architectures that have a PMU driver but no NMIs.
>
> Something like ARCH_HAS_NMI_PERF_EVENTS?
>
> Also, i havent checked, but what is the practical effect of the new generic
> watchdog on x86 CPUs that does not have a native PMU driver yet - such as
> P4s?
>

p4 pmu is not yet implemented. I'll try to post on lkml the thnigs
i've done for it today evening, though it's pretty ugly i would say.

> Anyway, i'll create a tip:perf/nmi topic branch for these patches, it
> certainly looks like a useful generalization and a new architecture that has
> perf could easily enable it, without having to write its own NMI watchdog
> implementation. It's also useful for any new watchdog features that people
> might want to add. Plus it makes the x86 PMU code cleaner in the long run as
> well.
>
> Thanks,
>
> Ingo
>

2010-02-08 14:58:45

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 3/3 v2] nmi_watchdog: config option to enable new nmi_watchdog

On Mon, Feb 08, 2010 at 08:19:54AM +0100, Ingo Molnar wrote:
>
> * Don Zickus <[email protected]> wrote:
>
> > +config NMI_WATCHDOG
> > + bool "Detect Hard Lockups with an NMI Watchdog"
> > + depends on DEBUG_KERNEL && PERF_EVENTS
> > + default y
> > + help
> > + Say Y here to enable the kernel to use the NMI as a watchdog
> > + to detect hard lockups. This is useful when a cpu hangs for no
> > + reason but can still respond to NMIs. A backtrace is displayed
> > + for reviewing and reporting.
> > +
> > + The overhead should be minimal, just an extra NMI every few
> > + seconds.
>
> Thought for later patches: I think an architecture should be able to express
> via a Kconfig switch that it actually _has_ NMI events. There's architectures
> which dont have a PMU driver and only have software events. There's also
> architectures that have a PMU driver but no NMIs.
>
> Something like ARCH_HAS_NMI_PERF_EVENTS?

I guess I assumed the perf event subsystem would take care of that which
is why I made the config option dependent on PERF_EVENTS. I am open to
suggestions on enhance it.

>
> Also, i havent checked, but what is the practical effect of the new generic
> watchdog on x86 CPUs that does not have a native PMU driver yet - such as
> P4s?

I believe the call to perf_event_create_kernel_counter would fail, which
then prevents the cpu from coming online. Probably not the smartest thing
to do. I was looking at adding code to fall back to trying PERF_TYPE_SOFTWARE.
Let me dig up a P4 box and see what happens.

>
> Anyway, i'll create a tip:perf/nmi topic branch for these patches, it
> certainly looks like a useful generalization and a new architecture that has
> perf could easily enable it, without having to write its own NMI watchdog
> implementation. It's also useful for any new watchdog features that people
> might want to add. Plus it makes the x86 PMU code cleaner in the long run as
> well.

Agreed.

Cheers,
Don

2010-02-09 10:52:18

by Ingo Molnar

[permalink] [raw]
Subject: [tip:perf/nmi] nmi_watchdog: Only enable on x86 for now

Commit-ID: 8e7672cdb413af859086ffceaed68f7e1e8ea4c2
Gitweb: http://git.kernel.org/tip/8e7672cdb413af859086ffceaed68f7e1e8ea4c2
Author: Ingo Molnar <[email protected]>
AuthorDate: Tue, 9 Feb 2010 06:11:00 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 9 Feb 2010 06:11:00 +0100

nmi_watchdog: Only enable on x86 for now

It wont even build on other platforms just yet - so restrict it
to x86 for now.

Cc: Don Zickus <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
lib/Kconfig.debug | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index f80b67e..acef882 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -173,6 +173,7 @@ config DETECT_SOFTLOCKUP
config NMI_WATCHDOG
bool "Detect Hard Lockups with an NMI Watchdog"
depends on DEBUG_KERNEL && PERF_EVENTS
+ depends on X86
default y
help
Say Y here to enable the kernel to use the NMI as a watchdog

2010-02-12 16:12:46

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] new nmi_watchdog using perf events

Don,

How is this new NMI watchdog code going to work when you also have OProfile
enabled in your kernel?

Today, perf_event disables the NMI watchdog while there is at least one event.
By releasing the PMU registers, it also allows for Oprofile to work.

But now with this new NMI watchdog code, perf_event never releases the PMU.
Thus, I suspect Oprofile will not work anymore, unless the NMI watchdog is
explicitly disabled. Up until now OProfile could co-exist with the NMI watchdog.

2010-02-12 17:00:29

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] new nmi_watchdog using perf events

On Fri, Feb 12, 2010 at 05:12:38PM +0100, Stephane Eranian wrote:
> Don,
>
> How is this new NMI watchdog code going to work when you also have OProfile
> enabled in your kernel?
>
> Today, perf_event disables the NMI watchdog while there is at least one event.
> By releasing the PMU registers, it also allows for Oprofile to work.
>
> But now with this new NMI watchdog code, perf_event never releases the PMU.
> Thus, I suspect Oprofile will not work anymore, unless the NMI watchdog is
> explicitly disabled. Up until now OProfile could co-exist with the NMI watchdog.

You are right. Originally when I read the code I thought perf_event just
grabbed all the PMUs in reserve_pmc_init(). But I see that only happens
when someone actually creates a PERF_TYPE_HARDWARE event, which the new
nmi watchdog does. Those PMUs only get released when the event is
destroyed which my new code only does when the cpu disappears.

So yeah, I have effectively blocked oprofile from working. I can change
my code such that when you disable the nmi_watchdog, you can release the
PMUs and let oprofile work.

But then I am curious, considering that perf and oprofile do the same
thing, how much longer do we let competing subsystems control the same
hardware? I thought the point of the perf_event subsystem was to have a
proper framework on top of the PMUs such that anyone who wants to use it
just registers themselves, which is what the new nmi_watchdog is doing.

I can add code that allows oprofile and the new nmi watchdog to coexist,
but things get a little ugly to maintain. Just wondering what the
gameplan is here?

Cheers,
Don

2010-02-12 17:12:55

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] new nmi_watchdog using perf events

Don,

On Fri, Feb 12, 2010 at 5:59 PM, Don Zickus <[email protected]> wrote:
> On Fri, Feb 12, 2010 at 05:12:38PM +0100, Stephane Eranian wrote:
>> Don,
>>
>> How is this new NMI watchdog code going to work when you also have OProfile
>> enabled in your kernel?
>>
>> Today, perf_event disables the NMI watchdog while there is at least one event.
>> By releasing the PMU registers, it also allows for Oprofile to work.
>>
>> But now with this new NMI watchdog code, perf_event never releases the PMU.
>> Thus, I suspect Oprofile will not work anymore, unless the NMI watchdog is
>> explicitly disabled. Up until now OProfile could co-exist with the NMI watchdog.
>
> You are right.  Originally when I read the code I thought perf_event just
> grabbed all the PMUs in reserve_pmc_init().  But I see that only happens
> when someone actually creates a PERF_TYPE_HARDWARE event, which the new
> nmi watchdog does.  Those PMUs only get released when the event is
> destroyed which my new code only does when the cpu disappears.
>
> So yeah, I have effectively blocked oprofile from working.  I can change
> my code such that when you disable the nmi_watchdog, you can release the
> PMUs and let oprofile work.
>
> But then I am curious, considering that perf and oprofile do the same
> thing, how much longer do we let competing subsystems control the same
> hardware?  I thought the point of the perf_event subsystem was to have a
> proper framework on top of the PMUs such that anyone who wants to use it
> just registers themselves, which is what the new nmi_watchdog is doing.
>
> I can add code that allows oprofile and the new nmi watchdog to coexist,
> but things get a little ugly to maintain.  Just wondering what the
> gameplan is here?
>
I believe OProfile should eventually be removed from the kernel. I suspect
much of the functionalities it needs are already provided by perf_events.
But that does not mean the OProfile user level tool must disappear. There is
a very large user community. I think it could and should be ported to use
perf_events instead. Given that the Oprofile users only interact through
opcontrol, opreport, opannotate and such, they never "see" the actual kernel
API. Thus by re-targeting the scripts, this should be mostly transparent to
end-users.

But for now, I believe the most practical solution is to release the perf_event
event when you disable the NMI watchdog. That would at least provide a
way to run OProfile.

2010-02-14 08:31:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/3 v2] new nmi_watchdog using perf events


* Stephane Eranian <[email protected]> wrote:

> Don,
>
> On Fri, Feb 12, 2010 at 5:59 PM, Don Zickus <[email protected]> wrote:
> > On Fri, Feb 12, 2010 at 05:12:38PM +0100, Stephane Eranian wrote:
> >> Don,
> >>
> >> How is this new NMI watchdog code going to work when you also have OProfile
> >> enabled in your kernel?
> >>
> >> Today, perf_event disables the NMI watchdog while there is at least one event.
> >> By releasing the PMU registers, it also allows for Oprofile to work.
> >>
> >> But now with this new NMI watchdog code, perf_event never releases the PMU.
> >> Thus, I suspect Oprofile will not work anymore, unless the NMI watchdog is
> >> explicitly disabled. Up until now OProfile could co-exist with the NMI watchdog.
> >
> > You are right. ??Originally when I read the code I thought perf_event just
> > grabbed all the PMUs in reserve_pmc_init(). ??But I see that only happens
> > when someone actually creates a PERF_TYPE_HARDWARE event, which the new
> > nmi watchdog does. ??Those PMUs only get released when the event is
> > destroyed which my new code only does when the cpu disappears.
> >
> > So yeah, I have effectively blocked oprofile from working. ??I can change
> > my code such that when you disable the nmi_watchdog, you can release the
> > PMUs and let oprofile work.
> >
> > But then I am curious, considering that perf and oprofile do the same
> > thing, how much longer do we let competing subsystems control the same
> > hardware? ??I thought the point of the perf_event subsystem was to have a
> > proper framework on top of the PMUs such that anyone who wants to use it
> > just registers themselves, which is what the new nmi_watchdog is doing.
> >
> > I can add code that allows oprofile and the new nmi watchdog to coexist,
> > but things get a little ugly to maintain. ??Just wondering what the
> > gameplan is here?
>
> I believe OProfile should eventually be removed from the kernel. I suspect
> much of the functionalities it needs are already provided by perf_events.
> But that does not mean the OProfile user level tool must disappear. There
> is a very large user community. I think it could and should be ported to
> use perf_events instead. Given that the Oprofile users only interact
> through opcontrol, opreport, opannotate and such, they never "see" the
> actual kernel API. Thus by re-targeting the scripts, this should be mostly
> transparent to end-users.

Agreed, i suggested this a few months ago. (In theory the oprofile user-space
tooling might even use something like libpapi or the perfmon library, instead
of directly binding to the perf syscall.)

Until that is done we can keep the parallel oprofile infrastructure i guess,
as a legacy option.

Then again, when it comes to profiling, i cannot see anyone not wanting to
use perf itself, it's so much easier to use (and more capable) than the
oprofile tooling in pretty much any area. Basically all instrumentation and
profiling development happens in that space these days as well.

Those few places where you cannot use perf yet (on weird or old CPUs) are
being filled in quickly as well - so it might be easier to just widen PMU
support to all relevant places instead of spending effort on oprofile
compatibility.

> But for now, I believe the most practical solution is to release the
> perf_event event when you disable the NMI watchdog. That would at least
> provide a way to run OProfile.

Agreed.

Ingo

Subject: Re: [PATCH 0/3 v2] new nmi_watchdog using perf events

On 12.02.10 18:12:47, Stephane Eranian wrote:
> Don,
>
> On Fri, Feb 12, 2010 at 5:59 PM, Don Zickus <[email protected]> wrote:
> > On Fri, Feb 12, 2010 at 05:12:38PM +0100, Stephane Eranian wrote:
> >> Don,
> >>
> >> How is this new NMI watchdog code going to work when you also have OProfile
> >> enabled in your kernel?
> >>
> >> Today, perf_event disables the NMI watchdog while there is at least one event.
> >> By releasing the PMU registers, it also allows for Oprofile to work.
> >>
> >> But now with this new NMI watchdog code, perf_event never releases the PMU.
> >> Thus, I suspect Oprofile will not work anymore, unless the NMI watchdog is
> >> explicitly disabled. Up until now OProfile could co-exist with the NMI watchdog.
> >
> > You are right. ?Originally when I read the code I thought perf_event just
> > grabbed all the PMUs in reserve_pmc_init(). ?But I see that only happens
> > when someone actually creates a PERF_TYPE_HARDWARE event, which the new
> > nmi watchdog does. ?Those PMUs only get released when the event is
> > destroyed which my new code only does when the cpu disappears.
> >
> > So yeah, I have effectively blocked oprofile from working. ?I can change
> > my code such that when you disable the nmi_watchdog, you can release the
> > PMUs and let oprofile work.
> >
> > But then I am curious, considering that perf and oprofile do the same
> > thing, how much longer do we let competing subsystems control the same
> > hardware? ?I thought the point of the perf_event subsystem was to have a
> > proper framework on top of the PMUs such that anyone who wants to use it
> > just registers themselves, which is what the new nmi_watchdog is doing.

There is the perfctr reservation framework what is used by all
subsystems. Perf reserves all counters if there is one event actively
running. This is ok as long you use perf from the userspace for
profiling. Nobody uses 2 different profilers at the same time. But if
the counters are also for implementing in-kernel features such as a
watchdog that is enabled all the time, perf must be modified to only
allocate those counters that are actually needed, and events may not
be scheduled on counters that are already reserved.

> > I can add code that allows oprofile and the new nmi watchdog to coexist,
> > but things get a little ugly to maintain. ?Just wondering what the
> > gameplan is here?

There is no longer kernel feature implementation for oprofile. But it
will be still in the kernel for a while until we can completely switch
to perf. Perf is improving very fast, compared to the ongoing
development the implementation effort for coexistence is small. So I
think we all can spend some time to also improve the counter
reservation code.

> I believe OProfile should eventually be removed from the kernel. I suspect
> much of the functionalities it needs are already provided by perf_events.
> But that does not mean the OProfile user level tool must disappear. There is
> a very large user community. I think it could and should be ported to use
> perf_events instead. Given that the Oprofile users only interact through
> opcontrol, opreport, opannotate and such, they never "see" the actual kernel
> API. Thus by re-targeting the scripts, this should be mostly transparent to
> end-users.

I think, porting the oprofile userland to work on top of a performance
library (libpapi or libpfm) would be the cleanest solution. Alternativly
we could also port the kernel part to use the in-kernel perf api.

>
> But for now, I believe the most practical solution is to release the perf_event
> event when you disable the NMI watchdog. That would at least provide a
> way to run OProfile.

This solution is fine to me. The current implemenation also has some
limitations for oprofile if the watchdog is enabled.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]