2005-10-25 14:23:51

by Steven Rostedt

[permalink] [raw]
Subject: 2.6.14-rc5-rt6 -- False NMI lockup detects

Hi Ingo and Thomas,

On some of my machines, I've been experiencing false NMI lockups. This
usually happens on slower machines, and taking a look into this, it
seems to be due to a short time where no processes are using timers, and
the ktimer interrupts aren't needed. So the APIC timer, which now is
used only for the ktimers, has a five second pause, and causes the NMI
to go off. The NMI uses the apic timer to determine lockups.

So, I added a more generic method. This only works for x86 for now, but
it has a #ifdef to keep other archs working until it implements this as
well. I added a nmi_irq_incr which is called by __do_IRQ in the generic
code. This is what is used in the NMI code to determine if the CPU has
locked up. This way we don't have to worry about what resource we are
using for timers.

-- Steve

Index: rt_linux_ernie/arch/i386/kernel/nmi.c
===================================================================
--- rt_linux_ernie.orig/arch/i386/kernel/nmi.c 2005-10-25 10:06:23.000000000 -0400
+++ rt_linux_ernie/arch/i386/kernel/nmi.c 2005-10-25 10:06:06.000000000 -0400
@@ -484,6 +484,12 @@
touch_softlockup_watchdog();
}

+static DEFINE_PER_CPU(int, nmi_irq_incr);
+void nmi_irq_incr(int cpu)
+{
+ per_cpu(nmi_irq_incr, cpu)++;
+}
+
extern void die_nmi(struct pt_regs *, const char *msg);

int nmi_show_regs[NR_CPUS];
@@ -521,7 +527,7 @@
*/
int sum, cpu = smp_processor_id();

- sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
+ sum = per_cpu(nmi_irq_incr, cpu);

profile_tick(CPU_PROFILING, regs);
if (nmi_show_regs[cpu]) {
Index: rt_linux_ernie/arch/i386/kernel/apic.c
===================================================================
--- rt_linux_ernie.orig/arch/i386/kernel/apic.c 2005-10-25 08:49:37.000000000 -0400
+++ rt_linux_ernie/arch/i386/kernel/apic.c 2005-10-25 10:14:37.000000000 -0400
@@ -1161,9 +1161,6 @@
{
int cpu = smp_processor_id();

- /*
- * the NMI deadlock-detector uses this.
- */
per_cpu(irq_stat, cpu).apic_timer_irqs++;

trace_special(regs->eip, 0, 0);
Index: rt_linux_ernie/include/asm-i386/irq.h
===================================================================
--- rt_linux_ernie.orig/include/asm-i386/irq.h 2005-08-28 19:41:01.000000000 -0400
+++ rt_linux_ernie/include/asm-i386/irq.h 2005-10-25 09:59:44.000000000 -0400
@@ -25,6 +25,7 @@

#ifdef CONFIG_X86_LOCAL_APIC
# define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */
+# define ARCH_HAS_NMI_IRQ_INCR
#endif

#ifdef CONFIG_4KSTACKS
Index: rt_linux_ernie/include/linux/nmi.h
===================================================================
--- rt_linux_ernie.orig/include/linux/nmi.h 2005-10-25 09:21:06.000000000 -0400
+++ rt_linux_ernie/include/linux/nmi.h 2005-10-25 09:26:28.000000000 -0400
@@ -19,4 +19,10 @@
# define touch_nmi_watchdog() do { } while(0)
#endif

+#ifdef ARCH_HAS_NMI_IRQ_INCR
+extern void nmi_irq_incr(int cpu);
+#else
+# define nmi_irq_incr(cpu) do { } while(0)
+#endif
+
#endif
Index: rt_linux_ernie/kernel/irq/handle.c
===================================================================
--- rt_linux_ernie.orig/kernel/irq/handle.c 2005-10-25 08:49:42.000000000 -0400
+++ rt_linux_ernie/kernel/irq/handle.c 2005-10-25 09:45:13.000000000 -0400
@@ -16,6 +16,7 @@
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
+#include <linux/nmi.h>

#if defined(CONFIG_NO_IDLE_HZ)
#include <asm/dyntick.h>
@@ -533,6 +534,12 @@
if (user_mode(regs))
touch_light_softlockup_watchdog();

+ /*
+ * Moved the NMI lockup detect here, since this we really
+ * know a CPU is locked up when it stops receiving interrupts.
+ */
+ nmi_irq_incr(smp_processor_id());
+
kstat_this_cpu.irqs[irq]++;
if (CHECK_IRQ_PER_CPU(desc->status)) {
irqreturn_t action_ret;



2005-10-25 14:28:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects


* Steven Rostedt <[email protected]> wrote:

> Hi Ingo and Thomas,
>
> On some of my machines, I've been experiencing false NMI lockups.
> This usually happens on slower machines, and taking a look into this,
> it seems to be due to a short time where no processes are using
> timers, and the ktimer interrupts aren't needed. So the APIC timer,
> which now is used only for the ktimers, has a five second pause, and
> causes the NMI to go off. The NMI uses the apic timer to determine
> lockups.

this would be a bug - the jiffy tick should be processed every 1 msec,
regardless of whether there are any ktimers pending. (in the future we
want to use a special ktimer for the jiffy tick, but that's not
implemented yet.)

> So, I added a more generic method. This only works for x86 for now,
> but it has a #ifdef to keep other archs working until it implements
> this as well. I added a nmi_irq_incr which is called by __do_IRQ in
> the generic code. This is what is used in the NMI code to determine
> if the CPU has locked up. This way we don't have to worry about what
> resource we are using for timers.

this will be useful for tickless stuff - but right now 'no APIC timer
irq for 5 seconds' is a 'must not happen'.

Ingo

2005-10-25 19:25:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects

On Tue, 2005-10-25 at 16:28 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > Hi Ingo and Thomas,
> >
> > On some of my machines, I've been experiencing false NMI lockups.
> > This usually happens on slower machines, and taking a look into this,
> > it seems to be due to a short time where no processes are using
> > timers, and the ktimer interrupts aren't needed. So the APIC timer,
> > which now is used only for the ktimers, has a five second pause, and
> > causes the NMI to go off. The NMI uses the apic timer to determine
> > lockups.
>
> this would be a bug - the jiffy tick should be processed every 1 msec,
> regardless of whether there are any ktimers pending. (in the future we
> want to use a special ktimer for the jiffy tick, but that's not
> implemented yet.)
>

Isn't the jiffy tick implemented with the PIT when possible? So the apic
is only used when a timer is needed. Also note that this "lockup"
happens on boot up while things are being initialized, so not many
things may be using the timer.

Also, the machine doesn't lock up. It continues happily along at normal
speed. It's only a 366 MHz machine.

At my customer's site (which I'm no longer at), my test machine (2GHz)
never showed this, but an equal machine that my customer had showed this
on every boot up. The only difference between the two machines was the
other one had a 1GHz processor. These were both running on modified -rt
kernels.


> > So, I added a more generic method. This only works for x86 for now,
> > but it has a #ifdef to keep other archs working until it implements
> > this as well. I added a nmi_irq_incr which is called by __do_IRQ in
> > the generic code. This is what is used in the NMI code to determine
> > if the CPU has locked up. This way we don't have to worry about what
> > resource we are using for timers.
>
> this will be useful for tickless stuff - but right now 'no APIC timer
> irq for 5 seconds' is a 'must not happen'.
>

I added the following patch:

Index: rt_linux_ernie/arch/i386/kernel/nmi.c
===================================================================
--- rt_linux_ernie.orig/arch/i386/kernel/nmi.c 2005-10-25 12:58:18.000000000 -0400
+++ rt_linux_ernie/arch/i386/kernel/nmi.c 2005-10-25 14:52:38.000000000 -0400
@@ -538,6 +538,8 @@
* wait a few IRQs (5 seconds) before doing the oops ...
*/
alert_counter[cpu]++;
+ if (alert_counter[cpu] && !(alert_counter[cpu] % (nmi_hz)))
+ early_printk("nmi: jiffies=%ld\n",jiffies);
if (alert_counter[cpu] && !(alert_counter[cpu] % (5*nmi_hz))) {
int i;


And this is my output:

Adding 554200k swap on /dev/hda5. Priority:-1 extents:1 across:554200k
EXT3 FS on hda1, internal journal
nmi: jiffies=-289143
nmi: jiffies=-272171
nmi: jiffies=-270269
nmi: jiffies=-268754
nmi: jiffies=-267630
NMI watchdog detected lockup on CPU#0 (50000/50000)

>>>> my comments

The above shows that the jiffies are incrementing. So what reason would
the apic timer be going off? Also, this is just a UP machine.

<<<< end my comments

Pid: 1378, comm: uname
EIP: 0060:[<c01f5592>] CPU: 0
EIP is at clear_user+0x32/0x50
EFLAGS: 00010202 Not tainted (2.6.14-rc5-rt6)
EAX: 00000000 EBX: 00000000 ECX: 0000009c EDX: 00000000
ESI: fffffff2 EDI: b7f10d90 EBP: cf055e28 DS: 007b ES: 007b
CR0: 8005003b CR2: b7f10750 CR3: 0f7d8000 CR4: 00000690
[<c01010ba>] show_regs+0x14a/0x174 (36)
[<c010ff13>] nmi_watchdog_tick+0x1a3/0x230 (56)
[<c0104b2b>] default_do_nmi+0x6b/0x160 (52)
[<c0104c5a>] do_nmi+0x2a/0x30 (20)
[<c0103a46>] nmi_stack_correct+0x1d/0x22 (68)
[<c018e813>] padzero+0x33/0x40 (16)
[<c018eecd>] load_elf_interp+0x22d/0x2e0 (72)
[<c018fca5>] load_elf_binary+0xb85/0xd30 (188)
[<c016d70a>] search_binary_handler+0xaa/0x2b0 (44)
[<c016da99>] do_execve+0x189/0x230 (36)
[<c0101a92>] sys_execve+0x42/0xa0 (40)
[<c0102f01>] syscall_call+0x7/0xb (-8116)
NMI Watchdog detected LOCKUP on CPU0, eip c01f5592, registers:
Modules linked in:
CPU: 0
EIP: 0060:[<c01f5592>] Not tainted VLI
EFLAGS: 00010202 (2.6.14-rc5-rt6)
EIP is at clear_user+0x32/0x50
eax: 00000000 ebx: 00000000 ecx: 0000009c edx: 00000000
esi: fffffff2 edi: b7f10d90 ebp: cf055e28 esp: cf055e20
ds: 007b es: 007b ss: 0068 preempt: 00000001
Process uname (pid: 1378, threadinfo=cf054000 task=cfed9150 stack_left=7660 wor)Stack: cfc25120 cfc25594 cf055e38 c018e813 b7f10750 000008b0 cf055e80 c018eecd
b7f10750 b7f0fcc0 cfc25080 00000003 00000812 00015cc0 cfc25060 b7efa000
00000001 b7f107f8 00000006 cfc25088 b7f10750 cfc25560 0804b2ec cec5d720
Call Trace:
[<c0103d3b>] show_stack+0xab/0xf0 (28)
[<c0103f1a>] show_registers+0x17a/0x230 (56)
[<c0104a6e>] die_nmi+0x9e/0xf0 (52)
[<c010ff37>] nmi_watchdog_tick+0x1c7/0x230 (56)
[<c0104b2b>] default_do_nmi+0x6b/0x160 (52)
[<c0104c5a>] do_nmi+0x2a/0x30 (20)
[<c0103a46>] nmi_stack_correct+0x1d/0x22 (68)
[<c018e813>] padzero+0x33/0x40 (16)
[<c018eecd>] load_elf_interp+0x22d/0x2e0 (72)
[<c018fca5>] load_elf_binary+0xb85/0xd30 (188)
[<c016d70a>] search_binary_handler+0xaa/0x2b0 (44)
[<c016da99>] do_execve+0x189/0x230 (36)
[<c0101a92>] sys_execve+0x42/0xa0 (40)
[<c0102f01>] syscall_call+0x7/0xb (-8116)
Code: 7c 24 04 8b 7d 08 89 1c 24 8b 4d 0c a1 48 e4 33 c0 89 fb 01 cb 19 d2 39 5
console shuts up ...

-- Steve


2005-10-25 19:37:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects

On Tue, 2005-10-25 at 15:24 -0400, Steven Rostedt wrote:

> Isn't the jiffy tick implemented with the PIT when possible?

Yes, PIT is the tick event source.

> So the apic is only used when a timer is needed.

And it might be used for profiling, which is also tick bound in the hrt
case at the moment.

> Also note that this "lockup"
> happens on boot up while things are being initialized, so not many
> things may be using the timer.

Can you send me a log of the boot messages please ?

tglx


2005-10-25 20:00:57

by George Anzinger

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects

Steven Rostedt wrote:
> On Tue, 2005-10-25 at 16:28 +0200, Ingo Molnar wrote:
>
>>* Steven Rostedt <[email protected]> wrote:
>>
>>
>>>Hi Ingo and Thomas,
>>>
>>>On some of my machines, I've been experiencing false NMI lockups.
>>>This usually happens on slower machines, and taking a look into this,
>>>it seems to be due to a short time where no processes are using
>>>timers, and the ktimer interrupts aren't needed. So the APIC timer,
>>>which now is used only for the ktimers, has a five second pause, and
>>>causes the NMI to go off. The NMI uses the apic timer to determine
>>>lockups.
>>
>>this would be a bug - the jiffy tick should be processed every 1 msec,
>>regardless of whether there are any ktimers pending. (in the future we
>>want to use a special ktimer for the jiffy tick, but that's not
>>implemented yet.)
>>
>
>
> Isn't the jiffy tick implemented with the PIT when possible? So the apic
> is only used when a timer is needed. Also note that this "lockup"
> happens on boot up while things are being initialized, so not many
> things may be using the timer.

Somewhere in the not too distant past the NMI watchdog was moved from the PIT tick to the APIC
timer. Might want to move it back, at least for now...
>
> Also, the machine doesn't lock up. It continues happily along at normal
> speed. It's only a 366 MHz machine.
>
> At my customer's site (which I'm no longer at), my test machine (2GHz)
> never showed this, but an equal machine that my customer had showed this
> on every boot up. The only difference between the two machines was the
> other one had a 1GHz processor. These were both running on modified -rt
> kernels.
>
>
>
>>>So, I added a more generic method. This only works for x86 for now,
>>>but it has a #ifdef to keep other archs working until it implements
>>>this as well. I added a nmi_irq_incr which is called by __do_IRQ in
>>>the generic code. This is what is used in the NMI code to determine
>>>if the CPU has locked up. This way we don't have to worry about what
>>>resource we are using for timers.
>>
>>this will be useful for tickless stuff - but right now 'no APIC timer
>>irq for 5 seconds' is a 'must not happen'.
>>
>
>
> I added the following patch:
>
> Index: rt_linux_ernie/arch/i386/kernel/nmi.c
> ===================================================================
> --- rt_linux_ernie.orig/arch/i386/kernel/nmi.c 2005-10-25 12:58:18.000000000 -0400
> +++ rt_linux_ernie/arch/i386/kernel/nmi.c 2005-10-25 14:52:38.000000000 -0400
> @@ -538,6 +538,8 @@
> * wait a few IRQs (5 seconds) before doing the oops ...
> */
> alert_counter[cpu]++;
> + if (alert_counter[cpu] && !(alert_counter[cpu] % (nmi_hz)))
> + early_printk("nmi: jiffies=%ld\n",jiffies);
> if (alert_counter[cpu] && !(alert_counter[cpu] % (5*nmi_hz))) {
> int i;
>
>
> And this is my output:
>
> Adding 554200k swap on /dev/hda5. Priority:-1 extents:1 across:554200k
> EXT3 FS on hda1, internal journal
> nmi: jiffies=-289143
> nmi: jiffies=-272171
> nmi: jiffies=-270269
> nmi: jiffies=-268754
> nmi: jiffies=-267630
> NMI watchdog detected lockup on CPU#0 (50000/50000)
>
>
>>>>>my comments
>
>
> The above shows that the jiffies are incrementing. So what reason would
> the apic timer be going off? Also, this is just a UP machine.
>
> <<<< end my comments
>
> Pid: 1378, comm: uname
> EIP: 0060:[<c01f5592>] CPU: 0
> EIP is at clear_user+0x32/0x50
> EFLAGS: 00010202 Not tainted (2.6.14-rc5-rt6)
> EAX: 00000000 EBX: 00000000 ECX: 0000009c EDX: 00000000
> ESI: fffffff2 EDI: b7f10d90 EBP: cf055e28 DS: 007b ES: 007b
> CR0: 8005003b CR2: b7f10750 CR3: 0f7d8000 CR4: 00000690
> [<c01010ba>] show_regs+0x14a/0x174 (36)
> [<c010ff13>] nmi_watchdog_tick+0x1a3/0x230 (56)
> [<c0104b2b>] default_do_nmi+0x6b/0x160 (52)
> [<c0104c5a>] do_nmi+0x2a/0x30 (20)
> [<c0103a46>] nmi_stack_correct+0x1d/0x22 (68)
> [<c018e813>] padzero+0x33/0x40 (16)
> [<c018eecd>] load_elf_interp+0x22d/0x2e0 (72)
> [<c018fca5>] load_elf_binary+0xb85/0xd30 (188)
> [<c016d70a>] search_binary_handler+0xaa/0x2b0 (44)
> [<c016da99>] do_execve+0x189/0x230 (36)
> [<c0101a92>] sys_execve+0x42/0xa0 (40)
> [<c0102f01>] syscall_call+0x7/0xb (-8116)
> NMI Watchdog detected LOCKUP on CPU0, eip c01f5592, registers:
> Modules linked in:
> CPU: 0
> EIP: 0060:[<c01f5592>] Not tainted VLI
> EFLAGS: 00010202 (2.6.14-rc5-rt6)
> EIP is at clear_user+0x32/0x50
> eax: 00000000 ebx: 00000000 ecx: 0000009c edx: 00000000
> esi: fffffff2 edi: b7f10d90 ebp: cf055e28 esp: cf055e20
> ds: 007b es: 007b ss: 0068 preempt: 00000001
> Process uname (pid: 1378, threadinfo=cf054000 task=cfed9150 stack_left=7660 wor)Stack: cfc25120 cfc25594 cf055e38 c018e813 b7f10750 000008b0 cf055e80 c018eecd
> b7f10750 b7f0fcc0 cfc25080 00000003 00000812 00015cc0 cfc25060 b7efa000
> 00000001 b7f107f8 00000006 cfc25088 b7f10750 cfc25560 0804b2ec cec5d720
> Call Trace:
> [<c0103d3b>] show_stack+0xab/0xf0 (28)
> [<c0103f1a>] show_registers+0x17a/0x230 (56)
> [<c0104a6e>] die_nmi+0x9e/0xf0 (52)
> [<c010ff37>] nmi_watchdog_tick+0x1c7/0x230 (56)
> [<c0104b2b>] default_do_nmi+0x6b/0x160 (52)
> [<c0104c5a>] do_nmi+0x2a/0x30 (20)
> [<c0103a46>] nmi_stack_correct+0x1d/0x22 (68)
> [<c018e813>] padzero+0x33/0x40 (16)
> [<c018eecd>] load_elf_interp+0x22d/0x2e0 (72)
> [<c018fca5>] load_elf_binary+0xb85/0xd30 (188)
> [<c016d70a>] search_binary_handler+0xaa/0x2b0 (44)
> [<c016da99>] do_execve+0x189/0x230 (36)
> [<c0101a92>] sys_execve+0x42/0xa0 (40)
> [<c0102f01>] syscall_call+0x7/0xb (-8116)
> Code: 7c 24 04 8b 7d 08 89 1c 24 8b 4d 0c a1 48 e4 33 c0 89 fb 01 cb 19 d2 39 5
> console shuts up ...
>
> -- Steve
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-25 20:11:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects

On Tue, 2005-10-25 at 13:00 -0700, George Anzinger wrote:
> Steven Rostedt wrote:
> >
> >
> > Isn't the jiffy tick implemented with the PIT when possible? So the apic
> > is only used when a timer is needed. Also note that this "lockup"
> > happens on boot up while things are being initialized, so not many
> > things may be using the timer.
>
> Somewhere in the not too distant past the NMI watchdog was moved from the PIT tick to the APIC
> timer. Might want to move it back, at least for now...
> >

Actually, I submitted a patch to Ingo, (and I guess it would also work
with Thomas' kthrt as well), that takes the nmi tick out of the timer
code completely (at least for x86) and moves it to the __do_IRQ. This
way, it would detect when something is locked up with interrupts
disabled, but you don't worry about having the right timer configured
for it.

If you want to detect the timer being screwed up, that can be done on a
timer by timer basis, and most likely the soft lockup would find that
out too.

-- Steve


2005-10-26 11:27:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects

On Tue, 2005-10-25 at 13:00 -0700, George Anzinger wrote:
> Steven Rostedt wrote:
> > On Tue, 2005-10-25 at 16:28 +0200, Ingo Molnar wrote:
> >
> >>* Steven Rostedt <[email protected]> wrote:
> >>
> >>
> >>>Hi Ingo and Thomas,
> >>>
> >>>On some of my machines, I've been experiencing false NMI lockups.
> >>>This usually happens on slower machines, and taking a look into this,
> >>>it seems to be due to a short time where no processes are using
> >>>timers, and the ktimer interrupts aren't needed. So the APIC timer,
> >>>which now is used only for the ktimers, has a five second pause, and
> >>>causes the NMI to go off. The NMI uses the apic timer to determine
> >>>lockups.
> >>
> >>this would be a bug - the jiffy tick should be processed every 1 msec,
> >>regardless of whether there are any ktimers pending. (in the future we
> >>want to use a special ktimer for the jiffy tick, but that's not
> >>implemented yet.)
> >>
> >
> >
> > Isn't the jiffy tick implemented with the PIT when possible? So the apic
> > is only used when a timer is needed. Also note that this "lockup"
> > happens on boot up while things are being initialized, so not many
> > things may be using the timer.
>
> Somewhere in the not too distant past the NMI watchdog was moved from the PIT tick to the APIC
> timer. Might want to move it back, at least for now...

Ingo,

So what's the fix here? The PIT is used for jiffies and the NMI lockup
uses the apic timer to test against. So if there isn't a ktimer that
goes off for 5 seconds (usually on slower machines), you get a false
positive for a lockup. Is this really a bug that the apic doesn't go
off for 5 seconds?

So, do we move the NMI detect back to the PIT, or use a more generic
solution that I supplied? I actually prefer my generic solution because
1. we don't need to worry about which timers we are using at the moment.
2. Like you said, it can be used later on with things like dynamic
ticks, or tickless solutions.

-- Steve


2005-11-01 11:32:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects


* Steven Rostedt <[email protected]> wrote:

> Hi Ingo and Thomas,
>
> On some of my machines, I've been experiencing false NMI lockups.
> This usually happens on slower machines, and taking a look into this,
> it seems to be due to a short time where no processes are using
> timers, and the ktimer interrupts aren't needed. So the APIC timer,
> which now is used only for the ktimers, has a five second pause, and
> causes the NMI to go off. The NMI uses the apic timer to determine
> lockups.
>
> So, I added a more generic method. This only works for x86 for now,
> but it has a #ifdef to keep other archs working until it implements
> this as well. I added a nmi_irq_incr which is called by __do_IRQ in
> the generic code. This is what is used in the NMI code to determine
> if the CPU has locked up. This way we don't have to worry about what
> resource we are using for timers.

but e.g. the APIC timer doesnt go through do_IRQ(), it has its own
special IRQ entry code. The simple solution would be to also include the
IRQ#0 count in the NMI watchdog detection condition - i.e. something
like the patch below. Hm?

Ingo

Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/i386/kernel/nmi.c
+++ linux/arch/i386/kernel/nmi.c
@@ -521,7 +521,7 @@ void notrace nmi_watchdog_tick (struct p
*/
int sum, cpu = smp_processor_id();

- sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
+ sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_irqs(0);

profile_tick(CPU_PROFILING, regs);
if (nmi_show_regs[cpu]) {

2005-11-01 17:42:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.14-rc5-rt6 -- False NMI lockup detects

On Tue, 2005-11-01 at 12:33 +0100, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > Hi Ingo and Thomas,
> >
> > On some of my machines, I've been experiencing false NMI lockups.
> > This usually happens on slower machines, and taking a look into this,
> > it seems to be due to a short time where no processes are using
> > timers, and the ktimer interrupts aren't needed. So the APIC timer,
> > which now is used only for the ktimers, has a five second pause, and
> > causes the NMI to go off. The NMI uses the apic timer to determine
> > lockups.
> >
> > So, I added a more generic method. This only works for x86 for now,
> > but it has a #ifdef to keep other archs working until it implements
> > this as well. I added a nmi_irq_incr which is called by __do_IRQ in
> > the generic code. This is what is used in the NMI code to determine
> > if the CPU has locked up. This way we don't have to worry about what
> > resource we are using for timers.
>
> but e.g. the APIC timer doesnt go through do_IRQ(), it has its own
> special IRQ entry code. The simple solution would be to also include the
> IRQ#0 count in the NMI watchdog detection condition - i.e. something
> like the patch below. Hm?
>
> Ingo
>
> Index: linux/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/nmi.c
> +++ linux/arch/i386/kernel/nmi.c
> @@ -521,7 +521,7 @@ void notrace nmi_watchdog_tick (struct p
> */
> int sum, cpu = smp_processor_id();
>
> - sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
> + sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_irqs(0);
>
> profile_tick(CPU_PROFILING, regs);
> if (nmi_show_regs[cpu]) {

:) I thought about doing that too, but I wanted a more generic solution.
I think I would have just put the nmi_incr in the apic interrupt handler
as well. That way we might some day be able to pull out the
nmi_watchdog detect code out of the arch specific all together.

-- Steve