2005-03-29 02:30:32

by Jack F Vogel

[permalink] [raw]
Subject: RFC: [PATCH] check nmi watchdog is broken

A bug against an xSeries system showed up recently
noting that the check_nmi_watchdog() test was failing.

I have been investigating it and discovered in both
i386 and x86_64 the recent change to the routine
to use the cpu_callin_map has uncovered a problem.
Prior to that change, on an SMP box, the test was
trivally passing because all cpu's were found to
not yet be online, but now with the callin_map they
are discovered, it goes on to test the counter
and they have not yet begun to increment, so it
announces a CPU is stuck and bails out.

On all the systems I have access to test, the announcement
of failure is also bougs... by the time you can login
and check /proc/interrupts, the NMI count is happily
incrementing on all CPUs. Its just that the test is
being done too early.

I have tried moving the call to the test around a bit,
and it was always too early. I finally hit on this
proposed solution, it delays the routine via a
late_initcall(), seems like the right solution to
me.

Please copy me on responses.

Regards,

Jack



Attachments:
(No filename) (1.02 kB)
watchdog.patch (4.19 kB)
Download all attachments