On Mon, 2007-08-06 at 15:07 -0400, Gregory Haskins wrote:
> Hi Ingo,
> I think there is a latent race condition somewhere in the code. We
> find that -rt works on our 4-way (and under) systems, but have problems
> on our 8-ways.
>
> If you run without nmi_watchdog, the system will sometimes boot (but
> very very very slow), and sometimes it will softlockup. If you turn on
> nmi_watchdog, the system detects a hang (probably at the point where the
> system gets really slow without it). We notice no problems on the
> 4-ways.
>
> The system where we can reproduce this is a Dell 690 with Dual 2Ghz
> Quad-core Xeon 5335s. I've attached some relevant info. Let me know if
> you need more.
Could you drop the following config options and test again?
#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
(btw, added LKML to the CC list ..)
Daniel
On Tue, 2007-08-07 at 10:03 -0700, Daniel Walker wrote:
> Could you drop the following config options and test again?
>
> #
> # Processor type and features
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ=y
> CONFIG_HIGH_RES_TIMERS=y
>
Will do.
I have a patch which works around the issue too, which I will forward
momentarily. It appears as though there is a deadlock either between
two task_rq_locks, or between a task_rq_lock and something else. The
patch I wrote changes the "double_lock_balance()" function to a full DP
algorithm under contention. Technically I think the original
implementation was correct which is why my patch is really a workaround:
I think its just plastering over the real issue. But in any case, the
8-way system is no longer slow and no longer nmi_watchdog/softlockups on
me(*). Perhaps it will at least help in finding the root cause.
Regards,
-Greg
(*) There are still issues on the 8-way once we get passed this
scheduler problem, however.