2007-08-07 17:08:33

by Daniel Walker

[permalink] [raw]
Subject: Re: Hang on 8-way with 2.6.23-rc2-rt2

On Mon, 2007-08-06 at 15:07 -0400, Gregory Haskins wrote:
> Hi Ingo,
> I think there is a latent race condition somewhere in the code. We
> find that -rt works on our 4-way (and under) systems, but have problems
> on our 8-ways.
>
> If you run without nmi_watchdog, the system will sometimes boot (but
> very very very slow), and sometimes it will softlockup. If you turn on
> nmi_watchdog, the system detects a hang (probably at the point where the
> system gets really slow without it). We notice no problems on the
> 4-ways.
>
> The system where we can reproduce this is a Dell 690 with Dual 2Ghz
> Quad-core Xeon 5335s. I've attached some relevant info. Let me know if
> you need more.

Could you drop the following config options and test again?

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

(btw, added LKML to the CC list ..)

Daniel


2007-08-07 17:30:00

by Gregory Haskins

[permalink] [raw]
Subject: Re: Hang on 8-way with 2.6.23-rc2-rt2

On Tue, 2007-08-07 at 10:03 -0700, Daniel Walker wrote:

> Could you drop the following config options and test again?
>
> #
> # Processor type and features
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ=y
> CONFIG_HIGH_RES_TIMERS=y
>

Will do.

I have a patch which works around the issue too, which I will forward
momentarily. It appears as though there is a deadlock either between
two task_rq_locks, or between a task_rq_lock and something else. The
patch I wrote changes the "double_lock_balance()" function to a full DP
algorithm under contention. Technically I think the original
implementation was correct which is why my patch is really a workaround:
I think its just plastering over the real issue. But in any case, the
8-way system is no longer slow and no longer nmi_watchdog/softlockups on
me(*). Perhaps it will at least help in finding the root cause.

Regards,
-Greg

(*) There are still issues on the 8-way once we get passed this
scheduler problem, however.