2002-02-28 12:51:30

by Martin Wilck

[permalink] [raw]
Subject: PROBLEM: Timer interrupt lockup on HT machine


Hello,

we have found another problem with our HT ("Jackson") prototype
machines. In our reboot tests we found that sometimes the timer
interrupts stop. This happens typically after ~30-50 seconds.

>From that time on, no more timer interrupts are encountered.
This happens only when the logical Jackson CPUs are enabled,
i.e. with the "acpismp=force" command line parameter, and approximately
at every 10th boot.

Another observation is that on the HT machines (whether or not the
above problem occurs) almost all interrupts seem to be handled by CPU 0.

CPU 1 gets a few, but in a ratio of about 1:10000 wrt CPU 0.
CPU 2 and 3 do not see any interrupts at all.

Has anybody heard of these problems yet, and are workarounds available?
I am currently investigating the problem and will be happy to supply
more information if requested.

Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy






2002-02-28 17:27:30

by Mikael Pettersson

[permalink] [raw]
Subject: Re: PROBLEM: Timer interrupt lockup on HT machine

Martin Wilck writes:
> Another observation is that on the HT machines (whether or not the
> above problem occurs) almost all interrupts seem to be handled by CPU 0.
>
> CPU 1 gets a few, but in a ratio of about 1:10000 wrt CPU 0.
> CPU 2 and 3 do not see any interrupts at all.
>
> Has anybody heard of these problems yet, and are workarounds available?

I've heard about that IRQ imbalance on MP P4 Xeons from a person at Dell.
My understanding of it is that it's a consequence of the local APIC
priorization changes Intel did in the P4 xAPIC design. (Study Intel's IA32
Vol 3 manual in detail and you'll see them.) Supposedly Ingo Molnar is
working on a solution.

/Mikael

2002-02-28 20:26:54

by Martin Wilck

[permalink] [raw]
Subject: Re: PROBLEM: Timer interrupt lockup on HT machine


I have looked into this some more.

It appears that the mask register of the 8259 PIC is corrupted.
On this machine, the timer is setup through setup_ExtINT_IRQ0_pin()
(8259A master -> IRQ broadcast), all is working well.

The correct value of the mask register (IO port 0x21) should be 0xfa
(Timer & cascade enabled).

When the timer lockup happens, the register value is 0x07, i.e.
completely wrong, and IRQ 0 is masked. I can use a user space
program to re-eneable the timer interrupt, and all is fine again.

I am pretty irritated by this behaviour, because in the timer
operating mode on this system the mask register is never accessed
after the initial setup.

It is very obvious that this situation arises around the time of X
server startup.

Any clues?

Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy