Subject: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

After some non deterministic amount of time (less than 2 hours)
every application that rely on time get locked.
If a do a 'cat /proc/interrupts' I can see that the timer and NMI
interrupts counter don't increase any more.
My base configuration is : Asus CUR-DLS Bi PIII 933MHz, 2GB RDRAM
~ > uname -a
Linux pc8-118 2.4.0-test10 #1 SMP Thu Nov 16 13:18:57 CET 2000 i686unknown
~ > cat /proc/interrupts
CPU0 CPU1
0: 143295 143298 IO-APIC-edge timer <<< This one is stuck
1: 3867 3275 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
8: 0 1 IO-APIC-edge rtc
12: 46469 51592 IO-APIC-edge PS/2 Mouse
13: 0 0 XT-PIC fpu
17: 11 16 IO-APIC-level aic7xxx
20: 14764 14294 IO-APIC-level eth0
24: 1917 1855 IO-APIC-level sym53c8xx
25: 15 15 IO-APIC-level sym53c8xx
NMI: 286482 286482 < < < < stuck
LOC: 286422 286420 < < < < stuck
ERR: 0

Nothing in syslogs indicates a panic nor an oops. The nmi watchdog is
activated and reports nothing.

The problem will not occur if nosmp is passed to the kernel.
The problem occurs aswell if I use a 2.2.17 kernel.

What does not work:
- X becomes non functional
- reboot will not reboot and will emit a infinite beep on the internal
speaker
- sleep 1 is endless but interruptible with C-c
- Magic SysRq will not synchronize nor umount filesystems
- date will not increase
What works:
- network and all related services
- shell (becomes slugish from time to time)
- console login
- hwclock is increasing correctly.

Here are the first boot messages from dmsg:
0 0 0 0 1 1 59
08 003 03 0 0 0 0 0 1 1 61
09 003 03 1 1 0 1 0 1 1 69
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 003 03 0 0 0 0 0 1 1 71
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 1 0 0 0 0 0 0 00
0f 000 00 1 0 0 0 0 0 0 00

IO APIC #1......
.... register #00: 01000000
....... : physical APIC id: 01
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : IO APIC version: 0011
.... register #02: 03000000
....... : arbitration: 03
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 003 03 1 1 0 1 0 1 1 79
01 003 03 1 1 0 1 0 1 1 81
02 000 00 1 0 0 0 0 0 0 00
03 000 00 1 0 0 0 0 0 0 00
04 003 03 1 1 0 1 0 1 1 89
05 000 00 1 0 0 0 0 0 0 00
06 000 00 1 0 0 0 0 0 0 00
07 000 00 1 0 0 0 0 0 0 00
08 003 03 1 1 0 1 0 1 1 91
09 003 03 1 1 0 1 0 1 1 99
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 000 00 1 0 0 0 0 0 0 00
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 1 0 0 0 0 0 0 00
0f 000 00 1 0 0 0 0 0 0 00
IRQ to pin mappings:
IRQ0 -> 2
IRQ1 -> 1
IRQ3 -> 3
IRQ4 -> 4
IRQ6 -> 6
IRQ7 -> 7
IRQ8 -> 8
IRQ9 -> 9
IRQ12 -> 12
IRQ16 -> 0
IRQ17 -> 1
IRQ20 -> 4
IRQ24 -> 8
IRQ25 -> 9
.................................... done.
calibrating APIC timer ...
..... CPU clock speed is 933.5232 MHz.
..... host bus clock speed is 133.3603 MHz.
cpu: 0, clocks: 1333603, slice: 444534
CPU0<T0:1333600,T1:889056,D:10,S:444534,C:1333603>
cpu: 1, clocks: 1333603, slice: 444534
CPU1<T0:1333600,T1:444528,D:4,S:444534,C:1333603>
checking TSC synchronization across CPUs: passed.
Setting commenced=1, go go go
PCI: PCI BIOS revision 2.10 entry at 0xf0aa0, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: ServerWorks host bridge: secondary bus 00
PCI: ServerWorks host bridge: secondary bus 01
PCI: Using IRQ router default [1166/0200] at 00:0f.0
PCI->APIC IRQ transform: (B1,I5,P0) -> 24
PCI->APIC IRQ transform: (B1,I5,P1) -> 25
PCI->APIC IRQ transform: (B0,I2,P0) -> 20
PCI->APIC IRQ transform: (B0,I3,P0) -> 16
PCI->APIC IRQ transform: (B0,I4,P0) -> 17
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
P6 Microcode Update Driver v1.07
Starting kswapd v1.8
pty: 256 Unix98 ptys configured
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
NTFS version 000607
Serial driver version 5.02 (2000-08-09) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
Real Time Clock Driver v1.10d
Software Watchdog Timer: 0.05, timer margin: 60 sec
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.33 $ 2000/05/24 Modified by Andrey V. Savochkin <[email protected]> and others
eth0: OEM i82557/i82558 10/100 Ethernet, 00:E0:18:02:5E:D5, IRQ 20.
Board assembly 668081-002, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
Receiver lock-up workaround activated.
SCSI subsystem driver Revision: 1.00
(scsi0) <Adaptec AIC-7850 SCSI host adapter> found at PCI 0/4/0
(scsi0) Narrow Channel, SCSI ID=7, 3/255 SCBs
(scsi0) Downloading sequencer code... 415 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
<Adaptec AIC-7850 SCSI host adapter>
(scsi0:0:4:0) Synchronous at 10.0 Mbyte/sec, offset 15.
Vendor: PIONEER Model: DVD-ROM DVD-304 Rev: 1.03
Type: CD-ROM ANSI SCSI revision: 02
sym53c8xx: at PCI bus 1, device 5, function 0
sym53c8xx: 53c896 detected with Symbios NVRAM
sym53c8xx: at PCI bus 1, device 5, function 1
sym53c8xx: 53c896 detected with Symbios NVRAM
sym53c896-0: rev 0x7 on pci bus 1 device 5 function 0 irq 24
sym53c896-0: Symbios format NVRAM, ID 7, Fast-40, Parity Checking
sym53c896-0: initial SCNTL3/DMODE/DCNTL/CTEST3/4/5 = (hex) 07/4e/a0/01/00/24
sym53c896-0: final SCNTL3/DMODE/DCNTL/CTEST3/4/5 = (hex) 07/4e/80/01/08/24
sym53c896-0: on-chip RAM at 0xf7000000
sym53c896-0: resetting, command processing suspended for 2 seconds
sym53c896-0: restart (scsi reset).
sym53c896-0: enabling clock multiplier
sym53c896-0: handling phase mismatch from SCRIPTS.
sym53c896-0: Downloading SCSI SCRIPTS.
sym53c896-1: rev 0x7 on pci bus 1 device 5 function 1 irq 25
sym53c896-1: Symbios format NVRAM, ID 7, Fast-40, Parity Checking
sym53c896-1: initial SCNTL3/DMODE/DCNTL/CTEST3/4/5 = (hex) 07/4e/a0/01/00/24
sym53c896-1: final SCNTL3/DMODE/DCNTL/CTEST3/4/5 = (hex) 07/4e/80/01/08/24
sym53c896-1: on-chip RAM at 0xf6000000
sym53c896-1: resetting, command processing suspended for 2 seconds
sym53c896-1: restart (scsi reset).
sym53c896-1: enabling clock multiplier
sym53c896-1: handling phase mismatch from SCRIPTS.
sym53c896-1: Downloading SCSI SCRIPTS.
scsi1 : sym53c8xx - version 1.6b
scsi2 : sym53c8xx - version 1.6b
sym53c896-0: command processing resumed
Vendor: IBM Model: DDYS-T18350N Rev: S93E
Type: Direct-Access ANSI SCSI revision: 03
sym53c896-1: command processing resumed
sym53c896-0-<0,0>: tagged command queue depth set to 4
Detected scsi disk sda at scsi1, channel 0, id 0, lun 0
sym53c896-0-<0,0>: wide msgout: 1-2-3-1.
sym53c896-0-<0,0>: wide msgin: 1-2-3-1.
sym53c896-0-<0,0>: wide: wide=1 chg=0.
sym53c896-0-<0,*>: WIDE SCSI (16 bit) enabled.
sym53c896-0-<0,0>: wide msgout: 1-2-3-1.
sym53c896-0-<0,0>: wide msgin: 1-2-3-1.
sym53c896-0-<0,0>: wide: wide=1 chg=0.
sym53c896-0-<0,0>: sync msgout: 1-3-1-a-1f.
sym53c896-0-<0,0>: sync msg in: 1-3-1-a-1f.
sym53c896-0-<0,0>: sync: per=10 scntl3=0x90 scntl4=0x0 ofs=31 fak=0 chg=0.
sym53c896-0-<0,*>: FAST-40 WIDE SCSI 80.0 MB/s (25 ns, offset 31)
SCSI device sda: 35843670 512-byte hdwr sectors (18352 MB)
Partition check:
sda: sda1 sda2 sda3 sda4
Detected scsi CD-ROM sr0 at scsi0, channel 0, id 4, lun 0
sr0: scsi3-mmc drive: 0x/0x cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.11
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
IP: routing cache hash table of 16384 buckets, 128Kbytes
TCP: Hash tables configured (established 131072 bind 65536)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
kmem_create: Forcing size word alignment - nfs_fh
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 200k freed

Thank you in advance for your help.

--
| Benjamin Monate | mailto:[email protected] |
| LRI - B?t. 490
| Universit? de Paris-Sud
| F-91405 ORSAY Cedex


Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

In his message of Sat 18 November, Andrew Morton writes :
> Try booting with the `noapic' option. Looks like your APIC
> is getting itself unprogrammed. Check that you're not
> overclocked and not over temperature.

Booting with noapic did not improve anything.
The processor is not supposed to be overclocked. How can I be sure of
that ?

Further investigations showed that the problem will occur only when
Xfree 4.0.1 is running with an smp kenel . Xfree 3.3.6 is ok. Could this
be a bug in X ? I thought that the kernel should prevent such a bug
from locking the computer.

Thank you again for your help.
--
| Benjamin Monate | mailto:[email protected] |
| LRI - B?t. 490
| Universit? de Paris-Sud | phoneto: +33 1 69 15 42 32 |
| F-91405 ORSAY Cedex | faxto: +33 1 69 15 65 86 |

2000-11-20 17:59:30

by FORT David

[permalink] [raw]
Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

"Benjamin Monate

> In his message of Sat 18 November, Andrew Morton writes :
> > Try booting with the `noapic' option. Looks like your APIC
> > is getting itself unprogrammed. Check that you're not
> > overclocked and not over temperature.
>
> Booting with noapic did not improve anything.
> The processor is not supposed to be overclocked. How can I be sure of
> that ?
>
> Further investigations showed that the problem will occur only when
> Xfree 4.0.1 is running with an smp kenel . Xfree 3.3.6 is ok. Could this
> be a bug in X ? I thought that the kernel should prevent such a bug
> from locking the computer.
>
> Thank you again for your help.
> --
>

What 's your video card ? Not something running with closed source drivers ?
(namely G-force)
The kernel cannot prevent drivers from locking PCI/AGP bus.

--
%-------------------------------------------------------------------------%
% FORT David, %
% 7 avenue de la morvandi?re 0240726275 %
% 44470 Thouare, France [email protected] %
% ICU:78064991 AIM: enlighted popo [email protected] %
%--LINUX-HTTPD-PIOGENE----------------------------------------------------%
% -datamining <-/ | .~. %
% -networking/flashed PHP3 coming soon | /V\ L I N U X %
% -opensource | // \\ >Fear the Penguin< %
% -GNOME/enlightenment/GIMP | /( )\ %
% feel enlighted.... | ^^-^^ %
% http://ibonneace.dnsalias.org/ when connected %
%-------------------------------------------------------------------------%



Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

Dans son message du Mon 20 November, Fort David ecrit :
> > Further investigations showed that the problem will occur only when
> > Xfree 4.0.1 is running with an smp kenel . Xfree 3.3.6 is ok. Could this
> > be a bug in X ? I thought that the kernel should prevent such a bug
> > from locking the computer.
> What 's your video card ? Not something running with closed source drivers ?
> (namely G-force)
> The kernel cannot prevent drivers from locking PCI/AGP bus.

My video card is a matrox G200SD, source is open. I will try another
video card tomorow. I do understand that the kernel cannot prevent a
driver from locking then PCI bus : the timer lockup does not look like
a lock of the PCI bus (as SCSI and NIC are still working). Only the
timer interrupts and NMI seem to be stuck : can a driver cause
something so "lowlevel" ?

By the way, the processors are not overclocked.
I don't know how to check if something is overheating. Both fans seem okay
to me.

Do you any way to somehow "restart" (is it unmask ?) the timer IRQ
work and the NMI ? This would at least avoid some long fsck on
reboot...

Many thanks

--
| Benjamin Monate | mailto:[email protected] |
| LRI - B?t. 490
| Universit? de Paris-Sud
| F-91405 ORSAY Cedex

2000-11-20 18:44:01

by Alan Cox

[permalink] [raw]
Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

> a lock of the PCI bus (as SCSI and NIC are still working). Only the
> timer interrupts and NMI seem to be stuck : can a driver cause
> something so "lowlevel" ?

Something stopping the timers on the APIC I guess. But quite what or how I
don't know

2000-11-21 15:06:53

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

On Mon, 20 Nov 2000, Alan Cox wrote:

> > a lock of the PCI bus (as SCSI and NIC are still working). Only the
> > timer interrupts and NMI seem to be stuck : can a driver cause
> > something so "lowlevel" ?
>
> Something stopping the timers on the APIC I guess. But quite what or how I
> don't know

I guess not -- the timer interrupt and the NMI use different I/O APIC
inputs. If both are stuck, it's probably 8254 that gets reprogrammed. I
suppose XFree86 might be at fault -- does it happen with the NMI watchdog
disabled, either?

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +


Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

> I guess not -- the timer interrupt and the NMI use different I/O APIC
> inputs. If both are stuck, it's probably 8254 that gets reprogrammed. I
> suppose XFree86 might be at fault -- does it happen with the NMI watchdog
> disabled, either?
Yes. I just enabled the nmi watchdog to try to debug the problem. It did
not change anything.

About the 8254, the kernel log contains :

Nov 20 17:15:15 pc8-118 kernel: MP-BIOS bug: 8254 timer
not connected to IO-AP
IC
Nov 20 17:15:15 pc8-118 kernel: ...trying to set up timer (IRQ0)
through the 8259A ...
Nov 20 17:15:15 pc8-118 kernel: ..... (found pin 0) ...works.

But this does not seem to annoy the kernel.

Is there anyway to restore the 8254 to a valid state without rebooting ?


--
| Benjamin Monate | mailto:[email protected] |
| LRI - B?t. 490
| Universit? de Paris-Sud
| F-91405 ORSAY Cedex

2000-11-21 18:06:50

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

On Tue, 21 Nov 2000, Benjamin Monate <Benjamin Monate wrote:

> About the 8254, the kernel log contains :
>
> Nov 20 17:15:15 pc8-118 kernel: MP-BIOS bug: 8254 timer
> not connected to IO-AP
> IC
> Nov 20 17:15:15 pc8-118 kernel: ...trying to set up timer (IRQ0)
> through the 8259A ...
> Nov 20 17:15:15 pc8-118 kernel: ..... (found pin 0) ...works.
>
> But this does not seem to annoy the kernel.

But this message is printed when a workaround for certain early SMP EISA
boards gets activated. You shouldn't normally get it for anything newer
than P5/66 unless your MP-table is broken. Can you send me a dump of your
MP-table (just issue `dmesg -s 32768' after a bootstrap -- the table is at
the top).

> Is there anyway to restore the 8254 to a valid state without rebooting ?

Well, in this specific configuration, it may be either the 8254 timer or
the 8259 legacy PIC as when the workaround gets activated both timer IRQs
and NMIs go through the latter. It is certainly possible to reprogram
the chips but maybe we can find a way to avoid the lockup.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

Dans son message du Tue 21 November, Maciej W. Rozycki ecrit :
> But this message is printed when a workaround for certain early SMP EISA
> boards gets activated. You shouldn't normally get it for anything newer
> than P5/66 unless your MP-table is broken. Can you send me a dump of your
> MP-table (just issue `dmesg -s 32768' after a bootstrap -- the table is at
> the top).
Ok: here is the begining of dmesg.

Linux version 2.4.0-test10 ([email protected]) (gcc version 2.95.2 20000220 (Debian GNU/Linux)) #1 SMP Thu Nov 16 13:18:57 CET 2000
BIOS-provided physical RAM map:
BIOS-e820: 000000000009f000 @ 0000000000000000 (usable)
BIOS-e820: 0000000000001000 @ 000000000009f000 (reserved)
BIOS-e820: 0000000000010000 @ 00000000000f0000 (reserved)
BIOS-e820: 000000007fefb000 @ 0000000000100000 (usable)
BIOS-e820: 0000000000004000 @ 000000007fffb000 (ACPI data)
BIOS-e820: 0000000000001000 @ 000000007ffff000 (ACPI NVS)
BIOS-e820: 0000000000010000 @ 00000000fec00000 (reserved)
BIOS-e820: 0000000000001000 @ 00000000fee00000 (reserved)
BIOS-e820: 0000000000080000 @ 00000000fff80000 (reserved)
1151MB HIGHMEM available.
Scan SMP from c0000000 for 1024 bytes.
Scan SMP from c009fc00 for 1024 bytes.
Scan SMP from c00f0000 for 65536 bytes.
found SMP MP-table at 000f5270
hm, page 000f5000 reserved twice.
hm, page 000f6000 reserved twice.
hm, page 000f4000 reserved twice.
hm, page 000f5000 reserved twice.
On node 0 totalpages: 524283
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 294907 pages.
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000
Processor #3 Pentium(tm) Pro APIC version 17
Floating point unit present.
Machine Exception supported.
64 bit compare & exchange supported.
Internal APIC present.
Bootup CPU
Processor #0 Pentium(tm) Pro APIC version 17
Floating point unit present.
Machine Exception supported.
64 bit compare & exchange supported.
Internal APIC present.
Bus #0 is PCI
Bus #1 is PCI
Bus #2 is ISA
I/O APIC #2 Version 17 at 0xFEC00000.
I/O APIC #3 Version 17 at 0xFEC01000.
Int: type 3, pol 0, trig 0, bus 2, IRQ 00, APIC ID 2, APIC INT 00
Int: type 0, pol 0, trig 0, bus 2, IRQ 01, APIC ID 2, APIC INT 01
Int: type 0, pol 0, trig 0, bus 2, IRQ 00, APIC ID 2, APIC INT 02
Int: type 0, pol 0, trig 0, bus 2, IRQ 03, APIC ID 2, APIC INT 03
Int: type 0, pol 0, trig 0, bus 2, IRQ 04, APIC ID 2, APIC INT 04
Int: type 0, pol 0, trig 0, bus 2, IRQ 06, APIC ID 2, APIC INT 06
Int: type 0, pol 0, trig 0, bus 2, IRQ 07, APIC ID 2, APIC INT 07
Int: type 0, pol 0, trig 0, bus 2, IRQ 08, APIC ID 2, APIC INT 08
Int: type 0, pol 0, trig 0, bus 2, IRQ 0c, APIC ID 2, APIC INT 0c
Int: type 0, pol 3, trig 3, bus 0, IRQ 08, APIC ID 3, APIC INT 04
Int: type 0, pol 3, trig 3, bus 0, IRQ 0c, APIC ID 3, APIC INT 00
Int: type 0, pol 3, trig 3, bus 0, IRQ 10, APIC ID 3, APIC INT 01
Int: type 0, pol 3, trig 3, bus 2, IRQ 09, APIC ID 2, APIC INT 09
Int: type 0, pol 3, trig 3, bus 1, IRQ 14, APIC ID 3, APIC INT 08
Int: type 0, pol 3, trig 3, bus 1, IRQ 15, APIC ID 3, APIC INT 09
Lint: type 3, pol 1, trig 1, bus 2, IRQ 00, APIC ID ff, APIC LINT 00
Lint: type 1, pol 1, trig 1, bus 2, IRQ 00, APIC ID ff, APIC LINT 01
Processors: 2
mapped APIC to ffffe000 (fee00000)
mapped IOAPIC to ffffd000 (fec00000)
mapped IOAPIC to ffffc000 (fec01000)
Kernel command line: BOOT_IMAGE=new ro root=803 BOOT_FILE=/boot/vmlinuz-2.4.0-test2 nmi_watchdog=1
Initializing CPU#0
Detected 933.446 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1861.22 BogoMIPS
Memory: 2059660k/2097132k available (1537k kernel code, 37084k reserved, 89k data, 200k init, 1179628k highmem)
Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Buffer-cache hash table entries: 131072 (order: 7, 524288 bytes)
Page-cache hash table entries: 524288 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.36 (20000221) Richard Gooch ([email protected])
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel Pentium III (Coppermine) stepping 03
per-CPU timeslice cutoff: 731.32 usecs.
Getting VERSION: 40011
Getting VERSION: 40011
Getting ID: 3000000
Getting ID: c000000
Getting LVT0: 8700
Getting LVT1: 400
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
CPU present map: 9
Booting processor 1/0 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Initializing CPU#1
CPU#1 (phys ID: 0) waiting for CALLOUT
Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 1.
After Callout 1.
CALLIN, before setup_local_APIC().
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1861.22 BogoMIPS
Stack at about c322bfbc
Intel machine check reporting enabled on CPU#1.
OK.
CPU1: Intel Pentium III (Coppermine) stepping 03
CPU has booted.
Before bogomips.
Total of 2 processors activated (3722.44 BogoMIPS).
Before bogocount - setting activated=1.
Boot done.
ENABLING IO-APIC IRQs
...changing IO-APIC physical APIC ID to 2 ... ok.
BIOS bug, IO-APIC#1 ID 3 is already used!...
... fixing up to 1. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 1 ... ok.
Synchronizing Arb IDs.
init IO_APIC IRQs
IO-APIC (apicid-pin) 2-0, 2-5, 2-10, 2-11, 2-13, 2-14, 2-15, 1-2, 1-3, 1-5, 1-6, 1-7, 1-10, 1-11, 1-12, 1-13, 1-14, 1-15 not connected.
..TIMER: vector=49 pin1=2 pin2=0
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...
..... (found pin 0) ...works.
activating NMI Watchdog ... done.
number of MP IRQ sources: 15.
number of IO-APIC #2 registers: 16.
number of IO-APIC #1 registers: 16.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
....... : physical APIC id: 02
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : IO APIC version: 0011
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 003 03 0 0 0 0 0 1 1 31
01 003 03 0 0 0 0 0 1 1 39
02 000 00 1 0 0 0 0 0 0 00
03 003 03 0 0 0 0 0 1 1 41
04 003 03 0 0 0 0 0 1 1 49
05 000 00 1 0 0 0 0 0 0 00
06 003 03 0 0 0 0 0 1 1 51
07 003 03 0 0 0 0 0 1 1 59
08 003 03 0 0 0 0 0 1 1 61
09 003 03 1 1 0 1 0 1 1 69
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 003 03 0 0 0 0 0 1 1 71
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 1 0 0 0 0 0 0 00
0f 000 00 1 0 0 0 0 0 0 00

IO APIC #1......
.... register #00: 01000000
....... : physical APIC id: 01
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : IO APIC version: 0011
.... register #02: 03000000
....... : arbitration: 03
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 003 03 1 1 0 1 0 1 1 79
01 003 03 1 1 0 1 0 1 1 81
02 000 00 1 0 0 0 0 0 0 00
03 000 00 1 0 0 0 0 0 0 00
04 003 03 1 1 0 1 0 1 1 89
05 000 00 1 0 0 0 0 0 0 00
06 000 00 1 0 0 0 0 0 0 00
07 000 00 1 0 0 0 0 0 0 00
08 003 03 1 1 0 1 0 1 1 91
09 003 03 1 1 0 1 0 1 1 99
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 000 00 1 0 0 0 0 0 0 00
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 1 0 0 0 0 0 0 00
0f 000 00 1 0 0 0 0 0 0 00
IRQ to pin mappings:
IRQ0 -> 2
IRQ1 -> 1
IRQ3 -> 3
IRQ4 -> 4
IRQ6 -> 6
IRQ7 -> 7
IRQ8 -> 8
IRQ9 -> 9
IRQ12 -> 12
IRQ16 -> 0
IRQ17 -> 1
IRQ20 -> 4
IRQ24 -> 8
IRQ25 -> 9
.................................... done.
calibrating APIC timer ...
..... CPU clock speed is 933.4306 MHz.
..... host bus clock speed is 133.3472 MHz.
cpu: 0, clocks: 1333472, slice: 444490
CPU0<T0:1333472,T1:888976,D:6,S:444490,C:1333472>
cpu: 1, clocks: 1333472, slice: 444490
CPU1<T0:1333472,T1:444480,D:12,S:444490,C:1333472>
checking TSC synchronization across CPUs: passed.
Setting commenced=1, go go go
PCI: PCI BIOS revision 2.10 entry at 0xf0aa0, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: ServerWorks host bridge: secondary bus 00
PCI: ServerWorks host bridge: secondary bus 01
PCI: Using IRQ router default [1166/0200] at 00:0f.0
PCI->APIC IRQ transform: (B1,I5,P0) -> 24
PCI->APIC IRQ transform: (B1,I5,P1) -> 25
PCI->APIC IRQ transform: (B0,I2,P0) -> 20
PCI->APIC IRQ transform: (B0,I3,P0) -> 16
PCI->APIC IRQ transform: (B0,I4,P0) -> 17
Linux NET4.0 for Linux 2.4

> > Is there anyway to restore the 8254 to a valid state without rebooting ?
>
> Well, in this specific configuration, it may be either the 8254 timer or
> the 8259 legacy PIC as when the workaround gets activated both timer IRQs
> and NMIs go through the latter. It is certainly possible to reprogram
> the chips but maybe we can find a way to avoid the lockup.

Yes, this would clearly be better.

Thank you again.

--
| Benjamin Monate | mailto:[email protected] |


Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

Dans son message du Thu 23 November, Maciej W. Rozycki ecrit :
> Hmm, your BIOS reports the timer IRQ is directly connected...
> > Int: type 0, pol 3, trig 3, bus 2, IRQ 09, APIC ID 2, APIC INT 09
> This is weird for an ISA IRQ.

Remember that I have TWO PCI buses and one ISA Bus.

>
> > ENABLING IO-APIC IRQs
> > ...changing IO-APIC physical APIC ID to 2 ... ok.
> > BIOS bug, IO-APIC#1 ID 3 is already used!...
> > ... fixing up to 1. (tell your hw vendor)
> > ...changing IO-APIC physical APIC ID to 1 ... ok.
> This is annoying but Linux recovers from it...

Yes. I had to patch 2.4.0pre2 to be able to boot, but now it boots
unpatched.
Why do you mean by "annoying" ? I thought this was just an
initialization problem.
By the way, my BIOS has an option to choose between MP 1.4 or older
specifications. Changing it has not changed anything to the problem.


> > ..TIMER: vector=49 pin1=2 pin2=0
> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > ...trying to set up timer (IRQ0) through the 8259A ...
> > ..... (found pin 0) ...works.
>
> At the moment I can't see any reason of this failure apart from pin1
> being unconnected. But why would it be? I'll prepare a debugging patch
> which might help finding the real cause and I'll send it to you soon.

Okay. Thank you very much.

> BTW, have you checked if there is a BIOS update for your system?

I will check that with Asus. At this time I have BIOS 1002.

--
| Benjamin Monate | mailto:[email protected] |
| LRI - B?t. 490
| Universit? de Paris-Sud
| F-91405 ORSAY Cedex

Subject: Re: Strange lockup of the timer with 2.4.0-test10 SMP (and older)

Hi,
I just patched my kernel and here are my results:

> systems. Please apply and give me the "IRQ to pin mappings:" part of the
> bootstrap log. I'm pretty confident we get everything fine there, but
> just in case...
0 0 00
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ12 -> 0:12
IRQ16 -> 1:0
IRQ17 -> 1:1
IRQ20 -> 1:4
IRQ24 -> 1:8
IRQ25 -> 1:9
Just curious : what do these numbers on the right refer to ? Are they
real physical pins of the processor ?

> Patch-2.4.0-test11-pic_debug-0 fetches as much as possible from chips
> involved in IRQ handling. Most of the code is already in place -- the
> patch only adds another magic SysRq command (credit goes for Andrew Morton
> for the idea). Please apply the patch and provide me the output of
> SysRq+A (you need to have MAGIC_SYSRQ enabled) as caught just after a
> bootstrap and then after you notice the timer IRQ counter no longer
> advances. I strongly suspect some software (be it XFree86 or not) enables
> one of the 8259A inputs which should remain disabled. This would lead to
> symptoms as you describe due to the special 8259A setup we use when the
> workaround is in use.

Here are both outputs:
Just after bootstrap:
Nov 27 16:51:35 pc8-118 kernel: SysRq:
Nov 27 16:51:35 pc8-118 kernel: print_PIC()
Nov 27 16:51:35 pc8-118 kernel: printing PIC contents
Nov 27 16:51:35 pc8-118 kernel: print_IO_APIC()
Nov 27 16:51:35 pc8-118 kernel: testing the IO APIC.......................
Nov 27 16:51:35 pc8-118 kernel:
Nov 27 16:51:35 pc8-118 kernel:
Nov 27 16:51:35 pc8-118 kernel: .................................... done.
Nov 27 16:51:35 pc8-118 kernel: print_all_local_APICs()
Nov 27 16:51:35 pc8-118 kernel:
Nov 27 16:51:35 pc8-118 kernel: ... APIC ID: 03000000 (3)
Nov 27 16:51:35 pc8-118 kernel: ... APIC VERSION: 00040011
Nov 27 16:51:35 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 16:51:35 pc8-118 last message repeated 9 times
Nov 27 16:51:35 pc8-118 kernel: 00000000010000000100000000000000
Nov 27 16:51:35 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 16:51:35 pc8-118 last message repeated 2 times
Nov 27 16:51:35 pc8-118 kernel: 00000000000000000100000000000000
Nov 27 16:51:35 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 16:51:35 pc8-118 last message repeated 4 times
Nov 27 16:51:35 pc8-118 kernel: 00000000000000010000000000000000
Nov 27 16:51:35 pc8-118 kernel:
Nov 27 16:51:35 pc8-118 kernel:
Nov 27 16:51:35 pc8-118 kernel: ... APIC ID: 00000000 (0)
Nov 27 16:51:35 pc8-118 kernel: ... APIC VERSION: 00040011
Nov 27 16:51:35 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 16:51:35 pc8-118 last message repeated 9 times
Nov 27 16:51:35 pc8-118 kernel: 00000000010000000100000000000000
Nov 27 16:51:35 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 16:51:35 pc8-118 last message repeated 8 times
Nov 27 16:51:35 pc8-118 kernel: 00000000000000010000000000000000
Nov 27 16:51:35 pc8-118 kernel:

And then a quite different one after the locks:
Nov 27 17:10:57 pc8-118 kernel: SysRq:
Nov 27 17:10:57 pc8-118 kernel: print_PIC()
Nov 27 17:10:57 pc8-118 kernel: printing PIC contents
Nov 27 17:10:57 pc8-118 kernel: print_IO_APIC()
Nov 27 17:10:57 pc8-118 kernel: testing the IO APIC.......................
Nov 27 17:10:57 pc8-118 kernel:
Nov 27 17:10:57 pc8-118 kernel:
Nov 27 17:10:57 pc8-118 kernel: .................................... done.
Nov 27 17:10:57 pc8-118 kernel: print_all_local_APICs()
Nov 27 17:10:57 pc8-118 kernel:
Nov 27 17:10:57 pc8-118 kernel: ... APIC ID: 00000000 (0)
Nov 27 17:10:57 pc8-118 kernel: ... APIC VERSION: 00040011
Nov 27 17:10:57 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 17:10:57 pc8-118 last message repeated 9 times
Nov 27 17:10:57 pc8-118 kernel: 00000000010000000100000000000000
Nov 27 17:10:57 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 17:10:57 pc8-118 last message repeated 9 times
Nov 27 17:10:57 pc8-118 kernel:
Nov 27 17:10:57 pc8-118 kernel:
Nov 27 17:10:57 pc8-118 kernel: ... APIC ID: 03000000 (3)
Nov 27 17:10:57 pc8-118 kernel: ... APIC VERSION: 00040011
Nov 27 17:10:57 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 17:10:57 pc8-118 last message repeated 9 times
Nov 27 17:10:57 pc8-118 kernel: 00000000010000000100000000000000
Nov 27 17:10:57 pc8-118 kernel: 00000000000000000000000000000000
Nov 27 17:10:57 pc8-118 last message repeated 8 times
Nov 27 17:10:57 pc8-118 kernel: 00000000000000010000000000000000
Nov 27 17:10:57 pc8-118 kernel:

Now this looks really cryptic to me.

Both outputs are different.

- They are in displayed in a different order :
APIC 3 then 0 before lock and APIC 0 then 3 after the lock.

- Last line of the APIC 0 differ on one bit :
Before lock 00000000000000010000000000000000
After lock 00000000000000000000000000000000

- Line 15 of APIC 3 differ :
Before lock 00000000000000000100000000000000
After lock 00000000000000000000000000000000

> I am trying to pull documentation from ServerWorks -- they appear not to
> want to share them with everyone. From marketing brochures available on
> their web site I found out their chipsets feature integrated I/O APICs
> which makes me quite suspicious the original problem with the timer IRQ
> routing (i.e. not the secondary one you observe when the workaround gets
> activated) lies in the chipset setup.

The only informations that I can read in the manual are :

ServerWorks LE 3.0 Chipset: features the ServerWorks LE 3.0 North
Bridge and RCC Open South Bridge.
Integrated IOAPIC: Supports full 32-APIC entries and removes the need
for a separate IOAPIC chip. <<<< the source of the original problem ?

I will try to find some technical specification on the web.

Thank you.
--
| Benjamin Monate
| LRI - B?t. 490
| Universit? de Paris-Sud