2002-01-11 14:54:02

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

00:00.0 Host bridge: ServerWorks CNB20LE (rev 05)
00:00.1 Host bridge: ServerWorks CNB20LE (rev 05)
00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
00:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 08)
00:09.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
(rev 30)
00:0a.0 Network controller: Penta Media Co Ltd: Unknown device 9050 (rev
02)
00:0f.0 ISA bridge: ServerWorks OSB4 (rev 4f)
00:0f.1 IDE interface: ServerWorks: Unknown device 0211
00:0f.2 USB Controller: ServerWorks: Unknown device 0220 (rev 04)
01:03.0 SCSI storage controller: Adaptec 7892P (rev 02)
01:05.0 Ethernet controller: 3Com Corporation 3c905 100BaseTX [Boomerang]
01:06.0 Network controller: Penta Media Co Ltd: Unknown device 9050 (rev
02)
01:07.0 PCI bridge: Distributed Processing Technology PCI Bridge (rev 02)
01:07.1 I2O: Distributed Processing Technology SmartRAID V Controller (rev
02)

CPU0 CPU1
0: 70930979 0 XT-PIC timer
1: 19222 0 XT-PIC keyboard
2: 0 0 XT-PIC cascade
8: 1 0 XT-PIC rtc
10: 24926906 0 XT-PIC eth0
11: 86480614 0 XT-PIC dpti0, pentanet0
14: 118 0 XT-PIC ide0
15: 2041588 0 XT-PIC eth1
NMI: 0 0
LOC: 70931368 70931502
ERR: 0

What does your /proc/interrupts look like?

Regards,
Zwane Mwaikambo


2002-01-11 14:56:12

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

What does your IRQ routing look like? (paste a dmesg) Are you running with
IO-APIC enabled?

Regards,
Zwane Mwaikambo

2002-01-11 17:33:45

by Jim Studt

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

Zwane Mwaikambo asks for more info...

I have IO-APIC configured in the kernel...
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y



# cat /proc/interrupts (with the card ifconfig-ed up, eth2 is on the 2nd PCI)
CPU0
0: 3183548 IO-APIC-edge timer
1: 6 IO-APIC-edge keyboard
2: 0 XT-PIC cascade
15: 4 IO-APIC-edge ide1
20: 143691 IO-APIC-level eth0
24: 108732 IO-APIC-level aic7xxx
26: 9 IO-APIC-level eth2
NMI: 0
LOC: 3183669
ERR: 0
MIS: 0



And the dmesg of the most recent boot. IO-APIC is enabled.

Linux version 2.4.17 (root@warehouse) (gcc version 2.95.4 20011006 (Debian prerelease)) #1 SMP Fri Jan 11 02:22:20 CST 2002
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e7400 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000000fff0000 (usable)
BIOS-e820: 000000000fff0000 - 000000000ffffc00 (ACPI data)
BIOS-e820: 000000000ffffc00 - 0000000010000000 (ACPI NVS)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
found SMP MP-table at 000f7ac0
hm, page 000f7000 reserved twice.
hm, page 000f8000 reserved twice.
hm, page 0009f000 reserved twice.
hm, page 000a0000 reserved twice.
On node 0 totalpages: 65520
zone(0): 4096 pages.
zone(1): 61424 pages.
zone(2): 0 pages.
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: Gateway Product ID: 7450R APIC at: 0xFEE00000
Processor #1 Pentium(tm) Pro APIC version 17
I/O APIC #0 Version 17 at 0xFEC00000.
I/O APIC #2 Version 17 at 0xFEC01000.
Processors: 1
Kernel command line: auto BOOT_IMAGE=Linux ro root=801
Initializing CPU#0
Detected 930.434 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1854.66 BogoMIPS
Memory: 255668k/262080k available (1013k kernel code, 6024k reserved, 307k data, 204k init, 0k highmem)
Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes)
Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes)
Page-cache hash table entries: 65536 (order: 6, 262144 bytes)
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
CPU0: Intel Pentium III (Coppermine) stepping 0a
per-CPU timeslice cutoff: 731.26 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Error: only one processor found.
ENABLING IO-APIC IRQs
Setting 0 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 0 ... ok.
Setting 2 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 2 ... ok.
init IO_APIC IRQs
IO-APIC (apicid-pin) 2-0, 2-1, 2-2, 2-3, 2-6, 2-7, 2-11, 2-12, 2-13, 2-14, 2-15 not connected.
..TIMER: vector=0x31 pin1=0 pin2=-1
number of MP IRQ sources: 21.
number of IO-APIC #0 registers: 16.
number of IO-APIC #2 registers: 16.
testing the IO APIC.......................

IO APIC #0......
.... register #00: 00000000
....... : physical APIC id: 00
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : PRQ implemented: 0
....... : IO APIC version: 0011
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 001 01 0 0 0 0 0 1 1 31
01 001 01 0 0 0 0 0 1 1 39
02 000 00 1 0 0 0 0 0 0 00
03 001 01 0 0 0 0 0 1 1 41
04 001 01 0 0 0 0 0 1 1 49
05 001 01 1 1 0 1 0 1 1 51
06 001 01 0 0 0 0 0 1 1 59
07 001 01 1 1 0 1 0 1 1 61
08 001 01 0 0 0 0 0 1 1 69
09 001 01 1 1 0 1 0 1 1 71
0a 001 01 1 1 0 1 0 1 1 79
0b 001 01 1 1 0 1 0 1 1 81
0c 001 01 0 0 0 0 0 1 1 89
0d 001 01 0 0 0 0 0 1 1 91
0e 001 01 0 0 0 0 0 1 1 99
0f 001 01 0 0 0 0 0 1 1 A1

IO APIC #2......
.... register #00: 02000000
....... : physical APIC id: 02
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : PRQ implemented: 0
....... : IO APIC version: 0011
.... register #02: 0C000000
....... : arbitration: 0C
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 000 00 1 0 0 0 0 0 0 00
01 000 00 1 0 0 0 0 0 0 00
02 000 00 1 0 0 0 0 0 0 00
03 000 00 1 0 0 0 0 0 0 00
04 001 01 1 1 0 1 0 1 1 A9
05 001 01 1 1 0 1 0 1 1 B1
06 000 00 1 0 0 0 0 0 0 00
07 000 00 1 0 0 0 0 0 0 00
08 001 01 1 1 0 1 0 1 1 B9
09 001 01 1 1 0 1 0 1 1 C1
0a 001 01 1 1 0 1 0 1 1 C9
0b 000 00 1 0 0 0 0 0 0 00
0c 000 00 1 0 0 0 0 0 0 00
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 1 0 0 0 0 0 0 00
0f 000 00 1 0 0 0 0 0 0 00
IRQ to pin mappings:
IRQ0 -> 0:0
IRQ1 -> 0:1
IRQ2 -> 0:2
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ10 -> 0:10
IRQ11 -> 0:11
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ20 -> 1:4
IRQ21 -> 1:5
IRQ24 -> 1:8
IRQ25 -> 1:9
IRQ26 -> 1:10
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 930.3575 MHz.
..... host bus clock speed is 132.9081 MHz.
cpu: 0, clocks: 1329081, slice: 664540
CPU0<T0:1329072,T1:664528,D:4,S:664540,C:1329081>
Waiting on wait_init_idle (map = 0x0)
All processors have done init_idle
PCI: PCI BIOS revision 2.10 entry at 0xfda1f, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Discovered primary peer bus 01 [IRQ]
PCI->APIC IRQ transform: (B0,I7,P0) -> 20
PCI->APIC IRQ transform: (B0,I9,P0) -> 21
PCI->APIC IRQ transform: (B0,I11,P0) -> 24
PCI->APIC IRQ transform: (B0,I12,P0) -> 25
PCI->APIC IRQ transform: (B1,I13,P0) -> 26
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
pty: 256 Unix98 ptys configured
block: 128 slots per queue, batch=32
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ServerWorks OSB4: IDE controller on PCI bus 00 dev 79
ServerWorks OSB4: chipset revision 0
ServerWorks OSB4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x1880-0x1887, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0x1888-0x188f, BIOS settings: hdc:DMA, hdd:pio
hdc: MATSHITA CR-177, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hdc: ATAPI 24X CD-ROM drive, 128kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
loop: loaded (max 8 devices)
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <[email protected]> and others
eth0: Intel Corp. 82557 [Ethernet Pro 100], 00:C0:9F:04:5E:15, IRQ 20.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
eth1: Intel Corp. 82557 [Ethernet Pro 100] (#2), 00:C0:9F:04:5E:14, IRQ 21.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
<Adaptec aic7892 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

Vendor: IBM Model: DDYS-T18350M Rev: S96H
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
Vendor: IBM Model: DDYS-T18350M Rev: S96H
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
Vendor: SDR Model: GEM318 Rev: 0
Type: Processor ANSI SCSI revision: 02
scsi0:A:0:0: Tagged Queuing enabled. Depth 253
scsi0:A:1:0: Tagged Queuing enabled. Depth 253
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
SCSI device sda: 35843670 512-byte hdwr sectors (18352 MB)
Partition check:
sda: sda1 sda2 sda3 sda4
SCSI device sdb: 35843670 512-byte hdwr sectors (18352 MB)
sdb: unknown partition table
Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0, type 3
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 16384)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 204k freed


--
Jim Studt, President
The Federated Software Group, Inc.

2002-01-13 09:39:59

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

Could you please try with kernel option "noapic", you don't have to
recompile. I'd just like to know wether the problem persists. You might
find the box sharing a lot of IRQs though.

Regards,
Zwane Mwaikambo


2002-01-14 00:30:29

by Jim Studt

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

> Could you please try with kernel option "noapic", you don't have to
> recompile. I'd just like to know wether the problem persists. You might
> find the box sharing a lot of IRQs though.

Many blessings to you!
Adding 'noapic' to the boot command line makes the machines work correctly.
As far as my problem goes this is a fine solution. The afflicted machine
just holds a couple terabytes of nearline storage and has no performance
demands to speak of.

I will replace the ieee1394 hardware and verify that it works. I expect
sucess now.

I have six of these machines and am holding one out as a spare. I will
be happy to continue testing and prodding on that spare unit.

For reference I now have...

# cat /proc/interrupts (eth2 is the afflicted card on the second PCI bus)
CPU0
0: 32594 XT-PIC timer
1: 2 XT-PIC keyboard
2: 0 XT-PIC cascade
5: 1316 XT-PIC eth2
7: 892 XT-PIC aic7xxx
11: 2132 XT-PIC eth0
15: 4 XT-PIC ide1
NMI: 0
LOC: 32554
ERR: 0
MIS: 0

And from dmesg...

Linux version 2.4.17 (root@warehouse) (gcc version 2.95.4 20011006 (Debian prerelease)) #1 SMP Fri Jan 11 02:22:20 CST 2002
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e7400 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000000fff0000 (usable)
BIOS-e820: 000000000fff0000 - 000000000ffffc00 (ACPI data)
BIOS-e820: 000000000ffffc00 - 0000000010000000 (ACPI NVS)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
found SMP MP-table at 000f7ac0
hm, page 000f7000 reserved twice.
hm, page 000f8000 reserved twice.
hm, page 0009f000 reserved twice.
hm, page 000a0000 reserved twice.
On node 0 totalpages: 65520
zone(0): 4096 pages.
zone(1): 61424 pages.
zone(2): 0 pages.
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: Gateway Product ID: 7450R APIC at: 0xFEE00000
Processor #1 Pentium(tm) Pro APIC version 17
I/O APIC #0 Version 17 at 0xFEC00000.
I/O APIC #2 Version 17 at 0xFEC01000.
Processors: 1
Kernel command line: auto BOOT_IMAGE=Linux ro root=801 noapic
Initializing CPU#0
Detected 930.434 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1854.66 BogoMIPS
Memory: 255668k/262080k available (1013k kernel code, 6024k reserved, 307k data, 204k init, 0k highmem)
Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes)
Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes)
Page-cache hash table entries: 65536 (order: 6, 262144 bytes)
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
CPU0: Intel Pentium III (Coppermine) stepping 0a
per-CPU timeslice cutoff: 731.26 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Error: only one processor found.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 930.4517 MHz.
..... host bus clock speed is 132.9216 MHz.
cpu: 0, clocks: 1329216, slice: 664608
CPU0<T0:1329216,T1:664608,D:0,S:664608,C:1329216>
Waiting on wait_init_idle (map = 0x0)
All processors have done init_idle
PCI: PCI BIOS revision 2.10 entry at 0xfda1f, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Discovered primary peer bus 01 [IRQ]
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
pty: 256 Unix98 ptys configured
block: 128 slots per queue, batch=32
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ServerWorks OSB4: IDE controller on PCI bus 00 dev 79
ServerWorks OSB4: chipset revision 0
ServerWorks OSB4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x1880-0x1887, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0x1888-0x188f, BIOS settings: hdc:DMA, hdd:pio
hdc: MATSHITA CR-177, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hdc: ATAPI 24X CD-ROM drive, 128kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
loop: loaded (max 8 devices)
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <[email protected]> and others
eth0: Intel Corp. 82557 [Ethernet Pro 100], 00:C0:9F:04:5E:15, IRQ 11.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
eth1: Intel Corp. 82557 [Ethernet Pro 100] (#2), 00:C0:9F:04:5E:14, IRQ 10.
Board assembly 000000-000, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
<Adaptec aic7892 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

Vendor: IBM Model: DDYS-T18350M Rev: S96H
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
Vendor: SDR Model: GEM318 Rev: 0
Type: Processor ANSI SCSI revision: 02
scsi0:A:0:0: Tagged Queuing enabled. Depth 253
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sda: 35843670 512-byte hdwr sectors (18352 MB)
Partition check:
sda: sda1 sda2 sda3 sda4
Attached scsi generic sg1 at scsi0, channel 0, id 8, lun 0, type 3
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 16384)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 204k freed
Adding Swap: 498004k swap-space (priority -1)
3c59x: Donald Becker and others. http://www.scyld.com/network/vortex.html
01:0d.0: 3Com PCI 3c905 Boomerang 100baseTx at 0x18c0. Vers LK1.1.16
eth2: Setting promiscuous mode.
device eth2 entered promiscuous mode
device eth2 left promiscuous mode

--
Jim Studt, President
The Federated Software Group, Inc.

2002-01-14 06:43:50

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

I have a similar problem with my Serverworks based box, also a network
device. However i have other devices on the second PCI bus which _do_
work, notably 2 scsi controllers and a 3com card.

Cheers,
Zwane Mwaikambo

2002-01-14 09:11:29

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Sun, 13 Jan 2002, Jim Studt wrote:

> I have six of these machines and am holding one out as a spare. I will
> be happy to continue testing and prodding on that spare unit.
>
> For reference I now have...
>
> # cat /proc/interrupts (eth2 is the afflicted card on the second PCI bus)
> CPU0
> 0: 32594 XT-PIC timer
> 1: 2 XT-PIC keyboard
> 2: 0 XT-PIC cascade
> 5: 1316 XT-PIC eth2
> 7: 892 XT-PIC aic7xxx
> 11: 2132 XT-PIC eth0
> 15: 4 XT-PIC ide1
> NMI: 0
> LOC: 32554
> ERR: 0
> MIS: 0

Alan Cox pointed out this problem to me and hinted that it was an IRQ
routing problem, i'm not sure wether it is possible to code workarounds
which don't break normal systems though. Anyone want to use Jim as a
guinea ping? ;)


2002-01-14 12:16:11

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Zwane Mwaikambo wrote:

> Alan Cox pointed out this problem to me and hinted that it was an IRQ
> routing problem, i'm not sure wether it is possible to code workarounds
> which don't break normal systems though. Anyone want to use Jim as a
> guinea ping? ;)

Why to code complicated workarounds for broken firmware? It's so easy to
fix, so either bother the vendor for a fix or replace the system with a
sane one. Reading and understanding the Intel's MP spec is a day or at
most two worth of man's work. I wouldn't trust the vendor that refuses to
invest in a product even that little.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-01-14 13:46:01

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Maciej W. Rozycki wrote:
> Why to code complicated workarounds for broken firmware? It's so easy to
> fix, so either bother the vendor for a fix or replace the system with a
> sane one. Reading and understanding the Intel's MP spec is a day or at
> most two worth of man's work. I wouldn't trust the vendor that refuses to
> invest in a product even that little.

You may have noticed the great deal of "hacks" which people have put into
the kernel over the years to get it to work with the imperfect world of
hardware. It makes you wonder wether we should waste our time supporting
broken hardware... Then again if we didn't we'd only run on 0.1% of the
boxes out there ;) But... i'm in no way advocating for adding more
kludges.

Regards,
Zwane Mwaikambo


2002-01-14 14:22:35

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Zwane Mwaikambo wrote:

> You may have noticed the great deal of "hacks" which people have put into
> the kernel over the years to get it to work with the imperfect world of
> hardware. It makes you wonder wether we should waste our time supporting

Well, I added a few trivial firmware workarounds myself, but the question
is how much we can obfuscate the kernel before it gets to the point of
insanity and how much effort one should put in them before deciding it's
not worth the trouble. One-liners are usually fine, but is anything more
complex? And I/O APIC routing table bugs are quite hopeless -- you need
to know the physical layout of traces on a given PCB before even trying to
do anything useful.

> broken hardware... Then again if we didn't we'd only run on 0.1% of the
> boxes out there ;) But... i'm in no way advocating for adding more
> kludges.

SMP hardware is mostly targetted to the server market which seems to care
of Linux due to many customers running it. If they ask vendors for fixes
or choose another brand (surprisingly enough, there are vendors that can
get their MP tables right), the vendors will start fixing bugs. If we
work them around, they will not, as there won't be a reason to.

The "noapic" option should probably get removed -- it was meant as a
debugging aid (as many of the "no*" options) at the early days of I/O APIC
support, I believe... Now the support is pretty stable.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-01-14 14:25:17

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Maciej W. Rozycki wrote:

> The "noapic" option should probably get removed -- it was meant as a
> debugging aid (as many of the "no*" options) at the early days of I/O APIC
> support, I believe... Now the support is pretty stable.

Oooooh that will break a _lot_ of boxes! Otherwise i agree wholeheartedly.

Cheers,
Zwane Mwaikambo


2002-01-14 14:33:25

by Alan

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

> On Mon, 14 Jan 2002, Maciej W. Rozycki wrote:
>
> > The "noapic" option should probably get removed -- it was meant as a
> > debugging aid (as many of the "no*" options) at the early days of I/O APIC
> > support, I believe... Now the support is pretty stable.
>
> Oooooh that will break a _lot_ of boxes! Otherwise i agree wholeheartedly.

noapic seems to be needed by a measurable number of boxes, many of which the
BIOS vendor will never fix or has refused to fix or assist in correcting.

2002-01-14 15:31:05

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Alan Cox wrote:

> noapic seems to be needed by a measurable number of boxes, many of which the
> BIOS vendor will never fix or has refused to fix or assist in correcting.

That's exactly why I consider the removal a Good Thing. ;-) The only
drawback I see is it would require an actively-maintained SMP hw
compatibility list.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-01-14 15:33:15

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Maciej W. Rozycki wrote:

> On Mon, 14 Jan 2002, Alan Cox wrote:
>
> > noapic seems to be needed by a measurable number of boxes, many of which the
> > BIOS vendor will never fix or has refused to fix or assist in correcting.
>
> That's exactly why I consider the removal a Good Thing. ;-) The only
> drawback I see is it would require an actively-maintained SMP hw
> compatibility list.

And an even more fervishly maintained procmailrc!!! ;)

/me watches Maciej get broiled!!

2002-01-14 15:54:15

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002, Zwane Mwaikambo wrote:

> > That's exactly why I consider the removal a Good Thing. ;-) The only
> > drawback I see is it would require an actively-maintained SMP hw
> > compatibility list.
>
> And an even more fervishly maintained procmailrc!!! ;)

Why? Since Linux doesn't work on these boards without the "noapic"
workaround anyway, I don't expect the number of mails with an ask for help
to grow. Only the answer would be different.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-01-14 16:17:27

by Jim Studt

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

> > > That's exactly why I consider the removal a Good Thing. ;-) The only
> > > drawback I see is it would require an actively-maintained SMP hw
> > > compatibility list.
> >
> > And an even more fervishly maintained procmailrc!!! ;)
>
Maciej W. Rozycki wrote...
> Why? Since Linux doesn't work on these boards without the "noapic"
> workaround anyway, I don't expect the number of mails with an ask for help
> to grow. Only the answer would be different.

The machine that started this thread is a single processor Gateway 7450
1U rackmount server with a two processor option.

Using the slightly better apic routing isn't going to make any
difference to its performance. I think being able to avoid unstable
hardware/software interractions when they aren't needed is a great
idea and `noapic' should stay in. This thread should be google-able
shortly and lead people with interrupt problems to the noapic workaround.

I also think that if this is a problem with Gateway's BIOS the best
solution is for them to fix it. I will contact Gateway and see about
that today.

--
Jim Studt, President
The Federated Software Group, Inc.

2002-01-14 21:05:22

by Keith Owens

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Mon, 14 Jan 2002 15:19:00 +0100 (MET),
"Maciej W. Rozycki" <[email protected]> wrote:
> The "noapic" option should probably get removed -- it was meant as a
>debugging aid (as many of the "no*" options) at the early days of I/O APIC
>support, I believe... Now the support is pretty stable.

Intel 440GX chipsets hang during SCSI probe with UP kernels unless you
use noapic. It works with SMP but many installers use UP kernels.
Removing noapic will break install on all 440GX machines, there are a
lot of them out there.

2002-01-15 11:44:58

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Problem with ServerWorks CNB20LE and lost interrupts

On Tue, 15 Jan 2002, Keith Owens wrote:

> > The "noapic" option should probably get removed -- it was meant as a
> >debugging aid (as many of the "no*" options) at the early days of I/O APIC
> >support, I believe... Now the support is pretty stable.
>
> Intel 440GX chipsets hang during SCSI probe with UP kernels unless you
> use noapic. It works with SMP but many installers use UP kernels.
> Removing noapic will break install on all 440GX machines, there are a
> lot of them out there.

Now, is that a chipset problem or a firmware (MP table) one? If the
former, we should code a workaround triggered by the chipset's PCI ID, so
the I/O APIC path works, otherwise vendors should fix their firmware. For
UP systems a simple possibility is to remove the MP table altogether if
it's too hard to fix -- it is not needed at all.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +