2003-11-09 21:05:41

by Shane Wegner

[permalink] [raw]
Subject: 2.4.23 crash on Intel SDS2

Hi,

I posted some weeks ago regarding a crash I was
experiencing with 2.4.23-pre4. I am just writing to
confirm that 2.4.23-pre9 is still unable to run relyably on
this machine. In my earlier post, I thought acpi might be
the culprit as I had it enabled due to a bios bug. Intel
since fixed that so I was able to boot 2.4.23-pre9 with
acpi totally disabled in make config.

The problem is that after some time, usually between 30
seconds and 15 minutes in, the system locks up. Nothing
gets printed into the kernel logs or onto the console.
After 60 seconds, the IPMI watchdog kicks in and reboots
the system. I run Linux 2.4.22 over here with no problems
with and without acpi.

It's an Intel server board model SDS2 with a dual Pentium
III tualatin 1.13ghz. I am attaching the dmesg output from
the kernel in case it is helpful but as there is no panics
or oops being printed, I am not sure how best I can help
track this down. If there is anything further I can do or
any other information needed, let me know.

Shane

Linux version 2.4.23-pre9 (shane@continuum) (gcc version 2.95.4 20011002 (Debian prerelease)) #1 SMP Sun Nov 9 12:09:56 PST 2003
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009e000 (usable)
BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 0000000007fc0000 (usable)
BIOS-e820: 0000000007fc0000 - 0000000007fffc00 (ACPI data)
BIOS-e820: 0000000007fffc00 - 0000000008000000 (ACPI NVS)
BIOS-e820: 0000000008000000 - 0000000040000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
128MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f65d0
hm, page 000f6000 reserved twice.
hm, page 000f7000 reserved twice.
hm, page 0009f000 reserved twice.
hm, page 000a0000 reserved twice.
On node 0 totalpages: 262144
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 32768 pages.
ACPI: RSDP (v000 INTEL ) @ 0x000f65b0
ACPI: RSDT (v001 INTEL RSDT 0x06040001 MSFT 0x00000000) @ 0x07ffa6e2
ACPI: FADT (v001 INTEL 0268 0x06040001 MSFT 0x01000000) @ 0x07fffb05
ACPI: MADT (v001 INTEL APIC 0x06040001 MSFT 0x00000000) @ 0x07fffb79
ACPI: BOOT (v001 INTEL $SBFTBL$ 0x06040001 MSFT 0x00000001) @ 0x07fffbd9
ACPI: DSDT (v001 INTEL 0268 0x06040001 MSFT 0x0100000b) @ 0x00000000
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x01] enabled)
Processor #1 Pentium(tm) Pro APIC version 17
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 Pentium(tm) Pro APIC version 17
ACPI: LAPIC_NMI (acpi_id[0x00] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x1])
Using ACPI for processor (LAPIC) configuration information
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: INTEL Product ID: SDS2 APIC at: 0xFEE00000
I/O APIC #2 Version 17 at 0xFEC00000.
I/O APIC #3 Version 17 at 0xFEC01000.
Enabling APIC mode: Flat. Using 2 I/O APICs
Processors: 2
Kernel command line: root=/dev/md0 ro
Initializing CPU#0
Detected 1127.945 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 2247.88 BogoMIPS
Memory: 1033128k/1048576k available (1431k kernel code, 14800k reserved, 594k data, 112k init, 131072k highmem)
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode cache hash table entries: 65536 (order: 7, 524288 bytes)
Mount cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer cache hash table entries: 65536 (order: 6, 262144 bytes)
Page-cache hash table entries: 262144 (order: 8, 1048576 bytes)
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch ([email protected])
mtrr: detected mtrr type: Intel
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
CPU0: Intel(R) Pentium(R) III CPU family 1133MHz stepping 01
per-CPU timeslice cutoff: 1463.40 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Booting processor 1/0 eip 2000
Initializing CPU#1
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 2254.43 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Intel machine check reporting enabled on CPU#1.
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
CPU1: Intel(R) Pentium(R) III CPU family 1133MHz stepping 01
Total of 2 processors activated (4502.32 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 2 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 2 ... ok.
Setting 3 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 3 ... ok.
init IO_APIC IRQs
IO-APIC (apicid-pin) 2-0, 2-2, 3-4, 3-9, 3-10, 3-11, 3-12, 3-13, 3-14, 3-15 not connected.
..TIMER: vector=0x31 pin1=-1 pin2=0
...trying to set up timer (IRQ0) through the 8259A ...
..... (found pin 0) ...works.
number of MP IRQ sources: 24.
number of IO-APIC #2 registers: 16.
number of IO-APIC #3 registers: 16.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
....... : physical APIC id: 02
....... : Delivery Type: 0
....... : LTS : 0
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : PRQ implemented: 0
....... : IO APIC version: 0011
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 003 03 0 0 0 0 0 1 1 31
01 003 03 0 0 0 0 0 1 1 39
02 000 00 1 0 0 0 0 0 0 00
03 003 03 0 0 0 0 0 1 1 41
04 003 03 0 0 0 0 0 1 1 49
05 003 03 0 0 0 1 0 1 1 51
06 003 03 0 0 0 0 0 1 1 59
07 003 03 0 0 0 0 0 1 1 61
08 003 03 0 0 0 0 0 1 1 69
09 003 03 0 0 0 1 0 1 1 71
0a 003 03 0 0 0 1 0 1 1 79
0b 003 03 0 0 0 0 0 1 1 81
0c 003 03 0 0 0 0 0 1 1 89
0d 003 03 0 0 0 0 0 1 1 91
0e 003 03 0 0 0 0 0 1 1 99
0f 003 03 0 0 0 0 0 1 1 A1

IO APIC #3......
.... register #00: 03000000
....... : physical APIC id: 03
....... : Delivery Type: 0
....... : LTS : 0
.... register #01: 000F0011
....... : max redirection entries: 000F
....... : PRQ implemented: 0
....... : IO APIC version: 0011
.... register #02: 0D000000
....... : arbitration: 0D
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 003 03 1 1 0 1 0 1 1 A9
01 003 03 1 1 0 1 0 1 1 B1
02 003 03 1 1 0 1 0 1 1 B9
03 003 03 1 1 0 1 0 1 1 C1
04 000 00 1 0 0 0 0 0 0 00
05 003 03 1 1 0 1 0 1 1 C9
06 003 03 1 1 0 1 0 1 1 D1
07 003 03 1 1 0 1 0 1 1 D9
08 003 03 1 1 0 1 0 1 1 E1
09 000 00 1 0 0 0 0 0 0 00
0a 000 00 1 0 0 0 0 0 0 00
0b 000 00 1 0 0 0 0 0 0 00
0c 000 00 1 0 0 0 0 0 0 00
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 1 0 0 0 0 0 0 00
0f 000 00 1 0 0 0 0 0 0 00
IRQ to pin mappings:
IRQ0 -> 0:0
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ10 -> 0:10
IRQ11 -> 0:11
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ16 -> 1:0
IRQ17 -> 1:1
IRQ18 -> 1:2
IRQ19 -> 1:3
IRQ21 -> 1:5
IRQ22 -> 1:6
IRQ23 -> 1:7
IRQ24 -> 1:8
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1127.9277 MHz.
..... host bus clock speed is 132.6969 MHz.
cpu: 0, clocks: 1326969, slice: 442323
CPU0<T0:1326960,T1:884624,D:13,S:442323,C:1326969>
cpu: 1, clocks: 1326969, slice: 442323
CPU1<T0:1326960,T1:442304,D:10,S:442323,C:1326969>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
PCI: PCI BIOS revision 2.10 entry at 0xfd9d1, last bus=2
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI: Ignoring BAR0-3 of IDE controller 00:0f.1
PCI: Discovered primary peer bus 01 [IRQ]
PCI: Discovered primary peer bus 02 [IRQ]
PCI->APIC IRQ transform: (B0,I3,P0) -> 18
PCI->APIC IRQ transform: (B0,I4,P0) -> 19
PCI->APIC IRQ transform: (B0,I8,P0) -> 23
PCI->APIC IRQ transform: (B0,I9,P0) -> 24
PCI->APIC IRQ transform: (B0,I15,P0) -> 10
PCI->APIC IRQ transform: (B0,I15,P0) -> 10
PCI->APIC IRQ transform: (B1,I8,P0) -> 21
PCI->APIC IRQ transform: (B1,I9,P0) -> 22
PCI->APIC IRQ transform: (B2,I4,P0) -> 16
PCI->APIC IRQ transform: (B2,I4,P1) -> 17
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
allocated 32 pages and 32 bhs reserved for the highmem bounces
VFS: Disk quotas vdquot_6.5.1
Journalled Block Device driver loaded
Detected PS/2 Mouse Port.
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
keyboard: Timeout - AT keyboard not present?(ed)
keyboard: Timeout - AT keyboard not present?(f4)
RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
SvrWks CSB5: IDE controller at PCI slot 00:0f.1
SvrWks CSB5: chipset revision 146
SvrWks CSB5: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x2470-0x2477, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0x2478-0x247f, BIOS settings: hdc:pio, hdd:pio
PDC20269: IDE controller at PCI slot 01:09.0
PDC20269: chipset revision 2
PDC20269: not 100% native mode: will probe irqs later
ide2: BM-DMA at 0x2490-0x2497, BIOS settings: hde:pio, hdf:pio
ide3: BM-DMA at 0x2498-0x249f, BIOS settings: hdg:pio, hdh:pio
hda: HL-DT-ST RW/DVD GCC-4480B, ATAPI CD/DVD-ROM drive
hde: Maxtor 4A250J0, ATA DISK drive
blk: queue c034cf78, I/O limit 4095Mb (mask 0xffffffff)
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide2 at 0x24b0-0x24b7,0x24aa on irq 22
hde: attached ide-disk driver.
hde: host protected area => 1
hde: 490234752 sectors (251000 MB) w/2048KiB Cache, CHS=30515/255/63, UDMA(133)
Partition check:
hde: hde1
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec aic7899 Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec aic7899 Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

blk: queue c1c99018, I/O limit 4095Mb (mask 0xffffffff)
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)
(scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)
(scsi0:A:2): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)
Vendor: COMPAQPC Model: ATLAS10K2-TY092L Rev: DDC2
Type: Direct-Access ANSI SCSI revision: 03
blk: queue f7ff8e18, I/O limit 4095Mb (mask 0xffffffff)
Vendor: COMPAQPC Model: ATLAS10K2-TY092L Rev: DDC2
Type: Direct-Access ANSI SCSI revision: 03
blk: queue f7ff8c18, I/O limit 4095Mb (mask 0xffffffff)
Vendor: COMPAQPC Model: ATLAS10K2-TY092L Rev: DDC2
Type: Direct-Access ANSI SCSI revision: 03
blk: queue c1c9f218, I/O limit 4095Mb (mask 0xffffffff)
scsi0:A:0:0: Tagged Queuing enabled. Depth 64
scsi0:A:1:0: Tagged Queuing enabled. Depth 64
scsi0:A:2:0: Tagged Queuing enabled. Depth 64
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
Attached scsi disk sdc at scsi0, channel 0, id 2, lun 0
SCSI device sda: 17773500 512-byte hdwr sectors (9100 MB)
sda: sda1 sda2 sda3
SCSI device sdb: 17773500 512-byte hdwr sectors (9100 MB)
sdb: sdb1 sdb2 sdb3
SCSI device sdc: 17773500 512-byte hdwr sectors (9100 MB)
sdc: sdc1 sdc2 sdc3
md: linear personality registered as nr 1
md: raid0 personality registered as nr 2
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
[events: 00000177]
[events: 00000177]
[events: 00000177]
md: autorun ...
md: considering sdc2 ...
md: adding sdc2 ...
md: adding sdb2 ...
md: adding sda2 ...
md: created md0
md: bind<sda2,1>
md: bind<sdb2,2>
md: bind<sdc2,3>
md: running: <sdc2><sdb2><sda2>
md: sdc2's event counter: 00000177
md: sdb2's event counter: 00000177
md: sda2's event counter: 00000177
md0: max total readahead window set to 1536k
md0: 3 data-disks, max readahead per data-disk: 512k
raid0: looking at sda2
raid0: comparing sda2(8779392) with sda2(8779392)
raid0: END
raid0: ==> UNIQUE
raid0: 1 zones
raid0: looking at sdb2
raid0: comparing sdb2(8779392) with sda2(8779392)
raid0: EQUAL
raid0: looking at sdc2
raid0: comparing sdc2(8779392) with sda2(8779392)
raid0: EQUAL
raid0: FINAL 1 zones
raid0: zone 0
raid0: checking sda2 ... contained as device 0
(8779392) is smallest!.
raid0: checking sdb2 ... contained as device 1
raid0: checking sdc2 ... contained as device 2
raid0: zone->nb_dev: 3, size: 26338176
raid0: current zone offset: 8779392
raid0: done.
raid0 : md_size is 26338176 blocks.
raid0 : conf->smallest->size is 26338176 blocks.
raid0 : nb_zone is 1.
raid0 : Allocating 8 bytes for hash.
md: updating md0 RAID superblock on device
md: sdc2 [events: 00000178]<6>(write) sdc2's sb offset: 8779456
md: sdb2 [events: 00000178]<6>(write) sdb2's sb offset: 8779456
md: sda2 [events: 00000178]<6>(write) sda2's sb offset: 8779456
md: ... autorun DONE.
LVM version 1.0.7(28/03/2003)
Initializing Cryptographic API
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
IP: routing cache hash table of 8192 buckets, 64Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
kjournald starting. Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 112k freed
Adding Swap: 64252k swap-space (priority 1)
Adding Swap: 64252k swap-space (priority 1)
Adding Swap: 64252k swap-space (priority 1)
EXT3 FS 2.4-0.9.19, 19 August 2002 on md(9,0), internal journal
Real Time Clock Driver v1.10e
Intel(R) PRO/100 Network Driver - version 2.3.18-k1
Copyright (c) 2003 Intel Corporation

e100: selftest OK.
e100: eth0: Intel(R) PRO/100 Network Connection
Hardware receive checksums enabled
cpu cycle saver enabled

e100: selftest OK.
e100: eth1: Intel(R) PRO/100 Network Connection
Hardware receive checksums enabled
cpu cycle saver enabled

Creative EMU10K1 PCI Audio Driver, version 0.20, 12:14:33 Nov 9 2003
emu10k1: EMU10K1 rev 7 model 0x8061 found, IO at 0x2440-0x245f, IRQ 24
ac97_codec: AC97 Audio codec, id: 0x8384:0x7608 (SigmaTel STAC9708)
emu10k1: SBLive! 5.1 card detected
Uniform CD-ROM driver unloaded
hda: attached ide-scsi driver.
scsi2 : SCSI host adapter emulation for IDE ATAPI devices
Vendor: HL-DT-ST Model: RW/DVD GCC-4480B Rev: 1.00
Type: CD-ROM ANSI SCSI revision: 02
Attached scsi CD-ROM sr0 at scsi2, channel 0, id 0, lun 0
sr0: scsi3-mmc drive: 48x/48x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.12
usb.c: registered new driver usbdevfs
usb.c: registered new driver hub
usb-ohci.c: USB OHCI at membase 0xf8894000, IRQ 10
usb-ohci.c: usb-00:0f.2, ServerWorks OSB4/CSB5 OHCI USB Controller
usb.c: new USB bus registered, assigned bus number 1
hub.c: USB hub found
hub.c: 4 ports detected
loop: loaded (max 8 devices)
hub.c: new USB device 00:0f.2-2, assigned address 2
usb.c: USB device 2 (vend/prod 0xbc7/0x4) is not claimed by any active driver.
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.19, 19 August 2002 on lvm(58,1), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
ip_tables: (C) 2000-2002 Netfilter core team
ip_conntrack version 2.1 (8192 buckets, 65536 max) - 292 bytes per conntrack
e100: eth0 NIC Link is Up 100 Mbps Full duplex
e100: eth1 NIC Link is Up 100 Mbps Half duplex
ipmi message handler version v27
IPMI watchdog driver version v27
ipmi_kcs: Acquiring BMC @ port=0xca2
MOXA Smartio family driver version 1.2.1
Tty devices major number = 174, callout devices major number = 175
Found MOXA C168H/PCI series board(BusNo=0,DevNo=8)


2003-11-10 10:49:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2



On Sun, 9 Nov 2003, Shane Wegner wrote:

> Hi,
>
> I posted some weeks ago regarding a crash I was
> experiencing with 2.4.23-pre4. I am just writing to
> confirm that 2.4.23-pre9 is still unable to run relyably on
> this machine. In my earlier post, I thought acpi might be
> the culprit as I had it enabled due to a bios bug. Intel
> since fixed that so I was able to boot 2.4.23-pre9 with
> acpi totally disabled in make config.

Shane,

Tracking down which -pre this started to happen would help a lot.

2003-11-10 11:20:36

by Mikael Pettersson

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

On Sun, 9 Nov 2003 13:05:27 -0800, Shane Wegner wrote:
>I posted some weeks ago regarding a crash I was
>experiencing with 2.4.23-pre4. I am just writing to
>confirm that 2.4.23-pre9 is still unable to run relyably on
>this machine. In my earlier post, I thought acpi might be
>the culprit as I had it enabled due to a bios bug. Intel
>since fixed that so I was able to boot 2.4.23-pre9 with
>acpi totally disabled in make config.
>
>The problem is that after some time, usually between 30
>seconds and 15 minutes in, the system locks up. Nothing
>gets printed into the kernel logs or onto the console.
>After 60 seconds, the IPMI watchdog kicks in and reboots
>the system. I run Linux 2.4.22 over here with no problems
>with and without acpi.
>
>It's an Intel server board model SDS2 with a dual Pentium
>III tualatin 1.13ghz. I am attaching the dmesg output from
>the kernel in case it is helpful but as there is no panics
>or oops being printed, I am not sure how best I can help
>track this down. If there is anything further I can do or
>any other information needed, let me know.

Pass "nmi_watchdog=1" to the kernel to enable the
I/O-APIC based NMI watchdog. This should detect any
software (kernel) lockups and produce an oops for them.

Enabling the I/O-APIC NMI watchdog eliminated mysterious
lockups on our Dell PE2650 (dual Xeons, Serverworks chipset).

/Mikael

2003-11-11 10:57:44

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2



On Sun, 9 Nov 2003, Shane Wegner wrote:

> Hi,
>
> I posted some weeks ago regarding a crash I was
> experiencing with 2.4.23-pre4. I am just writing to
> confirm that 2.4.23-pre9 is still unable to run relyably on
> this machine. In my earlier post, I thought acpi might be
> the culprit as I had it enabled due to a bios bug. Intel
> since fixed that so I was able to boot 2.4.23-pre9 with
> acpi totally disabled in make config.
>
> The problem is that after some time, usually between 30
> seconds and 15 minutes in, the system locks up. Nothing
> gets printed into the kernel logs or onto the console.
> After 60 seconds, the IPMI watchdog kicks in and reboots
> the system. I run Linux 2.4.22 over here with no problems
> with and without acpi.
>
> It's an Intel server board model SDS2 with a dual Pentium
> III tualatin 1.13ghz. I am attaching the dmesg output from
> the kernel in case it is helpful but as there is no panics
> or oops being printed, I am not sure how best I can help
> track this down. If there is anything further I can do or
> any other information needed, let me know.

Shane,

Can you please find out in which -pre these strange lockups started happening?

Thank you

2003-11-12 07:07:17

by Shane Wegner

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

On Mon, Nov 10, 2003 at 08:46:38AM -0200, Marcelo Tosatti wrote:
>
>
> On Sun, 9 Nov 2003, Shane Wegner wrote:
>
> > Hi,
> >
> > I posted some weeks ago regarding a crash I was
> > experiencing with 2.4.23-pre4. I am just writing to
> > confirm that 2.4.23-pre9 is still unable to run relyably on
> > this machine. In my earlier post, I thought acpi might be
> > the culprit as I had it enabled due to a bios bug. Intel
> > since fixed that so I was able to boot 2.4.23-pre9 with
> > acpi totally disabled in make config.
>
> Shane,
>
> Tracking down which -pre this started to happen would help a lot.

Ok, it starts happening in pre4. I am running pre3 now and
all is stable. The kernels I am using for testing are
compiled without acpi but that doesn't make a difference.

Shane

2003-11-12 11:09:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2



On Sun, 9 Nov 2003, Shane Wegner wrote:

> Hi,
>
> I posted some weeks ago regarding a crash I was
> experiencing with 2.4.23-pre4. I am just writing to
> confirm that 2.4.23-pre9 is still unable to run relyably on
> this machine. In my earlier post, I thought acpi might be
> the culprit as I had it enabled due to a bios bug. Intel
> since fixed that so I was able to boot 2.4.23-pre9 with
> acpi totally disabled in make config.
>
> The problem is that after some time, usually between 30
> seconds and 15 minutes in, the system locks up. Nothing
> gets printed into the kernel logs or onto the console.
> After 60 seconds, the IPMI watchdog kicks in and reboots
> the system. I run Linux 2.4.22 over here with no problems
> with and without acpi.
>
> It's an Intel server board model SDS2 with a dual Pentium
> III tualatin 1.13ghz. I am attaching the dmesg output from
> the kernel in case it is helpful but as there is no panics
> or oops being printed, I am not sure how best I can help
> track this down. If there is anything further I can do or
> any other information needed, let me know.

> On node 0 totalpages: 262144
> zone(0): 4096 pages.
> zone(1): 225280 pages.
> zone(2): 32768 pages.

What do you (what is your workload) during the few minutes before the
crash?

There are no significant driver changes in -pre4 that could affect you.

Can you please try with mem=900M? I suspect something in the VM changes
might be causing this.



2003-11-12 11:29:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2



On Wed, 12 Nov 2003, Marcelo Tosatti wrote:

>
>
> On Sun, 9 Nov 2003, Shane Wegner wrote:
>
> > Hi,
> >
> > I posted some weeks ago regarding a crash I was
> > experiencing with 2.4.23-pre4. I am just writing to
> > confirm that 2.4.23-pre9 is still unable to run relyably on
> > this machine. In my earlier post, I thought acpi might be
> > the culprit as I had it enabled due to a bios bug. Intel
> > since fixed that so I was able to boot 2.4.23-pre9 with
> > acpi totally disabled in make config.
> >
> > The problem is that after some time, usually between 30
> > seconds and 15 minutes in, the system locks up. Nothing
> > gets printed into the kernel logs or onto the console.
> > After 60 seconds, the IPMI watchdog kicks in and reboots
> > the system. I run Linux 2.4.22 over here with no problems
> > with and without acpi.
> >
> > It's an Intel server board model SDS2 with a dual Pentium
> > III tualatin 1.13ghz. I am attaching the dmesg output from
> > the kernel in case it is helpful but as there is no panics
> > or oops being printed, I am not sure how best I can help
> > track this down. If there is anything further I can do or
> > any other information needed, let me know.
>
> > On node 0 totalpages: 262144
> > zone(0): 4096 pages.
> > zone(1): 225280 pages.
> > zone(2): 32768 pages.
>
> What do you (what is your workload) during the few minutes before the
> crash?
>
> There are no significant driver changes in -pre4 that could affect you.
>
> Can you please try with mem=900M? I suspect something in the VM changes
> might be causing this.

Ah, have you tried to boot with "nmi_watchdog=1" as Mikael suggested?



2003-11-12 18:22:23

by Shane Wegner

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

On Wed, Nov 12, 2003 at 09:21:59AM -0200, Marcelo Tosatti wrote:
> > It's an Intel server board model SDS2 with a dual Pentium
> > III tualatin 1.13ghz. I am attaching the dmesg output from
> > the kernel in case it is helpful but as there is no panics
> > or oops being printed, I am not sure how best I can help
> > track this down. If there is anything further I can do or
> > any other information needed, let me know.
>
> > On node 0 totalpages: 262144
> > > zone(0): 4096 pages.
> > zone(1): 225280 pages.
> > zone(2): 32768 pages.
>
> > What do you (what is your workload) during the few minutes before the
> > crash?

It's a database machine running MySQL and Postgres. The
MySQL server runs about 4 queries/sec and PostGres only as
needed. It also does some minor mail service, say 2
messages per minute and runs apache at about 10 requests
per minute.

> > There are no significant driver changes in -pre4 that could affect you.
> >
> > Can you please try with mem=900M? I suspect something in the VM changes
> > might be causing this.

Just tried with mem=900m and subsequently mem=850m so as no
himem pages were available with no effect. Machine still
crashed.

> Ah, have you tried to boot with "nmi_watchdog=1" as Mikael suggested?

Will try that next, thanks.

Shane

2003-11-15 12:33:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2



On Wed, 12 Nov 2003, Shane Wegner wrote:

> On Wed, Nov 12, 2003 at 09:21:59AM -0200, Marcelo Tosatti wrote:
> > > It's an Intel server board model SDS2 with a dual Pentium
> > > III tualatin 1.13ghz. I am attaching the dmesg output from
> > > the kernel in case it is helpful but as there is no panics
> > > or oops being printed, I am not sure how best I can help
> > > track this down. If there is anything further I can do or
> > > any other information needed, let me know.
> >
> > > On node 0 totalpages: 262144
> > > > zone(0): 4096 pages.
> > > zone(1): 225280 pages.
> > > zone(2): 32768 pages.
> >
> > > What do you (what is your workload) during the few minutes before the
> > > crash?
>
> It's a database machine running MySQL and Postgres. The
> MySQL server runs about 4 queries/sec and PostGres only as
> needed. It also does some minor mail service, say 2
> messages per minute and runs apache at about 10 requests
> per minute.
>
> > > There are no significant driver changes in -pre4 that could affect you.
> > >
> > > Can you please try with mem=900M? I suspect something in the VM changes
> > > might be causing this.
>
> Just tried with mem=900m and subsequently mem=850m so as no
> himem pages were available with no effect. Machine still
> crashed.
>
> > Ah, have you tried to boot with "nmi_watchdog=1" as Mikael suggested?
>
> Will try that next, thanks.

Shane,

Have you tried the NMI watchdog?



2003-11-15 20:58:32

by Shane Wegner

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

On Sat, Nov 15, 2003 at 10:31:07AM -0200, Marcelo Tosatti wrote:
>
>
> On Wed, 12 Nov 2003, Shane Wegner wrote:
>
> > On Wed, Nov 12, 2003 at 09:21:59AM -0200, Marcelo Tosatti wrote:
> > > > It's an Intel server board model SDS2 with a dual Pentium
> > > > III tualatin 1.13ghz. I am attaching the dmesg output from
> > > > the kernel in case it is helpful but as there is no panics
> > > > or oops being printed, I am not sure how best I can help
> > > > track this down. If there is anything further I can do or
> > > > any other information needed, let me know.
> > >
> > > > On node 0 totalpages: 262144
> > > > > zone(0): 4096 pages.
> > > > zone(1): 225280 pages.
> > > > zone(2): 32768 pages.
> > >
> > > > What do you (what is your workload) during the few minutes before the
> > > > crash?
> >
> > It's a database machine running MySQL and Postgres. The
> > MySQL server runs about 4 queries/sec and PostGres only as
> > needed. It also does some minor mail service, say 2
> > messages per minute and runs apache at about 10 requests
> > per minute.
> >
> > > > There are no significant driver changes in -pre4 that could affect you.
> > > >
> > > Ah, have you tried to boot with "nmi_watchdog=1" as Mikael suggested?
> >
> > Will try that next, thanks.
>
> Shane,
>
> Have you tried the NMI watchdog?

Hi,

I did and unfortunately, it was of little help. If
anything though, it made the lockup more consistent. The
three times I tried to boot with nmi_watchdog=1, it locked
up when starting SpamAssassin. Nothing special about that
process but just above that it started the hotplug
subsystem which I use to automatically insert various usb
drivers as needed. Could that have anything to do with it?

Shane

Btw, to clarify, when the lockup occurs with nmi_watchdog,
no oops gets printed.

2003-11-15 21:12:18

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

Hi,

On Sat, Nov 15, 2003 at 12:58:28PM -0800, Shane Wegner wrote:

> I did and unfortunately, it was of little help. If
> anything though, it made the lockup more consistent. The
> three times I tried to boot with nmi_watchdog=1, it locked
> up when starting SpamAssassin. Nothing special about that
> process but just above that it started the hotplug
> subsystem which I use to automatically insert various usb
> drivers as needed. Could that have anything to do with it?

Would it be possible to print a "starting XXX" and
"XXX started" before and after every service ? And please
also try to disable automatic modprobe, or change it to
something which logs what is loaded. Eg:

echo /root/mymodprobe >/proc/sys/kernel/modprobe

with mymodprobe basically looking like :

#!/bin/bash
{date;echo "starting $@"} >> /tmp/modprobe.log
sync
exec modprobe $@

> Btw, to clarify, when the lockup occurs with nmi_watchdog,
> no oops gets printed.

You may try nmi_watchdog=2. I once was adviced to try =1,
but it never worked for me, while =2 worked as expected.
Don't ask me why, all I know is that there are a few other
people out there happily using it this way too.

Regards,
Willy

2003-11-16 00:14:11

by Keith Owens

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

On Sat, 15 Nov 2003 22:12:01 +0100,
Willy Tarreau <[email protected]> wrote:
>And please
>also try to disable automatic modprobe, or change it to
>something which logs what is loaded.

modprobe can log what isloaded. mkdir /var/log/ksymoops and insmod
will automatically and safel log all module loads and unloads in that
directory.

2003-11-16 22:02:04

by Shane Wegner

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

> > It's a database machine running MySQL and Postgres. The
> > MySQL server runs about 4 queries/sec and PostGres only as
> > needed. It also does some minor mail service, say 2
> > messages per minute and runs apache at about 10 requests
> > per minute.
> >
> > > > There are no significant driver changes in -pre4 that could affect you.
> > > >
> > > > Can you please try with mem=900M? I suspect something in the VM changes
> > > > might be causing this.
> >
> > Just tried with mem=900m and subsequently mem=850m so as no
> > himem pages were available with no effect. Machine still
> > crashed.

Hi,

Well, I tried backing out the vm changes from pre4, no
luck so started disabling things. So far, it seems my
firewall script is at fault. I looked through the
pre3-pre4 diff and the only change to the nat code is a
one-liner.

# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/09/04 [email protected] 1.1063.41.4
# [NETFILTER]: NAT range calculation fix.
#
# This patch fixes a logic bug in NAT range calculations, which also
# causes a large slowdown when ICMP floods go through NAT.
#
# Author: Karlis Piesenieks
# --------------------------------------------
diff -Nru a/net/ipv4/netfilter/ip_nat_core.c b/net/ipv4/netfilter/ip_nat_core.c
--- a/net/ipv4/netfilter/ip_nat_core.c Sun Nov 16 13:41:25 2003
+++ b/net/ipv4/netfilter/ip_nat_core.c Sun Nov 16 13:41:25 2003
@@ -157,8 +157,8 @@
continue;
}

- if ((mr->range[i].flags & IP_NAT_RANGE_PROTO_SPECIFIED)
- && proto->in_range(&newtuple, IP_NAT_MANIP_SRC,
+ if (!(mr->range[i].flags & IP_NAT_RANGE_PROTO_SPECIFIED)
+ || proto->in_range(&newtuple, IP_NAT_MANIP_SRC,
&mr->range[i].min, &mr->range[i].max))
return 1;
}

Reversing that change has thus far fixed things over here
but time will tell. Is there any possible way that that
particular change is somehow not smp safe?

Shane

2003-11-17 11:44:05

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2



On Sun, 16 Nov 2003, Shane Wegner wrote:

> > > It's a database machine running MySQL and Postgres. The
> > > MySQL server runs about 4 queries/sec and PostGres only as
> > > needed. It also does some minor mail service, say 2
> > > messages per minute and runs apache at about 10 requests
> > > per minute.
> > >
> > > > > There are no significant driver changes in -pre4 that could affect you.
> > > > >
> > > > > Can you please try with mem=900M? I suspect something in the VM changes
> > > > > might be causing this.
> > >
> > > Just tried with mem=900m and subsequently mem=850m so as no
> > > himem pages were available with no effect. Machine still
> > > crashed.
>
> Hi,
>
> Well, I tried backing out the vm changes from pre4, no
> luck so started disabling things. So far, it seems my
> firewall script is at fault. I looked through the
> pre3-pre4 diff and the only change to the nat code is a
> one-liner.
>
> # The following is the BitKeeper ChangeSet Log
> # --------------------------------------------
> # 03/09/04 [email protected] 1.1063.41.4
> # [NETFILTER]: NAT range calculation fix.
> #
> # This patch fixes a logic bug in NAT range calculations, which also
> # causes a large slowdown when ICMP floods go through NAT.
> #
> # Author: Karlis Piesenieks
> # --------------------------------------------
> diff -Nru a/net/ipv4/netfilter/ip_nat_core.c b/net/ipv4/netfilter/ip_nat_core.c
> --- a/net/ipv4/netfilter/ip_nat_core.c Sun Nov 16 13:41:25 2003
> +++ b/net/ipv4/netfilter/ip_nat_core.c Sun Nov 16 13:41:25 2003
> @@ -157,8 +157,8 @@
> continue;
> }
>
> - if ((mr->range[i].flags & IP_NAT_RANGE_PROTO_SPECIFIED)
> - && proto->in_range(&newtuple, IP_NAT_MANIP_SRC,
> + if (!(mr->range[i].flags & IP_NAT_RANGE_PROTO_SPECIFIED)
> + || proto->in_range(&newtuple, IP_NAT_MANIP_SRC,
> &mr->range[i].min, &mr->range[i].max))
> return 1;
> }
>
> Reversing that change has thus far fixed things over here
> but time will tell. Is there any possible way that that
> particular change is somehow not smp safe?

That change is broken, its known to break other setups.

It has been reverted in the BK tree.


2003-11-17 17:29:20

by Shane Wegner

[permalink] [raw]
Subject: Re: 2.4.23 crash on Intel SDS2

On Mon, Nov 17, 2003 at 09:29:10AM -0200, Marcelo Tosatti wrote:
> > # 03/09/04 [email protected] 1.1063.41.4
> > # [NETFILTER]: NAT range calculation fix.
> > #
> > # This patch fixes a logic bug in NAT range calculations, which also
> > # causes a large slowdown when ICMP floods go through NAT.
> > #
> > # Author: Karlis Piesenieks
> > Reversing that change has thus far fixed things over here
> > but time will tell. Is there any possible way that that
> > particular change is somehow not smp safe?
>
> That change is broken, its known to break other setups.
>
> It has been reverted in the BK tree.

I can now pretty much confirm that was the cause. I have a
19 hour uptime with pre4. Thanks for all your assistance.

Shane