Hi,
I've been playing around with 8-way IBM8500R (8x700MHz Xeon) with 4.5GB
memory & AIC7xxx SCSI-controller. It's perfectly stable with 2.2-kernel
(from Red Hat 7) but very erratic on all 2.4-kernels I've tried it with
(2.4.[012], compiled both with egcs and RH7's gcc-2.96, both share the
same symptoms). It did have a ServeRAID controller too but IBM suggested
we take it out since 4500R also had problems with it on 2.4 but it didn't
make any difference at all. Also tried to turn off highmem support but
didn't make difference either.
Symptoms: it sometimes boots and stays up for a while (anything between 10
seconds to maximum of about half an hour) but most of the time it locks up
early in the boot while its enabling the CPUs:
--
Booting processor 4/3 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 4.
After Callout 4.
--
..this is where it *usually* locks up, but the processor number where it
hangs varies randomly. Also it has locked up in other places too a couple
of times. If it boots and crashes then there's nothing in the logs, it's
just a sudden hard lockup.
If it is booted with "nosmp noapic" it seems perfectly stable but I'd sure
like to use those other 7 CPU's too :)
Any ideas/suggestions/patches etc would be greatly appreciated...
- Panu -
Here's a bootlog of a rare succesfull boot (hopefully got the copy-paste
right...)
---
Inspecting /boot/System.map-2.4.2
Loaded 14798 symbols from /boot/System.map-2.4.2.
Symbols match kernel version 2.4.2.
No module symbols loaded.
00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1399.19 BogoMIPS
Stack at about c5cd7fb8
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
Intel machine check reporting enabled on CPU#2.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
OK.
CPU2: Intel Pentium III (Cascades) stepping 01
CPU has booted.
Booting processor 3/2 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 3.
After Callout 3.
Initializing CPU#3
CPU#3 (phys ID: 2) waiting for CALLOUT
CALLIN, before setup_local_APIC().
masked ExtINT on CPU#3
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1399.19 BogoMIPS
Stack at about c5cd5fb8
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
Intel machine check reporting enabled on CPU#3.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
OK.
CPU3: Intel Pentium III (Cascades) stepping 01
CPU has booted.
Booting processor 4/3 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 4.
After Callout 4.
Initializing CPU#4
CPU#4 (phys ID: 3) waiting for CALLOUT
CALLIN, before setup_local_APIC().
masked ExtINT on CPU#4
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1399.19 BogoMIPS
Stack at about c5cd3fb8
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
Intel machine check reporting enabled on CPU#4.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
OK.
CPU4: Intel Pentium III (Cascades) stepping 01
CPU has booted.
Booting processor 5/4 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 5.
After Callout 5.
Initializing CPU#5
CPU#5 (phys ID: 4) waiting for CALLOUT
CALLIN, before setup_local_APIC().
masked ExtINT on CPU#5
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1399.19 BogoMIPS
Stack at about c5cfffb8
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
Intel machine check reporting enabled on CPU#5.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
OK.
CPU5: Intel Pentium III (Cascades) stepping 01
CPU has booted.
Booting processor 6/5 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 6.
After Callout 6.
Initializing CPU#6
CPU#6 (phys ID: 5) waiting for CALLOUT
CALLIN, before setup_local_APIC().
masked ExtINT on CPU#6
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1399.19 BogoMIPS
Stack at about c5cfdfb8
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
Intel machine check reporting enabled on CPU#6.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
OK.
CPU6: Intel Pentium III (Cascades) stepping 01
CPU has booted.
Booting processor 7/6 eip 2000
Setting warm reset code and vector.
1.
2.
3.
Asserting INIT.
Waiting for send to finish...
+Deasserting INIT.
Waiting for send to finish...
+#startup loops: 2.
Sending STARTUP #1.
After apic_write.
Startup point 1.
Waiting for send to finish...
+Sending STARTUP #2.
After apic_write.
Startup point 1.
Waiting for send to finish...
+After Startup.
Before Callout 7.
After Callout 7.
Initializing CPU#7
CPU#7 (phys ID: 6) waiting for CALLOUT
CALLIN, before setup_local_APIC().
masked ExtINT on CPU#7
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1399.19 BogoMIPS
Stack at about c5cfbfb8
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
Intel machine check reporting enabled on CPU#7.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
CPU: Common caps: 0383fbff 00000000 00000000 00000000
OK.
CPU7: Intel Pentium III (Cascades) stepping 01
CPU has booted.
Before bogomips.
Total of 8 processors activated (11190.27 BogoMIPS).
Before bogocount - setting activated=1.
Boot done.
ENABLING IO-APIC IRQs
...changing IO-APIC physical APIC ID to 14 ... ok.
Synchronizing Arb IDs.
..TIMER: vector=49 pin1=2 pin2=-1
activating NMI Watchdog ... done.
testing the IO APIC.......................
.................................... done.
calibrating APIC timer ...
..... CPU clock speed is 699.9388 MHz.
..... host bus clock speed is 99.9912 MHz.
cpu: 0, clocks: 999912, slice: 111101
CPU0<T0:999904,T1:888800,D:3,S:111101,C:999912>
cpu: 2, clocks: 999912, slice: 111101
cpu: 3, clocks: 999912, slice: 111101
cpu: 7, clocks: 999912, slice: 111101
cpu: 5, clocks: 999912, slice: 111101
cpu: 6, clocks: 999912, slice: 111101
cpu: 4, clocks: 999912, slice: 111101
cpu: 1, clocks: 999912, slice: 111101
CPU1<T0:999904,T1:777696,D:6,S:111101,C:999912>
CPU2<T0:999904,T1:666592,D:9,S:111101,C:999912>
CPU3<T0:999904,T1:555488,D:12,S:111101,C:999912>
CPU4<T0:999904,T1:444384,D:15,S:111101,C:999912>
CPU5<T0:999904,T1:333296,D:2,S:111101,C:999912>
CPU6<T0:999904,T1:222192,D:5,S:111101,C:999912>
CPU7<T0:999904,T1:111088,D:8,S:111101,C:999912>
checking TSC synchronization across CPUs:
BIOS BUG: CPU#7 improperly initialized, has 3 usecs TSC skew! FIXED.
Setting commenced=1, go go go
PCI: PCI BIOS revision 2.10 entry at 0xfd26c, last bus=15
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Compaq host bridge: last bus ff
PCI->APIC IRQ transform: (B0,I2,P0) -> 53
PCI->APIC IRQ transform: (B0,I14,P0) -> 51
PCI->APIC IRQ transform: (B0,I14,P0) -> 51
PCI->APIC IRQ transform: (B0,I15,P3) -> 49
PCI: Cannot allocate resource region 4 of device 00:0f.1
got res[1000:100f] for resource 4 of Intel Corporation 82371AB PIIX4 IDE
isapnp: Scanning for Pnp cards...
isapnp: No Plug & Play device found
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
IA-32 Microcode Update Driver: v1.08 <[email protected]>
Starting kswapd v1.8
pty: 256 Unix98 ptys configured
block: queued sectors max/low 2910933kB/2779861kB, 8192 slots per queue
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
loop: enabling 8 loop devices
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
PIIX4: IDE controller on PCI bus 00 dev 79
PIIX4: chipset revision 1
PIIX4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x1000-0x1007, BIOS settings: hda:pio, hdb:pio
hda: LTN485S, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 48X CD-ROM drive, 120kB Cache
Uniform CD-ROM driver Revision: 3.12
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Serial driver version 5.02 (2000-08-09) with MANY_PORTS SHARE_IRQ
SERIAL_PCI ISAPNP enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
Real Time Clock Driver v1.10d
Non-volatile memory driver v1.1
SCSI subsystem driver Revision: 1.00
(scsi0) <Adaptec AIC-7896/7 Ultra2 SCSI host adapter> found at PCI 0/14/1
(scsi0) Wide Channel B, SCSI ID=7, 32/255 SCBs
(scsi0) Downloading sequencer code... 392 instructions downloaded
(scsi1) <Adaptec AIC-7896/7 Ultra2 SCSI host adapter> found at PCI 0/14/0
(scsi1) Wide Channel A, SCSI ID=7, 32/255 SCBs
(scsi1) Downloading sequencer code... 392 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
<Adaptec AIC-7896/7 Ultra2 SCSI host adapter>
scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
<Adaptec AIC-7896/7 Ultra2 SCSI host adapter>
(scsi1:0:0:0) Synchronous at 80.0 Mbyte/sec, offset 63.
Vendor: IBM-PSG Model: ST318404LC !# Rev: 3283
Type: Direct-Access ANSI SCSI revision: 03
(scsi1:0:1:0) Synchronous at 80.0 Mbyte/sec, offset 63.
Vendor: IBM-PSG Model: ST318404LC !# Rev: 3283
Type: Direct-Access ANSI SCSI revision: 03
Vendor: IBM Model: TP4.6 V41b3 Rev: 4.1b
Type: Processor ANSI SCSI revision: 02
Detected scsi disk sda at scsi1, channel 0, id 0, lun 0
Detected scsi disk sdb at scsi1, channel 0, id 1, lun 0
SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
Partition check:
sda: sda1 sda2 sda3 < sda5 >
SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB)
sdb: sdb1 sdb2 < sdb5 >
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 65536 buckets, 512Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 208k freed
Adding Swap: 2096220k swap-space (priority -1)
On Thu, 1 Mar 2001, ext Andrew Morton wrote:
> "Matilainen Panu (NRC/Helsinki)" wrote:
> > On Thu, 1 Mar 2001, ext Andrew Morton wrote:
> > >
> > > Is it stable with `nmi_watchdog=0'?
> >
> > If the default value for nmi_watchdog is 0 then no - I added the
> > nmi_watchdog=1 just to see if that makes any difference. If it's on by
> > default then I'll need to test it that way.
>
> Default for nmi_watchdog is `enabled'.
>
> Several people have reported that turning it off with
> the `nmi_watchdog=0' LILO option makes systems stable.
> Nobody knows why.
>
> (If nmi_watchdog _does_ make the achine stable, please
> tell linux-kernel.).
It's too early to say for sure but that seems to have fixed it. Uptime now
nearly an hour under loads of 20-30 which is way more than it has been
able to stay up before. I'll let you know whether its still up tomorrow.
Million thanks for the tip!
- Panu -
>
--
On Thu, 1 Mar 2001, Matilainen Panu (NRC/Helsinki) wrote:
> I've been playing around with 8-way IBM8500R (8x700MHz Xeon) with 4.5GB
> memory & AIC7xxx SCSI-controller. It's perfectly stable with 2.2-kernel
> (from Red Hat 7) but very erratic on all 2.4-kernels I've tried it with
> (2.4.[012], compiled both with egcs and RH7's gcc-2.96, both share the
Under redhat 7 you should use kgcc to compile the kernel, since gcc2.96 is
inherently broken(*).
> same symptoms). It did have a ServeRAID controller too but IBM suggested
> we take it out since 4500R also had problems with it on 2.4 but it didn't
> make any difference at all. Also tried to turn off highmem support but
> didn't make difference either.
(*) redhat chose to ship an experimental compiler with this release of
the distribution that has a great many bugs. to ensure proper kernel
compillation another proven version of gcc was included, but called
kgcc instead. You should always use this to compile your kernels
under redhat 7 until the newer version of gcc is released.
talk to you later,
Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------
> > (from Red Hat 7) but very erratic on all 2.4-kernels I've tried it with
> > (2.4.[012], compiled both with egcs and RH7's gcc-2.96, both share the
^^^^
> Under redhat 7 you should use kgcc to compile the kernel, since gcc2.96 is
So he was using egcs, and whether he had the pre-errata gcc 2.96 wouldnt matter
"Dr. Kelsey Hudson" wrote:
> Under redhat 7 you should use kgcc to compile the kernel, since gcc2.96 is
> inherently broken(*).
Or upgrade to the current Red Hat 7 gcc, which works quite well.
jjs
Just FYI,
I am chasing this problem. There appears to be an unpleasant interaction between
the Advanced Systems Management card and the NMI watchdog code. Ripping the card
out of the machine also eradicates the problem, but is less desirable.
I'll let people know when there's a better solution.
Tim
On Thu, Mar 01, 2001 at 03:30:56PM +0200, Matilainen Panu (NRC/Helsinki) wrote:
> On Thu, 1 Mar 2001, ext Andrew Morton wrote:
> > "Matilainen Panu (NRC/Helsinki)" wrote:
> > > On Thu, 1 Mar 2001, ext Andrew Morton wrote:
> > > >
> > > > Is it stable with `nmi_watchdog=0'?
> > >
> > > If the default value for nmi_watchdog is 0 then no - I added the
> > > nmi_watchdog=1 just to see if that makes any difference. If it's on by
> > > default then I'll need to test it that way.
> >
> > Default for nmi_watchdog is `enabled'.
> >
> > Several people have reported that turning it off with
> > the `nmi_watchdog=0' LILO option makes systems stable.
> > Nobody knows why.
> >
> > (If nmi_watchdog _does_ make the achine stable, please
> > tell linux-kernel.).
>
> It's too early to say for sure but that seems to have fixed it. Uptime now
> nearly an hour under loads of 20-30 which is way more than it has been
> able to stay up before. I'll let you know whether its still up tomorrow.
>
> Million thanks for the tip!
>
> - Panu -
>
--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
On Thu, Mar 01, 2001 at 05:04:09PM -0800, Dr. Kelsey Hudson wrote:
> On Thu, 1 Mar 2001, Matilainen Panu (NRC/Helsinki) wrote:
>
> > I've been playing around with 8-way IBM8500R (8x700MHz Xeon) with 4.5GB
> > memory & AIC7xxx SCSI-controller. It's perfectly stable with 2.2-kernel
> > (from Red Hat 7) but very erratic on all 2.4-kernels I've tried it with
> > (2.4.[012], compiled both with egcs and RH7's gcc-2.96, both share the
>
> Under redhat 7 you should use kgcc to compile the kernel, since gcc2.96 is
> inherently broken(*).
>
For the umpteenth time, no it isn't. There are serious bugs in the shipped
version of gcc in RedHat 7.0, but they are fixed by applying the update.
The reason for supplying kgcc is to allow building a 2.2 kernel, because of
bugs in the kernel, NOT the compiler.
> > same symptoms). It did have a ServeRAID controller too but IBM suggested
> > we take it out since 4500R also had problems with it on 2.4 but it didn't
> > make any difference at all. Also tried to turn off highmem support but
> > didn't make difference either.
>
> (*) redhat chose to ship an experimental compiler with this release of
> the distribution that has a great many bugs. to ensure proper kernel
> compillation another proven version of gcc was included, but called
> kgcc instead. You should always use this to compile your kernels
> under redhat 7 until the newer version of gcc is released.
>
No. Provided you grab the update, you can build the 2.4 kernel perfectly
happily using the RedHat gcc snapshot. I'm running it successfully on a number
of machines. The issue with 2.4 on certain Netfinities is a bad interaction
between the NMI watchdog code and the systems management card. Changing
compilers makes no difference.
Tim
--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
On Thu, 1 Mar 2001, Dr. Kelsey Hudson wrote:
>> I've been playing around with 8-way IBM8500R (8x700MHz Xeon) with 4.5GB
>> memory & AIC7xxx SCSI-controller. It's perfectly stable with 2.2-kernel
>> (from Red Hat 7) but very erratic on all 2.4-kernels I've tried it with
>> (2.4.[012], compiled both with egcs and RH7's gcc-2.96, both share the
>
>Under redhat 7 you should use kgcc to compile the kernel, since gcc2.96 is
>inherently broken(*).
http://www.bero.org/gcc296.html
>> same symptoms). It did have a ServeRAID controller too but IBM suggested
>> we take it out since 4500R also had problems with it on 2.4 but it didn't
>> make any difference at all. Also tried to turn off highmem support but
>> didn't make difference either.
>
>(*) redhat chose to ship an experimental compiler with this release of
> the distribution that has a great many bugs. to ensure proper kernel
> compillation another proven version of gcc was included, but called
> kgcc instead. You should always use this to compile your kernels
> under redhat 7 until the newer version of gcc is released.
http://www.bero.org/gcc296.html
----------------------------------------------------------------------
Mike A. Harris - Linux advocate - Free Software advocate
This message is copyright 2001, all rights reserved.
Views expressed are my own, not necessarily shared by my employer.
----------------------------------------------------------------------
Red Hat Linux: http://www.redhat.com
Download for free: ftp://ftp.redhat.com/pub/redhat/redhat-6.2/
On Fri, 2 Mar 2001, ext Alan Cox wrote:
> > > (from Red Hat 7) but very erratic on all 2.4-kernels I've tried it with
> > > (2.4.[012], compiled both with egcs and RH7's gcc-2.96, both share the
> ^^^^
>
> > Under redhat 7 you should use kgcc to compile the kernel, since gcc2.96 is
>
> So he was using egcs, and whether he had the pre-errata gcc 2.96
> wouldnt matter
Since this (once again) came up... I've been running 2.4.[012] on my home
box compiled with 2.96-errata without a single problem so far.
And yes I know it's not supported, consider this just a datapoint :)
- Panu -
>