**** SUMMARY
Enabling hyperthreading on Asus PCDL Deluxe motherboard w/ 2xP4Xeon causes
system freeze in short order.
**** DESCRIPTION
I have an Asus PCDL Deluxe P4Xeon motherboard which has a north/southbridge
combination allowing system memory to run at 533MHz . The motherboard has the
usual Asus onboard gigabit ethernet, AD1985 audio, Intel and Promise SATA
controllers, firewire. All are enabled and operate. I am running dual 2.8GHz
P4 Xeons.
When hyperthreading is disabled the system is perfectly stable and usable. No
operating artefacts seem to occur and SMP appears to workcorrectly.
However when hyperthreading is enabled, the system operates for a brief period
(enough for KDE to boot, for example) before halting. When operating from the
command line it is usual to see a Machine Check Exception error immediately
prior to system failure.
**** KEYWORDS
SMP Hyperthreading Asus Xeon Freeze Crash
**** KERNELS
2.4.2x 2.6.x
**** OOPS
Not available
**** CATALYST
Enable hyperthreading
**** ENVIRONMENT
Asus PCDL Deluxe motherboard
(http://usa.asus.com/products/server/srv-mb/pc-dl/overview.HTM)
**** SOFTWARE
Gnu C 3.3.2
Gnu make 3.80
util-linux 2.12
mount 2.12
module-init-tools 0.9.15-pre4
e2fsprogs 1.34
Linux C Library 2.3.2
Dynamic linker (ldd) 2.3.2
Procps 3.1.15
Net-tools 1.60
Kbd 1.08
Sh-utils 5.0.91
Modules Loaded fglrx eeprom i2c_isa lm75 i2c_i801 i2c_algo_bit i2c_dev
i2c_sensor i2c_core
**** PROCESSOR
NOTE Hyperthreading disabled
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 7
cpu MHz : 2807.537
cache size : 512 KB
physical id : 0
siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5537.79
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 7
cpu MHz : 2807.537
cache size : 512 KB
physical id : 0
siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5603.32
**** MODULES
fglrx 208932 7 - Live 0xe09d5000
eeprom 11656 0 - Live 0xe08c9000
i2c_isa 6144 0 - Live 0xe08b3000
lm75 11908 0 - Live 0xe08c0000
i2c_i801 12176 0 - Live 0xe089f000
i2c_algo_bit 13832 0 - Live 0xe08bb000
i2c_dev 14976 0 - Live 0xe08b6000
i2c_sensor 7168 2 eeprom,lm75, Live 0xe08a3000
i2c_core 27524 7 eeprom,i2c_isa,lm75,i2c_i801,i2c_algo_bit,i2c_dev,i2c_sensor,
Live 0xe08a8000
**** DRIVER AND HARDWARE
(IOPORTS)
0000-001f : dma1
0020-0021 : pic1
0040-005f : timer
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : libata
01f0-01f7 : ide0
0378-037a : parport0
03c0-03df : vga+
03f6-03f6 : ide0
0500-051f : 0000:00:1f.3
0500-050f : i801-smbus
0cf8-0cff : PCI conf1
9000-9fff : PCI Bus #01
9000-90ff : 0000:01:00.0
a000-afff : PCI Bus #02
a000-a01f : 0000:02:01.0
a000-a01f : e1000
b000-b01f : 0000:00:1d.0
b000-b01f : uhci_hcd
b400-b41f : 0000:00:1d.1
b400-b41f : uhci_hcd
b800-b81f : 0000:00:1d.2
b800-b81f : uhci_hcd
bc00-bc1f : 0000:00:1d.3
bc00-bc1f : uhci_hcd
d800-d8ff : 0000:00:1f.5
dc00-dc3f : 0000:00:1f.5
f000-f00f : 0000:00:1f.2
f000-f00f : libata
(IOMEM)
00000000-0009f7ff : System RAM
0009f800-0009ffff : reserved
000a0000-000bffff : Video RAM area
000d0000-000d0fff : Extension ROM
000f0000-000fffff : System ROM
00100000-1ffeffff : System RAM
00100000-0036a323 : Kernel code
0036a324-00492e7f : Kernel data
1fff0000-1fff2fff : ACPI Non-volatile Storage
1fff3000-1fffffff : ACPI Tables
e0000000-e7ffffff : 0000:00:00.0
e8000000-f7ffffff : PCI Bus #01
e8000000-efffffff : 0000:01:00.0
f0000000-f7ffffff : 0000:01:00.1
f8000000-f9ffffff : PCI Bus #01
f9000000-f900ffff : 0000:01:00.0
f9010000-f901ffff : 0000:01:00.1
fa000000-fa0fffff : PCI Bus #02
fa000000-fa01ffff : 0000:02:01.0
fa000000-fa01ffff : e1000
fa100000-fa103fff : 0000:03:03.0
fa104000-fa1047ff : 0000:03:03.0
fa104000-fa1047ff : ohci1394
fa200000-fa2003ff : 0000:00:1d.7
fa200000-fa2003ff : ehci_hcd
fa201000-fa2011ff : 0000:00:1f.5
fa201000-fa2011ff : Intel ICH5 - AC'97
fa202000-fa2020ff : 0000:00:1f.5
fa202000-fa2020ff : Intel ICH5 - Controller
fec00000-ffffffff : reserved
**** PCI
(lspci not available; here output of cat /proc/bus/pci/devices)
0000 80862578 0 e0000008 00000000 00000000
00000000 00000000 00000000 00000000 08000000
00000000 00000000 00000000 00000000 00000000
00000000 agpgart-intel
0008 80862579 0 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000
0018 8086257b 0 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000
00e8 808624d2 10 00000000 00000000 00000000
00000000 0000b001 00000000 00000000 00000000
00000000 00000000 00000000 00000020 00000000
00000000 uhci_hcd
00e9 808624d4 13 00000000 00000000 00000000
00000000 0000b401 00000000 00000000 00000000
00000000 00000000 00000000 00000020 00000000
00000000 uhci_hcd
00ea 808624d7 12 00000000 00000000 00000000
00000000 0000b801 00000000 00000000 00000000
00000000 00000000 00000000 00000020 00000000
00000000 uhci_hcd
00eb 808624de 10 00000000 00000000 00000000
00000000 0000bc01 00000000 00000000 00000000
00000000 00000000 00000000 00000020 00000000
00000000 uhci_hcd
00ef 808624dd 17 fa200000 00000000 00000000
00000000 00000000 00000000 00000000 00000400
00000000 00000000 00000000 00000000 00000000
00000000 ehci_hcd
00f0 8086244e 0 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000
00f8 808624d0 0 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000
00fa 808624d1 12 00000000 00000000 00000000
00000000 0000f001 00000000 00000000 00000000
00000000 00000000 00000000 00000010 00000000
00000000 ata_piix
00fb 808624d3 11 00000000 00000000 00000000
00000000 00000501 00000000 00000000 00000000
00000000 00000000 00000000 00000020 00000000
00000000 i801 smbus
00fd 808624d5 11 0000d801 0000dc01 fa201000
fa202000 00000000 00000000 00000000 00000100
00000040 00000200 00000100 00000000 00000000
00000000 Intel ICH
0100 10024e48 10 e8000008 00009001 f9000000
00000000 00000000 00000000 00000000 08000000
00000100 00010000 00000000 00000000 00000000
00020000
0101 10024e68 0 f0000008 f9010000 00000000
00000000 00000000 00000000 00000000 08000000
00010000 00000000 00000000 00000000 00000000
00000000
0208 80861019 12 fa000000 00000000 0000a001
00000000 00000000 00000000 00000000 00020000
00000000 00000020 00000000 00000000 00000000
00000000 e1000
0318 104c8023 14 fa104000 fa100000 00000000
00000000 00000000 00000000 00000000 00000800
00004000 00000000 00000000 00000000 00000000
00000000 ohci1394
**** SCSI
/proc/scsi/scsi not available
On Wed, 10 Mar 2004, Richard Browning wrote:
> **** SUMMARY
> Enabling hyperthreading on Asus PCDL Deluxe motherboard w/ 2xP4Xeon causes
> system freeze in short order.
>
> **** DESCRIPTION
> I have an Asus PCDL Deluxe P4Xeon motherboard which has a north/southbridge
> combination allowing system memory to run at 533MHz . The motherboard has the
> usual Asus onboard gigabit ethernet, AD1985 audio, Intel and Promise SATA
> controllers, firewire. All are enabled and operate. I am running dual 2.8GHz
> P4 Xeons.
>
> When hyperthreading is disabled the system is perfectly stable and usable. No
> operating artefacts seem to occur and SMP appears to workcorrectly.
>
> However when hyperthreading is enabled, the system operates for a brief period
> (enough for KDE to boot, for example) before halting. When operating from the
> command line it is usual to see a Machine Check Exception error immediately
> prior to system failure.
>
> **** ENVIRONMENT
> Asus PCDL Deluxe motherboard
> (http://usa.asus.com/products/server/srv-mb/pc-dl/overview.HTM)
Just to make sure, have you tried a bios upgrade?
On Wed, 2004-03-10 at 07:27, Richard Browning wrote:
> When operating from the
> command line it is usual to see a Machine Check Exception error
> immediately prior to system failure.
details?
On Wednesday 10 March 2004 16:10, Zwane Mwaikambo wrote:
> Just to make sure, have you tried a bios upgrade?
No, because the two BIOS upgrades available are minor boot-up UI
modifications. I can do it though (<fx>purchases floppy drive</fx>) and I'll
get back to you.
R
On Thursday 11 March 2004 06:50, Len Brown wrote:
> On Wed, 2004-03-10 at 07:27, Richard Browning wrote:
> > When operating from the
> > command line it is usual to see a Machine Check Exception error
> > immediately prior to system failure.
>
> details?
CPU 2: Machine Check Exception: 000...0004
CPU 3: Machine Check Exception: 000...0004
Is what I get now. Previously I also got "Kernel Context Corrupt" in addition
to the above.
The MCE error can be made to appear when, for example, I'm running a
configure/compile cycle. I thought it might be a SATA issue so installed
Mandrake on an IDE drive but I got the same error.
As an aside, I'm running Gentoo with CFLAGS=-O2 -march=pentium4 -pipe ... I
was running CFLAGS=-O3 -march=pentium3 -mcpu=pentium4 -mmmx -ssse2 (and so
on) with no difference whatsoever (probably shouldn't be a surprise bearing
in mind the kernel is compiled with its own flags).
I've run MEMTEST with no errors reported. It's a very interesting problem,
since with SMP only everything works okay. But SMP+Hyperthreading and, sooner
rather than later, the thing will bomb.
R
On Thursday 11 March 2004 06:50, Len Brown wrote:
> On Wed, 2004-03-10 at 07:27, Richard Browning wrote:
> > When operating from the
> > command line it is usual to see a Machine Check Exception error
> > immediately prior to system failure.
>
> details?
I've now updated the BIOS to the latest version available on Asus' site. The
crash occurs even earlier - during bootup this time. Exactly:
CPU2: Machine Check Exception: 000.0004
CPU3: Machine Check Exception: 000.0004
Bank 0: a20000008c010400Bank0: a20000008c010400
Kernel Panic: CPU context corrupt
In idle task - not syncing
Again, disabling hyperthreading allows the system to operate normally.
R
On Thu, 11 Mar 2004, Richard Browning wrote:
> I've now updated the BIOS to the latest version available on Asus' site. The
> crash occurs even earlier - during bootup this time. Exactly:
>
> CPU2: Machine Check Exception: 000.0004
> CPU3: Machine Check Exception: 000.0004
> Bank 0: a20000008c010400Bank0: a20000008c010400
> Kernel Panic: CPU context corrupt
> In idle task - not syncing
>
> Again, disabling hyperthreading allows the system to operate normally.
For my own curiosity, does switching the processors around do anything?
Those MCEs look confined to the non bootstrap processor package.
On Thursday 11 March 2004 22:17, Zwane Mwaikambo wrote:
> On Thu, 11 Mar 2004, Richard Browning wrote:
> > I've now updated the BIOS to the latest version available on Asus' site.
> > The crash occurs even earlier - during bootup this time. Exactly:
> >
> > CPU2: Machine Check Exception: 000.0004
> > CPU3: Machine Check Exception: 000.0004
> > Bank 0: a20000008c010400Bank0: a20000008c010400
> > Kernel Panic: CPU context corrupt
> > In idle task - not syncing
> >
> > Again, disabling hyperthreading allows the system to operate normally.
>
> For my own curiosity, does switching the processors around do anything?
> Those MCEs look confined to the non bootstrap processor package.
Switched CPUs. This time I get the following:
CPU3: Machine Check Exception: 000.0004
CPU2: Machine Check Exception: 000.0004
Bank 0: a20000008c010400
Kernel Panic: CPU context corrupt
In idle task - not syncing
Note that the CPU# designations are swapped and that there's only one Bank 0:
message. Is this significant?
R
On Fri, 12 Mar 2004, Richard Browning wrote:
> > For my own curiosity, does switching the processors around do anything?
> > Those MCEs look confined to the non bootstrap processor package.
>
> Switched CPUs. This time I get the following:
>
> CPU3: Machine Check Exception: 000.0004
> CPU2: Machine Check Exception: 000.0004
> Bank 0: a20000008c010400
> Kernel Panic: CPU context corrupt
> In idle task - not syncing
>
> Note that the CPU# designations are swapped and that there's only one Bank 0:
> message. Is this significant?
Ok, but that's still on the same package so it's not moving with the
processor, thanks. Could you also supply processor info from
/proc/cpuinfo.
On Friday 12 March 2004 00:36, Zwane Mwaikambo wrote:
> On Fri, 12 Mar 2004, Richard Browning wrote:
> > > For my own curiosity, does switching the processors around do anything?
> > > Those MCEs look confined to the non bootstrap processor package.
> >
> > Switched CPUs. This time I get the following:
> >
> > CPU3: Machine Check Exception: 000.0004
> > CPU2: Machine Check Exception: 000.0004
> > Bank 0: a20000008c010400
> > Kernel Panic: CPU context corrupt
> > In idle task - not syncing
> >
> > Note that the CPU# designations are swapped and that there's only one
> > Bank 0: message. Is this significant?
>
> Ok, but that's still on the same package so it's not moving with the
> processor, thanks. Could you also supply processor info from
> /proc/cpuinfo.
I suppose that's good (for me); indicates no hardware error?
/proc/cpuinfo of course:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 7
cpu MHz : 2807.537
cache size : 512 KB
physical id : 0
siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5537.79
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 7
cpu MHz : 2807.537
cache size : 512 KB
physical id : 0
siblings : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5603.32
R
Maybe it's worthwhile pointing out that this Asus mobo achieves the unusual
Xeon 533MHz front side bus speed by combining the Intel875 Northbridge with
the Canterwood chipset. Dunno if that's significant.
R
On Thu, 2004-03-11 at 19:42, Richard Browning wrote:
> On Friday 12 March 2004 00:36, Zwane Mwaikambo wrote:
> > On Fri, 12 Mar 2004, Richard Browning wrote:
> > > > For my own curiosity, does switching the processors around do anything?
> > > > Those MCEs look confined to the non bootstrap processor package.
> > >
> > > Switched CPUs. This time I get the following:
> > >
> > > CPU3: Machine Check Exception: 000.0004
> > > CPU2: Machine Check Exception: 000.0004
> > > Bank 0: a20000008c010400
> > > Kernel Panic: CPU context corrupt
> > > In idle task - not syncing
> > >
> > > Note that the CPU# designations are swapped and that there's only one
> > > Bank 0: message. Is this significant?
> >
> > Ok, but that's still on the same package so it's not moving with the
> > processor, thanks. Could you also supply processor info from
> > /proc/cpuinfo.
>
> I suppose that's good (for me); indicates no hardware error?
MCE == hardware error.
In this case un-recoverable.
I'll take a swing at decoding this, call the Coast Guard if I don't
return in 30 minutes;-)
http://developer.intel.com/design/pentium4/manuals/25366813.pdf
> Machine Check Exception: 000.0004
fig 14-4 says this means that indeed, you have a valid MCE.
> Bank 0: a20000008c010400
fig 14-6 says:
63: valid register contents
61: UC -- processor did not correct the error
57: PCC -- Processor context corrupt (you're dead)
0400 is the MCA error code
fig E2 says
10 - internal watchdog timeout.
26,27 -- TT -- Thread timeout indicator -- both threads timed out
> /proc/cpuinfo of course:
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 15
> model : 2
I have no idea what causes this error, but it sure sounds specific to
the processor, and specific to HT -- which matches your experiments.
I'd imagine that after you verify that you've got the latest BIOS for
the board and the error persists that you should look into getting that
specific processor replaced.
cheers,
-Len
Hmm, read that note too fast...
Since the failure did not follow the package to the BSP socket
(CPU0/CPU1), but instead stayed with the AP (CPU2/CPU3) socket, that
suggests an issue with the MB rather than the processor itself.
-Len
On Fri, 2004-03-12 at 01:27, Len Brown wrote:
> On Thu, 2004-03-11 at 19:42, Richard Browning wrote:
> > On Friday 12 March 2004 00:36, Zwane Mwaikambo wrote:
> > > On Fri, 12 Mar 2004, Richard Browning wrote:
> > > > > For my own curiosity, does switching the processors around do anything?
> > > > > Those MCEs look confined to the non bootstrap processor package.
> > > >
> > > > Switched CPUs. This time I get the following:
> > > >
> > > > CPU3: Machine Check Exception: 000.0004
> > > > CPU2: Machine Check Exception: 000.0004
> > > > Bank 0: a20000008c010400
> > > > Kernel Panic: CPU context corrupt
> > > > In idle task - not syncing
> > > >
> > > > Note that the CPU# designations are swapped and that there's only one
> > > > Bank 0: message. Is this significant?
> > >
> > > Ok, but that's still on the same package so it's not moving with the
> > > processor, thanks. Could you also supply processor info from
> > > /proc/cpuinfo.
> >
> > I suppose that's good (for me); indicates no hardware error?
>
> MCE == hardware error.
> In this case un-recoverable.
>
> I'll take a swing at decoding this, call the Coast Guard if I don't
> return in 30 minutes;-)
>
> http://developer.intel.com/design/pentium4/manuals/25366813.pdf
>
> > Machine Check Exception: 000.0004
>
> fig 14-4 says this means that indeed, you have a valid MCE.
>
> > Bank 0: a20000008c010400
>
> fig 14-6 says:
> 63: valid register contents
> 61: UC -- processor did not correct the error
> 57: PCC -- Processor context corrupt (you're dead)
>
> 0400 is the MCA error code
>
> fig E2 says
> 10 - internal watchdog timeout.
> 26,27 -- TT -- Thread timeout indicator -- both threads timed out
>
> > /proc/cpuinfo of course:
> >
> > processor : 0
> > vendor_id : GenuineIntel
> > cpu family : 15
> > model : 2
>
> I have no idea what causes this error, but it sure sounds specific to
> the processor, and specific to HT -- which matches your experiments.
> I'd imagine that after you verify that you've got the latest BIOS for
> the board and the error persists that you should look into getting that
> specific processor replaced.
>
> cheers,
> -Len
>
>
On Friday 12 March 2004 07:07, Len Brown wrote:
> Hmm, read that note too fast...
> Since the failure did not follow the package to the BSP socket
> (CPU0/CPU1), but instead stayed with the AP (CPU2/CPU3) socket, that
> suggests an issue with the MB rather than the processor itself.
But how is it possible that simply activating HT could cause this error? Is it
not the BIOS/processor that determines HT support ... it operates correctly
in SMP mode. Hm.
R
On Friday 12 March 2004 07:07, Len Brown wrote:
> Hmm, read that note too fast...
> Since the failure did not follow the package to the BSP socket
> (CPU0/CPU1), but instead stayed with the AP (CPU2/CPU3) socket, that
> suggests an issue with the MB rather than the processor itself.
Right then. I've just 'borrowed' the same motherboard (Asus PC-DL Deluxe) and
CPUs from the local friendly computer place. I simply connected the hard
drive to their combo and booted. SAME RESULT.
To confirm these (SEVERE) issues, I installed (deep breath, apologies, etc,)
Windows XP Pro. Regrettably, the Micro$oft beast worked perfectly. Four
processors in task manager, no hangs after several hours of 'doing stuff'.
I don't know what's going on with Linux and this motherboard. Is it the
strangeness of Asus putting a Canterwood and Intel ICH5 chipset together to
get the 533 FSB out of the Xeon?
The symptoms only show up if I'm doing 'dev'. In other words, I can boot into
KDE, play music, watch DVDs and even play Enemy Territory - all with HT
enabled. However AS SOON as I begin the configure/make/install cycle the
system will hang. It's almost like an interrupt is gumming up the works
somehow, but I lack the expertise to pinpoint it. For the record, I'm running
a Radeon9800Pro graphics card.
I dunno why there haven't been more issues like this. Then again, most of the
folk I hear on the Asus forums are using these mobos with Windoze.
What can I do to help you chaps get to the bottom of this? (Interestingly I
note that turning off ACPI - not APIC - with HT enabled causes the kernel to
not realise that HT is in fact enabled. I though the kernel used CPUIDs to
work out whether HT was enabled?)
R
Gents
Is there anyone in kernelland who is tackling this? I'm currently in the
throes of recompiling everything with -march=pentium3 -O2 to see if these
simple flags make a difference (as I reiterate, Windoze XP works without
problem). I refuse to believe that I will have to use XP in order to get my
money's worth. I've always thought anything Doze can do, GNU/Linux does
better!
R
On Saturday 20 March 2004 14:33, Richard Browning wrote:
> Is there anyone in kernelland who is tackling this? I'm currently in the
> throes of recompiling everything with -march=pentium3 -O2 to see if these
> simple flags make a difference (as I reiterate, Windoze XP works without
> problem). I refuse to believe that I will have to use XP in order to get my
> money's worth. I've always thought anything Doze can do, GNU/Linux does
> better!
It's been pointed out to me that this Doze-vs-Linux comparison won't help. If
you're offended by it, please don't be. I've been a supporter and contributer
to the GNU cause for the last six years and I'm not going to stop. However
this is the first serious issue that I've come across re: the kernel and
incompatible hardware and I would like to see it fixed. Like I've said, I
want to help pinpoint the problem - but no-one's asking.
R
On Sat, 20 Mar 2004, Richard Browning wrote:
> It's been pointed out to me that this Doze-vs-Linux comparison won't help. If
> you're offended by it, please don't be. I've been a supporter and contributer
> to the GNU cause for the last six years and I'm not going to stop. However
> this is the first serious issue that I've come across re: the kernel and
> incompatible hardware and I would like to see it fixed. Like I've said, I
> want to help pinpoint the problem - but no-one's asking.
It's good that you've verified that the hardware should work fine, could
you please try the attached configuration. I'll also check errata,
but i'm pretty sure there isn't anything flagged for this problem. But
i'll double check.
Thanks
On Saturday 20 March 2004 21:30, Zwane Mwaikambo wrote:
> On Sat, 20 Mar 2004, Richard Browning wrote:
> > It's been pointed out to me that this Doze-vs-Linux comparison won't
> > help. If you're offended by it, please don't be. I've been a supporter
> > and contributer to the GNU cause for the last six years and I'm not going
> > to stop. However this is the first serious issue that I've come across
> > re: the kernel and incompatible hardware and I would like to see it
> > fixed. Like I've said, I want to help pinpoint the problem - but no-one's
> > asking.
>
> It's good that you've verified that the hardware should work fine, could
> you please try the attached configuration. I'll also check errata,
> but i'm pretty sure there isn't anything flagged for this problem. But
> i'll double check.
I'll try this config in the morning - off to bed now (4am).
Thanks!
R
On Sat, 20 Mar 2004, Richard Browning wrote:
> Gents
>
> Is there anyone in kernelland who is tackling this? I'm currently in the
> throes of recompiling everything with -march=pentium3 -O2 to see if these
> simple flags make a difference (as I reiterate, Windoze XP works without
> problem). I refuse to believe that I will have to use XP in order to get my
> money's worth. I've always thought anything Doze can do, GNU/Linux does
> better!
I have been using 2.4.24 with SMP and hyperthreading with no
problems. FYI, the reference to Windows is useless, because
M$ was unable to make any SMP stuff function without crashes
since windows 2000/professional, later Windows versions don't
use your additional CPUs at all, they just report that they
exist. FYI, see if you can find your CPU resources at all
in XP!!! They just don't want you to know!
Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.
On Sunday 21 March 2004 14:46, Richard B. Johnson wrote:
> I have been using 2.4.24 with SMP and hyperthreading with no
> problems. FYI, the reference to Windows is useless, because
> M$ was unable to make any SMP stuff function without crashes
> since windows 2000/professional, later Windows versions don't
> use your additional CPUs at all, they just report that they
> exist. FYI, see if you can find your CPU resources at all
> in XP!!! They just don't want you to know!
Er, sorry old man, but WindozeXPPro certainly does use the extra processors
with HT. I'm not talking about Task Manager (which of course shows four
processors) but the multi-threaded secure gateway application I'm developing
confirms the multiple (and virtual) processors.
In fact, Windoze2kPro has a different threading kernel to WIndozeXPPro, which
is why you get four procs in XPPro and only two in 2kPro. Anyway, this isn't
the thread or forum for this topic. I don't use Doze for anything other than
compatibility testing so it's a (fairly) moot point. I'm only interested in
improving Linux (since I develop on and for it) and if this investigation
helps, then marvellous.
Cheers
R
Richard Browning <[email protected]> said:
> On Saturday 20 March 2004 14:33, Richard Browning wrote:
> > Is there anyone in kernelland who is tackling this? I'm currently in the
> > throes of recompiling everything with -march=pentium3 -O2 to see if these
> > simple flags make a difference (as I reiterate, Windoze XP works without
> > problem). I refuse to believe that I will have to use XP in order to get my
> > money's worth. I've always thought anything Doze can do, GNU/Linux does
> > better!
> It's been pointed out to me that this Doze-vs-Linux comparison won't
> help. If you're offended by it, please don't be. I've been a supporter
> and contributer to the GNU cause for the last six years and I'm not going
> to stop. However this is the first serious issue that I've come across
> re: the kernel and incompatible hardware and I would like to see it
> fixed. Like I've said, I want to help pinpoint the problem - but no-one's
> asking.
In my experience, when Linux crashes, Windows works fine it is flaky
hardware. Has variously been overclocking, bad fans (CPU overheating), bad
RAM. You have to rule all that out first. Might need a BIOS update...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
On Sunday 21 March 2004 03:33, Horst von Brand wrote:
> In my experience, when Linux crashes, Windows works fine it is flaky
> hardware. Has variously been overclocking, bad fans (CPU overheating), bad
> RAM. You have to rule all that out first. Might need a BIOS update...
Done all that. Changed motherboards, changed CPUs, changed RAM. Haven't
changed graphics card, though. I don't know where else to get a Radeon9800Pro
from on loan! I don't overclock either. I'm confident it's not the hardware.
R
On Sunday 21 March 2004 22:32, Richard Browning wrote:
> On Sunday 21 March 2004 03:33, Horst von Brand wrote:
> > In my experience, when Linux crashes, Windows works fine it is flaky
> > hardware. Has variously been overclocking, bad fans (CPU overheating),
> > bad RAM. You have to rule all that out first. Might need a BIOS update...
>
> Done all that. Changed motherboards, changed CPUs, changed RAM. Haven't
> changed graphics card, though. I don't know where else to get a
> Radeon9800Pro from on loan! I don't overclock either. I'm confident it's
> not the hardware.
But it must be SOMETHING, right?
Kernel compile is too broad. Can you try to narrow it down?
Does burnCPU trigger it? Several burnCPUs?
dd if=/dev/hda of=/dev/null?
dd if=/dev/zero of=file bs=1M count=128?
network flood? (/me uses netcat and UDP)
Combination of above?
Then, you will be able to call for testing.
You can post your kernel version and .config
and ask folks who has identical hardware to try
to duplicate.
--
vda
On Sunday 21 March 2004 23:33, Denis Vlasenko wrote:
> But it must be SOMETHING, right?
I haven't been completely idle :)
> Kernel compile is too broad. Can you try to narrow it down?
> Does burnCPU trigger it? Several burnCPUs?
> dd if=/dev/hda of=/dev/null?
> dd if=/dev/zero of=file bs=1M count=128?
> network flood? (/me uses netcat and UDP)
> Combination of above?
As I suggested in the original post, the problem can be triggered simply by
executing ./configure - the kernel corrupts when gcc does its thing. I can
boot into KDE, run Enemy Territory, execute a Java compile, and so on. But
the thing absolutely and most definitely able to upset the cart is to execute
gcc.
Oh, I've even recompiled libgcc etc using a variety of optimisation flags
(Gentoo is my distro), from the sooper-over-the-top P4 flags down to simple
-O2 -march=pentium3. With no effect.
> Then, you will be able to call for testing.
Because I have exhausted my own meagre talents in the search for the cause (eg
swapping hardware, altering config parameters, using different hard drives,
etc) I felt the time was right to 'call in the experts'.
> You can post your kernel version and .config
I've done that, too, oddly enough. The config, cpuinfo, pci, the works. It's
all there in the original thread.
> and ask folks who has identical hardware to try
> to duplicate.
The nearest I've got is someone who has the same mobo but with different CPUs,
no AGP graphics card and no SATA.
Like I said, as a software engineer of some 20 years (heavens! I had my first
game published when I was 14, lovingly handcrafted in 65c102 assember), I am
aware of the steps required to pinpoint an issue. The penultimate one - the
last, of course, is to give up - is to enlist the help of others who know
more. That is what I've done.
R
On Mon, 22 Mar 2004, Richard Browning wrote:
> As I suggested in the original post, the problem can be triggered simply by
> executing ./configure - the kernel corrupts when gcc does its thing. I can
> boot into KDE, run Enemy Territory, execute a Java compile, and so on. But
> the thing absolutely and most definitely able to upset the cart is to execute
> gcc.
Aha! A single thread? A specific asm insn? Is gcc --version enough to kill
the machine? Or on smth like
int main(void)
{return 0;}
gcc -E; gcc -S; gcc -c; ld? Step-by-step with gdb (hopefully, gdb doesn't
have this insn...). NMI watchdog?
Guennadi
---
Guennadi Liakhovetski
On Monday 22 March 2004 07:25, Guennadi Liakhovetski wrote:
> Aha! A single thread? A specific asm insn? Is gcc --version enough to kill
> the machine? Or on smth like
> int main(void)
> {return 0;}
> gcc -E; gcc -S; gcc -c; ld? Step-by-step with gdb (hopefully, gdb doesn't
> have this insn...). NMI watchdog?
gcc -version is fine. A compile will cause the problem.
It should be noted that whilst a ./configure cycle is *guaranteed* to initiate
the MCE, the MCE can occur on other (seemingly random) occasions.
This of course smacks of hardware failure, but not only are the components new
I have also swapped them all (excluding graphics card.) And, don't forget,
simple dual 'SMP' mode works fine too.
R
On Wed, 24 Mar 2004, Richard Browning wrote:
> On Monday 22 March 2004 07:25, Guennadi Liakhovetski wrote:
> > Aha! A single thread? A specific asm insn? Is gcc --version enough to kill
> > the machine? Or on smth like
> > int main(void)
> > {return 0;}
> > gcc -E; gcc -S; gcc -c; ld? Step-by-step with gdb (hopefully, gdb doesn't
> > have this insn...). NMI watchdog?
>
> gcc -version is fine. A compile will cause the problem.
>
> It should be noted that whilst a ./configure cycle is *guaranteed* to initiate
> the MCE, the MCE can occur on other (seemingly random) occasions.
Sorry, I should have been more explicit. I had 0 time (writing before
going to work, like now). So, what I really wanted to ask is the
following:
1) does gcc (compile) always cause an MCE?
2) If yes, can you try to narrow it down by
a) running separately different gcc stages on a trivial C-program,
like the one above (preprocessor with gcc -E, produce assembly
output with gcc -S, compile without linking with gcc -c, link with
ld)
b) once you've figured out which stage causes it, try to further
narrow it down by running this stage under gdb. Since you
re-compiled gcc yourself, you should have the sources at hand, and
should be able set up a couple of break-points to narrow it down
gradually, eventually coming to a single assembly instruction - if
my suspicion is right.
> This of course smacks of hardware failure, but not only are the components new
> I have also swapped them all (excluding graphics card.) And, don't forget,
> simple dual 'SMP' mode works fine too.
I know. You can compile with the Java compiler, but not with gcc. Can you
explain, why? I cannot.
Thanks
Guennadi
---
Guennadi Liakhovetski
same as you - dual 2.8G Xeon - though my crash seems to be happening
both with and without HT :(
sven
On Fri, 2004-03-26 at 00:04, Richard Browning wrote:
> > just to mess you up a bit :)
> >
> > I've just gotten one of these boxen for work
> >
> > and running win2003 its nice and stable,
> >
> > EXCEPT
> >
> > when i play quake, when i locks up hard after some random amount of
> > time. Also, if I use the ATI driver setup to rotate the screen by
> > 90degrees, it resets instantly.
> >
> > so i'm wondering if there's not an issue with the radeon 9600pro and
> > this mobo..
> >
> > i'm goint to wack a matrox card into it in the next few days just to see
>
> Which processors - is HT enabled?
>
> R
On Thursday 25 March 2004 22:47, Sven Dowideit wrote:
> same as you - dual 2.8G Xeon - though my crash seems to be happening
> both with and without HT :(
I'm busy with a deadline at the mo' but I'll stick in a GeForce over t'weekend
and see owt it goes.
R