I get the following problem with 2.6.1 consistently after apm resuming:
"ksyrium kernel: MCE: The hardware reports a non fatal, correctable
incident occurred on CPU 0.
Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
ksyrium kernel: Bank 1: f2000000000001c5"
It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
diagnostic.
I'd appreciate help with the parameters for parsemce to interpret the
problem...not sure if my usage is correct? ;)
# ./parsemce -b 1 -a 0 -e f2000000000001c5
Status: (f2000000000001c5) Machine Check in progress.
Restart IP valid.
Is this really hardware (maybe a bug in the BIOS?) or are false
positives possible with 2.6 MCE code?
-Niel
On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:
> I get the following problem with 2.6.1 consistently after apm resuming:
> "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
> incident occurred on CPU 0.
>
> Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
> ksyrium kernel: Bank 1: f2000000000001c5"
As it only happens when you resume from APM, I'm inclined to believe
its a BIOS bug. With the output of dmidecode, we could blacklist this
box to not do the nonfatal checking.
> It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
> 2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
> brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
> diagnostic.
None of the other kernels you mention have this, its a new feature of 2.6
Dave
I also have been getting apparently false MCEs since 2.5.xx
I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
and in 2.6.1 I just get them from time to time but none fatal.
most of the time in CPU 0
request_module: failed /sbin/modprobe -- char-major-6-0. error = 256
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: e606200000000833
request_module: failed /sbin/modprobe --
the box is dual athlon mp with AMD 760MP chipset.
nebula:/home/piotr# ./parsemce b 1 -a 0 -e e606200000000833
Status: (e606200000000833) Error IP valid
Restart IP valid.
nebula:/home/piotr#
--
Pedro Larroy Tovar | piotr%member.fsf.org
Software patents are a threat to innovation in Europe please check:
http://www.eurolinux.org/
On Sun, Jan 18, 2004 at 02:30:48PM +0100, Pedro Larroy wrote:
> I also have been getting apparently false MCEs since 2.5.xx
> I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
> and in 2.6.1 I just get them from time to time but none fatal.
> most of the time in CPU 0
>
I get them too, so I applied this patch.
- g.
On Sun, Jan 18, 2004 at 10:23:38PM +0800, [email protected] wrote:
>
> I get them too, so I applied this patch.
gah, that still didn't get applied?
Dave
--
Dave Jones http://www.codemonkey.org.uk
I tried the mentioned patch, with a modification for my CPU type, but
still get the problem:
"Jan 20 21:30:23 ksyrium kernel: MCE: The hardware reports a non fatal,
correctable incident occurred on CPU 0.
Jan 20 21:30:23 ksyrium kernel: MCE: startbank = 1, vendor : 0, x86 = 6,
model = 9, mask = 5.
Jan 20 21:30:23 ksyrium kernel: Bank 1: f200000000000185"
As you can see, I added a little extra debugging info. Here is the
relevant portion of the code:
" if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD && boot_cpu_data.x86
== 6) || (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 9 &&
boot_cpu_data.x86_mask == 5))
startbank = 1;"
Comments would be appreciated.
-Niel
On Sun, 18 Jan 2004 02:03:01 +0000 Dave Jones <[email protected]> wrote:
>
> On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:
>
> > I get the following problem with 2.6.1 consistently after apm resuming:
> > "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
> > incident occurred on CPU 0.
> >
> > Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
> > ksyrium kernel: Bank 1: f2000000000001c5"
>
> As it only happens when you resume from APM, I'm inclined to believe
> its a BIOS bug. With the output of dmidecode, we could blacklist this
> box to not do the nonfatal checking.
My Thinkpad T22 produces a similar warning on resume using APM:
kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
kernel: Bank 1: f200000000000104
dmidecode output starts with:
# dmidecode 2.3
SMBIOS 2.3 present.
46 structures occupying 1585 bytes.
Table at 0x1FFF0000.
Handle 0x0000
DMI type 0, 20 bytes.
BIOS Information
Vendor: IBM
Version: 16ET31WW (1.11 )
Release Date: 03/20/2003
.
.
Handle 0x0001
DMI type 1, 25 bytes.
System Information
Manufacturer: IBM
Product Name: 26475EA
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/