2004-01-18 01:45:40

by Niel Lambrechts

[permalink] [raw]
Subject: [2.6.1 MCE falseness?] Hardware reports non-fatal error


I get the following problem with 2.6.1 consistently after apm resuming:

"ksyrium kernel: MCE: The hardware reports a non fatal, correctable
incident occurred on CPU 0.

Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
ksyrium kernel: Bank 1: f2000000000001c5"

It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
diagnostic.

I'd appreciate help with the parameters for parsemce to interpret the
problem...not sure if my usage is correct? ;)

# ./parsemce -b 1 -a 0 -e f2000000000001c5
Status: (f2000000000001c5) Machine Check in progress.
Restart IP valid.

Is this really hardware (maybe a bug in the BIOS?) or are false
positives possible with 2.6 MCE code?

-Niel





2004-01-18 02:04:35

by Dave Jones

[permalink] [raw]
Subject: Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error

On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:

> I get the following problem with 2.6.1 consistently after apm resuming:
> "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
> incident occurred on CPU 0.
>
> Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
> ksyrium kernel: Bank 1: f2000000000001c5"

As it only happens when you resume from APM, I'm inclined to believe
its a BIOS bug. With the output of dmidecode, we could blacklist this
box to not do the nonfatal checking.

> It does not happen on any other kernels I use (vanilla 2.4.24, SuSE 9
> 2.4.21-166) - even though CONFIG_X86_MCE=y for both. The equipment is
> brand-new - an IBM Thinkpad R50P - and it passes all IBM's s/w
> diagnostic.

None of the other kernels you mention have this, its a new feature of 2.6

Dave

2004-01-18 13:28:31

by Pedro Larroy

[permalink] [raw]
Subject: Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error

I also have been getting apparently false MCEs since 2.5.xx
I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
and in 2.6.1 I just get them from time to time but none fatal.
most of the time in CPU 0

request_module: failed /sbin/modprobe -- char-major-6-0. error = 256
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: e606200000000833
request_module: failed /sbin/modprobe --

the box is dual athlon mp with AMD 760MP chipset.


nebula:/home/piotr# ./parsemce b 1 -a 0 -e e606200000000833
Status: (e606200000000833) Error IP valid
Restart IP valid.
nebula:/home/piotr#




--
Pedro Larroy Tovar | piotr%member.fsf.org

Software patents are a threat to innovation in Europe please check:
http://www.eurolinux.org/

2004-01-18 14:24:24

by Geoffrey Lee

[permalink] [raw]
Subject: Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error

On Sun, Jan 18, 2004 at 02:30:48PM +0100, Pedro Larroy wrote:
> I also have been getting apparently false MCEs since 2.5.xx
> I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
> and in 2.6.1 I just get them from time to time but none fatal.
> most of the time in CPU 0
>


I get them too, so I applied this patch.


- g.


Attachments:
(No filename) (347.00 B)
mce-fix.patch (1.04 kB)
Download all attachments

2004-01-18 20:06:18

by Dave Jones

[permalink] [raw]
Subject: Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error

On Sun, Jan 18, 2004 at 10:23:38PM +0800, [email protected] wrote:
>
> I get them too, so I applied this patch.

gah, that still didn't get applied?

Dave

--
Dave Jones http://www.codemonkey.org.uk

2004-01-20 19:43:56

by Niel Lambrechts

[permalink] [raw]
Subject: RE: [2.6.1 MCE falseness?] Hardware reports non-fatal error

I tried the mentioned patch, with a modification for my CPU type, but
still get the problem:

"Jan 20 21:30:23 ksyrium kernel: MCE: The hardware reports a non fatal,
correctable incident occurred on CPU 0.
Jan 20 21:30:23 ksyrium kernel: MCE: startbank = 1, vendor : 0, x86 = 6,
model = 9, mask = 5.
Jan 20 21:30:23 ksyrium kernel: Bank 1: f200000000000185"

As you can see, I added a little extra debugging info. Here is the
relevant portion of the code:
" if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD && boot_cpu_data.x86
== 6) || (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 9 &&
boot_cpu_data.x86_mask == 5))

startbank = 1;"

Comments would be appreciated.

-Niel



2004-01-21 12:17:07

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error

On Sun, 18 Jan 2004 02:03:01 +0000 Dave Jones <[email protected]> wrote:
>
> On Sun, Jan 18, 2004 at 03:44:16AM +0200, Niel Lambrechts wrote:
>
> > I get the following problem with 2.6.1 consistently after apm resuming:
> > "ksyrium kernel: MCE: The hardware reports a non fatal, correctable
> > incident occurred on CPU 0.
> >
> > Message from syslogd@ksyrium at Wed Jan 14 13:33:06 2004 ...
> > ksyrium kernel: Bank 1: f2000000000001c5"
>
> As it only happens when you resume from APM, I'm inclined to believe
> its a BIOS bug. With the output of dmidecode, we could blacklist this
> box to not do the nonfatal checking.

My Thinkpad T22 produces a similar warning on resume using APM:

kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
kernel: Bank 1: f200000000000104

dmidecode output starts with:

# dmidecode 2.3
SMBIOS 2.3 present.
46 structures occupying 1585 bytes.
Table at 0x1FFF0000.
Handle 0x0000
DMI type 0, 20 bytes.
BIOS Information
Vendor: IBM
Version: 16ET31WW (1.11 )
Release Date: 03/20/2003
.
.
Handle 0x0001
DMI type 1, 25 bytes.
System Information
Manufacturer: IBM
Product Name: 26475EA

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (1.34 kB)
(No filename) (189.00 B)
Download all attachments