2003-08-28 13:48:43

by Tomasz Czaus

[permalink] [raw]
Subject: 2.6.0-test4 and hardware reports a non fatal incident

Hello,

when my system is booting I can see such a message:

kernel: MCE: The hardware reports a non fatal, correctable incident occurred
on CPU 0.
kernel: Bank 0: e664000000000185

What does it mean ??? My kernel 2.6.0-test4 has applyed "Nick's scheduler
policy v8" patch.

When I boot 2.4.x kernel I can't see this message.


Thanks,
Tomasz Czaus


2003-08-28 15:51:48

by Randy.Dunlap

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Thu, 28 Aug 2003 15:48:44 +0200 Tomasz Czaus <[email protected]> wrote:

| Hello,
|
| when my system is booting I can see such a message:
|
| kernel: MCE: The hardware reports a non fatal, correctable incident occurred
| on CPU 0.
| kernel: Bank 0: e664000000000185
|
| What does it mean ??? My kernel 2.6.0-test4 has applyed "Nick's scheduler
| policy v8" patch.

Use "parsemce" from here:
http://www.codemonkey.org.uk/projects/parsemce/
to decode it.

| When I boot 2.4.x kernel I can't see this message.

So 2.6 has more/better/different processor error checking.

--
~Randy

2003-08-28 18:50:12

by Matt Gibson

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Thursday 28 Aug 2003 16:46, Randy.Dunlap wrote:
> On Thu, 28 Aug 2003 15:48:44 +0200 Tomasz Czaus <[email protected]>
wrote:
> | Hello,
> |
> | when my system is booting I can see such a message:
> |
> | kernel: MCE: The hardware reports a non fatal, correctable incident
> | occurred on CPU 0.
> | kernel: Bank 0: e664000000000185

Yeah, I get one of those on boot, too. Or at least I did. I was going to
turn the processor checking stuff back on to see if it happened
consistently. What processor is it, Tomasz? Mine's an Athlon. Output of
"cat /proc/cpuinfo" at the end, if anyone's remotely interested...

> Use "parsemce" from here:
> http://www.codemonkey.org.uk/projects/parsemce/
> to decode it.
>
> So 2.6 has more/better/different processor error checking.

Thanks for the link, Randy, I'll give it a go tonight. Although with my
knowledge of current processor archictecture, I'm guessing it'll parse it
from one format I don't have a clue about into a more verbose format I don't
have a clue about ;-)

Cheers,

M

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 4
cpu MHz : 1195.130
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 2367.48


--
"It's the small gaps between the rain that count,
and learning how to live amongst them."
-- Jeff Noon

2003-08-28 20:10:59

by Matt Gibson

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Thursday 28 Aug 2003 16:46, Randy.Dunlap wrote:
> Use "parsemce" from here:
> http://www.codemonkey.org.uk/projects/parsemce/
> to decode it.

Hi Randy,

The format seems to have changed rather a lot since that was written. All I
get is:

Aug 17 11:25:13 codewave kernel: MCE: The hardware reports a non fatal,
correctable incident occurred on CPU 0.
Aug 17 11:25:13 codewave kernel: Bank 0: dc0000000000050b

...but what parsemce seems to be expecting is:

Sample kernel output..
Sep 4 21:43:41 hamlet kernel: CPU 0: Machine Check Exception:
0000000000000004
Sep 4 21:43:41 hamlet kernel: Bank 1: f600200000000152 at 7600200000000152
Sep 4 21:43:41 hamlet kernel: Bank 2: d40040000000017a at 540040000000017a
Sep 4 21:43:41 hamlet kernel: Kernel panic: CPU context corrupt

As a result, I'm still no more enlightened. I can't quite figure out from
reading the parser what values to put where, as it seems to expect a few
more than I have. Any tips?

Ta,

Matt

--
"It's the small gaps between the rain that count,
and learning how to live amongst them."
-- Jeff Noon

2003-08-28 22:22:24

by Randy.Dunlap

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Thu, 28 Aug 2003 20:02:00 +0100 Matt Gibson <[email protected]> wrote:

| On Thursday 28 Aug 2003 16:46, Randy.Dunlap wrote:
| > Use "parsemce" from here:
| > http://www.codemonkey.org.uk/projects/parsemce/
| > to decode it.
|
| Hi Randy,

| I'm guessing it'll parse it
| from one format I don't have a clue about into a more verbose format I don't
| have a clue about ;-)

That was insightful. :(

| The format seems to have changed rather a lot since that was written. All I
| get is:
|
| Aug 17 11:25:13 codewave kernel: MCE: The hardware reports a non fatal,
| correctable incident occurred on CPU 0.
| Aug 17 11:25:13 codewave kernel: Bank 0: dc0000000000050b
|
| ...but what parsemce seems to be expecting is:
|
| Sample kernel output..
| Sep 4 21:43:41 hamlet kernel: CPU 0: Machine Check Exception:
| 0000000000000004
| Sep 4 21:43:41 hamlet kernel: Bank 1: f600200000000152 at 7600200000000152
| Sep 4 21:43:41 hamlet kernel: Bank 2: d40040000000017a at 540040000000017a
| Sep 4 21:43:41 hamlet kernel: Kernel panic: CPU context corrupt
|
| As a result, I'm still no more enlightened. I can't quite figure out from
| reading the parser what values to put where, as it seems to expect a few
| more than I have. Any tips?

Yes, the kernel has decided that your processor only has 1 Bank of
MCE register data to report. I don't know how/why. Sorry.

--
~Randy

2003-08-30 11:21:26

by Matt Gibson

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Thursday 28 Aug 2003 23:17, Randy.Dunlap wrote:
> Yes, the kernel has decided that your processor only has 1 Bank of
> MCE register data to report. I don't know how/why. Sorry.

Could it be something to do with this (in arch/i386/kernel/cpu/mcheck/k7.c)?

if (l & (1<<8)) /* Control register present ? */
wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
nr_mce_banks = l & 0xff;

for (i=1; i<nr_mce_banks; i++) {

Check out the "for". Or am I reading this wrong?

M

--
"It's the small gaps between the rain that count,
and learning how to live amongst them."
-- Jeff Noon

2003-08-30 13:11:59

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Thu, Aug 28, 2003 at 03:17:08PM -0700, Randy.Dunlap wrote:
> | As a result, I'm still no more enlightened. I can't quite figure out from
> | reading the parser what values to put where, as it seems to expect a few
> | more than I have. Any tips?
>
> Yes, the kernel has decided that your processor only has 1 Bank of
> MCE register data to report. I don't know how/why. Sorry.

The non-fatal checker dumps the single bank that is reporting failures.
parsemce should still have enough info there to decode into something
useful however. (just use 0 for the address).

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-08-30 13:20:34

by Matt Gibson

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Saturday 30 Aug 2003 11:49, Matt Gibson wrote:
> On Thursday 28 Aug 2003 23:17, Randy.Dunlap wrote:
> > Yes, the kernel has decided that your processor only has 1 Bank of
> > MCE register data to report. I don't know how/why. Sorry.
>
> Could it be something to do with this (in
> arch/i386/kernel/cpu/mcheck/k7.c)?
>
> if (l & (1<<8)) /* Control register present ? */
> wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
> nr_mce_banks = l & 0xff;
>
> for (i=1; i<nr_mce_banks; i++) {
>
> Check out the "for". Or am I reading this wrong?

Having checked back, this was changed between test-2 and test-3. The
checking code in k7_machine_check() still loops from 0 rather than 1. I
think this may be leading to false reporting of problems, which may be why I
and Tomasz are seeing these MCE messages on our Athlons.

Anyone who knows more about this stuff care to comment? Is someone looking
after MCE at the moment? I couldn't find out much info on it.

Thanks,

Matt

--
"It's the small gaps between the rain that count,
and learning how to live amongst them."
-- Jeff Noon

2003-08-30 13:36:13

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Sat, Aug 30, 2003 at 01:44:56PM +0100, Matt Gibson wrote:
> > for (i=1; i<nr_mce_banks; i++) {
> >
> > Check out the "for". Or am I reading this wrong?
>
> Having checked back, this was changed between test-2 and test-3. The
> checking code in k7_machine_check() still loops from 0 rather than 1. I
> think this may be leading to false reporting of problems, which may be why I
> and Tomasz are seeing these MCE messages on our Athlons.

When it was i=0 people were seeing false positives. Starting from 1
reduces that.

> Anyone who knows more about this stuff care to comment? Is someone looking
> after MCE at the moment? I couldn't find out much info on it.

in the past, Alan and myself took care of i386, Andi Kleen did AMD64.

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-08-30 13:52:43

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Sat, Aug 30, 2003 at 02:48:30PM +0100, Matt Gibson wrote:
> On Saturday 30 Aug 2003 14:35, you wrote:
> > When it was i=0 people were seeing false positives. Starting from 1
> > reduces that.
> Cool. Can you point me towards any background-reading on MCE? This's got me
> interested.

not sure if any of the public amd docs have info on the mce registers,
but the stuff in the intel system archicture manuals on
developer.intel.com is largely relevant.

> Rather ironically, since I changed my kernel back to starting from 0, I
> haven't seen any errors.

coincidence. By enabling more error checking you're seeing less doesn't
really make sense.

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-08-30 13:49:20

by Matt Gibson

[permalink] [raw]
Subject: Re: 2.6.0-test4 and hardware reports a non fatal incident

On Saturday 30 Aug 2003 14:35, you wrote:
> When it was i=0 people were seeing false positives. Starting from 1
> reduces that.

Cool. Can you point me towards any background-reading on MCE? This's got me
interested.

Rather ironically, since I changed my kernel back to starting from 0, I
haven't seen any errors. Having said that, I was only getting a couple each
day anyway, so I'll leave it a few days and see what develops. I think it's
happening only once on boot, every now and again, but I've not had time to
analyse the logs properly yet. Maybe there's a problem when my machine's
cold...

> in the past, Alan and myself took care of i386, Andi Kleen did AMD64.

Thanks for responding; it was fairly clear to me that I was out of my depth,
and it's nice to know that there's someone out there that isn't *grin*

Cheers,

Matt

--
"It's the small gaps between the rain that count,
and learning how to live amongst them."
-- Jeff Noon