2003-09-21 14:40:10

by Luca

[permalink] [raw]
Subject: [PATCH] Fix Athlon MCA

Hi,
on boot I'm seeing a lot of messages like this:

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: d47fa0000000bfee

This messages go away if I revert cset 1.1119.9.1. AFAIK you were trying
to decrease the logging level. After reading IA32 Architecture Software
Developers Manual, vol3 - chapter 14.5 "Machine-Check Initialization" I
think that the right way to do it is this:

--- linux-2.6/arch/i386/kernel/cpu/mcheck/k7.c~ Sat Aug 9 06:37:27 2003
+++ linux-2.6/arch/i386/kernel/cpu/mcheck/k7.c Sun Sep 21 00:36:39 2003
@@ -81,8 +81,9 @@
wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
nr_mce_banks = l & 0xff;

- for (i=1; i<nr_mce_banks; i++) {
- wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
+ for (i=0; i<nr_mce_banks; i++) {
+ if (i)
+ wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
wrmsr (MSR_IA32_MC0_STATUS+4*i, 0x0, 0x0);
}


We really want to clean all MC*_STATUS. I'm currently running linux 2.6.0-t5
+ this patch and I don't see the MCE messages on boot anymore.

Luca
--
Reply-To: [email protected]
Home: http://kronoz.cjb.net
"L'abilita` politica e` l'abilita` di prevedere quello che
accadra` domani, la prossima settimana, il prossimo mese e
l'anno prossimo. E di essere cosi` abili, piu` tardi,
da spiegare perche' non e` accaduto."


2003-09-21 14:48:04

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Sun, Sep 21, 2003 at 04:39:34PM +0200, Kronos wrote:

> We really want to clean all MC*_STATUS. I'm currently running linux 2.6.0-t5
> + this patch and I don't see the MCE messages on boot anymore.

Patch looks good to me.

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-09-21 17:47:55

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Sun, Sep 21, 2003 at 10:39:26AM -0700, Linus Torvalds wrote:

> > This messages go away if I revert cset 1.1119.9.1. AFAIK you were trying
> > to decrease the logging level. After reading IA32 Architecture Software
> > Developers Manual, vol3 - chapter 14.5 "Machine-Check Initialization" I
> > think that the right way to do it is this:
>
> Why not just handling the (different) 0-based case in front of the loop:
>
> /* Clear status for MC index 0 separately, we don't touch CTL. */
> wrmsr (MSR_IA32_MC0_STATUS, 0x0, 0x0);
>
> and leave the loop 1-based.
>
> Dave, up to you..

yeah, I prefer that way just for the added comment outside the loop.
expanding it to mention "some athlons don't work with bank 0 enabled"
would be a nice finishing touch.

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-09-21 17:39:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA


On Sun, 21 Sep 2003, Kronos wrote:
>
> This messages go away if I revert cset 1.1119.9.1. AFAIK you were trying
> to decrease the logging level. After reading IA32 Architecture Software
> Developers Manual, vol3 - chapter 14.5 "Machine-Check Initialization" I
> think that the right way to do it is this:

Why not just handling the (different) 0-based case in front of the loop:

/* Clear status for MC index 0 separately, we don't touch CTL. */
wrmsr (MSR_IA32_MC0_STATUS, 0x0, 0x0);

and leave the loop 1-based.

Dave, up to you..

Linus

2003-09-22 13:33:17

by Luigi Genoni

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA


I'm seeing this message too on all my athlons, also if it seems harmless.
Will try this patch tomorrow.

bests

Luigi

On Sun, 21 Sep 2003, Kronos wrote:

> Date: Sun, 21 Sep 2003 16:39:34 +0200
> From: Kronos <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Subject: [PATCH] Fix Athlon MCA
>
> Hi,
> on boot I'm seeing a lot of messages like this:
>
> MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
> Bank 0: d47fa0000000bfee
>
> This messages go away if I revert cset 1.1119.9.1. AFAIK you were trying
> to decrease the logging level. After reading IA32 Architecture Software
> Developers Manual, vol3 - chapter 14.5 "Machine-Check Initialization" I
> think that the right way to do it is this:
>
> --- linux-2.6/arch/i386/kernel/cpu/mcheck/k7.c~ Sat Aug 9 06:37:27 2003
> +++ linux-2.6/arch/i386/kernel/cpu/mcheck/k7.c Sun Sep 21 00:36:39 2003
> @@ -81,8 +81,9 @@
> wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
> nr_mce_banks = l & 0xff;
>
> - for (i=1; i<nr_mce_banks; i++) {
> - wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
> + for (i=0; i<nr_mce_banks; i++) {
> + if (i)
> + wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
> wrmsr (MSR_IA32_MC0_STATUS+4*i, 0x0, 0x0);
> }
>
>
> We really want to clean all MC*_STATUS. I'm currently running linux 2.6.0-t5
> + this patch and I don't see the MCE messages on boot anymore.
>
> Luca
> --
> Reply-To: [email protected]
> Home: http://kronoz.cjb.net
> "L'abilita` politica e` l'abilita` di prevedere quello che
> accadra` domani, la prossima settimana, il prossimo mese e
> l'anno prossimo. E di essere cosi` abili, piu` tardi,
> da spiegare perche' non e` accaduto."
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-09-22 14:21:46

by CaT

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Sun, Sep 21, 2003 at 06:47:31PM +0100, Dave Jones wrote:
> yeah, I prefer that way just for the added comment outside the loop.
> expanding it to mention "some athlons don't work with bank 0 enabled"

Would this MCE message be of the same flavour as the one this
thread is about?

Message from syslogd@lexx at Mon Sep 22 21:38:01 2003 ...
lexx kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Message from syslogd@lexx at Mon Sep 22 21:38:01 2003 ...
lexx kernel: Bank 2: 940040000000017a

I don't have my stick of RAM plugged into the first RAM slot but rather
the 3rd of 4. I guess this correspends to bank 2 above. I've been ignoring
them uptil now but is this a linux hassle or a h/w one?

--
And so the stripper looks down and asks 'Can you breathe?'
- from a friend's bucks night

2003-09-22 14:44:19

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Tue, Sep 23, 2003 at 12:20:23AM +1000, CaT wrote:

> I don't have my stick of RAM plugged into the first RAM slot but rather
> the 3rd of 4. I guess this correspends to bank 2 above. I've been ignoring
> them uptil now but is this a linux hassle or a h/w one?

The bank is referring to an MCE bank rather than a memory slot.
Each MCE bank checks different things.

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-09-22 15:09:24

by CaT

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Mon, Sep 22, 2003 at 03:43:45PM +0100, Dave Jones wrote:
> The bank is referring to an MCE bank rather than a memory slot.
> Each MCE bank checks different things.

ahhh. ok. Well... I found your parsemce.c source. got it compiled it. Ran:

./parsemce -b 2 -e 940040000000017a

and got:

Status: (940040000000017a) Error IP valid
Restart IP invalid.

What the snot does that mean? 8)

(if you can help it'd be appreciated :)

--
And so the stripper looks down and asks 'Can you breathe?'
- from a friend's bucks night

2003-09-22 16:02:59

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Tue, Sep 23, 2003 at 01:06:01AM +1000, CaT wrote:
> On Mon, Sep 22, 2003 at 03:43:45PM +0100, Dave Jones wrote:
> > The bank is referring to an MCE bank rather than a memory slot.
> > Each MCE bank checks different things.
>
> ahhh. ok. Well... I found your parsemce.c source. got it compiled it. Ran:
>
> ./parsemce -b 2 -e 940040000000017a
>
> and got:
>
> Status: (940040000000017a) Error IP valid
> Restart IP invalid.
>
> What the snot does that mean? 8)

If this was from a kernel that didn't clear that bank on boot,
it's bogus, and you can ignore it.

Dave

--
Dave Jones http://www.codemonkey.org.uk

2003-09-22 16:08:57

by CaT

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Mon, Sep 22, 2003 at 05:02:22PM +0100, Dave Jones wrote:
> On Tue, Sep 23, 2003 at 01:06:01AM +1000, CaT wrote:
> > On Mon, Sep 22, 2003 at 03:43:45PM +0100, Dave Jones wrote:
> > Status: (940040000000017a) Error IP valid
> > Restart IP invalid.
> >
> > What the snot does that mean? 8)
>
> If this was from a kernel that didn't clear that bank on boot,
> it's bogus, and you can ignore it.

Ahhh. Phew. Thanks. I've been wondering. I take it this can show up on
a long-running system too? (hopefully someone will find this bit of
the thread useful because I saw 1 or 2 msgs in the past but I didn't
quite understand the answers)

--
And so the stripper looks down and asks 'Can you breathe?'
- from a friend's bucks night

2003-09-22 16:21:33

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] Fix Athlon MCA

On Tue, Sep 23, 2003 at 02:07:01AM +1000, CaT wrote:

> Ahhh. Phew. Thanks. I've been wondering. I take it this can show up on
> a long-running system too? (hopefully someone will find this bit of
> the thread useful because I saw 1 or 2 msgs in the past but I didn't
> quite understand the answers)

Ones that turn up after a while should be something different.
This bug was crap left in that register that gets reported, and then
zero'd away. As we don't enable checking in that bank, once its zero
it stays zero. Anything that triggers afterwards should be coming
from a different bank.

Dave

--
Dave Jones http://www.codemonkey.org.uk