2003-03-24 21:17:05

by Steven Pritchard

[permalink] [raw]
Subject: 3ware driver errors

(Apparently 3w-xxxx in the Subject gets caught as spam. Somebody
might want to adjust that regular expression. :-)

I have a server that is locking up every day or two with a console
full of this error:

3w-xxxx: scsi0: Command failed: status = 0xcb, flags = 0x37, unit #0.

This is on a Dell PowerEdge 1400SC (dual PIII/1.13GHz, 1.1GB RAM),
with a 3ware Escalade 7000-2 and two WD1600JB drives, running Red Hat
8.0 with kernel-smp 2.4.18-27.8.0.

I plan to report this to Red Hat's bugzilla, but I'm hoping for some
ideas or big red flags to jump out at somebody here... I use this box
for a UML hosting server, so all this downtime is affecting *way* too
many people.

This box has been having other stability problems, so I'm guessing
this might not be directly related to the 3ware card/driver. It did
survive a memtest86 pass.

Steve
--
[email protected] | Southern Illinois Linux Users Group
(618)398-7360 | See web site for meeting details.
Steven Pritchard | http://www.silug.org/


2003-03-24 21:35:50

by Adam Radford

[permalink] [raw]
Subject: RE: 3ware driver errors

This is fixed in driver 1.02.00.032, the scsi layer is looping on sense key
'aborted command' after you lost power to a jbod drive, and not checking the
ASC
which is 0x04 (Logical Unit Not Ready). If you want to fix it by hand in
your
driver before .032 comes out, in 3w-xxxx.h, change:

{0x37, 0x0b, 0x04, 0x00},

to:

{0x37, 0x02, 0x04, 0x00}

This will return sense key 'Not Ready', and you will will not infinitely
loop.
If I were you, I would jiggle the power cables on that box and replace flaky
ones.

-Adam

-----Original Message-----
From: Steven Pritchard [mailto:[email protected]]
Sent: Monday, March 24, 2003 1:28 PM
To: [email protected]
Subject: 3ware driver errors


(Apparently 3w-xxxx in the Subject gets caught as spam. Somebody
might want to adjust that regular expression. :-)

I have a server that is locking up every day or two with a console
full of this error:

3w-xxxx: scsi0: Command failed: status = 0xcb, flags = 0x37, unit #0.

This is on a Dell PowerEdge 1400SC (dual PIII/1.13GHz, 1.1GB RAM),
with a 3ware Escalade 7000-2 and two WD1600JB drives, running Red Hat
8.0 with kernel-smp 2.4.18-27.8.0.

I plan to report this to Red Hat's bugzilla, but I'm hoping for some
ideas or big red flags to jump out at somebody here... I use this box
for a UML hosting server, so all this downtime is affecting *way* too
many people.

This box has been having other stability problems, so I'm guessing
this might not be directly related to the 3ware card/driver. It did
survive a memtest86 pass.

Steve
--
[email protected] | Southern Illinois Linux Users Group
(618)398-7360 | See web site for meeting details.
Steven Pritchard | http://www.silug.org/

2003-03-24 23:24:32

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 3ware driver errors



There is a firmware upgrade you need to obtain from WD if you are using their
drives with a 3Ware controller. The WD drives were optimized for desktop use
and they go into a "powersave" mode of sorts which will cause them to disappear
and reappear mysteriously with all sorts of strange errors. WD is aware of
this problem and so is 3Ware.

Jeff

On Mon, Mar 24, 2003 at 03:28:13PM -0600, Steven Pritchard wrote:
> (Apparently 3w-xxxx in the Subject gets caught as spam. Somebody
> might want to adjust that regular expression. :-)
>
> I have a server that is locking up every day or two with a console
> full of this error:
>
> 3w-xxxx: scsi0: Command failed: status = 0xcb, flags = 0x37, unit #0.
>
> This is on a Dell PowerEdge 1400SC (dual PIII/1.13GHz, 1.1GB RAM),
> with a 3ware Escalade 7000-2 and two WD1600JB drives, running Red Hat
> 8.0 with kernel-smp 2.4.18-27.8.0.
>
> I plan to report this to Red Hat's bugzilla, but I'm hoping for some
> ideas or big red flags to jump out at somebody here... I use this box
> for a UML hosting server, so all this downtime is affecting *way* too
> many people.
>
> This box has been having other stability problems, so I'm guessing
> this might not be directly related to the 3ware card/driver. It did
> survive a memtest86 pass.
>
> Steve
> --
> [email protected] | Southern Illinois Linux Users Group
> (618)398-7360 | See web site for meeting details.
> Steven Pritchard | http://www.silug.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-03-24 23:26:56

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 3ware driver errors


Adam,

They also need to upgrade the firmware on the WD drives. There's a known problem with WD drives and 3Ware.

Jeff

On Mon, Mar 24, 2003 at 01:44:23PM -0800, Adam Radford wrote:
> This is fixed in driver 1.02.00.032, the scsi layer is looping on sense key
> 'aborted command' after you lost power to a jbod drive, and not checking the
> ASC
> which is 0x04 (Logical Unit Not Ready). If you want to fix it by hand in
> your
> driver before .032 comes out, in 3w-xxxx.h, change:
>
> {0x37, 0x0b, 0x04, 0x00},
>
> to:
>
> {0x37, 0x02, 0x04, 0x00}
>
> This will return sense key 'Not Ready', and you will will not infinitely
> loop.
> If I were you, I would jiggle the power cables on that box and replace flaky
> ones.
>
> -Adam
>
> -----Original Message-----
> From: Steven Pritchard [mailto:[email protected]]
> Sent: Monday, March 24, 2003 1:28 PM
> To: [email protected]
> Subject: 3ware driver errors
>
>
> (Apparently 3w-xxxx in the Subject gets caught as spam. Somebody
> might want to adjust that regular expression. :-)
>
> I have a server that is locking up every day or two with a console
> full of this error:
>
> 3w-xxxx: scsi0: Command failed: status = 0xcb, flags = 0x37, unit #0.
>
> This is on a Dell PowerEdge 1400SC (dual PIII/1.13GHz, 1.1GB RAM),
> with a 3ware Escalade 7000-2 and two WD1600JB drives, running Red Hat
> 8.0 with kernel-smp 2.4.18-27.8.0.
>
> I plan to report this to Red Hat's bugzilla, but I'm hoping for some
> ideas or big red flags to jump out at somebody here... I use this box
> for a UML hosting server, so all this downtime is affecting *way* too
> many people.
>
> This box has been having other stability problems, so I'm guessing
> this might not be directly related to the 3ware card/driver. It did
> survive a memtest86 pass.
>
> Steve
> --
> [email protected] | Southern Illinois Linux Users Group
> (618)398-7360 | See web site for meeting details.
> Steven Pritchard | http://www.silug.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-03-24 23:33:05

by Larry McVoy

[permalink] [raw]
Subject: Re: 3ware driver errors

On Mon, Mar 24, 2003 at 06:01:07PM -0700, Jeff V. Merkey wrote:
> There is a firmware upgrade you need to obtain from WD if you are using their
> drives with a 3Ware controller. The WD drives were optimized for desktop use
> and they go into a "powersave" mode of sorts which will cause them to disappear
> and reappear mysteriously with all sorts of strange errors. WD is aware of
> this problem and so is 3Ware.

Is this for all WD drives or just some? I've got some wd400 drives that
I've been using for a long time behind a 3ware in jbod mode. I have seen
some errors but they seem to have settled down. Is there any way to know?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-03-24 23:48:29

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: 3ware driver errors


The person at WD to contact with specifics is listed below. We have seen it
on the 180GB drives, but the 200GB are also affected.

[email protected]

Jeff

On Mon, Mar 24, 2003 at 03:44:10PM -0800, Larry McVoy wrote:
> On Mon, Mar 24, 2003 at 06:01:07PM -0700, Jeff V. Merkey wrote:
> > There is a firmware upgrade you need to obtain from WD if you are using their
> > drives with a 3Ware controller. The WD drives were optimized for desktop use
> > and they go into a "powersave" mode of sorts which will cause them to disappear
> > and reappear mysteriously with all sorts of strange errors. WD is aware of
> > this problem and so is 3Ware.
>
> Is this for all WD drives or just some? I've got some wd400 drives that
> I've been using for a long time behind a 3ware in jbod mode. I have seen
> some errors but they seem to have settled down. Is there any way to know?
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-03-24 23:56:42

by Mark Hahn

[permalink] [raw]
Subject: Re: 3ware driver errors

> > and reappear mysteriously with all sorts of strange errors. WD is aware of
> > this problem and so is 3Ware.
>
> Is this for all WD drives or just some? I've got some wd400 drives that
> I've been using for a long time behind a 3ware in jbod mode. I have seen
> some errors but they seem to have settled down. Is there any way to know?

I haven't seen the problem myself, but:
http://support.wdc.com/download/index.asp#raid3ware

2003-03-25 03:01:16

by Steven Pritchard

[permalink] [raw]
Subject: Re: 3ware driver errors

On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> The person at WD to contact with specifics is listed below.

Thanks for the pointer. I have a lot of these WD drives...

> We have seen it on the 180GB drives, but the 200GB are also affected.

I don't suppose you've heard if the 160GB drives are affected, have
you? The page on support.wdc.com that someone else referred to
specifically mentions the 200s and the 180s, but I see no mention of
the 160s.

Steve
--
[email protected] | Southern Illinois Linux Users Group
(618)398-7360 | See web site for meeting details.
Steven Pritchard | http://www.silug.org/

2003-03-25 03:07:14

by Kevin P. Fleming

[permalink] [raw]
Subject: Re: 3ware driver errors

Steven Pritchard wrote:
> I don't suppose you've heard if the 160GB drives are affected, have
> you? The page on support.wdc.com that someone else referred to
> specifically mentions the 200s and the 180s, but I see no mention of
> the 160s.
>

I'd like to know this too; I've got a pair of WD1600JB drives (less than
two weeks old) attached to a 3Ware 7000-2. They've been working fine so
far, but I'm not keen on finding a problem later...

2003-03-25 15:02:44

by Ezra Nugroho

[permalink] [raw]
Subject: Re: 3ware driver errors

I have 8 120GB in a raid 5.
Although the site doesn't say that the 120s are affected, I have gotten
my raid to be degraded because one drive disappeared.
I got the same error message.

I am not sure if I want to upgrade the firmware, however, I am not sure
my array is stable either...

On Mon, 2003-03-24 at 22:12, Steven Pritchard wrote:
> On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> > The person at WD to contact with specifics is listed below.
>
> Thanks for the pointer. I have a lot of these WD drives...
>
> > We have seen it on the 180GB drives, but the 200GB are also affected.
>
> I don't suppose you've heard if the 160GB drives are affected, have
> you? The page on support.wdc.com that someone else referred to
> specifically mentions the 200s and the 180s, but I see no mention of
> the 160s.
>
> Steve
> --
> [email protected] | Southern Illinois Linux Users Group
> (618)398-7360 | See web site for meeting details.
> Steven Pritchard | http://www.silug.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2003-03-25 15:15:48

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: 3ware driver errors

I'm running 2x8-port 3ware with 8 IBM 120gig disks on each controller in raid
5. this has been running stably for half a year (that is - since I installed
it).

On Tuesday 25 March 2003 16:25, Ezra Nugroho wrote:
> I have 8 120GB in a raid 5.
> Although the site doesn't say that the 120s are affected, I have gotten
> my raid to be degraded because one drive disappeared.
> I got the same error message.
>
> I am not sure if I want to upgrade the firmware, however, I am not sure
> my array is stable either...
>
> On Mon, 2003-03-24 at 22:12, Steven Pritchard wrote:
> > On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> > > The person at WD to contact with specifics is listed below.
> >
> > Thanks for the pointer. I have a lot of these WD drives...
> >
> > > We have seen it on the 180GB drives, but the 200GB are also affected.
> >
> > I don't suppose you've heard if the 160GB drives are affected, have
> > you? The page on support.wdc.com that someone else referred to
> > specifically mentions the 200s and the 180s, but I see no mention of
> > the 160s.
> >
> > Steve
> > --
> > [email protected] | Southern Illinois Linux Users Group
> > (618)398-7360 | See web site for meeting details.
> > Steven Pritchard | http://www.silug.org/
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.

2003-03-25 16:03:36

by Ezra Nugroho

[permalink] [raw]
Subject: Re: 3ware driver errors

yeah, but that's IBM drives, We were talking about the WDC -JB/BB.

I would be interested to listen to other WDJ sub 180G users who have the
same problem.
Anyone?



On Tue, 2003-03-25 at 10:26, Roy Sigurd Karlsbakk wrote:
> I'm running 2x8-port 3ware with 8 IBM 120gig disks on each controller in raid
> 5. this has been running stably for half a year (that is - since I installed
> it).
>
> On Tuesday 25 March 2003 16:25, Ezra Nugroho wrote:
> > I have 8 120GB in a raid 5.
> > Although the site doesn't say that the 120s are affected, I have gotten
> > my raid to be degraded because one drive disappeared.
> > I got the same error message.
> >
> > I am not sure if I want to upgrade the firmware, however, I am not sure
> > my array is stable either...
> >
> > On Mon, 2003-03-24 at 22:12, Steven Pritchard wrote:
> > > On Mon, Mar 24, 2003 at 06:25:08PM -0700, Jeff V. Merkey wrote:
> > > > The person at WD to contact with specifics is listed below.
> > >
> > > Thanks for the pointer. I have a lot of these WD drives...
> > >
> > > > We have seen it on the 180GB drives, but the 200GB are also affected.
> > >
> > > I don't suppose you've heard if the 160GB drives are affected, have
> > > you? The page on support.wdc.com that someone else referred to
> > > specifically mentions the 200s and the 180s, but I see no mention of
> > > the 160s.
> > >
> > > Steve
> > > --
> > > [email protected] | Southern Illinois Linux Users Group
> > > (618)398-7360 | See web site for meeting details.
> > > Steven Pritchard | http://www.silug.org/
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > in the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Roy Sigurd Karlsbakk, Datavaktmester
> ProntoTV AS - http://www.pronto.tv/
> Tel: +47 9801 3356
>
> Computers are like air conditioners.
> They stop working when you open Windows.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2003-03-27 15:51:11

by Larry McVoy

[permalink] [raw]
Subject: ECC error in 2.5.64 + some patches

I'm getting these on the machine we use to do the BK->CVS conversions.
My guess is that this means there was a memory error and ECC fixed it.
The only problem is that I'm reasonably sure that there isn't ECC on
these DIMMs. Does anyone have the table of error codes to explanations?
Google didn't find anything for this one.

Thanks.

Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
slovax kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
slovax kernel: Bank 1: 9000000000000151

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-03-27 16:06:15

by Tim Schmielau

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, 27 Mar 2003, Larry McVoy wrote:

> I'm getting these on the machine we use to do the BK->CVS conversions.
> My guess is that this means there was a memory error and ECC fixed it.
> The only problem is that I'm reasonably sure that there isn't ECC on
> these DIMMs. Does anyone have the table of error codes to explanations?
> Google didn't find anything for this one.

No, I don't have a table of error codes either, but it's probably the
on-die Cache which has ECC for all recent (>=350 MHz iirc) Pentii.

Tim

2003-03-27 16:15:14

by Randy.Dunlap

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, 27 Mar 2003 08:02:20 -0800 Larry McVoy <[email protected]> wrote:

| I'm getting these on the machine we use to do the BK->CVS conversions.
| My guess is that this means there was a memory error and ECC fixed it.
| The only problem is that I'm reasonably sure that there isn't ECC on
| these DIMMs. Does anyone have the table of error codes to explanations?
| Google didn't find anything for this one.
|
| Thanks.
|
| Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
| slovax kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
|
| Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
| slovax kernel: Bank 1: 9000000000000151

You can try the Dave Jones "parsemce" tool on it, from
http://www.codemonkey.org.uk/cruft/parsemce.c/

--
~Randy

2003-03-27 16:20:15

by Larry McVoy

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

> | Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> | slovax kernel: Bank 1: 9000000000000151
>
> You can try the Dave Jones "parsemce" tool on it, from
> http://www.codemonkey.org.uk/cruft/parsemce.c/

slovax /tmp a.out -b 1 -e 9000000000000151
Status: (-8070450532247928495) Restart IP valid.

What does that mean?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-03-27 16:20:21

by Dave Jones

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, Mar 27, 2003 at 08:02:20AM -0800, Larry McVoy wrote:

> Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> slovax kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
>
> Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> slovax kernel: Bank 1: 9000000000000151

An MCE (Machine Check Exception) could be triggered by any number of
things from bad cooling, underrated power supply, to flaky RAM.
Give things a going over with memtest86 for the latter.
The former just means you pull everything apart and double check
it looks ok.

Dave

2003-03-27 16:16:52

by Larry McVoy

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, Mar 27, 2003 at 05:17:25PM +0100, Tim Schmielau wrote:
> On Thu, 27 Mar 2003, Larry McVoy wrote:
>
> > I'm getting these on the machine we use to do the BK->CVS conversions.
> > My guess is that this means there was a memory error and ECC fixed it.
> > The only problem is that I'm reasonably sure that there isn't ECC on
> > these DIMMs. Does anyone have the table of error codes to explanations?
> > Google didn't find anything for this one.
>
> No, I don't have a table of error codes either, but it's probably the
> on-die Cache which has ECC for all recent (>=350 MHz iirc) Pentii.

This is a 2.16Ghz Athlon not a Pentium if that makes a difference.

slovax ~ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) XP 2700+
stepping : 1
cpu MHz : 2162.466
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 4276.22

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-03-27 16:27:49

by Tim Schmielau

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, 27 Mar 2003, Larry McVoy wrote:

> On Thu, Mar 27, 2003 at 05:17:25PM +0100, Tim Schmielau wrote:
> > On Thu, 27 Mar 2003, Larry McVoy wrote:
> >
> > > I'm getting these on the machine we use to do the BK->CVS conversions.
> > > My guess is that this means there was a memory error and ECC fixed it.
> > > The only problem is that I'm reasonably sure that there isn't ECC on
> > > these DIMMs. Does anyone have the table of error codes to explanations?
> > > Google didn't find anything for this one.
> >
> > No, I don't have a table of error codes either, but it's probably the
> > on-die Cache which has ECC for all recent (>=350 MHz iirc) Pentii.
>
> This is a 2.16Ghz Athlon not a Pentium if that makes a difference.

The on-die second-level cache of all Athlons also has ECC. But I can't
find a document of the error codes on AMD's website either.

2003-03-27 16:49:28

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, Mar 27, 2003 at 08:02:20AM -0800, Larry McVoy wrote:

> My guess is that this means there was a memory error and ECC fixed
> it.

Nope.

There is an ecc driver for RAM and you'll be able to detect these
using that. RAM ECC errors in my experience don't cause MCEs, usually
the CPU never notices.

> The only problem is that I'm reasonably sure that there isn't ECC on
> these DIMMs.

Dump the SPD and you can check... usually the BIOS will tell you too.

> Does anyone have the table of error codes to explanations? Google
> didn't find anything for this one.

as someone else pointed our, parsemce is what you want

> Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> slovax kernel: Bank 1: 9000000000000151

Status: (9000000000000151) Restart IP valid.

*Exactly* what this means I don't know --- but I'm guessing the CPU is
overheating. Check fans, air-flow, etc. and see if that helps. So
far whenever I've seen the above problem it's *ALWAYS* been related to
the CPU getting too hot.


--cw

2003-03-27 17:10:14

by Dominik Kubla

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

Am Donnerstag, 27. M?rz 2003 18:00 schrieb Chris Wedgwood:

> > Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> > slovax kernel: Bank 1: 9000000000000151
>
> Status: (9000000000000151) Restart IP valid.
>
> *Exactly* what this means I don't know --- but I'm guessing the CPU is
> overheating. Check fans, air-flow, etc. and see if that helps. So
> far whenever I've seen the above problem it's *ALWAYS* been related to
> the CPU getting too hot.

Well the internal busses and buffers of modern CPU's and in many cases also
the on-die caches have ECC logic. And if i should hazard a guess: "Restart
IP valid" => Restarted Instruction Pre-Fetch resulted in a valid state of the
pre-fetch queue.

In Larry's case i'd remove the cpu cooler, clean everything and reassemble,
since i would assume that there is a hot-spot on the die.

Regards,
Dominik
--
Be at war with your voices, at peace with your neighbors, and let every new
year find you a better man. (Benjamin Franklin, 1706-1790)

2003-03-27 17:15:49

by Chris Wedgwood

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, Mar 27, 2003 at 06:19:41PM +0100, Dominik Kubla wrote:

> Well the internal busses and buffers of modern CPU's and in many
> cases also the on-die caches have ECC logic.

his email said "DIMMs"

> And if i should hazard a guess: "Restart IP valid" => Restarted
> Instruction Pre-Fetch resulted in a valid state of the pre-fetch
> queue.

could be ... i've not checked the AMD docs

> In Larry's case i'd remove the cpu cooler, clean everything and
> reassemble, since i would assume that there is a hot-spot on the
> die.

or simply remove the side of the case or increase air-conditioning and
see if that goes away or becomes less apparent, IME if you get these
sporadically rather than often it's 'just' overheating...


--cw

2003-03-27 23:59:51

by Dave Jones

[permalink] [raw]
Subject: Re: ECC error in 2.5.64 + some patches

On Thu, Mar 27, 2003 at 08:31:20AM -0800, Larry McVoy wrote:
> > | Message from syslogd@slovax at Thu Mar 27 05:53:49 2003 ...
> > | slovax kernel: Bank 1: 9000000000000151
> > You can try the Dave Jones "parsemce" tool on it, from
> > http://www.codemonkey.org.uk/cruft/parsemce.c/
>
> slovax /tmp a.out -b 1 -e 9000000000000151
> Status: (-8070450532247928495) Restart IP valid.
>
> What does that mean?

It means Dave sucks and hasn't done a good enough job on the parser.
parsemce is really really unintuitive to use.

There's some bits missing from your dump. Usually, MCEs look like..

Sep 4 21:43:41 hamlet kernel: CPU 0: Machine Check Exception: 0000000000000004
Sep 4 21:43:41 hamlet kernel: Bank 1: f600200000000152 at 7600200000000152

All we have to go on in your example is the bank status code.
(which is -s, not -e. -e would be the 00000000000000004 in the example above. [*])

So, without the missing bits, we have to fake it..

(davej@deviant:davej)$ ./a.out -b 1 -e 1 -s 9000000000000151 -a 0
Status: (1) Restart IP valid.
parsebank(1): 9000000000000151 @ 0
External tag parity error
Error enabled in control register
Memory heirarchy error
Request: Generic error
Transaction type : Instruction
Memory/IO : Reserved

Ignore the Status: line, thats decoded from the (faked) -e 1.

Any the wiser ? 8-) [*]

Dave

[*] See, unintuitive, evil and nasty.
Given the time, I'd start over from scratch.