2005-12-24 17:32:35

by Andy Stewart

[permalink] [raw]
Subject: Machine check 2.6.13.3 dual opteron

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi everybody,

My machine locked up on me and I found this message on my serial
console. I have no idea how to decode its meaning - can you help?

CPU 0: Machine Check Exception: 4
Bank 4: b200000000070f0f
TSC 39619ee1e2187
Kernel panic - not syncing: Machine check

My machine is a dual Opteron running the 2.6.13.3 kernel. I'm not
positive, but I think I can reproduce it. Assuming that I can, what
information would be helpful to debug the problem?

Please cc: me on the response as I am not subscribed to this mailing list.

Thanks!

Andy
- --
Andy Stewart, Founder
Worcester Linux Users' Group
Worcester, MA, USA
http://www.wlug.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDrYYxHl0iXDssISsRAjONAJ9zoU0vSmikAkMqmQI2po0Jp9E83QCghO/M
Zxq/FKaldR1hzyrJqiJ+sMg=
=gdcL
-----END PGP SIGNATURE-----


2005-12-24 19:46:25

by Doug Thompson

[permalink] [raw]
Subject: Re: Machine check 2.6.13.3 dual opteron

With the Opteron Bank 4 value you gave, this decodes to:


Decoding MCE value as MCi_STATUS: 'b200000000070f0f'
Bit 63: Valid error
Bit 61: UNCORRECTED error
Bit 60: MCA Error Reporting Enabled
Bit 57: Process Context Corrupt
HyperTransport Link Number= 0
Extended Error Code = 0x7 - WatchDog error

BUS Error:
Processor(generic)
TimeOut(timed out)
Memory Transaction Type(generic)
Mem or IO(generic)
Cache Level(generic)

You had an Uncorrectable Error.
Since you did not post an address error, I assume that it did NOT report such. Therefore, because
of the WatchDog error, there might be an error between the CPU and memory. There is a hardware
problem definitely.

CPU-Mem Controller
Even bad memory DIMM

doug thompson


--- Andy Stewart <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Hi everybody,
>
> My machine locked up on me and I found this message on my serial
> console. I have no idea how to decode its meaning - can you help?
>
> CPU 0: Machine Check Exception: 4
> Bank 4: b200000000070f0f
> TSC 39619ee1e2187
> Kernel panic - not syncing: Machine check
>
> My machine is a dual Opteron running the 2.6.13.3 kernel. I'm not
> positive, but I think I can reproduce it. Assuming that I can, what
> information would be helpful to debug the problem?
>
> Please cc: me on the response as I am not subscribed to this mailing list.
>
> Thanks!
>
> Andy
> - --
> Andy Stewart, Founder
> Worcester Linux Users' Group
> Worcester, MA, USA
> http://www.wlug.org
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQFDrYYxHl0iXDssISsRAjONAJ9zoU0vSmikAkMqmQI2po0Jp9E83QCghO/M
> Zxq/FKaldR1hzyrJqiJ+sMg=
> =gdcL
> -----END PGP SIGNATURE-----
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



"If you think Education is expensive, just try Ignorance"

"Don't tell people HOW to do things, tell them WHAT you
want and they will surprise you with their ingenuity."
Gen George Patton

2005-12-25 04:06:59

by Andy Stewart

[permalink] [raw]
Subject: Re: Machine check 2.6.13.3 dual opteron

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Doug Thompson wrote:
> With the Opteron Bank 4 value you gave, this decodes to:
>
>
> Decoding MCE value as MCi_STATUS: 'b200000000070f0f'
> Bit 63: Valid error
> Bit 61: UNCORRECTED error
> Bit 60: MCA Error Reporting Enabled
> Bit 57: Process Context Corrupt
> HyperTransport Link Number= 0
> Extended Error Code = 0x7 - WatchDog error
>
> BUS Error:
> Processor(generic)
> TimeOut(timed out)
> Memory Transaction Type(generic)
> Mem or IO(generic)
> Cache Level(generic)
>
> You had an Uncorrectable Error.
> Since you did not post an address error, I assume that it did NOT report such. Therefore, because
> of the WatchDog error, there might be an error between the CPU and memory. There is a hardware
> problem definitely.
>
> CPU-Mem Controller
> Even bad memory DIMM
>
> doug thompson

HI Doug,

Thank you for your quick reply. There was no "address error" - the only
thing on the serial console is what I posted.

In the past, this board, the CPUs, and these memory DIMMs have passed 24
continuous hours of memtest, but perhaps something has deteriorated in
the mean time. I'll run memtest again to see if I can find a problem.

FYI: This MB (MSI K8T Master2 FAR) has been problematic since day one.
A BIOS upgrade improved stability, but I've never gotten more then 21
straight days of uptime on this machine (compared to 200-300 days on
several other machines in my house). I've had *many* random hangs but
seldom does something get printed on the serial console. When it does
hang, the serial console is unresponsive.

I'll be swapping out this MB in favor of a Tyan. I've put up with it
long enough. Assuming that the Tyan MB solves the problem, I won't be
purchasing any more MSI MBs (and neither will my friends if I have
anything to say about it).

Thanks again for the help!

Andy

- --
Andy Stewart, Founder
Worcester Linux Users' Group
Worcester, MA, USA
http://www.wlug.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDrhrVHl0iXDssISsRAhzLAJoCb5yTL2meARSIVhnQjP54AVVPOwCeJ2aS
NWkOTzLQ57U6tuU8h+YM9bM=
=NH89
-----END PGP SIGNATURE-----

2005-12-25 04:13:30

by Andy Stewart

[permalink] [raw]
Subject: Re: Machine check 2.6.13.3 dual opteron

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Avuton Olrich wrote:
> On 12/24/05, Andy Stewart <[email protected]> wrote:
>
>>My machine locked up on me and I found this message on my serial
>>console. I have no idea how to decode its meaning - can you help?
>>
>>CPU 0: Machine Check Exception: 4
>>Bank 4: b200000000070f0f
>>TSC 39619ee1e2187
>>Kernel panic - not syncing: Machine check
>>
>>My machine is a dual Opteron running the 2.6.13.3 kernel. I'm not
>>positive, but I think I can reproduce it. Assuming that I can, what
>>information would be helpful to debug the problem?
>>
>>Please cc: me on the response as I am not subscribed to this mailing list.
>
>
> Welcome to my life. Finding out what exactly's causing the problem is
> paramount to me, also. I'm getting a random crash sometimes after an
> hour, sometimes it'll go 24 hours. But the results are consistant.
> After talking to AMD Tech support, they basically said they could do
> nothing for me. Are your opterons dual core? I've gotten an email from
> someone who was having the same problem with a similar setup but was
> running a dual core chip.
>
> I have had one person personally email me about this subject and he
> stated that taking one of his hard drives off 'slave' helped. (I'm
> only running one hard drive at the moment, which isn't on slave and
> doesn't help).
>
> The recommended:
> I run memtest86 (did for 24 hours, everything seems is fine there).
>
> And they said they could help me no further.
>
> Here's my thread on the subject.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113239372109342&w=2
> --
> avuton
> --
> Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Hello Avuton,

Thank you for your reply to my inquiry. My Opterons are 244s and they
are not dual core. Like you, my setup has passed 24 hours of memtest on
at least 2 or 3 separate occasions.

I'm strongly suspicious of the MB since a BIOS upgrade improved my
situation but there is still instability. Nothing else I've done has
had as marked an effect on stability as that BIOS upgrade.

I've even turned off every spiggot on the MB and replaced them with
their own circuit boards, all to no avail. This thing still hangs
randomly - sometimes after a couple of days, sometimes after 3 weeks,
sometimes multiple times in an evening.

I'll take a look at the thread which you referenced. I also plan to
replace the MB due to my aforementioned suspicions.

Thanks!

Andy

- --
Andy Stewart, Founder
Worcester Linux Users' Group
Worcester, MA, USA
http://www.wlug.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDrhxbHl0iXDssISsRAraeAJ99WLYZNBPvAiltl21oBS1RtwIt+gCcDUIK
GJw+0tj7AOJooKb6E0gnka4=
=N+p9
-----END PGP SIGNATURE-----

2005-12-25 04:19:16

by Doug Thompson

[permalink] [raw]
Subject: Re: Machine check 2.6.13.3 dual opteron

I got an ASUS A8V-E with a 3500+ CPU, 2 GB of ram. I have had not problem with it.

I run Bluesmoke on it (bluesmoke.sourceforge.net), which is renamed EDAC in a submission for
inclusion in the kernel (hopefully 2.6.16). Bluesmoke, scans the the K8's Memory controller
looking for Correctable Errors that occur. The idea is that detection correctable errors and
replacing memory can reduce sudden uncorrectable errors.

But your problem seems generated by the motherboard, no memory. SOMETHING is not quite right.

doug t


--- Andy Stewart <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Doug Thompson wrote:
> > With the Opteron Bank 4 value you gave, this decodes to:
> >
> >
> > Decoding MCE value as MCi_STATUS: 'b200000000070f0f'
> > Bit 63: Valid error
> > Bit 61: UNCORRECTED error
> > Bit 60: MCA Error Reporting Enabled
> > Bit 57: Process Context Corrupt
> > HyperTransport Link Number= 0
> > Extended Error Code = 0x7 - WatchDog error
> >
> > BUS Error:
> > Processor(generic)
> > TimeOut(timed out)
> > Memory Transaction Type(generic)
> > Mem or IO(generic)
> > Cache Level(generic)
> >
> > You had an Uncorrectable Error.
> > Since you did not post an address error, I assume that it did NOT report such. Therefore,
> because
> > of the WatchDog error, there might be an error between the CPU and memory. There is a
> hardware
> > problem definitely.
> >
> > CPU-Mem Controller
> > Even bad memory DIMM
> >
> > doug thompson
>
> HI Doug,
>
> Thank you for your quick reply. There was no "address error" - the only
> thing on the serial console is what I posted.
>
> In the past, this board, the CPUs, and these memory DIMMs have passed 24
> continuous hours of memtest, but perhaps something has deteriorated in
> the mean time. I'll run memtest again to see if I can find a problem.
>
> FYI: This MB (MSI K8T Master2 FAR) has been problematic since day one.
> A BIOS upgrade improved stability, but I've never gotten more then 21
> straight days of uptime on this machine (compared to 200-300 days on
> several other machines in my house). I've had *many* random hangs but
> seldom does something get printed on the serial console. When it does
> hang, the serial console is unresponsive.
>
> I'll be swapping out this MB in favor of a Tyan. I've put up with it
> long enough. Assuming that the Tyan MB solves the problem, I won't be
> purchasing any more MSI MBs (and neither will my friends if I have
> anything to say about it).
>
> Thanks again for the help!
>
> Andy
>
> - --
> Andy Stewart, Founder
> Worcester Linux Users' Group
> Worcester, MA, USA
> http://www.wlug.org
>
>



"If you think Education is expensive, just try Ignorance"

"Don't tell people HOW to do things, tell them WHAT you
want and they will surprise you with their ingenuity."
Gen George Patton

2005-12-25 04:36:40

by Andy Stewart

[permalink] [raw]
Subject: Re: Machine check 2.6.13.3 dual opteron

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Doug Thompson wrote:
> I got an ASUS A8V-E with a 3500+ CPU, 2 GB of ram. I have had not problem with it.
>
> I run Bluesmoke on it (bluesmoke.sourceforge.net), which is renamed EDAC in a submission for
> inclusion in the kernel (hopefully 2.6.16). Bluesmoke, scans the the K8's Memory controller
> looking for Correctable Errors that occur. The idea is that detection correctable errors and
> replacing memory can reduce sudden uncorrectable errors.
>
> But your problem seems generated by the motherboard, no memory. SOMETHING is not quite right.
>
> doug t

HI Doug,

I'll be sure to take a look at Bluesmoke / EDAC. Thanks!

Now, I'd best go to bed or Santa won't come to my house. :-)

Andy


- --
Andy Stewart, Founder
Worcester Linux Users' Group
Worcester, MA, USA
http://www.wlug.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDriHJHl0iXDssISsRAnqGAJ4rUEd7rF4uesucMhQaxTlW4byx3ACfRYV3
EJeZ3kXez6cRF1hxQJVh+NE=
=h52c
-----END PGP SIGNATURE-----