LinuxLists.cc - Help with machine check exception

2006-01-12 16:33:59

Subject: Help with machine check exception

Can someone help determine the problem here? Does it definitely point
to a bad CPU, or possibly a bad motherboard?

Thanks!

CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f
TSC 184fcd0553e4
Kernel panic - not syncing: Machine check

Call Trace: <#MC> <ffffffff80134831>{panic+133}
<ffffffff8034329c>{_spin_trylock+9}
<ffffffff8010f4d3>{oops_begin+90} <ffffffff801149f8>{print_mce+136}
<ffffffff80114abf>{mcheck_timer+0}
<ffffffff801150cd>{do_machine_check+752}
<ffffffff8010ee6f>{machine_check+127} <EOE>
NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: nfs lockd nfs_acl ipv6 parport_pc lp parport autofs4
sunrpc xfs export
fs dm_mod video button battery ac ohci_hcd i2c_amd8111 i2c_amd756
i2c_core shpchp eepro10
0 e100 mii tg3 floppy ext3 jbd
Pid: 14041, comm: srt Tainted: G M 2.6.14-1.1656_FC4smp #1
RIP: 0010:[<ffffffff80118242>] <ffffffff80118242>{__smp_call_function+107}
RSP: 0000:ffffffff804a8358 EFLAGS: 00000002
RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000ffff0000ffff
RDX: 0000000000000004 RSI: 0000000000000020 RDI: ffffffff80523be0
RBP: 0000000000000000 R08: ffff8100826b71e0 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff8011abcb R12: ffffffff80117f0b
R13: 0000000000000000 R14: ffffffff80360d29 R15: 0000000000000001
FS: 00002aaaaae8ad00(0000) GS:ffffffff80518000(0000) knlGS:00000000f7fab6c0
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffffc07000 CR3: 00000000c2f42000 CR4: 00000000000006e0
Process srt (pid: 14041, threadinfo ffff8100bd0ba000, task ffff8100822000c0)
Stack: ffffffff80117f0b 0000000000000000 0000000000000002 0000000000000000
0000000000014f00 0000000000000001 0000000000000000 0000000000000000
0000184fcd054eab ffffffff801182a0
Call Trace: <#MC> <ffffffff80117f0b>{smp_really_stop_cpu+0}
<ffffffff801182a0>{smp_send_s
top+43}
<ffffffff8013483d>{panic+145} <ffffffff8034329c>{_spin_trylock+9}
<ffffffff8010f4d3>{oops_begin+90} <ffffffff801149f8>{print_mce+136}
<ffffffff80114abf>{mcheck_timer+0}
<ffffffff801150cd>{do_machine_check+752}
<ffffffff8010ee6f>{machine_check+127} <EOE>

Code: 8b 44 24 10 39 c3 75 f6 85 ed 75 14 66 90 eb 18 f3 90 8b 44
console shuts up ...
<3>Debug: sleeping function called from invalid context at
include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():1

Call Trace: <NMI> <ffffffff8013603a>{profile_task_exit+21}
<ffffffff801371f8>{do_exit+34}
<ffffffff8025456d>{do_unblank_screen+40}
<ffffffff8010f593>{bad_intr+0}
<ffffffff80118ed0>{nmi_watchdog_tick+242}
<ffffffff8010f835>{default_do_nmi+137}
<ffffffff80117f0b>{smp_really_stop_cpu+0}
<ffffffff80118ff9>{do_nmi+69}
<ffffffff8010eb97>{nmi+127}
<ffffffff80117f0b>{smp_really_stop_cpu+0}
<ffffffff8011abcb>{flat_send_IPI_mask+0}
<ffffffff80118242>{__smp_call_function+10
7}
<EOE> <#MC> <ffffffff80117f0b>{smp_really_stop_cpu+0}
<ffffffff801182a0>{smp_send_stop+43} <ffffffff8013483d>{panic+145}
<ffffffff8034329c>{_spin_trylock+9}
<ffffffff8010f4d3>{oops_begin+90}
<ffffffff801149f8>{print_mce+136} <ffffffff80114abf>{mcheck_timer+0}
<ffffffff801150cd>{do_machine_check+752}
<ffffffff8010ee6f>{machine_check+127}
<EOE>
APIC error on CPU0: 00(08)
Kernel panic - not syncing: Aiee, killing interrupt handler!
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

Call Trace:<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<NMI> <7>APIC error on CPU0: 08(08)
<ffffffff80134831>{panic+133}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff8034334a>{_spin_unlock_irq+14}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff80342cd1>{__down_read+50}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<7>APIC error on CPU0: 08(08)
<ffffffff803432e8>{_spin_lock_irqsave+9}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff801fde91>{__up_read+19}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff80137255>{do_exit+127}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<ffffffff8025456d>{do_unblank_screen+40}<7>APIC error on CPU0:
08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff8010f593>{bad_intr+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<ffffffff80118ed0>{nmi_watchdog_tick+242}<7>APIC error on CPU0:
08(08)
<ffffffff8010f835>{default_do_nmi+137}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff80117f0b>{smp_really_stop_cpu+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff80118ff9>{do_nmi+69}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff8010eb97>{nmi+127}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff80117f0b>{smp_really_stop_cpu+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff8011abcb>{flat_send_IPI_mask+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff80118242>{__smp_call_function+107}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<EOE> <7>APIC error on CPU0: 08(08)
<#MC> <7>APIC error on CPU0: 08(08)
<ffffffff80117f0b>{smp_really_stop_cpu+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff801182a0>{smp_send_stop+43}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff8013483d>{panic+145}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff8034329c>{_spin_trylock+9}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff8010f4d3>{oops_begin+90}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff801149f8>{print_mce+136}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff80114abf>{mcheck_timer+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
<ffffffff801150cd>{do_machine_check+752}<7>APIC error on CPU0: 08(08)
<ffffffff8010ee6f>{machine_check+127}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<EOE> <7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

2006-01-12 17:12:32

by Orion Poplawski

[permalink] [raw]

Subject: Re: Help with machine check exception

Orion Poplawski wrote:
> Can someone help determine the problem here? Does it definitely point
> to a bad CPU, or possibly a bad motherboard?
>
> Thanks!
>

mcelog decode states:

CPU 0 4 northbridge TSC 184fcd0553e4
Northbridge Watchdog error
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'generic participation, request timed out
generic error mem transaction
generic access, level generic'
STATUS b200000000070f0f MCGSTATUS 4
Kernel panic - not syncing: Machine check

2006-01-12 17:22:37

by Roger Heflin

[permalink] [raw]

Subject: RE: Help with machine check exception

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Orion Poplawski
> Sent: Thursday, January 12, 2006 10:30 AM
> To: [email protected]
> Subject: Help with machine check exception
>
> Can someone help determine the problem here? Does it
> definitely point to a bad CPU, or possibly a bad motherboard?
>
> Thanks!
>
> CPU 0: Machine Check Exception: 4 Bank 4:
> b200000000070f0f
> TSC 184fcd0553e4
> Kernel panic - not syncing: Machine check
>

If this is an Opteron, CPU or Memory, a dimm failing in the
correct manner will cause it, and I have seen a CPU cause it,
I don't know that I have seen a MB cause it, and we have fixed
a fair number of these errors. If it is memory, it can be any
of the dimms on that cpu.

I have seen this error kill a machine on boot up, but it looks
more like something was cleared improperly, and may only affect
much older versions of 2.6, in this case it is not broken hardware,
and rebooting will cause it to not be duplicatable.

Roger

2006-01-12 17:31:05

by Alan

[permalink] [raw]

Subject: Re: Help with machine check exception

On Iau, 2006-01-12 at 10:07 -0700, Orion Poplawski wrote:
> mcelog decode states:
>
> CPU 0 4 northbridge TSC 184fcd0553e4
> Northbridge Watchdog error
> bit57 = processor context corrupt
> bit61 = error uncorrected
> bus error 'generic participation, request timed out
> generic error mem transaction
> generic access, level generic'
> STATUS b200000000070f0f MCGSTATUS 4
> Kernel panic - not syncing: Machine check

Could be ram cpu or motherboard, even a power glitch of course.

Before you panic I'd suggest that you check the machine is being
adequately cooled (especially the CPU) and that the ram and cpu are all
well socketed.

memtest86+ will help test for memory problems and may be worth an
overnight run

2006-01-12 17:52:27

by Orion Poplawski

[permalink] [raw]

Subject: Re: Help with machine check exception

Alan Cox wrote:
> On Iau, 2006-01-12 at 10:07 -0700, Orion Poplawski wrote:
>> mcelog decode states:
>>
>> CPU 0 4 northbridge TSC 184fcd0553e4
>> Northbridge Watchdog error
>> bit57 = processor context corrupt
>> bit61 = error uncorrected
>> bus error 'generic participation, request timed out
>> generic error mem transaction
>> generic access, level generic'
>> STATUS b200000000070f0f MCGSTATUS 4
>> Kernel panic - not syncing: Machine check
>
> Could be ram cpu or motherboard, even a power glitch of course.
>
> Before you panic I'd suggest that you check the machine is being
> adequately cooled (especially the CPU) and that the ram and cpu are all
> well socketed.
>
> memtest86+ will help test for memory problems and may be worth an
> overnight run
>

Well, I've swapped memory with an identical machine and the problem
stayed where it was. The crash is fairly frequent (about 1-2 days of
operating).

adm1027-i2c-0-2e
Adapter: SMBus AMD8111 adapter at 10e0
V1.5: +2.601 V (min = +1.42 V, max = +1.58 V) ALARM
VCore: +1.304 V (min = +1.48 V, max = +1.63 V) ALARM
V3.3: +3.326 V (min = +3.13 V, max = +3.47 V)
V5: +5.117 V (min = +4.74 V, max = +5.26 V)
V12: +12.094 V (min = +11.38 V, max = +12.62 V)
CPU_Fan: 0 RPM (min = 4000 RPM) ALARM
fan2: 0 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
fan4: 4981 RPM (min = 0 RPM)
CPU: +50.00?C (low = +10?C, high = +50?C)
Board: +34.00?C (low = +10?C, high = +35?C)
Remote: +52.50?C (low = +10?C, high = +35?C) ALARM
CPU_PWM: 255
Fan2_PWM: 255
Fan3_PWM: 255
vid: +1.550 V (VRM Version 9.1)

I would have expected 2 CPU temps (being dual-processor). Maybe Remote
is the second.

With 4 copies of burnK7:

machine with problems:

CPU: +71.25?C (low = +10?C, high = +50?C) ALARM
Board: +47.25?C (low = +10?C, high = +35?C) ALARM
Remote: +68.50?C (low = +10?C, high = +35?C) ALARM

machine without:

CPU: +61.25?C (low = +10?C, high = +50?C) ALARM
Board: +47.25?C (low = +10?C, high = +35?C) ALARM
Remote: +74.25?C (low = +10?C, high = +35?C) ALARM

So *maybe* cooling?

2006-01-12 19:08:43

by Roger Heflin

[permalink] [raw]

Subject: RE: Help with machine check exception

>
> I would have expected 2 CPU temps (being dual-processor).
> Maybe Remote is the second.
>
> With 4 copies of burnK7:
>
> machine with problems:
>
> CPU: +71.25?C (low = +10?C, high = +50?C) ALARM
> Board: +47.25?C (low = +10?C, high = +35?C) ALARM
> Remote: +68.50?C (low = +10?C, high = +35?C) ALARM
>
> machine without:
>
> CPU: +61.25?C (low = +10?C, high = +50?C) ALARM
> Board: +47.25?C (low = +10?C, high = +35?C) ALARM
> Remote: +74.25?C (low = +10?C, high = +35?C) ALARM
>
>
> So *maybe* cooling?
>

That is a little on the warm side, I believe AMD's posted limit
is 70C for most of their chips, assuming the measuring point is
in the correct place for the 70C limit.

Certain cpus also seem to have more issues than others, so one cpu
out of a batch can be ok with a certain setup, and another from the
same batch will mce under similar conditions.

Did you build the machines yourself or did you buy them this way?

Machines getting MCE's that often will fail the burnin testing that
we use here.

And machines that produce those kinds of temps will also fail our
burn-in process just because that seems a bit too warm.

Roger