Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261998AbUCLHH1 (ORCPT ); Fri, 12 Mar 2004 02:07:27 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262003AbUCLHH1 (ORCPT ); Fri, 12 Mar 2004 02:07:27 -0500 Received: from fmr11.intel.com ([192.55.52.31]:17384 "EHLO fmsfmr004.fm.intel.com") by vger.kernel.org with ESMTP id S261998AbUCLHHY (ORCPT ); Fri, 12 Mar 2004 02:07:24 -0500 Subject: Re: SMP + Hyperthreading / Asus PCDL Deluxe / Kernel 2.4.x 2.6.x / Crash/Freeze From: Len Brown To: Richard Browning Cc: Zwane Mwaikambo , linux-kernel@vger.kernel.org, Venkatesh Pallipadi In-Reply-To: <1079072878.3885.33.camel@dhcppc4> References: <200403120022.13534.richard@redline.org.uk> <200403120042.32166.richard@redline.org.uk> <1079072878.3885.33.camel@dhcppc4> Content-Type: text/plain Organization: Message-Id: <1079075236.3885.52.camel@dhcppc4> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.3 Date: 12 Mar 2004 02:07:17 -0500 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2656 Lines: 79 Hmm, read that note too fast... Since the failure did not follow the package to the BSP socket (CPU0/CPU1), but instead stayed with the AP (CPU2/CPU3) socket, that suggests an issue with the MB rather than the processor itself. -Len On Fri, 2004-03-12 at 01:27, Len Brown wrote: > On Thu, 2004-03-11 at 19:42, Richard Browning wrote: > > On Friday 12 March 2004 00:36, Zwane Mwaikambo wrote: > > > On Fri, 12 Mar 2004, Richard Browning wrote: > > > > > For my own curiosity, does switching the processors around do anything? > > > > > Those MCEs look confined to the non bootstrap processor package. > > > > > > > > Switched CPUs. This time I get the following: > > > > > > > > CPU3: Machine Check Exception: 000.0004 > > > > CPU2: Machine Check Exception: 000.0004 > > > > Bank 0: a20000008c010400 > > > > Kernel Panic: CPU context corrupt > > > > In idle task - not syncing > > > > > > > > Note that the CPU# designations are swapped and that there's only one > > > > Bank 0: message. Is this significant? > > > > > > Ok, but that's still on the same package so it's not moving with the > > > processor, thanks. Could you also supply processor info from > > > /proc/cpuinfo. > > > > I suppose that's good (for me); indicates no hardware error? > > MCE == hardware error. > In this case un-recoverable. > > I'll take a swing at decoding this, call the Coast Guard if I don't > return in 30 minutes;-) > > http://developer.intel.com/design/pentium4/manuals/25366813.pdf > > > Machine Check Exception: 000.0004 > > fig 14-4 says this means that indeed, you have a valid MCE. > > > Bank 0: a20000008c010400 > > fig 14-6 says: > 63: valid register contents > 61: UC -- processor did not correct the error > 57: PCC -- Processor context corrupt (you're dead) > > 0400 is the MCA error code > > fig E2 says > 10 - internal watchdog timeout. > 26,27 -- TT -- Thread timeout indicator -- both threads timed out > > > /proc/cpuinfo of course: > > > > processor : 0 > > vendor_id : GenuineIntel > > cpu family : 15 > > model : 2 > > I have no idea what causes this error, but it sure sounds specific to > the processor, and specific to HT -- which matches your experiments. > I'd imagine that after you verify that you've got the latest BIOS for > the board and the error persists that you should look into getting that > specific processor replaced. > > cheers, > -Len > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/