Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S272756AbTHENms (ORCPT ); Tue, 5 Aug 2003 09:42:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S272759AbTHENms (ORCPT ); Tue, 5 Aug 2003 09:42:48 -0400 Received: from colin2.muc.de ([193.149.48.15]:60941 "HELO colin2.muc.de") by vger.kernel.org with SMTP id S272756AbTHENmp (ORCPT ); Tue, 5 Aug 2003 09:42:45 -0400 Date: 5 Aug 2003 15:42:41 +0200 Date: Tue, 5 Aug 2003 15:42:41 +0200 From: Andi Kleen To: Simon Garner Cc: Andi Kleen , linux-kernel@vger.kernel.org Subject: Re: MSI K8D-Master - GART error 3 Message-ID: <20030805134241.GA63394@colin2.muc.de> References: <028101c35aea$d2753690$0401a8c0@SIMON> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <028101c35aea$d2753690$0401a8c0@SIMON> User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2495 Lines: 68 On Tue, Aug 05, 2003 at 12:45:01PM +1200, Simon Garner wrote: > Andi Kleen wrote: > > > There is nothing in any of my trees that generates such a message. > > If it was GART related it would be either "GART TLB error ..." or > > "extended error gart error". But even that should not happen anymore, > > see below. > > > > I don't know what the RedHat kernel does, they may have changed the > > MCE handler over the reference port. > > > > A quick google brings up this reference: > http://www.iglu.org.il/lxr/source/arch/x86_64/kernel/bluesmoke.c Ok that's the very old MCE code that incorrectly enabled the northbridge machine check. Don't use that or use mce=off. However I still think it's a driver bug in your case. If it was the shakey GART MCE itself you would get a panic because it's a unrecoverable MCE. More likely the driver is accessing PCI DMA mappings after they got unmapped, which is a serious bug, but somehow not serious enough that the northbridge triggers the MCE. I was confused by your statement that the SuSE 8.2 beta9 kernel generated that. It didn't because it doesn't contain that old code. What does a modern kernel like the SuSE one or a x86-64.org kernel generate exactly? > > The error appears to be generated by the code starting around line 152 > in that file. > > Btw, what is 'bluesmoke'? Alan Cox's sense of humour. Look it up in the jargon file. > > You can always disable it with mce=off or better mce=0 > > as the message seems to be caused by the periodic non fatal MCE check > > timer. > > > > What will I lose by disabling this? mce=0 turns off periodic MCE checking for non fatal errors. That's not a big issue, the worst you lose is reporting of one bit corrected ECC memory failures. mce=off turns off MCE reporting for fatal MCE exceptions (however your box may still crash when something really bad happens) mce=0 should have turned off the periodic check and your message very much looks like a periodic one, as actual MCE exceptions report more data. I'm a bit puzzled why it doesn't kill the message here. You can try mce=off, but I'm not sure it will help neither. Using a newer kernel is probably a good idea anyways, as there were many bugfixes since then. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/