Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754783Ab1DOORH (ORCPT ); Fri, 15 Apr 2011 10:17:07 -0400 Received: from mail-vw0-f46.google.com ([209.85.212.46]:44382 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750923Ab1DOORE (ORCPT ); Fri, 15 Apr 2011 10:17:04 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:reply-to:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=xZ+USlWkRcBOiNTaXOjmfhxnieOEh4mWlf1iUyzhqUnxUEz2ZVsdXWPduzGtu5iRIy 2+MDikP3Ua0DRDJ0DbWaWaCjdnNxFoyPYeQi310RM5JkRa/UJ+k2cgMixLYHbnBQzCYA r0JnqFVta9ND50zVe/rvip5ccBXjaqNCEbUEM= Message-ID: <4DA8535B.4080901@gmail.com> Date: Fri, 15 Apr 2011 10:16:59 -0400 From: Alexandre Demers Reply-To: alexandre.f.demers@gmail.com User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110411 Lightning/1.0b2 Thunderbird/3.1.9 MIME-Version: 1.0 To: Joerg Roedel CC: Linus Torvalds , "H. Peter Anvin" , Yinghai Lu , Ingo Molnar , Alex Deucher , Linux Kernel Mailing List , "dri-devel@lists.freedesktop.org" , Thomas Gleixner , Tejun Heo Subject: Re: Linux 2.6.39-rc3 References: <20110413172147.GI19819@8bytes.org> <4DA5F62F.3030504@kernel.org> <20110413193459.GL19819@8bytes.org> <4DA60C30.4060606@kernel.org> <4DA6145D.9070703@kernel.org> <4DA655E7.3000904@zytor.com> <20110415131152.GJ18463@8bytes.org> In-Reply-To: <20110415131152.GJ18463@8bytes.org> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4093 Lines: 114 On 11-04-15 09:11 AM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: >> we definitely want to also understand the reason for things not >> working, even if we do revert.. > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel > --- > arch/x86/include/asm/msr-index.h | 4 ++++ > arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++++ > 2 files changed, 23 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h > index fd5a1f3..3cce714 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -96,11 +96,15 @@ > #define MSR_IA32_MC0_ADDR 0x00000402 > #define MSR_IA32_MC0_MISC 0x00000403 > > +#define MSR_AMD64_MC0_MASK 0xc0010044 > + > #define MSR_IA32_MCx_CTL(x) (MSR_IA32_MC0_CTL + 4*(x)) > #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x)) > #define MSR_IA32_MCx_ADDR(x) (MSR_IA32_MC0_ADDR + 4*(x)) > #define MSR_IA32_MCx_MISC(x) (MSR_IA32_MC0_MISC + 4*(x)) > > +#define MSR_AMD64_MCx_MASK(x) (MSR_AMD64_MC0_MASK + (x)) > + > /* These are consecutive and not in the normal 4er MCE bank block */ > #define MSR_IA32_MC0_CTL2 0x00000280 > #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x)) > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c > index 3ecece0..3532d3b 100644 > --- a/arch/x86/kernel/cpu/amd.c > +++ b/arch/x86/kernel/cpu/amd.c > @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) > /* As a rule processors have APIC timer running in deep C states */ > if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400)) > set_cpu_cap(c, X86_FEATURE_ARAT); > + > + /* > + * Disable GART TLB Walk Errors on Fam10h. We do this here > + * because this is always needed when GART is enabled, even in a > + * kernel which has no MCE support built in. > + */ > + if (c->x86 == 0x10) { > + /* > + * BIOS should disable GartTlbWlk Errors themself. If > + * it doesn't do it here as suggested by the BKDG. > + * > + * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012 > + */ > + u64 mask; > + > + rdmsrl(MSR_AMD64_MCx_MASK(4), mask); > + mask |= (1 << 10); > + wrmsrl(MSR_AMD64_MCx_MASK(4), mask); > + } > } > > #ifdef CONFIG_X86_32 Ok, I'll test it today. Should I apply it on a clean rc3 without any of the other patches? BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? This could be useful in the future. I'll keep you up to date. -- Alexandre Demers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/