Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757402AbdIIINq (ORCPT ); Sat, 9 Sep 2017 04:13:46 -0400 Received: from ud10.udmedia.de ([194.117.254.50]:34324 "EHLO mail.ud10.udmedia.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756989AbdIIINm (ORCPT ); Sat, 9 Sep 2017 04:13:42 -0400 Date: Sat, 9 Sep 2017 10:13:38 +0200 From: Markus Trippelsdorf To: Andy Lutomirski Cc: Ingo Molnar , Borislav Petkov , Thomas Gleixner , Peter Zijlstra , LKML , Ingo Molnar , Tom Lendacky Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf Message-ID: <20170909081338.GB277@x4> References: <20170908080536.ninspvplibd37fj2@pd.tnic> <20170908091614.nmdxjnukxowlsjja@pd.tnic> <20170908094815.GA278@x4> <20170908103513.npjmb2kcjt2zljb2@gmail.com> <20170908103906.GB278@x4> <20170908113039.GA285@x4> <20170908171633.GA279@x4> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4251 Lines: 105 On 2017.09.08 at 14:47 -0700, Andy Lutomirski wrote: > On Fri, Sep 8, 2017 at 10:16 AM, Markus Trippelsdorf > wrote: > > On 2017.09.08 at 09:12 -0700, Andy Lutomirski wrote: > >> On Fri, Sep 8, 2017 at 4:30 AM, Markus Trippelsdorf > >> wrote: > >> > On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote: > >> >> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote: > >> >> > > >> >> > * Markus Trippelsdorf wrote: > >> >> > > >> >> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote: > >> >> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote: > >> >> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote: > >> >> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote: > >> >> > > > > > > >> >> > > > > > CC+ Borislav. He might have access to such a beast > >> >> > > > > > >> >> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have > >> >> > > > > something similar? > >> >> > > > > > >> >> > > > > Private mail's fine too. > >> >> > > > > >> >> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see > >> >> > > > something strange too, in dmesg: > >> >> > > > >> >> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch: > >> >> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or > >> >> > > 'encrypted memory' support by Tom Lendacky. > >> >> > > > >> >> > > (Bisecting is hard, because sometimes I can compile stuff for over 15 > >> >> > > minutes without hitting the bug. At other times the machine locks up > >> >> > > hard when starting X11 already.) > >> >> > > >> >> > Do you have the 72c0098d92ce fix? > >> >> > >> >> Yes. The bug still happens on the current git tree (which has the fix > >> >> already): > >> > > >> > The bug is definitely caused by Andy Lutomirski's PCID optimized TLB > >> > flushing" patches. Tom is off the hook. > >> > >> I'm pretty sure it can't be PCID per se, since these CPUs are way too > >> old and are very unlikely to have PCID. > > > > Yes, the CPU doesn't support PCID (,but it does support PGE). > > > >> It could plausibly be the lazy TLB flushing changes. > > > > Yes, I've narrowed it down to: > > > > commit 94b1b03b519b81c494900cb112aa00ed205cc2d9 > > Author: Andy Lutomirski > > Date: Thu Jun 29 08:53:17 2017 -0700 > > > > x86/mm: Rework lazy TLB mode and TLB freshness tracking > > > > > > Theoretically you guys should be able to reproduce the issue by using > > the "nopcid" boot option. > > > > Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of > potentially useful assertions in that code. CONFIG_DEBUG_VM=y doesn't change anything. I still get the hard hang without anything in the logs. > Can you also post your /proc/cpuinfo? And can you re-confirm that a > problematic guest kernel is causing problems in the *host*? processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 4 model name : AMD Phenom(tm) II X4 955 Processor stepping : 2 microcode : 0x10000db cpu MHz : 3210.960 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save bugs : tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs null_seg amd_e400 bogomips : 6424.50 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate Unfortunately I cannot reproduce the qemu (kvm) problem anymore. (Perhaps I have not tried long enough). Anyway, kvm has code that should handle erratum_383. -- Markus