Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750907AbdIITKF (ORCPT ); Sat, 9 Sep 2017 15:10:05 -0400 Received: from mail.skyhub.de ([5.9.137.197]:37306 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750704AbdIITKE (ORCPT ); Sat, 9 Sep 2017 15:10:04 -0400 Date: Sat, 9 Sep 2017 21:09:48 +0200 From: Borislav Petkov To: Linus Torvalds Cc: Markus Trippelsdorf , Andy Lutomirski , Ingo Molnar , Thomas Gleixner , Peter Zijlstra , LKML , Ingo Molnar , Tom Lendacky Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf Message-ID: <20170909190948.xydyega7i2rjnlqt@pd.tnic> References: <20170909143335.ja2iwjsbeyfxz4ez@pd.tnic> <20170909144350.GA290@x4> <20170909163225.GA290@x4> <20170909170537.6xmxtzwripplhhwi@pd.tnic> <20170909172352.GA290@x4> <20170909173633.4ttfk7maooxkcwum@pd.tnic> <20170909181445.GA281@x4> <20170909182952.itqad4ryngjwrgqf@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1567 Lines: 39 On Sat, Sep 09, 2017 at 11:47:33AM -0700, Linus Torvalds wrote: > The thing is, even with the delayed TLB flushing, I don't think it > should be *so* delayed that we should be seeing a TLB fill from > garbage page tables. Yeah, but we can't know what kind of speculative accesses happen between the removal from the mask and the actual flushing. > But the part in Andy's patch that worries me the most is that > > + cpumask_clear_cpu(cpu, mm_cpumask(mm)); > > in enter_lazy_tlb(). It means that we won't be notified by peopel > invalidating the page tables, and while we then do re-validate the TLB > when we switch back from lazy mode, I still worry. I'm not entirely > convinced by that tlb_gen logic. > > I can't actually see anything *wrong* in the tlb_gen logic, but it worries me. Yeah, sounds like we're uncovering a situation of possibly stale mappings which we haven't had before. Or at least widening that window. And I still need to analyze what that MCE on Markus' machine is saying exactly. The TlbCacheDis thing is an optimization which does away with memory type checks. But we probably will have to disable it on those boxes as we can't guarantee pagetable elements are all in WB mem... Or we can guarantee them in WB but the lazy flushing delays the actual clearing of the TLB entries so much so that they end up pointing to garbage, as you say, which is not in WB mem and thus causes the protocol error. Hmm. All still wet. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.