Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751022AbdIIT2x (ORCPT ); Sat, 9 Sep 2017 15:28:53 -0400 Received: from mail.kernel.org ([198.145.29.99]:54870 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750819AbdIIT2w (ORCPT ); Sat, 9 Sep 2017 15:28:52 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A48F921D28 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org X-Google-Smtp-Source: ADKCNb7Xa+M3eVvFvP4MG/Y1/Sfg6GbkdeG5aRSLklKFyY4zfHLdBnvIo8vkVYFjCiMS/CiFmsXCetCc4qwHltlZFK0= MIME-Version: 1.0 In-Reply-To: <20170909190948.xydyega7i2rjnlqt@pd.tnic> References: <20170909143335.ja2iwjsbeyfxz4ez@pd.tnic> <20170909144350.GA290@x4> <20170909163225.GA290@x4> <20170909170537.6xmxtzwripplhhwi@pd.tnic> <20170909172352.GA290@x4> <20170909173633.4ttfk7maooxkcwum@pd.tnic> <20170909181445.GA281@x4> <20170909182952.itqad4ryngjwrgqf@pd.tnic> <20170909190948.xydyega7i2rjnlqt@pd.tnic> From: Andy Lutomirski Date: Sat, 9 Sep 2017 12:28:30 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf To: Borislav Petkov Cc: Linus Torvalds , Markus Trippelsdorf , Andy Lutomirski , Ingo Molnar , Thomas Gleixner , Peter Zijlstra , LKML , Ingo Molnar , Tom Lendacky , Rik van Riel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3794 Lines: 80 On Sat, Sep 9, 2017 at 12:09 PM, Borislav Petkov wrote: > On Sat, Sep 09, 2017 at 11:47:33AM -0700, Linus Torvalds wrote: >> The thing is, even with the delayed TLB flushing, I don't think it >> should be *so* delayed that we should be seeing a TLB fill from >> garbage page tables. > > Yeah, but we can't know what kind of speculative accesses happen between > the removal from the mask and the actual flushing. > >> But the part in Andy's patch that worries me the most is that >> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm)); >> >> in enter_lazy_tlb(). It means that we won't be notified by peopel >> invalidating the page tables, and while we then do re-validate the TLB >> when we switch back from lazy mode, I still worry. I'm not entirely >> convinced by that tlb_gen logic. >> >> I can't actually see anything *wrong* in the tlb_gen logic, but it worries me. > > Yeah, sounds like we're uncovering a situation of possibly stale > mappings which we haven't had before. Or at least widening that window. > > And I still need to analyze what that MCE on Markus' machine is saying > exactly. The TlbCacheDis thing is an optimization which does away with > memory type checks. But we probably will have to disable it on those > boxes as we can't guarantee pagetable elements are all in WB mem... > > Or we can guarantee them in WB but the lazy flushing delays the actual > clearing of the TLB entries so much so that they end up pointing to > garbage, as you say, which is not in WB mem and thus causes the protocol > error. > > Hmm. All still wet. > I think it's my theory #3. The CPU has a "paging-structure cache" (Intel lingo) that points to a freed page. The CPU speculatively follows it and gets complete garbage, triggering this MCE and who knows what else. I propose the following fix. If PCID is on, then, in enter_lazy_tlb(), we switch to init_mm with the no-flush flag set. (And we give init_mm its own dedicated ASID to keep it simple and fast -- no need to use the LRU ASID mapping to assign one dynamically.) We clear the bit in mm_cpumask. That is, we more or less just skip the whole lazy TLB optimization and rely on PCID CPUs having reasonably fast CR3 writes. No extra IPIs. I suppose I need to benchmark this. It will certainly slow down workloads that rapidly toggle between a user thread and a kernel thread because it forces serialization on each mm switch, but maybe that's not so bad. If PCID is off, then we leave the old CR3 value when we go lazy, and we also leave the flag in mm_cpumask set. When a flush is requested, we send out the IPI and switch to init_mm (and flush because we have no choice). IOW, the no-PCID behavior goes back to what it used to be. For the PCID case, I'm relying on this language in the SDM (vol 3, 4.10): When a logical processor creates entries in the TLBs (Section 4.10.2) and paging-structure caches (Section 4.10.3), it associates those entries with the current PCID. When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID (see Section 4.10.2.4 for an exception). This is also just common sense -- a CPU that makes any assumptions about a paging-structure cache for an inactive ASID is just nuts, especially if it assumes that the result of following it is at all sane. IOW, we really should be able to switch to ASID 1 and back to 0 without any flushes without worrying that the old page tables for ASID 1 might get freed afterwards. Obviously we need to flush if we switch back to PCID 1, but the code already does this. Also, sorry Rik, this means your old increased laziness optimization is dead in the water. It will have exactly the same speculative load problem.