Received: by 10.223.176.46 with SMTP id f43csp2389244wra; Thu, 25 Jan 2018 09:06:31 -0800 (PST) X-Google-Smtp-Source: AH8x2256hFfGrRWYm6ObpVkmVtHHFADoBU75avIuf+m9IP59QmuegyI08PKoVXYnt/zhOd9c+kWB X-Received: by 10.99.120.203 with SMTP id t194mr13632170pgc.39.1516899991769; Thu, 25 Jan 2018 09:06:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516899991; cv=none; d=google.com; s=arc-20160816; b=xrafH4vAy7L7LA8IsuMuTm75FYUV3k50kR/tsPsFV+emZ54zDNCJPBpIMk3U8NiJNS dNbLhxGV2DEQIfUua4Os89vuQG33q4LBxl0RC0njuqLLuJRr17XVq2yGKIVPIcHz7qjs VnYIfDtauawxo+B9ha3Ra25U9F0V6Q6Uqy0zjKbAwsvWEolHezxqyeY2NNVWSwIwLjMl z9Xt5sGbG7u/cywc2N/pi9cjb8oH/N0yEONyaCCWFf+DJ03iPpbBdQCdFsmaLN1NiwUr bEc+519QgvsP7/Cfs84I3gE8072EusN46dF+D2AczficrRkoprKdb8pL5CSQST0dxwzx DCYA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dmarc-filter :arc-authentication-results; bh=mUCAzpC+WNa8YWHT3d00QUw0AS4cjebJ34SMy8Ma5Bs=; b=LVqiWNF00QU/a9XY8qX8vPdlUIgTmLRFtE2FRM3MCQw4dDD2TEOFgsfpqHlpIuwITu ru28h3esKCwrqWYdbV5qZj9w0mYO40HIYs5liqBEGP4/5IZqoKSDSKmT2AHItqAMxtD6 rbCWJhKaJqzPa8vC0r6ckdYrQfVzEgbzdVUR5Y1SMzv0CZYYogZgTWwnHMluhZjzlgTy Mj6SvEQEHvaZxUqNVBih/okY81I3ai05qRzn4AcN2jdiiMlJ7JhzkNSAhJweEyA2oMQk kzgoNosrDsqs8agyScPJhPsNMeHmF31N0oTN4H1jn/61zvepXvXTxqDGGu94eCkG/7UH qVAw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k4si1764861pgr.731.2018.01.25.09.06.12; Thu, 25 Jan 2018 09:06:31 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751240AbeAYREo (ORCPT + 99 others); Thu, 25 Jan 2018 12:04:44 -0500 Received: from mail.kernel.org ([198.145.29.99]:52030 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750994AbeAYREn (ORCPT ); Thu, 25 Jan 2018 12:04:43 -0500 Received: from mail-io0-f173.google.com (mail-io0-f173.google.com [209.85.223.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B1C1D217A0 for ; Thu, 25 Jan 2018 17:04:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B1C1D217A0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org Received: by mail-io0-f173.google.com with SMTP id f4so9338934ioh.8 for ; Thu, 25 Jan 2018 09:04:42 -0800 (PST) X-Gm-Message-State: AKwxyte8L+lpQdi8dztYlRTnx47zcVwkFVjXoQvCfa80C1kEPBPA+Jkp NS35kDGI7mZJ95hA+XK8Z9+tO9dxRnR3m9lFA2UK8g== X-Received: by 10.107.167.136 with SMTP id q130mr13281825ioe.173.1516899881995; Thu, 25 Jan 2018 09:04:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.2.137.84 with HTTP; Thu, 25 Jan 2018 09:04:21 -0800 (PST) In-Reply-To: <20180125164139.GM2269@hirez.programming.kicks-ass.net> References: <20180125085820.GV2228@hirez.programming.kicks-ass.net> <20180125092233.GE2295@hirez.programming.kicks-ass.net> <86541aca-8de7-163d-b620-083dddf29184@linux.intel.com> <20180125135055.GK2249@hirez.programming.kicks-ass.net> <20180125164139.GM2269@hirez.programming.kicks-ass.net> From: Andy Lutomirski Date: Thu, 25 Jan 2018 09:04:21 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process To: Peter Zijlstra Cc: Arjan van de Ven , Tim Chen , LKML , KarimAllah Ahmed , Andi Kleen , Andrea Arcangeli , Andy Lutomirski , Ashok Raj , Asit Mallick , Borislav Petkov , Dan Williams , Dave Hansen , David Woodhouse , Greg Kroah-Hartman , "H . Peter Anvin" , Ingo Molnar , Janakarajan Natarajan , Joerg Roedel , Jun Nakajima , Laura Abbott , Linus Torvalds , Masami Hiramatsu , Paolo Bonzini , Radim Krcmar , Thomas Gleixner , Tom Lendacky , X86 ML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 25, 2018 at 8:41 AM, Peter Zijlstra wrote: > On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote: >> On 1/25/2018 5:50 AM, Peter Zijlstra wrote: >> > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote: >> > > > >> > > > This means that 'A -> idle -> A' should never pass through switch_mm to >> > > > begin with. >> > > > >> > > > Please clarify how you think it does. >> > > > >> > > >> > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states >> > > for a tlb flush. >> > >> > The intel_idle code does, not the idle code. This is squirreled away in >> > some driver :/ >> >> afaik (but haven't looked in a while) acpi drivers did too > > Only makes it worse.. drivers shouldn't be frobbing with things like > this. > >> > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep >> > > state to just flush a tlb that's empty, the performance of that is horrific) >> > >> > Hurmph. I'd rather fix that some other way than leave_mm(), this is >> > piling special on special. >> > >> the problem was tricky. but of course if something better is possible lets figure this out > > How about something like the below? It boots with "nopcid" appended to > the cmdline. > > Andy, could you pretty please have a look at this? This is fickle code > at best and I'm sure I messed _something_ up. > > The idea is simple, do what we do for virt. Don't send IPI's to CPUs > that don't need them (in virt's case because the vCPU isn't running, in > our case because we're not in fact running a user process), but mark the > CPU as having needed a TLB flush. I haven't tried to fully decipher the patch, but I think the idea is wrong. (I think it's the same wrong idea that Rik and I both had and that I got into Linus' tree for a while...) The problem is that it's not actually correct to run indefinitely in kernel mode using stale cached page table data. The stale PTEs themselves are fine, but the stale intermediate translations can cause the CPU to speculatively load complete garbage into the TLB, and that's bad (and causes MCEs on AMD CPUs). I think we only really have two choices: tlb_defer_switch_to_init_mm() == true and tlb_defer_switch_to_init_mm() == false. The current heuristic is to not defer if we have PCID, because loading CR3 is reasonably fast. > void native_flush_tlb_others(const struct cpumask *cpumask, > const struct flush_tlb_info *info) > { > + struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask); > + > count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); > if (info->end == TLB_FLUSH_ALL) > trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); > @@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask, > (void *)info, 1); > return; > } > + > + if (tlb_defer_switch_to_init_mm() && flushmask) { > + int cpu; > + > + cpumask_copy(flushmask, cpumask); > + for_each_cpu(cpu, flushmask) { > + if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1) > + __cpumask_clear_cpu(cpu, flushmask); If this code path here executes and we're flushing because we just removed a reference to a page table and we're about to free the page table, then the CPU that we didn't notify by IPI can start using whatever gets written to the pagetable after it's freed, and that's bad :( > + } > + > + cpumask = flushmask; > + } > + > smp_call_function_many(cpumask, flush_tlb_func_remote, > (void *)info, 1); > }