Received: by 10.223.176.46 with SMTP id f43csp2361257wra; Thu, 25 Jan 2018 08:42:39 -0800 (PST) X-Google-Smtp-Source: AH8x2256Gk5pYmC/2ZA2D//v2FUXn/CwRyjJM+72QuH3FQmbdYRzF2S9TsbTrFChbzAws4oMO9/b X-Received: by 2002:a17:902:3103:: with SMTP id w3-v6mr2816757plb.3.1516898558913; Thu, 25 Jan 2018 08:42:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516898558; cv=none; d=google.com; s=arc-20160816; b=eXFuRjbsBoYKlNMLpxfet8G3f9F71Lb3c0hKUvvNMlIiE4H5A+COlXW1AR5uf8zmzL SlT9OCOx9yYqbm1TDVHyGfkiLGki0w89s8Xlck8342J9ryErXV4aZtzdcuhwrmayD0rB jJKC7ZuqhnKgmkGaDR+wwu8UytsgUsrQJhXnoh1Z4DvMf3xfxBqwt/S7cjKL7IRulxeN uh3oPrHMN2d68OGFtNdtB1aGlT7rxlNSHr4KcdsEypzdYjvfUi8tr0rtD4KiqS01d10u XRlvp4so1xAbawW5yMAl4w0T+0imEMUVG4VtM0Tqf9NC0IMrlcxd5l2kUsFkgdmmPftz FL0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=1YBsiFvhe98mJ7O1rSvjFFCM19vts7A0rICoLoJruv0=; b=XosspjAkQYLlzy9MDpVRwC0m84Ig5q+RObK9tr6VhfFWSox+D0TbrgX2Iy4r03SmKL G+Qt9ZpqTBHPpQDudi3AnOQU3AJanpVj8Rd6V8trNBWMQWi1Dk4lailw/tI/y/OZPmrd 9dHQY3YvnTBx189YYAHRkgZBNQMfQagNWT0gI6pVWXyI6H42/kwbP9WY3ZNkVxN9FVXW RqxCL0cLQSv6BodOcnRRdTSiOkRMsJ+JZmgsr8EU3EQWHO+5XaOlrweJU4eNYj5Nq2nK WVhPF1+ri+Mph5c7qW7jA+sZWuQUFz5Bonz95bhaiIZKOiLvbIUHuGg/pEpfX/K7a80I X/Kg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=O6ta1t1u; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e4si4836094pfb.84.2018.01.25.08.42.24; Thu, 25 Jan 2018 08:42:38 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=O6ta1t1u; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751281AbeAYQmC (ORCPT + 99 others); Thu, 25 Jan 2018 11:42:02 -0500 Received: from merlin.infradead.org ([205.233.59.134]:35424 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750992AbeAYQmA (ORCPT ); Thu, 25 Jan 2018 11:42:00 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=1YBsiFvhe98mJ7O1rSvjFFCM19vts7A0rICoLoJruv0=; b=O6ta1t1uj3cUcng08t7GPj5Ui d7sreVMWuJQO8It6XhBHiF82eGPrB2mCr7k0ITaXa1r55LiQg64u8ypzN0/Vz+O6l1bEb1TaRuG6X TxWLASQtCs1jW9Amy11idh3Pue2tweJ4YBt78ISffan/52wWwMf92H5+jFNchC9eKlqghnmkGHMO6 K7twOLJy4FuOASgMPskXLnwZNmb3PAQ7Z9q5H6LsQbRqM5ZHAMMcXauImFKBWVirAgC8pmJCOM/20 aHOI/JNeWj4gzijZ/Dg683m8qYk53N5pmiWhQeriuKmtESaQWDBuCPC53f7gn4niKY0j4ar+cchos DfCzrSbnw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by merlin.infradead.org with esmtpsa (Exim 4.89 #1 (Red Hat Linux)) id 1eekas-0007oZ-GX; Thu, 25 Jan 2018 16:41:42 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id D7F872029B0FD; Thu, 25 Jan 2018 17:41:39 +0100 (CET) Date: Thu, 25 Jan 2018 17:41:39 +0100 From: Peter Zijlstra To: Arjan van de Ven Cc: Tim Chen , linux-kernel@vger.kernel.org, KarimAllah Ahmed , Andi Kleen , Andrea Arcangeli , Andy Lutomirski , Ashok Raj , Asit Mallick , Borislav Petkov , Dan Williams , Dave Hansen , David Woodhouse , Greg Kroah-Hartman , "H . Peter Anvin" , Ingo Molnar , Janakarajan Natarajan , Joerg Roedel , Jun Nakajima , Laura Abbott , Linus Torvalds , Masami Hiramatsu , Paolo Bonzini , rkrcmar@redhat.com, Thomas Gleixner , Tom Lendacky , x86@kernel.org Subject: Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Message-ID: <20180125164139.GM2269@hirez.programming.kicks-ass.net> References: <20180125085820.GV2228@hirez.programming.kicks-ass.net> <20180125092233.GE2295@hirez.programming.kicks-ass.net> <86541aca-8de7-163d-b620-083dddf29184@linux.intel.com> <20180125135055.GK2249@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote: > On 1/25/2018 5:50 AM, Peter Zijlstra wrote: > > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote: > > > > > > > > This means that 'A -> idle -> A' should never pass through switch_mm to > > > > begin with. > > > > > > > > Please clarify how you think it does. > > > > > > > > > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states > > > for a tlb flush. > > > > The intel_idle code does, not the idle code. This is squirreled away in > > some driver :/ > > afaik (but haven't looked in a while) acpi drivers did too Only makes it worse.. drivers shouldn't be frobbing with things like this. > > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep > > > state to just flush a tlb that's empty, the performance of that is horrific) > > > > Hurmph. I'd rather fix that some other way than leave_mm(), this is > > piling special on special. > > > the problem was tricky. but of course if something better is possible lets figure this out How about something like the below? It boots with "nopcid" appended to the cmdline. Andy, could you pretty please have a look at this? This is fickle code at best and I'm sure I messed _something_ up. The idea is simple, do what we do for virt. Don't send IPI's to CPUs that don't need them (in virt's case because the vCPU isn't running, in our case because we're not in fact running a user process), but mark the CPU as having needed a TLB flush. Then when we leave the special mode (switching back to a user task), check our flag and invalidate TLBs if required. All of this is conditional on tlb_defer_switch_to_init_mm() (aka !PCID) because otherwise we already switch to init_mm under the hood (we retain active_mm) unconditionally. This way active_mm is preserved and we can use something like: if (prev != next) ibpb(); in switch_mm_irqs_off(). --- arch/x86/include/asm/acpi.h | 2 -- arch/x86/include/asm/mmu_context.h | 1 + arch/x86/include/asm/tlbflush.h | 4 +++- arch/x86/mm/tlb.c | 38 +++++++++++++++++++++++++++++++++++--- drivers/acpi/processor_idle.c | 2 -- drivers/idle/intel_idle.c | 8 -------- kernel/sched/core.c | 28 +++++++++++++++------------- 7 files changed, 54 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 8d0ec9df1cbe..72d867f6b518 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -150,8 +150,6 @@ static inline void disable_acpi(void) { } extern int x86_acpi_numa_init(void); #endif /* CONFIG_ACPI_NUMA */ -#define acpi_unlazy_tlb(x) leave_mm(x) - #ifdef CONFIG_ACPI_APEI static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr) { diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index c931b88982a0..009c6a450e70 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -180,6 +180,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) } void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); +void leave_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 3effd3c994af..948c0997e6ab 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -189,8 +189,10 @@ struct tlb_state { * We're heuristically guessing that the CR3 load we * skipped more than makes up for the overhead added by * lazy mode. + * + * XXX */ - bool is_lazy; + u8 is_lazy; /* * If set we changed the page tables in such a way that we diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a1561957dccb..f94767d16b24 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -139,7 +139,6 @@ void leave_mm(int cpu) switch_mm(NULL, &init_mm, NULL); } -EXPORT_SYMBOL_GPL(leave_mm); void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) @@ -304,12 +303,21 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) * old mm loaded and only switch to init_mm when * tlb_remove_page() happens. */ - this_cpu_write(cpu_tlbstate.is_lazy, true); + this_cpu_write(cpu_tlbstate.is_lazy, 1); } else { - switch_mm(NULL, &init_mm, NULL); + switch_mm_irqs_off(NULL, &init_mm, NULL); } } +void leave_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +{ + if (!tlb_defer_switch_to_init_mm()) + return; + + if (xchg(this_cpu_ptr(&cpu_tlbstate.is_lazy), 0) == 2) + switch_mm_irqs_off(NULL, &init_mm, NULL); +} + /* * Call this when reinitializing a CPU. It fixes the following potential * problems: @@ -496,9 +504,13 @@ static void flush_tlb_func_remote(void *info) flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN); } +static DEFINE_PER_CPU(cpumask_var_t, __tlb_mask); + void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { + struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask); + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask, (void *)info, 1); return; } + + if (tlb_defer_switch_to_init_mm() && flushmask) { + int cpu; + + cpumask_copy(flushmask, cpumask); + for_each_cpu(cpu, flushmask) { + if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1) + __cpumask_clear_cpu(cpu, flushmask); + } + + cpumask = flushmask; + } + smp_call_function_many(cpumask, flush_tlb_func_remote, (void *)info, 1); } @@ -688,6 +713,13 @@ static const struct file_operations fops_tlbflush = { static int __init create_tlb_single_page_flush_ceiling(void) { + int cpu; + + for_each_possible_cpu(cpu) { + zalloc_cpumask_var_node(per_cpu_ptr(&__tlb_mask, cpu), + GFP_KERNEL, cpu_to_node(cpu)); + } + debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR, arch_debugfs_dir, NULL, &fops_tlbflush); return 0; diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index d50a7b6ccddd..2736e25e9dc6 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -710,8 +710,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock); static void acpi_idle_enter_bm(struct acpi_processor *pr, struct acpi_processor_cx *cx, bool timer_bc) { - acpi_unlazy_tlb(smp_processor_id()); - /* * Must be done before busmaster disable as we might need to * access HPET ! diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index f0b06b14e782..920e719156db 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -913,17 +913,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state = &drv->states[index]; unsigned long eax = flg2MWAIT(state->flags); unsigned int cstate; - int cpu = smp_processor_id(); cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1; - /* - * leave_mm() to avoid costly and often unnecessary wakeups - * for flushing the user TLB's associated with the active mm. - */ - if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED) - leave_mm(cpu); - if (!(lapic_timer_reliable_states & (1 << (cstate)))) tick_broadcast_enter(); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..74c51da7301a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2750,12 +2750,8 @@ static __always_inline struct rq * context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { - struct mm_struct *mm, *oldmm; - prepare_task_switch(rq, prev, next); - mm = next->mm; - oldmm = prev->active_mm; /* * For paravirt, this is coupled with an exit in switch_to to * combine the page table reload and the switch backend into @@ -2763,16 +2759,22 @@ context_switch(struct rq *rq, struct task_struct *prev, */ arch_start_context_switch(prev); - if (!mm) { - next->active_mm = oldmm; - mmgrab(oldmm); - enter_lazy_tlb(oldmm, next); - } else - switch_mm_irqs_off(oldmm, mm, next); + if (!next->mm) { /* to kernel */ + next->active_mm = prev->active_mm; + + if (prev->mm) { /* from user */ + enter_lazy_tlb(prev->active_mm, next); + mmgrab(prev->active_mm); + } + } else { /* to user */ + if (!prev->mm) { /* from kernel */ + /* will mmdrop() in finish_task_switch(). */ + leave_lazy_tlb(prev->active_mm, NULL); + rq->prev_mm = prev->active_mm; + prev->active_mm = NULL; + } - if (!prev->mm) { - prev->active_mm = NULL; - rq->prev_mm = oldmm; + switch_mm_irqs_off(prev->active_mm, next->mm, next); } rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);