Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Thu, 25 Jan 2018 17:41:39 +0100
From:   Peter Zijlstra <peterz@infradead.org>
To:     Arjan van de Ven <arjan@linux.intel.com>
Cc:     Tim Chen <tim.c.chen@linux.intel.com>,
        linux-kernel@vger.kernel.org,
        KarimAllah Ahmed <karahmed@amazon.de>,
        Andi Kleen <ak@linux.intel.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Andy Lutomirski <luto@kernel.org>,
        Ashok Raj <ashok.raj@intel.com>,
        Asit Mallick <asit.k.mallick@intel.com>,
        Borislav Petkov <bp@suse.de>,
        Dan Williams <dan.j.williams@intel.com>,
        Dave Hansen <dave.hansen@intel.com>,
        David Woodhouse <dwmw@amazon.co.uk>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        "H . Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
        Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jun Nakajima <jun.nakajima@intel.com>,
        Laura Abbott <labbott@redhat.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Masami Hiramatsu <mhiramat@kernel.org>,
        Paolo Bonzini <pbonzini@redhat.com>, rkrcmar@redhat.com,
        Thomas Gleixner <tglx@linutronix.de>,
        Tom Lendacky <thomas.lendacky@amd.com>, x86@kernel.org
Subject: Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same
 user process
Message-ID: <20180125164139.GM2269@hirez.programming.kicks-ass.net>
References: <c5cc429013c2248ddebe0a8480b172ae70b29733.1516840211.git.tim.c.chen@linux.intel.com>
 <20180125085820.GV2228@hirez.programming.kicks-ass.net>
 <20180125092233.GE2295@hirez.programming.kicks-ass.net>
 <86541aca-8de7-163d-b620-083dddf29184@linux.intel.com>
 <20180125135055.GK2249@hirez.programming.kicks-ass.net>
 <aeef4a08-7dd3-3d3a-02f4-9b8b4a1461bf@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <aeef4a08-7dd3-3d3a-02f4-9b8b4a1461bf@linux.intel.com>
User-Agent: Mutt/1.9.2 (2017-12-15)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote:
> On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
> > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
> > > > 
> > > > This means that 'A -> idle -> A' should never pass through switch_mm to
> > > > begin with.
> > > > 
> > > > Please clarify how you think it does.
> > > > 
> > > 
> > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
> > > for a tlb flush.
> > 
> > The intel_idle code does, not the idle code. This is squirreled away in
> > some driver :/
> 
> afaik (but haven't looked in a while) acpi drivers did too

Only makes it worse.. drivers shouldn't be frobbing with things like
this.

> > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
> > > state to just flush a tlb that's empty, the performance of that is horrific)
> > 
> > Hurmph. I'd rather fix that some other way than leave_mm(), this is
> > piling special on special.
> > 
> the problem was tricky. but of course if something better is possible lets figure this out

How about something like the below? It boots with "nopcid" appended to
the cmdline.

Andy, could you pretty please have a look at this? This is fickle code
at best and I'm sure I messed _something_ up.

The idea is simple, do what we do for virt. Don't send IPI's to CPUs
that don't need them (in virt's case because the vCPU isn't running, in
our case because we're not in fact running a user process), but mark the
CPU as having needed a TLB flush.

Then when we leave the special mode (switching back to a user task),
check our flag and invalidate TLBs if required.

All of this is conditional on tlb_defer_switch_to_init_mm() (aka !PCID)
because otherwise we already switch to init_mm under the hood (we retain
active_mm) unconditionally.

This way active_mm is preserved and we can use something like:

  if (prev != next)
    ibpb();

in switch_mm_irqs_off().


---
 arch/x86/include/asm/acpi.h        |  2 --
 arch/x86/include/asm/mmu_context.h |  1 +
 arch/x86/include/asm/tlbflush.h    |  4 +++-
 arch/x86/mm/tlb.c                  | 38 +++++++++++++++++++++++++++++++++++---
 drivers/acpi/processor_idle.c      |  2 --
 drivers/idle/intel_idle.c          |  8 --------
 kernel/sched/core.c                | 28 +++++++++++++++-------------
 7 files changed, 54 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 8d0ec9df1cbe..72d867f6b518 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
 extern int x86_acpi_numa_init(void);
 #endif /* CONFIG_ACPI_NUMA */
 
-#define acpi_unlazy_tlb(x)	leave_mm(x)
-
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c931b88982a0..009c6a450e70 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -180,6 +180,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 }
 
 void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
+void leave_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3effd3c994af..948c0997e6ab 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -189,8 +189,10 @@ struct tlb_state {
 	 *    We're heuristically guessing that the CR3 load we
 	 *    skipped more than makes up for the overhead added by
 	 *    lazy mode.
+	 *
+	 *    XXX
 	 */
-	bool is_lazy;
+	u8 is_lazy;
 
 	/*
 	 * If set we changed the page tables in such a way that we
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a1561957dccb..f94767d16b24 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -139,7 +139,6 @@ void leave_mm(int cpu)
 
 	switch_mm(NULL, &init_mm, NULL);
 }
-EXPORT_SYMBOL_GPL(leave_mm);
 
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
@@ -304,12 +303,21 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 		 * old mm loaded and only switch to init_mm when
 		 * tlb_remove_page() happens.
 		 */
-		this_cpu_write(cpu_tlbstate.is_lazy, true);
+		this_cpu_write(cpu_tlbstate.is_lazy, 1);
 	} else {
-		switch_mm(NULL, &init_mm, NULL);
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
 	}
 }
 
+void leave_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (!tlb_defer_switch_to_init_mm())
+		return;
+
+	if (xchg(this_cpu_ptr(&cpu_tlbstate.is_lazy), 0) == 2)
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
+}
+
 /*
  * Call this when reinitializing a CPU.  It fixes the following potential
  * problems:
@@ -496,9 +504,13 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, __tlb_mask);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
+	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask);
+
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 					       (void *)info, 1);
 		return;
 	}
+
+	if (tlb_defer_switch_to_init_mm() && flushmask) {
+		int cpu;
+
+		cpumask_copy(flushmask, cpumask);
+		for_each_cpu(cpu, flushmask) {
+			if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1)
+				__cpumask_clear_cpu(cpu, flushmask);
+		}
+
+		cpumask = flushmask;
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
 }
@@ -688,6 +713,13 @@ static const struct file_operations fops_tlbflush = {
 
 static int __init create_tlb_single_page_flush_ceiling(void)
 {
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		zalloc_cpumask_var_node(per_cpu_ptr(&__tlb_mask, cpu),
+					GFP_KERNEL, cpu_to_node(cpu));
+	}
+
 	debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
 			    arch_debugfs_dir, NULL, &fops_tlbflush);
 	return 0;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index d50a7b6ccddd..2736e25e9dc6 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -710,8 +710,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
 static void acpi_idle_enter_bm(struct acpi_processor *pr,
 			       struct acpi_processor_cx *cx, bool timer_bc)
 {
-	acpi_unlazy_tlb(smp_processor_id());
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index f0b06b14e782..920e719156db 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -913,17 +913,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	struct cpuidle_state *state = &drv->states[index];
 	unsigned long eax = flg2MWAIT(state->flags);
 	unsigned int cstate;
-	int cpu = smp_processor_id();
 
 	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
-	/*
-	 * leave_mm() to avoid costly and often unnecessary wakeups
-	 * for flushing the user TLB's associated with the active mm.
-	 */
-	if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-		leave_mm(cpu);
-
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		tick_broadcast_enter();
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..74c51da7301a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2750,12 +2750,8 @@ static __always_inline struct rq *
 context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next, struct rq_flags *rf)
 {
-	struct mm_struct *mm, *oldmm;
-
 	prepare_task_switch(rq, prev, next);
 
-	mm = next->mm;
-	oldmm = prev->active_mm;
 	/*
 	 * For paravirt, this is coupled with an exit in switch_to to
 	 * combine the page table reload and the switch backend into
@@ -2763,16 +2759,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
-	if (!mm) {
-		next->active_mm = oldmm;
-		mmgrab(oldmm);
-		enter_lazy_tlb(oldmm, next);
-	} else
-		switch_mm_irqs_off(oldmm, mm, next);
+	if (!next->mm) {	/* to kernel */
+		next->active_mm = prev->active_mm;
+
+		if (prev->mm) {	/* from user */
+			enter_lazy_tlb(prev->active_mm, next);
+			mmgrab(prev->active_mm);
+		}
+	} else {		/* to user */
+		if (!prev->mm) { /* from kernel */
+			/* will mmdrop() in finish_task_switch(). */
+			leave_lazy_tlb(prev->active_mm, NULL);
+			rq->prev_mm = prev->active_mm;
+			prev->active_mm = NULL;
+		}
 
-	if (!prev->mm) {
-		prev->active_mm = NULL;
-		rq->prev_mm = oldmm;
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 	}
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);