Linus asked me to come up with a smaller patch set to get the benefits
of lazy TLB mode, so I spent some time trying out various permutations
of the code, with a few workloads that do lots of context switches, and
also happen to have a fair number of TLB flushes a second.
Both of the workloads tested are memcache style workloads, running
on two socket systems. One of the workloads has around 300,000
context switches a second, and around 19,000 TLB flushes.
The first patch in the series, of always using lazy TLB mode,
reduces CPU use around 1% on both Haswell and Broadwell systems.
The rest of the series reduces the number of TLB flush IPIs by
about 1,500 a second, resulting in a 0.2% reduction in CPU use,
on top of the 1% seen by just enabling lazy TLB mode.
These are the low hanging fruits in the context switch code.
The big thing remaining is the reference count overhead of
the lazy TLB mm_struct, but getting rid of that is rather a
lot of code for a small performance gain. Not quite what
Linus asked for :)
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.
Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.
The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.
CPUs in lazy TLB mode remain part of the mm_cpumask(mm), both because
that allows TLB flush IPIs to be sent at page table freeing time, and
because the cache line bouncing on the mm_cpumask(mm) was responsible
for about half the CPU use in switch_mm_irqs_off().
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 58 insertions(+), 9 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5d73bde0e4c8..280b4cc4c28c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -156,6 +156,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -207,17 +208,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
next->context.ctx_id);
/*
- * We don't currently support having a real mm loaded without
- * our cpu set in mm_cpumask(). We have all the bookkeeping
- * in place to figure out whether we would need to flush
- * if our cpu were cleared in mm_cpumask(), but we don't
- * currently use it.
+ * Even in lazy TLB mode, the CPU should stay set in the
+ * mm_cpumask. The TLB shootdown code can figure out from
+ * from cpu_tlbstate.is_lazy whether or not to send an IPI.
*/
if (WARN_ON_ONCE(real_prev != &init_mm &&
!cpumask_test_cpu(cpu, mm_cpumask(next))))
cpumask_set_cpu(cpu, mm_cpumask(next));
- return;
+ /*
+ * If the CPU is not in lazy TLB mode, we are just switching
+ * from one thread in a process to another thread in the same
+ * process. No TLB flush required.
+ */
+ if (!was_lazy)
+ return;
+
+ /*
+ * Read the tlb_gen to check whether a flush is needed.
+ * If the TLB is up to date, just use it.
+ * The barrier synchronizes with the tlb_gen increment in
+ * the TLB shootdown code.
+ */
+ smp_mb();
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+ if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+ next_tlb_gen)
+ return;
+
+ /*
+ * TLB contents went out of date while we were in lazy
+ * mode. Fall through to the TLB switching code below.
+ */
+ new_asid = prev_asid;
+ need_flush = true;
} else {
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
@@ -308,8 +332,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
- load_mm_cr4(next);
- switch_ldt(real_prev, next);
+ if (next != real_prev) {
+ load_mm_cr4(next);
+ switch_ldt(real_prev, next);
+ }
}
/*
@@ -416,6 +442,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* paging-structure cache to avoid speculatively reading
* garbage into our TLB. Since switching to init_mm is barely
* slower than a minimal flush, just switch to init_mm.
+ *
+ * This should be rare, with native_flush_tlb_others skipping
+ * IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
return;
@@ -519,6 +548,11 @@ static void flush_tlb_func_remote(void *info)
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
}
+static bool tlb_is_not_lazy(int cpu, void *data)
+{
+ return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
@@ -554,8 +588,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(void *)info, 1);
return;
}
- smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+ /*
+ * If no page tables were freed, we can skip sending IPIs to
+ * CPUs in lazy TLB mode. They will flush the CPU themselves
+ * at the next context switch.
+ *
+ * However, if page tables are getting freed, we need to send the
+ * IPI everywhere, to prevent CPUs in lazy TLB mode from tripping
+ * up on the new contents of what used to be page tables, while
+ * doing a speculative memory access.
+ */
+ if (info->freed_tables)
+ smp_call_function_many(cpumask, flush_tlb_func_remote,
(void *)info, 1);
+ else
+ on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
+ (void *)info, 1, GFP_ATOMIC, cpumask);
}
/*
--
2.17.1
Add an argument to flush_tlb_mm_range to indicate whether page tables
are about to be freed after this TLB flush. This allows for an
optimization of flush_tlb_mm_range to skip CPUs in lazy TLB mode.
No functional changes.
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/tlb.h | 6 ++++--
arch/x86/include/asm/tlbflush.h | 12 ++++++++----
arch/x86/kernel/ldt.c | 2 +-
arch/x86/kernel/vm86_32.c | 2 +-
arch/x86/mm/tlb.c | 3 ++-
5 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index cb0a1f470980..b9237a1f567f 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -9,9 +9,11 @@
#define tlb_flush(tlb) \
{ \
if (!tlb->fullmm && !tlb->need_flush_all) \
- flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end, 0UL); \
+ flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end, 0UL, \
+ tlb->freed_tables); \
else \
- flush_tlb_mm_range(tlb->mm, 0UL, TLB_FLUSH_ALL, 0UL); \
+ flush_tlb_mm_range(tlb->mm, 0UL, TLB_FLUSH_ALL, 0UL, \
+ tlb->freed_tables); \
}
#include <asm-generic/tlb.h>
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 82898cd3d933..ae1ef755cb94 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -515,19 +515,23 @@ struct flush_tlb_info {
#define local_flush_tlb() __flush_tlb()
-#define flush_tlb_mm(mm) flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
+/* Simultaneous unmap operations could have freed page tables. */
+#define flush_tlb_mm(mm) \
+ flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
+/* The operations calling flush_tlb_range do not free page tables. */
#define flush_tlb_range(vma, start, end) \
- flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
+ flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags, false)
extern void flush_tlb_all(void);
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
- unsigned long end, unsigned long vmflag);
+ unsigned long end, unsigned long vmflag,
+ bool freed_tables);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
{
- flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE);
+ flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE, false);
}
void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 26d713ecad34..2c2291fa14b3 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -190,7 +190,7 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
}
va = (unsigned long)ldt_slot_va(slot);
- flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0);
+ flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0, false);
ldt->slot = slot;
#endif
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 5edb27f1a2c4..51024424dfbe 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -199,7 +199,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
pte_unmap_unlock(pte, ptl);
out:
up_write(&mm->mmap_sem);
- flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, 0UL);
+ flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, 0UL, false);
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ac05d61cc90e..62f147b0f6b7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -571,7 +571,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
- unsigned long end, unsigned long vmflag)
+ unsigned long end, unsigned long vmflag,
+ bool freed_tables)
{
int cpu;
--
2.17.1
The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask,
which should never be used by other CPUs simultaneously. There is no
need to use locked memory accesses to set the bits in this bitmap.
Switch to __cpumask_set_cpu.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
---
kernel/smp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..2dbc842dd385 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -680,7 +680,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
preempt_disable();
for_each_online_cpu(cpu)
if (cond_func(cpu, info))
- cpumask_set_cpu(cpu, cpus);
+ __cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
preempt_enable();
free_cpumask_var(cpus);
--
2.17.1
Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.
No functional changes, since the if (real_prev == next) branch
always returns.
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
(cherry picked from commit 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b)
---
arch/x86/mm/tlb.c | 87 +++++++++++++++++++++++++++++------------------
1 file changed, 54 insertions(+), 33 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d19f424073d9..ac05d61cc90e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -158,6 +158,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
+ bool need_flush;
+ u16 new_asid;
/*
* NB: The scheduler will call us with prev == next when switching
@@ -217,8 +219,27 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
return;
} else {
- u16 new_asid;
- bool need_flush;
+ u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
+
+ /*
+ * Avoid user/user BTB poisoning by flushing the branch
+ * predictor when switching between processes. This stops
+ * one process from doing Spectre-v2 attacks on another.
+ *
+ * As an optimization, flush indirect branches only when
+ * switching into processes that disable dumping. This
+ * protects high value processes like gpg, without having
+ * too high performance overhead. IBPB is *expensive*!
+ *
+ * This will not flush branches when switching into kernel
+ * threads. It will also not flush if we switch to idle
+ * thread and back to the same process. It will flush if we
+ * switch to a different non-dumpable process.
+ */
+ if (tsk && tsk->mm &&
+ tsk->mm->context.ctx_id != last_ctx_id &&
+ get_dumpable(tsk->mm) != SUID_DUMP_USER)
+ indirect_branch_prediction_barrier();
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
@@ -249,44 +270,44 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/* Let nmi_uaccess_okay() know that we're changing CR3. */
this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
barrier();
+ }
- if (need_flush) {
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
-
- /*
- * NB: This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we need to use the _rcuidle variant.
- *
- * (There is no good reason for this. The idle code should
- * be rearranged to call this before rcu_idle_enter().)
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- } else {
- /* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
-
- /* See above wrt _rcuidle. */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
- }
+ if (need_flush) {
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+ load_new_mm_cr3(next->pgd, new_asid, true);
/*
- * Record last user mm's context id, so we can avoid
- * flushing branch buffer with IBPB if we switch back
- * to the same user.
+ * NB: This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we need to use the _rcuidle variant.
+ *
+ * (There is no good reason for this. The idle code should
+ * be rearranged to call this before rcu_idle_enter().)
*/
- if (next != &init_mm)
- this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
-
- /* Make sure we write CR3 before loaded_mm. */
- barrier();
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ load_new_mm_cr3(next->pgd, new_asid, false);
- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+ /* See above wrt _rcuidle. */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
+ /*
+ * Record last user mm's context id, so we can avoid
+ * flushing branch buffer with IBPB if we switch back
+ * to the same user.
+ */
+ if (next != &init_mm)
+ this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
+ /* Make sure we write CR3 before loaded_mm. */
+ barrier();
+
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+
load_mm_cr4(next);
switch_ldt(real_prev, next);
}
--
2.17.1
Introduce a variant of on_each_cpu_cond that iterates only over the
CPUs in a cpumask, in order to avoid making callbacks for every single
CPU in the system when we only need to test a subset.
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/smp.h | 4 ++++
kernel/smp.c | 17 +++++++++++++----
kernel/up.c | 14 +++++++++++---
3 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..a56f08ff3097 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -53,6 +53,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags, const struct cpumask *mask);
+
int smp_call_function_single_async(int cpu, call_single_data_t *csd);
#ifdef CONFIG_SMP
diff --git a/kernel/smp.c b/kernel/smp.c
index 2dbc842dd385..f4cf1b0bb3b8 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -667,9 +667,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* You must not call this function with disabled interrupts or
* from a hardware interrupt handler or from a bottom half handler.
*/
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
- gfp_t gfp_flags)
+ gfp_t gfp_flags, const struct cpumask *mask)
{
cpumask_var_t cpus;
int cpu, ret;
@@ -678,7 +678,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
preempt_disable();
- for_each_online_cpu(cpu)
+ for_each_cpu(cpu, mask)
if (cond_func(cpu, info))
__cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
@@ -690,7 +690,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
* just have to IPI them one by one.
*/
preempt_disable();
- for_each_online_cpu(cpu)
+ for_each_cpu(cpu, mask)
if (cond_func(cpu, info)) {
ret = smp_call_function_single(cpu, func,
info, wait);
@@ -699,6 +699,15 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
preempt_enable();
}
}
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags)
+{
+ on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
+ cpu_online_mask);
+}
EXPORT_SYMBOL(on_each_cpu_cond);
static void do_nothing(void *unused)
diff --git a/kernel/up.c b/kernel/up.c
index 42c46bf3e0a5..ff536f9cc8a2 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -68,9 +68,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* Preemption is disabled here to make sure the cond_func is called under the
* same condtions in UP and SMP.
*/
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
- smp_call_func_t func, void *info, bool wait,
- gfp_t gfp_flags)
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags, const struct cpumask *mask)
{
unsigned long flags;
@@ -82,6 +82,14 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
}
preempt_enable();
}
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags)
+{
+ on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags, NULL);
+}
EXPORT_SYMBOL(on_each_cpu_cond);
int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)
--
2.17.1
Pass the information on to native_flush_tlb_others.
No functional changes.
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ae1ef755cb94..dc4d99579984 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -511,6 +511,7 @@ struct flush_tlb_info {
unsigned long start;
unsigned long end;
u64 new_tlb_gen;
+ bool freed_tables;
};
#define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 62f147b0f6b7..5d73bde0e4c8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -578,6 +578,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
struct flush_tlb_info info = {
.mm = mm,
+ .freed_tables = freed_tables,
};
cpu = get_cpu();
--
2.17.1
On Mon, 2018-09-24 at 14:37 -0400, Rik van Riel wrote:
> Linus asked me to come up with a smaller patch set to get the
> benefits
> of lazy TLB mode, so I spent some time trying out various
> permutations
> of the code, with a few workloads that do lots of context switches,
> and
> also happen to have a fair number of TLB flushes a second.
I made a nice list of which patches this code
is based on, but I forgot to copy it into my
intro email.
The patches are based on current -tip, plus:
- tip x86/core: 012e77a903d ("x86/nmi: Fix NMI uaccess race against CR3
switching")
- arm64 tlb/asm-generic branch, including
- faaadaf315b4 ("asm-generic/tlb: Guard with #ifdef CONFIG_MMU")
- 22a61c3c4f13 ("asm-generic/tlb: Track freeing of page-table
directories in struct mmu_gather")
- a6d60245d6d9 ("asm-generic/tlb: Track which levels of the page
tables have been cleared")
--
All Rights Reversed.
Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
(cherry picked from commit 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e)
---
arch/x86/include/asm/tlbflush.h | 16 ----------------
arch/x86/mm/tlb.c | 15 +--------------
2 files changed, 1 insertion(+), 30 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ad6629537af5..82898cd3d933 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -143,22 +143,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif
-static inline bool tlb_defer_switch_to_init_mm(void)
-{
- /*
- * If we have PCID, then switching to init_mm is reasonably
- * fast. If we don't have PCID, then switching to init_mm is
- * quite slow, so we try to defer it in the hopes that we can
- * avoid it entirely. The latter approach runs the risk of
- * receiving otherwise unnecessary IPIs.
- *
- * This choice is just a heuristic. The tlb code can handle this
- * function returning true or false regardless of whether we have
- * PCID.
- */
- return !static_cpu_has(X86_FEATURE_PCID);
-}
-
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 063433ff67bf..d19f424073d9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -309,20 +309,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
return;
- if (tlb_defer_switch_to_init_mm()) {
- /*
- * There's a significant optimization that may be possible
- * here. We have accurate enough TLB flush tracking that we
- * don't need to maintain coherence of TLB per se when we're
- * lazy. We do, however, need to maintain coherence of
- * paging-structure caches. We could, in principle, leave our
- * old mm loaded and only switch to init_mm when
- * tlb_remove_page() happens.
- */
- this_cpu_write(cpu_tlbstate.is_lazy, true);
- } else {
- switch_mm(NULL, &init_mm, NULL);
- }
+ this_cpu_write(cpu_tlbstate.is_lazy, true);
}
/*
--
2.17.1
* Rik van Riel <[email protected]> wrote:
> The big thing remaining is the reference count overhead of
> the lazy TLB mm_struct, but getting rid of that is rather a
> lot of code for a small performance gain. Not quite what
> Linus asked for :)
BTW., what would be the plan to improve scalability there,
is it even possible?
Also, it would be nice to integrate some of those workloads
into a simple 'perf bench mm' or 'perf bench tlb' subcommand,
see tools/perf/bench/ on how to add benchmarking modules.
Thanks,
Ingo
On Wed, 2018-10-24 at 07:53 +0200, Ingo Molnar wrote:
> * Rik van Riel <[email protected]> wrote:
>
> > The big thing remaining is the reference count overhead of
> > the lazy TLB mm_struct, but getting rid of that is rather a
> > lot of code for a small performance gain. Not quite what
> > Linus asked for :)
>
> BTW., what would be the plan to improve scalability there,
> is it even possible?
One thing I looked at was shooting down the lazy TLB
MM from the exit code path, that way the MM would no
longer be referenced from any CPU by the time the
memory gets freed.
Once we do that, we no longer need to do any refcounting
for the lazy TLB MM.
--
All Rights Reversed.