2019-05-25 08:23:29

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on the
behavior of on_each_cpu() that functions were first executed remotely
and only then locally.

On my Haswell machine (bare-metal), running a TLB flush microbenchmark
(MADV_DONTNEED/touch for a single page on one thread), takes the
following time (ns):

n_threads before after
--------- ------ -----
1 661 663
2 1436 1225 (-14%)
4 1571 1421 (-10%)

Note that since the benchmark also causes page-faults, the actual
speedup of TLB shootdowns is actually greater. Also note the higher
improvement in performance with 2 thread (a single remote TLB flush
target). This seems to be a side-effect of holding synchronization
data-structures (csd) off the stack, unlike the way it is currently done
(in smp_call_function_single()).

Patches 1-2 do small cleanup. Patches 3-5 actually implement the
concurrent execution of TLB flushes. Patch 6 restores local TLB flushes
performance, which was hurt by the optimization, to be as good as it was
before these changes by introducing a fast-pass for this specific case.

Nadav Amit (6):
smp: Remove smp_call_function() and on_each_cpu() return values
cpumask: Purify cpumask_next()
smp: Run functions concurrently in smp_call_function_many()
x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
x86/mm/tlb: Flush remote and local TLBs concurrently
x86/mm/tlb: Optimize local TLB flushes

arch/alpha/kernel/smp.c | 19 +---
arch/alpha/oprofile/common.c | 6 +-
arch/ia64/kernel/perfmon.c | 12 +--
arch/ia64/kernel/uncached.c | 8 +-
arch/x86/hyperv/mmu.c | 2 +
arch/x86/include/asm/paravirt.h | 8 ++
arch/x86/include/asm/paravirt_types.h | 6 ++
arch/x86/include/asm/tlbflush.h | 6 ++
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/paravirt.c | 3 +
arch/x86/lib/cache-smp.c | 3 +-
arch/x86/mm/tlb.c | 137 +++++++++++++++++--------
arch/x86/xen/mmu_pv.c | 2 +
drivers/char/agp/generic.c | 3 +-
include/linux/cpumask.h | 2 +-
include/linux/smp.h | 32 ++++--
kernel/smp.c | 139 ++++++++++++--------------
kernel/up.c | 3 +-
18 files changed, 230 insertions(+), 162 deletions(-)

--
2.20.1


2019-05-25 08:23:36

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many()

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, introduce __smp_call_function_many(), which allows the callers
to provide local and remote functions that should be executed, and run
them concurrently. Keep smp_call_function_many() semantic as it is today
for backward compatibility: the called function is not executed in this
case locally.

__smp_call_function_many() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Adding support to run the functions concurrently required to remove a
micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
of saving/restoring them. The benefit of running the local and remote
code concurrently is expected to be greater.

CC: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
include/linux/smp.h | 27 ++++++---
kernel/smp.c | 133 +++++++++++++++++++++-----------------------
2 files changed, 83 insertions(+), 77 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index bb8b451ab01f..b69abc88259d 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -32,11 +32,6 @@ extern unsigned int total_cpus;
int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
int wait);

-/*
- * Call a function on all processors
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
-
/*
* Call a function on processors specified by mask, which might include
* the local one.
@@ -44,6 +39,15 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
void *info, bool wait);

+/*
+ * Call a function on all processors. May be used during early boot while
+ * early_boot_irqs_disabled is set.
+ */
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+ on_each_cpu_mask(cpu_online_mask, func, info, wait);
+}
+
/*
* Call a function on each processor for which the supplied function
* cond_func returns a positive value. This may include the local
@@ -102,8 +106,17 @@ extern void smp_cpus_done(unsigned int max_cpus);
* Call a function on all other processors
*/
void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
- smp_call_func_t func, void *info, bool wait);
+
+void __smp_call_function_many(const struct cpumask *mask,
+ smp_call_func_t remote_func,
+ smp_call_func_t local_func,
+ void *info, bool wait);
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+ smp_call_func_t func, void *info, bool wait)
+{
+ __smp_call_function_many(mask, func, NULL, info, wait);
+}

int smp_call_function_any(const struct cpumask *mask,
smp_call_func_t func, void *info, int wait);
diff --git a/kernel/smp.c b/kernel/smp.c
index e0b05ac53108..17750608c52c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -387,9 +387,13 @@ int smp_call_function_any(const struct cpumask *mask,
EXPORT_SYMBOL_GPL(smp_call_function_any);

/**
- * smp_call_function_many(): Run a function on a set of other CPUs.
+ * __smp_call_function_many(): Run a function on a set of CPUs.
* @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
+ * @remote_func: The function to run on remote cores. This must be fast and
+ * non-blocking.
+ * @local_func: The function that should be run on this CPU. This must be
+ * fast and non-blocking. If NULL is provided, no function will
+ * be executed on this CPU.
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait (atomically) until function has completed
* on other CPUs.
@@ -400,11 +404,16 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
* hardware interrupt handler or from a bottom half handler. Preemption
* must be disabled when calling this function.
*/
-void smp_call_function_many(const struct cpumask *mask,
- smp_call_func_t func, void *info, bool wait)
+void __smp_call_function_many(const struct cpumask *mask,
+ smp_call_func_t remote_func,
+ smp_call_func_t local_func,
+ void *info, bool wait)
{
+ int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
- int cpu, next_cpu, this_cpu = smp_processor_id();
+ bool run_remote = false;
+ bool run_local = false;
+ int nr_cpus = 0;

/*
* Can deadlock when called with interrupts disabled.
@@ -412,55 +421,62 @@ void smp_call_function_many(const struct cpumask *mask,
* send smp call function interrupt to this cpu and as such deadlocks
* can't happen.
*/
- WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
- && !oops_in_progress && !early_boot_irqs_disabled);
+ if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled)
+ lockdep_assert_irqs_enabled();
+
+ /* Check if we need local execution. */
+ if (local_func && cpumask_test_cpu(this_cpu, mask))
+ run_local = true;

- /* Try to fastpath. So, what's a CPU they want? Ignoring this one. */
+ /* Check if we need remote execution, i.e., any CPU excluding this one. */
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+ if (cpu < nr_cpu_ids)
+ run_remote = true;

- /* No online cpus? We're done. */
- if (cpu >= nr_cpu_ids)
- return;
+ if (run_remote) {
+ cfd = this_cpu_ptr(&cfd_data);

- /* Do we have another CPU which isn't us? */
- next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
- if (next_cpu == this_cpu)
- next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
- /* Fastpath: do that cpu by itself. */
- if (next_cpu >= nr_cpu_ids) {
- smp_call_function_single(cpu, func, info, wait);
- return;
- }
+ cpumask_and(cfd->cpumask, mask, cpu_online_mask);
+ __cpumask_clear_cpu(this_cpu, cfd->cpumask);

- cfd = this_cpu_ptr(&cfd_data);
-
- cpumask_and(cfd->cpumask, mask, cpu_online_mask);
- __cpumask_clear_cpu(this_cpu, cfd->cpumask);
+ cpumask_clear(cfd->cpumask_ipi);
+ for_each_cpu(cpu, cfd->cpumask) {
+ call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+
+ nr_cpus++;
+ last_cpu = cpu;
+
+ csd_lock(csd);
+ if (wait)
+ csd->flags |= CSD_FLAG_SYNCHRONOUS;
+ csd->func = remote_func;
+ csd->info = info;
+ if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+ __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+ }

- /* Some callers race with other cpus changing the passed mask */
- if (unlikely(!cpumask_weight(cfd->cpumask)))
- return;
+ /*
+ * Choose the most efficient way to send an IPI. Note that the
+ * number of CPUs might be zero due to concurrent changes to the
+ * provided mask or cpu_online_mask.
+ */
+ if (nr_cpus == 1)
+ arch_send_call_function_single_ipi(last_cpu);
+ else if (likely(nr_cpus > 1))
+ arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
+ }

- cpumask_clear(cfd->cpumask_ipi);
- for_each_cpu(cpu, cfd->cpumask) {
- call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+ if (run_local) {
+ unsigned long flags;

- csd_lock(csd);
- if (wait)
- csd->flags |= CSD_FLAG_SYNCHRONOUS;
- csd->func = func;
- csd->info = info;
- if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
- __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+ local_irq_save(flags);
+ local_func(info);
+ local_irq_restore(flags);
}

- /* Send a message to all CPUs in the map */
- arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
-
- if (wait) {
+ if (run_remote && wait) {
for_each_cpu(cpu, cfd->cpumask) {
call_single_data_t *csd;

@@ -469,7 +485,7 @@ void smp_call_function_many(const struct cpumask *mask,
}
}
}
-EXPORT_SYMBOL(smp_call_function_many);
+EXPORT_SYMBOL(__smp_call_function_many);

/**
* smp_call_function(): Run a function on all other CPUs.
@@ -586,24 +602,6 @@ void __init smp_init(void)
smp_cpus_done(setup_max_cpus);
}

-/*
- * Call a function on all processors. May be used during early boot while
- * early_boot_irqs_disabled is set. Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(void (*func) (void *info), void *info, int wait)
-{
- unsigned long flags;
-
- preempt_disable();
- smp_call_function(func, info, wait);
- local_irq_save(flags);
- func(info);
- local_irq_restore(flags);
- preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
/**
* on_each_cpu_mask(): Run a function on processors specified by
* cpumask, which may include the local processor.
@@ -623,16 +621,11 @@ EXPORT_SYMBOL(on_each_cpu);
void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
void *info, bool wait)
{
- int cpu = get_cpu();
+ preempt_disable();

- smp_call_function_many(mask, func, info, wait);
- if (cpumask_test_cpu(cpu, mask)) {
- unsigned long flags;
- local_irq_save(flags);
- func(info);
- local_irq_restore(flags);
- }
- put_cpu();
+ __smp_call_function_many(mask, func, func, info, wait);
+
+ preempt_enable();
}
EXPORT_SYMBOL(on_each_cpu_mask);

--
2.20.1

2019-05-25 08:23:37

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Juergen Gross <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/hyperv/mmu.c | 2 +
arch/x86/include/asm/paravirt.h | 8 +++
arch/x86/include/asm/paravirt_types.h | 6 ++
arch/x86/include/asm/tlbflush.h | 6 ++
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/paravirt.c | 3 +
arch/x86/mm/tlb.c | 80 +++++++++++++++++++++++----
arch/x86/xen/mmu_pv.c | 2 +
8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+ static_key_disable(&flush_tlb_multi_enabled.key);
}
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
#endif
}

+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
static inline void __flush_tlb(void)
{
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
}

+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+ const struct flush_tlb_info *info)
+{
+ PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+ /*
+ * flush_tlb_multi() is the preferred interface. When it is used,
+ * flush_tlb_others() should return false.
+ */
+ void (*flush_tlb_multi)(const struct cpumask *cpus,
+ const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
const struct flush_tlb_info *info);

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
}

+void native_flush_tlb_multi(const struct cpumask *cpumask,
+ const struct flush_tlb_info *info);
+
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info);

@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

#ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info) \
+ native_flush_tlb_multi(mask, info)
+
#define flush_tlb_others(mask, info) \
native_flush_tlb_others(mask, info)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+ static_key_disable(&flush_tlb_multi_enabled.key);
}

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5492a669f658..1314f89304a8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -171,6 +171,8 @@ unsigned paravirt_patch_insns(void *insn_buff, unsigned len,
return insn_len;
}

+DEFINE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
static void native_flush_tlb(void)
{
__native_flush_tlb();
@@ -375,6 +377,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_user = native_flush_tlb,
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
+ .mmu.flush_tlb_multi = native_flush_tlb_multi,
.mmu.flush_tlb_others = native_flush_tlb_others,
.mmu.tlb_remove_table =
(void (*)(struct mmu_gather *, void *))tlb_remove_page,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index afd2859a6966..0ec2bfca7581 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -550,7 +550,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* garbage into our TLB. Since switching to init_mm is barely
* slower than a minimal flush, just switch to init_mm.
*
- * This should be rare, with native_flush_tlb_others skipping
+ * This should be rare, with native_flush_tlb_multi skipping
* IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -634,9 +634,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
}

-static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)
+static void flush_tlb_func_local(void *info)
{
const struct flush_tlb_info *f = info;
+ enum tlb_flush_reason reason;
+
+ reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;

flush_tlb_func_common(f, true, reason);
}
@@ -654,14 +657,30 @@ static void flush_tlb_func_remote(void *info)
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
}

-static bool tlb_is_not_lazy(int cpu, void *data)
+static inline bool tlb_is_not_lazy(int cpu)
{
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
}

-void native_flush_tlb_others(const struct cpumask *cpumask,
- const struct flush_tlb_info *info)
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+ const struct flush_tlb_info *info)
{
+ /*
+ * native_flush_tlb_multi() can handle a single CPU, but it is
+ * suboptimal if the local TLB should be flushed, and therefore should
+ * not be used in such case. Check that it is not used in such case,
+ * and use this assumption for tracing and accounting of remote TLB
+ * flushes.
+ */
+ VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));
+
+ /*
+ * Do accounting and tracing. Note that there are (and have always been)
+ * cases in which a remote TLB flush will be traced, but eventually
+ * would not happen.
+ */
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (info->end == TLB_FLUSH_ALL)
trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -681,10 +700,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
* means that the percpu tlb_gen variables won't be updated
* and we'll do pointless flushes on future context switches.
*
- * Rather than hooking native_flush_tlb_others() here, I think
+ * Rather than hooking native_flush_tlb_multi() here, I think
* that UV should be updated so that smp_call_function_many(),
* etc, are optimal on UV.
*/
+ local_irq_disable();
+ flush_tlb_func_local((__force void *)info);
+ local_irq_enable();
+
cpumask = uv_flush_tlb_others(cpumask, info);
if (cpumask)
smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -703,11 +726,39 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
* doing a speculative memory access.
*/
if (info->freed_tables)
- smp_call_function_many(cpumask, flush_tlb_func_remote,
- (void *)info, 1);
- else
- on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
- (void *)info, 1, GFP_ATOMIC, cpumask);
+ __smp_call_function_many(cpumask, flush_tlb_func_remote,
+ flush_tlb_func_local, (void *)info, 1);
+ else {
+ /*
+ * Although we could have used on_each_cpu_cond_mask(),
+ * open-coding it has several performance advantages: (1) we can
+ * use specialized functions for remote and local flushes; (2)
+ * no need for indirect branch to test if TLB is lazy; (3) we
+ * can use a designated cpumask for evaluating the condition
+ * instead of allocating a new one.
+ *
+ * This works under the assumption that there are no nested TLB
+ * flushes, an assumption that is already made in
+ * flush_tlb_mm_range().
+ */
+ struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
+ int cpu;
+
+ cpumask_clear(cond_cpumask);
+
+ for_each_cpu(cpu, cpumask) {
+ if (tlb_is_not_lazy(cpu))
+ __cpumask_set_cpu(cpu, cond_cpumask);
+ }
+ __smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+ flush_tlb_func_local, (void *)info, 1);
+ }
+}
+
+void native_flush_tlb_others(const struct cpumask *cpumask,
+ const struct flush_tlb_info *info)
+{
+ native_flush_tlb_multi(cpumask, info);
}

/*
@@ -773,10 +824,15 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
{
int this_cpu = smp_processor_id();

+ if (static_branch_likely(&flush_tlb_multi_enabled)) {
+ flush_tlb_multi(cpumask, info);
+ return;
+ }
+
if (cpumask_test_cpu(this_cpu, cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
- flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+ flush_tlb_func_local((__force void *)info);
local_irq_enable();
}

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index beb44e22afdf..0cb277848cb4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2474,6 +2474,8 @@ void __init xen_init_mmu_ops(void)

pv_ops.mmu = xen_mmu_ops;

+ static_key_disable(&flush_tlb_multi_enabled.key);
+
memset(dummy_mapping, 0xff, PAGE_SIZE);
}

--
2.20.1

2019-05-25 08:23:45

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 6/6] x86/mm/tlb: Optimize local TLB flushes

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, making
local TLB flushes slower than they were before the recent changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() another time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/mm/tlb.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0ec2bfca7581..3f3f983e224e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -823,8 +823,12 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
const struct flush_tlb_info *info)
{
int this_cpu = smp_processor_id();
+ bool flush_others = false;

- if (static_branch_likely(&flush_tlb_multi_enabled)) {
+ if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+ flush_others = true;
+
+ if (static_branch_likely(&flush_tlb_multi_enabled) && flush_others) {
flush_tlb_multi(cpumask, info);
return;
}
@@ -836,7 +840,7 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
local_irq_enable();
}

- if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+ if (flush_others)
flush_tlb_others(cpumask, info);
}

--
2.20.1

2019-05-25 08:25:35

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()

arch_tlbbatch_flush() and flush_tlb_mm_range() have very similar code,
which is effectively the same. Extract the mutual code into a new
function flush_tlb_on_cpus().

There is one functional change, which should not affect correctness:
flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
flush is needed. Instead, the common code would look at the mm_cpumask()
which should give the same result. Performance should not be affected,
since this cpumask should not change in such a frequency that would
introduce cache contention.

Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/mm/tlb.c | 55 ++++++++++++++++++++++-------------------------
1 file changed, 26 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7f61431c75fb..afd2859a6966 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -733,7 +733,11 @@ static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
{
- struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
+ struct flush_tlb_info *info;
+
+ preempt_disable();
+
+ info = this_cpu_ptr(&flush_tlb_info);

#ifdef CONFIG_DEBUG_VM
/*
@@ -761,6 +765,23 @@ static inline void put_flush_tlb_info(void)
barrier();
this_cpu_dec(flush_tlb_info_idx);
#endif
+ preempt_enable();
+}
+
+static void flush_tlb_on_cpus(const cpumask_t *cpumask,
+ const struct flush_tlb_info *info)
+{
+ int this_cpu = smp_processor_id();
+
+ if (cpumask_test_cpu(this_cpu, cpumask)) {
+ lockdep_assert_irqs_enabled();
+ local_irq_disable();
+ flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+ local_irq_enable();
+ }
+
+ if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+ flush_tlb_others(cpumask, info);
}

void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -769,9 +790,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
{
struct flush_tlb_info *info;
u64 new_tlb_gen;
- int cpu;
-
- cpu = get_cpu();

/* Should we flush just the requested range? */
if ((end == TLB_FLUSH_ALL) ||
@@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
new_tlb_gen);

- if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
- lockdep_assert_irqs_enabled();
- local_irq_disable();
- flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
- local_irq_enable();
- }
-
- if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), info);
+ flush_tlb_on_cpus(mm_cpumask(mm), info);

put_flush_tlb_info();
- put_cpu();
}


@@ -832,13 +841,11 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
} else {
struct flush_tlb_info *info;

- preempt_disable();
info = get_flush_tlb_info(NULL, start, end, 0, false, 0);

on_each_cpu(do_kernel_range_flush, info, 1);

put_flush_tlb_info();
- preempt_enable();
}
}

@@ -856,21 +863,11 @@ static const struct flush_tlb_info full_flush_tlb_info = {

void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
- int cpu = get_cpu();
-
- if (cpumask_test_cpu(cpu, &batch->cpumask)) {
- lockdep_assert_irqs_enabled();
- local_irq_disable();
- flush_tlb_func_local(&full_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
- local_irq_enable();
- }
-
- if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
- flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
+ preempt_disable();
+ flush_tlb_on_cpus(&batch->cpumask, &full_flush_tlb_info);
+ preempt_enable();

cpumask_clear(&batch->cpumask);
-
- put_cpu();
}

static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
--
2.20.1

2019-05-25 08:25:47

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 2/6] cpumask: Purify cpumask_next()

cpumask_next() has no side-effects. Mark it as pure.

Cc: "David S. Miller" <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
include/linux/cpumask.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 147bdec42215..20df46705f9c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -209,7 +209,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
}

-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);

/**
* cpumask_next_zero - get the next unset cpu in a cpumask
--
2.20.1

2019-05-25 08:33:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] cpumask: Purify cpumask_next()


* Nadav Amit <[email protected]> wrote:

> cpumask_next() has no side-effects. Mark it as pure.
>
> Cc: "David S. Miller" <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> include/linux/cpumask.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 147bdec42215..20df46705f9c 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -209,7 +209,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
> return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
> }
>
> -unsigned int cpumask_next(int n, const struct cpumask *srcp);
> +unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);

I suppose this makes a code generation difference somewhere, right?

I'm wondering, couldn't it also be marked a const function? That's
supposedly an even better category.

Thanks,

Ingo

2019-05-25 08:39:33

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

> On May 25, 2019, at 1:22 AM, Nadav Amit <[email protected]> wrote:
>
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
>
> +void native_flush_tlb_multi(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> {
> + /*
> + * native_flush_tlb_multi() can handle a single CPU, but it is
> + * suboptimal if the local TLB should be flushed, and therefore should
> + * not be used in such case. Check that it is not used in such case,
> + * and use this assumption for tracing and accounting of remote TLB
> + * flushes.
> + */
> + VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));

This warning might fire off incorrectly and will be removed.

2019-05-25 08:56:18

by Jürgen Groß

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

On 25/05/2019 10:22, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
>
> Add a static key to tell whether this new interface is supported.
>
> Cc: "K. Y. Srinivasan" <[email protected]>
> Cc: Haiyang Zhang <[email protected]>
> Cc: Stephen Hemminger <[email protected]>
> Cc: Sasha Levin <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Cc: Juergen Gross <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> arch/x86/hyperv/mmu.c | 2 +
> arch/x86/include/asm/paravirt.h | 8 +++
> arch/x86/include/asm/paravirt_types.h | 6 ++
> arch/x86/include/asm/tlbflush.h | 6 ++
> arch/x86/kernel/kvm.c | 1 +
> arch/x86/kernel/paravirt.c | 3 +
> arch/x86/mm/tlb.c | 80 +++++++++++++++++++++++----
> arch/x86/xen/mmu_pv.c | 2 +
> 8 files changed, 96 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index e65d7fe6489f..ca28b400c87c 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
> pr_info("Using hypercall for remote TLB flush\n");
> pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
> pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> +
> + static_key_disable(&flush_tlb_multi_enabled.key);
> }
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25c38a05c1c..192be7254457 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -47,6 +47,8 @@ static inline void slow_down_io(void)
> #endif
> }
>
> +DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
> +
> static inline void __flush_tlb(void)
> {
> PVOP_VCALL0(mmu.flush_tlb_user);
> @@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
> PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
> }
>
> +static inline void flush_tlb_multi(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
> +}
> +
> static inline void flush_tlb_others(const struct cpumask *cpumask,
> const struct flush_tlb_info *info)
> {
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 946f8f1f1efc..3a156e63c57d 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
> void (*flush_tlb_user)(void);
> void (*flush_tlb_kernel)(void);
> void (*flush_tlb_one_user)(unsigned long addr);
> + /*
> + * flush_tlb_multi() is the preferred interface. When it is used,
> + * flush_tlb_others() should return false.

This comment does not make sense. flush_tlb_others() return type is
void.


Juergen

2019-05-27 08:31:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently

On Sat, May 25, 2019 at 01:21:57AM -0700, Nadav Amit wrote:
> The proposed changes should also improve the performance of other
> invocations of on_each_cpu(). Hopefully, no one has relied on the
> behavior of on_each_cpu() that functions were first executed remotely
> and only then locally.

Oh gawd... I actually have vague memories of code doing exactly that,
but I can't for the life of me remember where :/:

2019-05-27 08:32:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] cpumask: Purify cpumask_next()

On Sat, May 25, 2019 at 01:21:59AM -0700, Nadav Amit wrote:
> cpumask_next() has no side-effects. Mark it as pure.

It would be good to have a few word on why... because apparently you
found this makes a difference.

> Cc: "David S. Miller" <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> include/linux/cpumask.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 147bdec42215..20df46705f9c 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -209,7 +209,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
> return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
> }
>
> -unsigned int cpumask_next(int n, const struct cpumask *srcp);
> +unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
>
> /**
> * cpumask_next_zero - get the next unset cpu in a cpumask
> --
> 2.20.1
>

2019-05-27 09:18:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many()

> + /*
> + * Choose the most efficient way to send an IPI. Note that the
> + * number of CPUs might be zero due to concurrent changes to the
> + * provided mask or cpu_online_mask.
> + */

Since we have preemption disabled here, I don't think online mask can
shrink, cpu-offline uses stop_machine().

> + if (nr_cpus == 1)
> + arch_send_call_function_single_ipi(last_cpu);
> + else if (likely(nr_cpus > 1))
> + arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
> + }

2019-05-27 09:26:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()

On Sat, May 25, 2019 at 01:22:01AM -0700, Nadav Amit wrote:

> There is one functional change, which should not affect correctness:
> flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
> flush is needed. Instead, the common code would look at the mm_cpumask()
> which should give the same result.

> @@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
> new_tlb_gen);
>
> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> - lockdep_assert_irqs_enabled();
> - local_irq_disable();
> - flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
> - local_irq_enable();
> - }
> -
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> - flush_tlb_others(mm_cpumask(mm), info);

So if we want to double check that; we'd add:

WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) ==
(mm == this_cpu_read(cpu_tlbstate.loaded_mm)));

right?

> + flush_tlb_on_cpus(mm_cpumask(mm), info);
>
> put_flush_tlb_info();
> - put_cpu();
> }

2019-05-27 09:48:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
> On 25/05/2019 10:22, Nadav Amit wrote:

> > diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> > index 946f8f1f1efc..3a156e63c57d 100644
> > --- a/arch/x86/include/asm/paravirt_types.h
> > +++ b/arch/x86/include/asm/paravirt_types.h
> > @@ -211,6 +211,12 @@ struct pv_mmu_ops {
> > void (*flush_tlb_user)(void);
> > void (*flush_tlb_kernel)(void);
> > void (*flush_tlb_one_user)(unsigned long addr);
> > + /*
> > + * flush_tlb_multi() is the preferred interface. When it is used,
> > + * flush_tlb_others() should return false.
>
> This comment does not make sense. flush_tlb_others() return type is
> void.

I suspect that is an artifact from before the static_key; an attempt to
make the pv interface less awkward.

Something like the below would work for KVM I suspect, the others
(Hyper-V and Xen are more 'interesting').

---
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi

static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);

-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
u8 state;
@@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
* queue flush_on_enter for pre-empted vCPUs
*/
for_each_cpu(cpu, flushmask) {
+ if (cpu == smp_processor_id())
+ continue;
+
src = &per_cpu(steal_time, cpu);
state = READ_ONCE(src->preempted);
if ((state & KVM_VCPU_PREEMPTED)) {
@@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
}
}

- native_flush_tlb_others(flushmask, info);
+ native_flush_tlb_multi(flushmask, info);
}

static void __init kvm_guest_init(void)
@@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
- pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+ pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
- static_key_disable(&flush_tlb_multi_enabled.key);
}

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

2019-05-27 10:02:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently

On Sat, May 25, 2019 at 01:21:57AM -0700, Nadav Amit wrote:
> Currently, local and remote TLB flushes are not performed concurrently,
> which introduces unnecessary overhead - each INVLPG can take 100s of
> cycles. This patch-set allows TLB flushes to be run concurrently: first
> request the remote CPUs to initiate the flush, then run it locally, and
> finally wait for the remote CPUs to finish their work.
>
> The proposed changes should also improve the performance of other
> invocations of on_each_cpu(). Hopefully, no one has relied on the
> behavior of on_each_cpu() that functions were first executed remotely
> and only then locally.
>
> On my Haswell machine (bare-metal), running a TLB flush microbenchmark
> (MADV_DONTNEED/touch for a single page on one thread), takes the
> following time (ns):
>
> n_threads before after
> --------- ------ -----
> 1 661 663
> 2 1436 1225 (-14%)
> 4 1571 1421 (-10%)
>
> Note that since the benchmark also causes page-faults, the actual
> speedup of TLB shootdowns is actually greater. Also note the higher
> improvement in performance with 2 thread (a single remote TLB flush
> target). This seems to be a side-effect of holding synchronization
> data-structures (csd) off the stack, unlike the way it is currently done
> (in smp_call_function_single()).
>
> Patches 1-2 do small cleanup. Patches 3-5 actually implement the
> concurrent execution of TLB flushes. Patch 6 restores local TLB flushes
> performance, which was hurt by the optimization, to be as good as it was
> before these changes by introducing a fast-pass for this specific case.

I like; ideally we'll get Hyper-V and Xen sorted before the final
version and avoid having to introduce more PV crud and static keys for
that.

The Hyper-V implementation in particular is horrifically ugly, the Xen
one also doesn't win any prices, esp. that on-stack CPU mask needs to
go.

Looking at them, I'm not sure they can actually win anything from using
the new interface, but at least we can avoid making our PV crud uglier
than it has to be.

2019-05-27 10:23:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

On 27/05/19 11:47, Peter Zijlstra wrote:
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
>
>>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>> void (*flush_tlb_user)(void);
>>> void (*flush_tlb_kernel)(void);
>>> void (*flush_tlb_one_user)(unsigned long addr);
>>> + /*
>>> + * flush_tlb_multi() is the preferred interface. When it is used,
>>> + * flush_tlb_others() should return false.
>>
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
>
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.
>
> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
>
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>
> static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
> const struct flush_tlb_info *info)
> {
> u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
> * queue flush_on_enter for pre-empted vCPUs
> */
> for_each_cpu(cpu, flushmask) {
> + if (cpu == smp_processor_id())
> + continue;
> +

Even this would be just an optimization; the vCPU you're running on
cannot be preempted. You can just change others to multi.

Paolo

> src = &per_cpu(steal_time, cpu);
> state = READ_ONCE(src->preempted);
> if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
> }
> }
>
> - native_flush_tlb_others(flushmask, info);
> + native_flush_tlb_multi(flushmask, info);
> }
>
> static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
> if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
> !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
> kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
> pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> - static_key_disable(&flush_tlb_multi_enabled.key);
> }
>
> if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>

2019-05-27 12:34:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

On Mon, May 27, 2019 at 12:21:59PM +0200, Paolo Bonzini wrote:
> On 27/05/19 11:47, Peter Zijlstra wrote:

> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
> >
> > static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> >
> > -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> > +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
> > const struct flush_tlb_info *info)
> > {
> > u8 state;
> > @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
> > * queue flush_on_enter for pre-empted vCPUs
> > */
> > for_each_cpu(cpu, flushmask) {
> > + if (cpu == smp_processor_id())
> > + continue;
> > +
>
> Even this would be just an optimization; the vCPU you're running on
> cannot be preempted. You can just change others to multi.

Yeah, I know, but it felt weird so I added the explicit skip. No strong
feelings though.

2019-05-27 12:46:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

On 27/05/19 14:32, Peter Zijlstra wrote:
> On Mon, May 27, 2019 at 12:21:59PM +0200, Paolo Bonzini wrote:
>> On 27/05/19 11:47, Peter Zijlstra wrote:
>
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>>>
>>> static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>>>
>>> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>>> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>>> const struct flush_tlb_info *info)
>>> {
>>> u8 state;
>>> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>>> * queue flush_on_enter for pre-empted vCPUs
>>> */
>>> for_each_cpu(cpu, flushmask) {
>>> + if (cpu == smp_processor_id())
>>> + continue;
>>> +
>>
>> Even this would be just an optimization; the vCPU you're running on
>> cannot be preempted. You can just change others to multi.
>
> Yeah, I know, but it felt weird so I added the explicit skip. No strong
> feelings though.

Neither here, and it would indeed deserve a comment if you left the if out.

Paolo

2019-05-27 17:36:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] cpumask: Purify cpumask_next()

> On May 27, 2019, at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sat, May 25, 2019 at 01:21:59AM -0700, Nadav Amit wrote:
>> cpumask_next() has no side-effects. Mark it as pure.
>
> It would be good to have a few word on why... because apparently you
> found this makes a difference.

I see that eventually it did not make any difference. I saw in the past that
some instances of the kernel are affected by it, but I will remove it from
this patch-set in the next iteration.

2019-05-27 17:42:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many()

> On May 27, 2019, at 2:15 AM, Peter Zijlstra <[email protected]> wrote:
>
>> + /*
>> + * Choose the most efficient way to send an IPI. Note that the
>> + * number of CPUs might be zero due to concurrent changes to the
>> + * provided mask or cpu_online_mask.
>> + */
>
> Since we have preemption disabled here, I don't think online mask can
> shrink, cpu-offline uses stop_machine().

Right. So I’ll update the comment, but IIUC the provided mask might still
change, so I’ll leave the rest of the comment and the code as is.

>> + if (nr_cpus == 1)
>> + arch_send_call_function_single_ipi(last_cpu);
>> + else if (likely(nr_cpus > 1))
>> + arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
>> + }


2019-05-27 17:51:02

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

> On May 27, 2019, at 2:47 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
>
>>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>> void (*flush_tlb_user)(void);
>>> void (*flush_tlb_kernel)(void);
>>> void (*flush_tlb_one_user)(unsigned long addr);
>>> + /*
>>> + * flush_tlb_multi() is the preferred interface. When it is used,
>>> + * flush_tlb_others() should return false.
>>
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
>
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.

Yes, remainders that should have been removed - I will remove them for the
next version.

> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
>
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>
> static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
> const struct flush_tlb_info *info)
> {
> u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
> * queue flush_on_enter for pre-empted vCPUs
> */
> for_each_cpu(cpu, flushmask) {
> + if (cpu == smp_processor_id())
> + continue;
> +
> src = &per_cpu(steal_time, cpu);
> state = READ_ONCE(src->preempted);
> if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
> }
> }
>
> - native_flush_tlb_others(flushmask, info);
> + native_flush_tlb_multi(flushmask, info);
> }
>
> static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
> if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
> !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
> kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
> pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> - static_key_disable(&flush_tlb_multi_enabled.key);
> }
>
> if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

That’s what I have as well ;-).

As you mentioned (in another email), specifically hyper-v code seems
convoluted to me. In general, I prefer not to touch KVM/Xen/hyper-v, but you
twist my arm, I will send a compile-tested version for Xen and hyper-v.

2019-05-27 19:01:27

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()

> On May 27, 2019, at 2:24 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sat, May 25, 2019 at 01:22:01AM -0700, Nadav Amit wrote:
>
>> There is one functional change, which should not affect correctness:
>> flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
>> flush is needed. Instead, the common code would look at the mm_cpumask()
>> which should give the same result.
>
>> @@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
>> new_tlb_gen);
>>
>> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>> - lockdep_assert_irqs_enabled();
>> - local_irq_disable();
>> - flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
>> - local_irq_enable();
>> - }
>> -
>> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> - flush_tlb_others(mm_cpumask(mm), info);
>
> So if we want to double check that; we'd add:
>
> WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) ==
> (mm == this_cpu_read(cpu_tlbstate.loaded_mm)));
>
> right?

Yes, except the condition should be inverted (“!=“ instead of “==“), and I
would prefer to use VM_WARN_ON_ONCE().

Unfortunately, this condition does fire when copy_init_mm() calls dup_mm().
I don’t think there is a correctness issue, and I am tempted just check,
before warning, that (mm != init_mm) .

What do you say?

2019-05-27 19:16:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()

On Mon, May 27, 2019 at 06:59:01PM +0000, Nadav Amit wrote:
> > On May 27, 2019, at 2:24 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Sat, May 25, 2019 at 01:22:01AM -0700, Nadav Amit wrote:
> >
> >> There is one functional change, which should not affect correctness:
> >> flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
> >> flush is needed. Instead, the common code would look at the mm_cpumask()
> >> which should give the same result.
> >
> >> @@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >> info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
> >> new_tlb_gen);
> >>
> >> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> >> - lockdep_assert_irqs_enabled();
> >> - local_irq_disable();
> >> - flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
> >> - local_irq_enable();
> >> - }
> >> -
> >> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> >> - flush_tlb_others(mm_cpumask(mm), info);
> >
> > So if we want to double check that; we'd add:
> >
> > WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) ==
> > (mm == this_cpu_read(cpu_tlbstate.loaded_mm)));
> >
> > right?
>
> Yes, except the condition should be inverted (“!=“ instead of “==“), and I
> would prefer to use VM_WARN_ON_ONCE().
>
> Unfortunately, this condition does fire when copy_init_mm() calls dup_mm().
> I don’t think there is a correctness issue, and I am tempted just check,
> before warning, that (mm != init_mm) .
>
> What do you say?

Works for me.