2018-08-01 10:04:17

by Rik van Riel

[permalink] [raw]
Subject: [PATCH v2 0/11] x86,tlb,mm: more lazy TLB cleanups & optimizations

This patch series implements the cleanups suggested by Peter and Andy,
removes lazy TLB mm refcounting on x86, and shows how other architectures
could implement that same optimization.

The previous patch series already seems to have removed most of the
cache line contention I was seeing at context switch time, so CPU use
of the memcache and memcache-like workloads has not changed measurably
with this patch series.

However, the memory bandwidth used by the memcache system has been
reduced by about 1%, to serve the same number of queries per second.

This happens on two socket Haswell and Broadwell systems. Maybe on
larger systems (4 or 8 socket) one might also see a measurable drop
in the amount of CPU time used, with workloads where the previous
patch series does not remove all cache line contention on the mm.

This is against the latest -tip tree, and seems to be stable (on top
of another tree) with workloads that do over a million context switches
a second.

V2 of the series uses Peter's logic flow in context_switch(), and
is a step in the direction of completely getting rid of ->active_mm.



2018-08-01 10:04:27

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 04/11] x86,mm: use on_each_cpu_cond for TLB flushes

Instead of open coding bitmap magic, use on_each_cpu_cond
to determine which CPUs to send TLB flush IPIs to.

This might be a little bit slower than examining the bitmaps,
but it should be a lot easier to maintain in the long run.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 75 +++++++++++--------------------------------------------
1 file changed, 15 insertions(+), 60 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5321e02c4e09..671cc66df801 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -582,12 +582,19 @@ static void flush_tlb_func_remote(void *info)
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
}

+static bool tlb_is_lazy(int cpu, void *data)
+{
+ return per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
+static bool tlb_is_not_lazy(int cpu, void *data)
+{
+ return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
- cpumask_var_t lazymask;
- unsigned int cpu;
-
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (info->end == TLB_FLUSH_ALL)
trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -596,6 +603,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(info->end - info->start) >> PAGE_SHIFT);

if (is_uv_system()) {
+ unsigned int cpu;
/*
* This whole special case is confused. UV has a "Broadcast
* Assist Unit", which seems to be a fancy way to send IPIs.
@@ -619,28 +627,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
return;
}

- /*
- * A temporary cpumask is used in order to skip sending IPIs
- * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm).
- * If the allocation fails, simply IPI every CPU in mm_cpumask.
- */
- if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) {
- smp_call_function_many(cpumask, flush_tlb_func_remote,
- (void *)info, 1);
- return;
- }
-
- cpumask_copy(lazymask, cpumask);
-
- for_each_cpu(cpu, lazymask) {
- if (per_cpu(cpu_tlbstate.is_lazy, cpu))
- cpumask_clear_cpu(cpu, lazymask);
- }
-
- smp_call_function_many(lazymask, flush_tlb_func_remote,
- (void *)info, 1);
-
- free_cpumask_var(lazymask);
+ on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
+ (void *)info, 1, GFP_ATOMIC, cpumask);
}

/*
@@ -709,50 +697,17 @@ void tlb_flush_remove_tables_local(void *arg)
}
}

-static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
- struct cpumask *lazy_cpus)
-{
- int cpu;
-
- for_each_cpu(cpu, mm_cpumask(mm)) {
- if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
- cpumask_set_cpu(cpu, lazy_cpus);
- }
-}
-
void tlb_flush_remove_tables(struct mm_struct *mm)
{
int cpu = get_cpu();
- cpumask_var_t lazy_cpus;

if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
put_cpu();
return;
}

- if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) {
- /*
- * If the cpumask allocation fails, do a brute force flush
- * on all the CPUs that have this mm loaded.
- */
- smp_call_function_many(mm_cpumask(mm),
- tlb_flush_remove_tables_local, (void *)mm, 1);
- put_cpu();
- return;
- }
-
- /*
- * CPUs with !is_lazy either received a TLB flush IPI while the user
- * pages in this address range were unmapped, or have context switched
- * and reloaded %CR3 since then.
- *
- * Shootdown IPIs at page table freeing time only need to be sent to
- * CPUs that may have out of date TLB contents.
- */
- mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus);
- smp_call_function_many(lazy_cpus,
- tlb_flush_remove_tables_local, (void *)mm, 1);
- free_cpumask_var(lazy_cpus);
+ on_each_cpu_cond_mask(tlb_is_lazy, tlb_flush_remove_tables_local,
+ (void *)mm, 1, GFP_ATOMIC, mm_cpumask(mm));
put_cpu();
}

--
2.14.4


2018-08-01 10:04:33

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 08/11] arch,mm: add config variable to skip lazy TLB mm refcounting

Add a config variable indicating that this architecture does not require
lazy TLB mm refcounting, because lazy TLB mms get shot down instantly
at exit_mmap time.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1aa59063f1fd..1baf1a176dbf 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -237,6 +237,10 @@ config ARCH_HAS_FORTIFY_SOURCE
config ARCH_HAS_SET_MEMORY
bool

+# Select if arch shoots down lazy TLB mms at exit time, instead of refcounting
+config ARCH_NO_ACTIVE_MM_REFCOUNTING
+ bool
+
# Select if arch init_task must go in the __init_task_data section
config ARCH_TASK_STRUCT_ON_STACK
bool
--
2.14.4


2018-08-01 10:04:39

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 05/11] mm,tlb: turn dummy defines into inline functions

Turn the dummy tlb_flush_remove_tables* defines into inline functions,
in order to get compiler type checking, etc.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
include/asm-generic/tlb.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e811ef7b8350..eb57062d9dd3 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -309,8 +309,13 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
* pointing to about-to-be-freed page table memory.
*/
#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
-#define tlb_flush_remove_tables(mm) do {} while (0)
-#define tlb_flush_remove_tables_local(mm) do {} while (0)
+static inline void tlb_flush_remove_tables(struct mm_struct *mm)
+{
+}
+
+static inline void tlb_flush_remove_tables_local(struct mm_struct *mm)
+{
+}
#endif

#endif /* _ASM_GENERIC__TLB_H */
--
2.14.4


2018-08-01 10:04:53

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 10/11] x86,tlb: really leave mm on shootdown

When getting an mm shot down from under us in lazy TLB mode, don't
just switch the TLB over to the init_mm page tables, but really drop
our references to the lazy TLB mm.

This allows for faster (instant) freeing of a lazy TLB mm, which is
a precondition to getting rid of the refcounting of mms in lazy TLB mode.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7b1add904396..425cb9fa2640 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,8 @@ void leave_mm(void *dummy)
WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));

switch_mm(NULL, &init_mm, NULL);
+ current->active_mm = &init_mm;
+ mmdrop(loaded_mm);
}
EXPORT_SYMBOL_GPL(leave_mm);

@@ -483,6 +485,8 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
+ current->active_mm = &init_mm;
+ mmdrop(loaded_mm);
return;
}

--
2.14.4


2018-08-01 10:04:56

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 01/11] x86,tlb: clarify memory barrier in switch_mm_irqs_off

Clarify exactly what the memory barrier synchronizes with.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/tlb.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 752dbf4e0e50..5321e02c4e09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -263,8 +263,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/*
* Read the tlb_gen to check whether a flush is needed.
* If the TLB is up to date, just use it.
- * The barrier synchronizes with the tlb_gen increment in
- * the TLB shootdown code.
+ * The TLB shootdown code first increments tlb_gen, and then
+ * sends IPIs to CPUs that have this CPU loaded and are not
+ * in lazy TLB mode. The barrier ensures we handle
+ * cpu_tlbstate.is_lazy before tlb_gen, keeping this code
+ * synchronized with the TLB flush code.
*/
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
--
2.14.4


2018-08-01 10:05:13

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 06/11] mm,x86: skip cr4 and ldt reload when mm stays the same

When switching back from lazy TLB mode to a thread of the same process
that switched into lazy TLB mode, we still have the cr4 (and sometimes
LDT) of that process loaded, and there is no need to reload it.

When there was no TLB flush while the CPU was in lazy TLB mode, the
current code in switch_mm_irqs_off already avoids the reload, by
returning early.

However, when the TLB contents on the CPU are out of date, and we
flush the TLB for the task, we fall through to the regular context
switching code. This patch teaches that code to skip the cr4 and LDT
flushes when switching back to the same mm after a flush.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/tlb.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 671cc66df801..149fb64e4bf4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -367,8 +367,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);

- load_mm_cr4(next);
- switch_ldt(real_prev, next);
+ if (next != real_prev) {
+ load_mm_cr4(next);
+ switch_ldt(real_prev, next);
+ }
}

/*
--
2.14.4


2018-08-01 10:05:21

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

Conditionally skip lazy TLB mm refcounting. When an architecture has
CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
lazy TLB mode anywhere will get shot down from exit_mmap, and there
in no need to incur the cache line bouncing overhead of refcounting
a lazy TLB mm.

Implement this by moving the refcounting of a lazy TLB mm to helper
functions, which skip the refcounting when it is not necessary.

Deal with use_mm and unuse_mm by fully splitting out the refcounting
of the lazy TLB mm a kernel thread may have when entering use_mm from
the refcounting of the mm that use_mm is about to start using.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 5 +++--
fs/exec.c | 2 +-
include/linux/sched/mm.h | 25 +++++++++++++++++++++++++
kernel/sched/core.c | 29 +++++++++++++++++++++--------
mm/mmu_context.c | 21 ++++++++++++++-------
5 files changed, 64 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 425cb9fa2640..d53d9c19b97d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/gfp.h>
+#include <linux/sched/mm.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -141,7 +142,7 @@ void leave_mm(void *dummy)

switch_mm(NULL, &init_mm, NULL);
current->active_mm = &init_mm;
- mmdrop(loaded_mm);
+ drop_lazy_mm(loaded_mm);
}
EXPORT_SYMBOL_GPL(leave_mm);

@@ -486,7 +487,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
current->active_mm = &init_mm;
- mmdrop(loaded_mm);
+ drop_lazy_mm(loaded_mm);
return;
}

diff --git a/fs/exec.c b/fs/exec.c
index bdd0eacefdf5..7a6d4811b02b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1043,7 +1043,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
return 0;
}
- mmdrop(active_mm);
+ drop_lazy_mm(active_mm);
return 0;
}

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 44d356f5e47c..7308bf38012f 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,31 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}

+/*
+ * In lazy TLB mode, a CPU keeps the mm of the last process mapped while
+ * running a kernel thread or idle; we must make sure the lazy TLB mm and
+ * page tables do not disappear while a lazy TLB mode CPU uses them.
+ * There are two ways to handle the race between lazy TLB CPUs and exit_mmap:
+ * 1) Have a lazy TLB CPU hold a refcount on the lazy TLB mm.
+ * 2) Have the architecture code shoot down the lazy TLB mm from exit_mmap;
+ * in that case, refcounting can be skipped, reducing cache line bouncing.
+ */
+static inline void grab_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmgrab(mm);
+}
+
+static inline void drop_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmdrop(mm);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c45de46fdf10..ba87235b8a31 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2691,7 +2691,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
- mmdrop(mm);
+ drop_lazy_mm(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
@@ -2803,16 +2803,29 @@ context_switch(struct rq *rq, struct task_struct *prev,
* membarrier after storing to rq->curr, before returning to
* user-space.
*/
- if (!mm) {
+ /*
+ * kernel -> kernel lazy + transfer active
+ * user -> kernel lazy + grab_lazy_mm active
+ *
+ * kernel -> user switch + drop_lazy_mm active
+ * user -> user switch
+ */
+ if (!mm) { // to kernel
next->active_mm = oldmm;
- mmgrab(oldmm);
enter_lazy_tlb(oldmm, next);
- } else
+
+ if (prev->mm) // from user
+ grab_lazy_mm(oldmm);
+ else
+ prev->active_mm = NULL;
+ } else { // to user
switch_mm_irqs_off(oldmm, mm, next);

- if (!prev->mm) {
- prev->active_mm = NULL;
- rq->prev_mm = oldmm;
+ if (!prev->mm) { // from kernel
+ /* will drop_lazy_mm() in finish_task_switch(). */
+ rq->prev_mm = oldmm;
+ prev->active_mm = NULL;
+ }
}

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
@@ -5532,7 +5545,7 @@ void idle_task_exit(void)
current->active_mm = &init_mm;
finish_arch_post_lock_switch();
}
- mmdrop(mm);
+ drop_lazy_mm(mm);
}

/*
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..d5c2524cdd9a 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -24,12 +24,15 @@ void use_mm(struct mm_struct *mm)
struct mm_struct *active_mm;
struct task_struct *tsk = current;

+ /* Kernel threads have a NULL tsk->mm when entering. */
+ WARN_ON(tsk->mm);
+
task_lock(tsk);
+ /* Previous ->active_mm was held in lazy TLB mode. */
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
- tsk->active_mm = mm;
- }
+ /* Grab mm for reals; tsk->mm needs to stick around until unuse_mm. */
+ mmgrab(mm);
+ tsk->active_mm = mm;
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
@@ -37,8 +40,9 @@ void use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
#endif

- if (active_mm != mm)
- mmdrop(active_mm);
+ /* Drop the lazy TLB mode mm. */
+ if (active_mm)
+ drop_lazy_mm(active_mm);
}
EXPORT_SYMBOL_GPL(use_mm);

@@ -57,8 +61,11 @@ void unuse_mm(struct mm_struct *mm)
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
- /* active_mm is still 'mm' */
+ /* active_mm is still 'mm'; grab it as a lazy TLB mm */
+ grab_lazy_mm(mm);
enter_lazy_tlb(mm, tsk);
+ /* drop the tsk->mm refcount */
+ mmdrop(mm);
task_unlock(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);
--
2.14.4


2018-08-01 10:05:23

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 09/11] mm,x86: shoot down lazy TLB references at exit_mmap time

Shooting down lazy TLB references to an mm at exit_mmap time ensures
that no users of the mm_struct will be left anywhere in the system,
allowing it to be torn down and freed immediately.

Signed-off-by: Rik van Riel <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/mmu_context.h | 1 +
arch/x86/include/asm/tlbflush.h | 2 ++
arch/x86/mm/tlb.c | 15 +++++++++++++++
4 files changed, 19 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6d4774f203d0..ecdfc6933203 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -75,6 +75,7 @@ config X86
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
+ select ARCH_NO_ACTIVE_MM_REFCOUNTING
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_USE_BUILTIN_BSWAP
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index eeeb9289c764..529bf7bc5f75 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -238,6 +238,7 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
{
paravirt_arch_exit_mmap(mm);
ldt_arch_exit_mmap(mm);
+ lazy_tlb_exit_mmap(mm);
}

#ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 511bf5fae8b8..3966a45367cd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -538,6 +538,8 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
native_flush_tlb_others(mask, info)
#endif

+extern void lazy_tlb_exit_mmap(struct mm_struct *mm);
+
extern void tlb_flush_remove_tables(struct mm_struct *mm);
extern void tlb_flush_remove_tables_local(void *arg);

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ea4ef5ceaba2..7b1add904396 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -713,6 +713,21 @@ void tlb_flush_remove_tables(struct mm_struct *mm)
put_cpu();
}

+/*
+ * At exit or execve time, all other threads of a process have disappeared,
+ * but other CPUs could still be referencing this mm in lazy TLB mode.
+ * Get rid of those references before releasing the mm.
+ */
+void lazy_tlb_exit_mmap(struct mm_struct *mm)
+{
+ int cpu = get_cpu();
+
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+ on_each_cpu_mask(mm_cpumask(mm), leave_mm, NULL, 1);
+
+ put_cpu();
+}
+
static void do_flush_tlb_all(void *info)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
--
2.14.4


2018-08-01 10:05:33

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 03/11] smp,cpumask: introduce on_each_cpu_cond_mask

Introduce a variant of on_each_cpu_cond that iterates only over the
CPUs in a cpumask, in order to avoid making callbacks for every single
CPU in the system when we only need to test a subset.

Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/smp.h | 4 ++++
kernel/smp.c | 17 +++++++++++++----
kernel/up.c | 14 +++++++++++---
3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..a56f08ff3097 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -53,6 +53,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);

+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags, const struct cpumask *mask);
+
int smp_call_function_single_async(int cpu, call_single_data_t *csd);

#ifdef CONFIG_SMP
diff --git a/kernel/smp.c b/kernel/smp.c
index 2dbc842dd385..f4cf1b0bb3b8 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -667,9 +667,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* You must not call this function with disabled interrupts or
* from a hardware interrupt handler or from a bottom half handler.
*/
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
- gfp_t gfp_flags)
+ gfp_t gfp_flags, const struct cpumask *mask)
{
cpumask_var_t cpus;
int cpu, ret;
@@ -678,7 +678,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),

if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
preempt_disable();
- for_each_online_cpu(cpu)
+ for_each_cpu(cpu, mask)
if (cond_func(cpu, info))
__cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
@@ -690,7 +690,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
* just have to IPI them one by one.
*/
preempt_disable();
- for_each_online_cpu(cpu)
+ for_each_cpu(cpu, mask)
if (cond_func(cpu, info)) {
ret = smp_call_function_single(cpu, func,
info, wait);
@@ -699,6 +699,15 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
preempt_enable();
}
}
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags)
+{
+ on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
+ cpu_online_mask);
+}
EXPORT_SYMBOL(on_each_cpu_cond);

static void do_nothing(void *unused)
diff --git a/kernel/up.c b/kernel/up.c
index 42c46bf3e0a5..ff536f9cc8a2 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -68,9 +68,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* Preemption is disabled here to make sure the cond_func is called under the
* same condtions in UP and SMP.
*/
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
- smp_call_func_t func, void *info, bool wait,
- gfp_t gfp_flags)
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags, const struct cpumask *mask)
{
unsigned long flags;

@@ -82,6 +82,14 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
}
preempt_enable();
}
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags)
+{
+ on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags, NULL);
+}
EXPORT_SYMBOL(on_each_cpu_cond);

int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)
--
2.14.4


2018-08-01 10:05:35

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 02/11] smp: use __cpumask_set_cpu in on_each_cpu_cond

The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask,
which should never be used by other CPUs simultaneously. There is no
need to use locked memory accesses to set the bits in this bitmap.

Switch to __cpumask_set_cpu.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
---
kernel/smp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..2dbc842dd385 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -680,7 +680,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
preempt_disable();
for_each_online_cpu(cpu)
if (cond_func(cpu, info))
- cpumask_set_cpu(cpu, cpus);
+ __cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
preempt_enable();
free_cpumask_var(cpus);
--
2.14.4


2018-08-01 10:06:14

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 07/11] x86,mm: remove leave_mm cpu argument

The function leave_mm does not use its cpu argument, but always works on
the CPU where it is called. Change the argument to a void *, so leave_mm
can be called directly from smp_call_function_mask, and stop looking up the
CPU number in current leave_mm callers.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/mm/tlb.c | 2 +-
arch/x86/xen/mmu_pv.c | 2 +-
drivers/acpi/processor_idle.c | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5ff3e8af2c20..ec27966f338f 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -61,6 +61,6 @@ typedef struct {
.ctx_id = 1, \
}

-void leave_mm(int cpu);
+void leave_mm(void * dummy);

#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 149fb64e4bf4..ea4ef5ceaba2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -121,7 +121,7 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
write_cr3(new_mm_cr3);
}

-void leave_mm(int cpu)
+void leave_mm(void *dummy)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 52206ad81e4b..3402d2bf4ae0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -984,7 +984,7 @@ static void drop_mm_ref_this_cpu(void *info)
struct mm_struct *mm = info;

if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
- leave_mm(smp_processor_id());
+ leave_mm(NULL);

/*
* If this cpu still has a stale cr3 reference, then make sure
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index abb559cd28d7..675ffacdd82b 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -714,7 +714,7 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
static void acpi_idle_enter_bm(struct acpi_processor *pr,
struct acpi_processor_cx *cx, bool timer_bc)
{
- acpi_unlazy_tlb(smp_processor_id());
+ acpi_unlazy_tlb(NULL);

/*
* Must be done before busmaster disable as we might need to
--
2.14.4


2018-08-03 15:57:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Wed, Aug 01, 2018 at 06:02:55AM -0400, Rik van Riel wrote:
> Conditionally skip lazy TLB mm refcounting. When an architecture has
> CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> lazy TLB mode anywhere will get shot down from exit_mmap, and there
> in no need to incur the cache line bouncing overhead of refcounting
> a lazy TLB mm.
>
> Implement this by moving the refcounting of a lazy TLB mm to helper
> functions, which skip the refcounting when it is not necessary.
>
> Deal with use_mm and unuse_mm by fully splitting out the refcounting
> of the lazy TLB mm a kernel thread may have when entering use_mm from
> the refcounting of the mm that use_mm is about to start using.


> @@ -2803,16 +2803,29 @@ context_switch(struct rq *rq, struct task_struct *prev,
> * membarrier after storing to rq->curr, before returning to
> * user-space.
> */
> + /*
> + * kernel -> kernel lazy + transfer active
> + * user -> kernel lazy + grab_lazy_mm active
> + *
> + * kernel -> user switch + drop_lazy_mm active
> + * user -> user switch
> + */
> + if (!mm) { // to kernel
> next->active_mm = oldmm;
> enter_lazy_tlb(oldmm, next);
> +
> + if (prev->mm) // from user
> + grab_lazy_mm(oldmm);
> + else
> + prev->active_mm = NULL;
> + } else { // to user
> switch_mm_irqs_off(oldmm, mm, next);
>
> + if (!prev->mm) { // from kernel
> + /* will drop_lazy_mm() in finish_task_switch(). */
> + rq->prev_mm = oldmm;
> + prev->active_mm = NULL;
> + }
> }

So this still confuses the heck out of me; and the Changelog doesn't
seem to even mention it. You still track and swizzle ->active_mm but no
longer refcount it.

Why can't we skip the ->active_mm swizzle and keep ->active_mm == ->mm.

Doing the swizzle but not the refcount just makes me itch.

2018-08-03 16:42:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Fri, 2018-08-03 at 17:56 +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 06:02:55AM -0400, Rik van Riel wrote:
> > Conditionally skip lazy TLB mm refcounting. When an architecture
> > has
> > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> > lazy TLB mode anywhere will get shot down from exit_mmap, and there
> > in no need to incur the cache line bouncing overhead of refcounting
> > a lazy TLB mm.
> >
> > Implement this by moving the refcounting of a lazy TLB mm to helper
> > functions, which skip the refcounting when it is not necessary.
> >
> > Deal with use_mm and unuse_mm by fully splitting out the
> > refcounting
> > of the lazy TLB mm a kernel thread may have when entering use_mm
> > from
> > the refcounting of the mm that use_mm is about to start using.
>
>
> > @@ -2803,16 +2803,29 @@ context_switch(struct rq *rq, struct
> > task_struct *prev,
> > * membarrier after storing to rq->curr, before returning
> > to
> > * user-space.
> > */
> > + /*
> > + * kernel -> kernel lazy + transfer active
> > + * user -> kernel lazy + grab_lazy_mm active
> > + *
> > + * kernel -> user switch + drop_lazy_mm active
> > + * user -> user switch
> > + */
> > + if (!mm) { // to kernel
> > next->active_mm = oldmm;
> > enter_lazy_tlb(oldmm, next);
> > +
> > + if (prev->mm) // from user
> > + grab_lazy_mm(oldmm);
> > + else
> > + prev->active_mm = NULL;
> > + } else { // to user
> > switch_mm_irqs_off(oldmm, mm, next);
> >
> > + if (!prev->mm) { // from kernel
> > + /* will drop_lazy_mm() in
> > finish_task_switch(). */
> > + rq->prev_mm = oldmm;
> > + prev->active_mm = NULL;
> > + }
> > }
>
> So this still confuses the heck out of me; and the Changelog doesn't
> seem to even mention it. You still track and swizzle ->active_mm but
> no
> longer refcount it.
>
> Why can't we skip the ->active_mm swizzle and keep ->active_mm ==
> ->mm.
>
> Doing the swizzle but not the refcount just makes me itch.

I am working on that now, it adds another 7-8
patches on top of this series.

The big question is, do we want this optimization
to wait for further cleanups, or should we run with
code that seems to be stable right now, and put
additional cleanups and enhancements on top of it
later?

The end result should be the same.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-08-03 17:27:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Fri, Aug 03, 2018 at 12:40:48PM -0400, Rik van Riel wrote:
> On Fri, 2018-08-03 at 17:56 +0200, Peter Zijlstra wrote:
> > On Wed, Aug 01, 2018 at 06:02:55AM -0400, Rik van Riel wrote:
> > > Conditionally skip lazy TLB mm refcounting. When an architecture
> > > has
> > > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> > > lazy TLB mode anywhere will get shot down from exit_mmap, and there
> > > in no need to incur the cache line bouncing overhead of refcounting
> > > a lazy TLB mm.
> > >
> > > Implement this by moving the refcounting of a lazy TLB mm to helper
> > > functions, which skip the refcounting when it is not necessary.
> > >
> > > Deal with use_mm and unuse_mm by fully splitting out the
> > > refcounting
> > > of the lazy TLB mm a kernel thread may have when entering use_mm
> > > from
> > > the refcounting of the mm that use_mm is about to start using.
> >
> >
> > > @@ -2803,16 +2803,29 @@ context_switch(struct rq *rq, struct
> > > task_struct *prev,
> > > * membarrier after storing to rq->curr, before returning
> > > to
> > > * user-space.
> > > */
> > > + /*
> > > + * kernel -> kernel lazy + transfer active
> > > + * user -> kernel lazy + grab_lazy_mm active
> > > + *
> > > + * kernel -> user switch + drop_lazy_mm active
> > > + * user -> user switch
> > > + */
> > > + if (!mm) { // to kernel
> > > next->active_mm = oldmm;
> > > enter_lazy_tlb(oldmm, next);
> > > +
> > > + if (prev->mm) // from user
> > > + grab_lazy_mm(oldmm);
> > > + else
> > > + prev->active_mm = NULL;
> > > + } else { // to user
> > > switch_mm_irqs_off(oldmm, mm, next);
> > >
> > > + if (!prev->mm) { // from kernel
> > > + /* will drop_lazy_mm() in
> > > finish_task_switch(). */
> > > + rq->prev_mm = oldmm;
> > > + prev->active_mm = NULL;
> > > + }
> > > }
> >
> > So this still confuses the heck out of me; and the Changelog doesn't
> > seem to even mention it. You still track and swizzle ->active_mm but
> > no
> > longer refcount it.
> >
> > Why can't we skip the ->active_mm swizzle and keep ->active_mm ==
> > ->mm.
> >
> > Doing the swizzle but not the refcount just makes me itch.
>
> I am working on that now, it adds another 7-8
> patches on top of this series.

I thought those were taking ->active_mm out entirely, not avoiding the
swizzle, but I might have missed something in the middle :-)

> The big question is, do we want this optimization
> to wait for further cleanups, or should we run with
> code that seems to be stable right now, and put
> additional cleanups and enhancements on top of it
> later?

At the very least the Changelog needs to explain why we cannot do away
with the swizzle now and how doing the swizzle without the refcounting
is not completely broken (I think I see, but urgh).


2018-08-03 17:42:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Fri, 2018-08-03 at 19:25 +0200, Peter Zijlstra wrote:
> On Fri, Aug 03, 2018 at 12:40:48PM -0400, Rik van Riel wrote:
> > On Fri, 2018-08-03 at 17:56 +0200, Peter Zijlstra wrote:
> > >
> > > Why can't we skip the ->active_mm swizzle and keep ->active_mm ==
> > > ->mm.
> > >
> > > Doing the swizzle but not the refcount just makes me itch.
> >
> > I am working on that now, it adds another 7-8
> > patches on top of this series.
>
> I thought those were taking ->active_mm out entirely, not avoiding
> the
> swizzle, but I might have missed something in the middle :-)

At this point, only the fact that ->active_mm is still
being used by a few places in the code :)

> > The big question is, do we want this optimization
> > to wait for further cleanups, or should we run with
> > code that seems to be stable right now, and put
> > additional cleanups and enhancements on top of it
> > later?
>
> At the very least the Changelog needs to explain why we cannot do
> away
> with the swizzle now and how doing the swizzle without the
> refcounting
> is not completely broken (I think I see, but urgh).

The changelog for patches 9 & 10 explains, I think.

What is missing from my explanation?

How would you like to see it explained?

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-08-06 18:28:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Fri, Aug 03, 2018 at 01:39:52PM -0400, Rik van Riel wrote:

> > At the very least the Changelog needs to explain why we cannot do
> > away with the swizzle now and how doing the swizzle without the
> > refcounting is not completely broken (I think I see, but urgh).
>
> The changelog for patches 9 & 10 explains, I think.

They hint at it :-)

> What is missing from my explanation?
> How would you like to see it explained?

Maybe a few words like:

"Since ->active_mm is still used in a few sites, we must keep the
current tracking, such that we will not hit a kthread's NULL mm. Note
that lazy_tlb_exit_mmap() switches ->active_mm to &init_mm before taking
out the lazy mm."

That said; I'm not entirely sure we'll actually touch active_mm if we're
not a user task. The perf thing for example will only touch active_mm
when user_mode(regs).

But whatever, this was the only hickup.

Acked-by: Peter Zijlstra (Intel) <[email protected]>


Attachments:
(No filename) (973.00 B)
signature.asc (849.00 B)
Download all attachments