This patch series implements the cleanups suggested by Peter and Andy,
removes lazy TLB mm refcounting on x86, and shows how other architectures
could implement that same optimization.
The previous patch series already seems to have removed most of the
cache line contention I was seeing at context switch time, so CPU use
of the memcache and memcache-like workloads has not changed measurably
with this patch series.
However, the memory bandwidth used by the memcache system has been
reduced by about 1%, to serve the same number of queries per second.
This happens on two socket Haswell and Broadwell systems. Maybe on
larger systems (4 or 8 socket) one might also see a measurable drop
in the amount of CPU time used, with workloads where the previous
patch series does not remove all cache line contention on the mm.
This is against the latest -tip tree, and seems to be stable (on top
of another tree) with workloads that do over a million context switches
a second.
The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask,
which should never be used by other CPUs simultaneously. There is no
need to use locked memory accesses to set the bits in this bitmap.
Switch to __cpumask_set_cpu.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/smp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..2dbc842dd385 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -680,7 +680,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
preempt_disable();
for_each_online_cpu(cpu)
if (cond_func(cpu, info))
- cpumask_set_cpu(cpu, cpus);
+ __cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
preempt_enable();
free_cpumask_var(cpus);
--
2.14.4
Turn the dummy tlb_flush_remove_tables* defines into inline functions,
in order to get compiler type checking, etc.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
include/asm-generic/tlb.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e811ef7b8350..eb57062d9dd3 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -309,8 +309,13 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
* pointing to about-to-be-freed page table memory.
*/
#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
-#define tlb_flush_remove_tables(mm) do {} while (0)
-#define tlb_flush_remove_tables_local(mm) do {} while (0)
+static inline void tlb_flush_remove_tables(struct mm_struct *mm)
+{
+}
+
+static inline void tlb_flush_remove_tables_local(struct mm_struct *mm)
+{
+}
#endif
#endif /* _ASM_GENERIC__TLB_H */
--
2.14.4
Clarify exactly what the memory barrier synchronizes with.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 752dbf4e0e50..5321e02c4e09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -263,8 +263,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/*
* Read the tlb_gen to check whether a flush is needed.
* If the TLB is up to date, just use it.
- * The barrier synchronizes with the tlb_gen increment in
- * the TLB shootdown code.
+ * The TLB shootdown code first increments tlb_gen, and then
+ * sends IPIs to CPUs that have this CPU loaded and are not
+ * in lazy TLB mode. The barrier ensures we handle
+ * cpu_tlbstate.is_lazy before tlb_gen, keeping this code
+ * synchronized with the TLB flush code.
*/
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
--
2.14.4
Instead of open coding bitmap magic, use on_each_cpu_cond
to determine which CPUs to send TLB flush IPIs to.
This might be a little bit slower than examining the bitmaps,
but it should be a lot easier to maintain in the long run.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 75 +++++++++++--------------------------------------------
1 file changed, 15 insertions(+), 60 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5321e02c4e09..671cc66df801 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -582,12 +582,19 @@ static void flush_tlb_func_remote(void *info)
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
}
+static bool tlb_is_lazy(int cpu, void *data)
+{
+ return per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
+static bool tlb_is_not_lazy(int cpu, void *data)
+{
+ return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
- cpumask_var_t lazymask;
- unsigned int cpu;
-
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (info->end == TLB_FLUSH_ALL)
trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -596,6 +603,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(info->end - info->start) >> PAGE_SHIFT);
if (is_uv_system()) {
+ unsigned int cpu;
/*
* This whole special case is confused. UV has a "Broadcast
* Assist Unit", which seems to be a fancy way to send IPIs.
@@ -619,28 +627,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
return;
}
- /*
- * A temporary cpumask is used in order to skip sending IPIs
- * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm).
- * If the allocation fails, simply IPI every CPU in mm_cpumask.
- */
- if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) {
- smp_call_function_many(cpumask, flush_tlb_func_remote,
- (void *)info, 1);
- return;
- }
-
- cpumask_copy(lazymask, cpumask);
-
- for_each_cpu(cpu, lazymask) {
- if (per_cpu(cpu_tlbstate.is_lazy, cpu))
- cpumask_clear_cpu(cpu, lazymask);
- }
-
- smp_call_function_many(lazymask, flush_tlb_func_remote,
- (void *)info, 1);
-
- free_cpumask_var(lazymask);
+ on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
+ (void *)info, 1, GFP_ATOMIC, cpumask);
}
/*
@@ -709,50 +697,17 @@ void tlb_flush_remove_tables_local(void *arg)
}
}
-static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
- struct cpumask *lazy_cpus)
-{
- int cpu;
-
- for_each_cpu(cpu, mm_cpumask(mm)) {
- if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
- cpumask_set_cpu(cpu, lazy_cpus);
- }
-}
-
void tlb_flush_remove_tables(struct mm_struct *mm)
{
int cpu = get_cpu();
- cpumask_var_t lazy_cpus;
if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
put_cpu();
return;
}
- if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) {
- /*
- * If the cpumask allocation fails, do a brute force flush
- * on all the CPUs that have this mm loaded.
- */
- smp_call_function_many(mm_cpumask(mm),
- tlb_flush_remove_tables_local, (void *)mm, 1);
- put_cpu();
- return;
- }
-
- /*
- * CPUs with !is_lazy either received a TLB flush IPI while the user
- * pages in this address range were unmapped, or have context switched
- * and reloaded %CR3 since then.
- *
- * Shootdown IPIs at page table freeing time only need to be sent to
- * CPUs that may have out of date TLB contents.
- */
- mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus);
- smp_call_function_many(lazy_cpus,
- tlb_flush_remove_tables_local, (void *)mm, 1);
- free_cpumask_var(lazy_cpus);
+ on_each_cpu_cond_mask(tlb_is_lazy, tlb_flush_remove_tables_local,
+ (void *)mm, 1, GFP_ATOMIC, mm_cpumask(mm));
put_cpu();
}
--
2.14.4
Add a config variable indicating that this architecture does not require
lazy TLB mm refcounting, because lazy TLB mms get shot down instantly
at exit_mmap time.
Signed-off-by: Rik van Riel <[email protected]>
---
arch/Kconfig | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index 1aa59063f1fd..1baf1a176dbf 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -237,6 +237,10 @@ config ARCH_HAS_FORTIFY_SOURCE
config ARCH_HAS_SET_MEMORY
bool
+# Select if arch shoots down lazy TLB mms at exit time, instead of refcounting
+config ARCH_NO_ACTIVE_MM_REFCOUNTING
+ bool
+
# Select if arch init_task must go in the __init_task_data section
config ARCH_TASK_STRUCT_ON_STACK
bool
--
2.14.4
Shooting down lazy TLB references to an mm at exit_mmap time ensures
that no users of the mm_struct will be left anywhere in the system,
allowing it to be torn down and freed immediately.
Signed-off-by: Rik van Riel <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/mmu_context.h | 1 +
arch/x86/include/asm/tlbflush.h | 2 ++
arch/x86/mm/tlb.c | 15 +++++++++++++++
4 files changed, 19 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6d4774f203d0..ecdfc6933203 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -75,6 +75,7 @@ config X86
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
+ select ARCH_NO_ACTIVE_MM_REFCOUNTING
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_USE_BUILTIN_BSWAP
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index eeeb9289c764..529bf7bc5f75 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -238,6 +238,7 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
{
paravirt_arch_exit_mmap(mm);
ldt_arch_exit_mmap(mm);
+ lazy_tlb_exit_mmap(mm);
}
#ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 511bf5fae8b8..3966a45367cd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -538,6 +538,8 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
native_flush_tlb_others(mask, info)
#endif
+extern void lazy_tlb_exit_mmap(struct mm_struct *mm);
+
extern void tlb_flush_remove_tables(struct mm_struct *mm);
extern void tlb_flush_remove_tables_local(void *arg);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ea4ef5ceaba2..7b1add904396 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -713,6 +713,21 @@ void tlb_flush_remove_tables(struct mm_struct *mm)
put_cpu();
}
+/*
+ * At exit or execve time, all other threads of a process have disappeared,
+ * but other CPUs could still be referencing this mm in lazy TLB mode.
+ * Get rid of those references before releasing the mm.
+ */
+void lazy_tlb_exit_mmap(struct mm_struct *mm)
+{
+ int cpu = get_cpu();
+
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+ on_each_cpu_mask(mm_cpumask(mm), leave_mm, NULL, 1);
+
+ put_cpu();
+}
+
static void do_flush_tlb_all(void *info)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
--
2.14.4
Conditionally skip lazy TLB mm refcounting. When an architecture has
CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
lazy TLB mode anywhere will get shot down from exit_mmap, and there
in no need to incur the cache line bouncing overhead of refcounting
a lazy TLB mm.
Implement this by moving the refcounting of a lazy TLB mm to helper
functions, which skip the refcounting when it is not necessary.
Deal with use_mm and unuse_mm by fully splitting out the refcounting
of the lazy TLB mm a kernel thread may have when entering use_mm from
the refcounting of the mm that use_mm is about to start using.
Signed-off-by: Rik van Riel <[email protected]>
---
fs/exec.c | 2 +-
include/linux/sched/mm.h | 25 +++++++++++++++++++++++++
kernel/sched/core.c | 6 +++---
mm/mmu_context.c | 21 ++++++++++++++-------
4 files changed, 43 insertions(+), 11 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index bdd0eacefdf5..7a6d4811b02b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1043,7 +1043,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
return 0;
}
- mmdrop(active_mm);
+ drop_lazy_mm(active_mm);
return 0;
}
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 44d356f5e47c..7308bf38012f 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,31 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}
+/*
+ * In lazy TLB mode, a CPU keeps the mm of the last process mapped while
+ * running a kernel thread or idle; we must make sure the lazy TLB mm and
+ * page tables do not disappear while a lazy TLB mode CPU uses them.
+ * There are two ways to handle the race between lazy TLB CPUs and exit_mmap:
+ * 1) Have a lazy TLB CPU hold a refcount on the lazy TLB mm.
+ * 2) Have the architecture code shoot down the lazy TLB mm from exit_mmap;
+ * in that case, refcounting can be skipped, reducing cache line bouncing.
+ */
+static inline void grab_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmgrab(mm);
+}
+
+static inline void drop_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmdrop(mm);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c45de46fdf10..11724c9e88b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2691,7 +2691,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
- mmdrop(mm);
+ drop_lazy_mm(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
@@ -2805,7 +2805,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
if (!mm) {
next->active_mm = oldmm;
- mmgrab(oldmm);
+ grab_lazy_mm(oldmm);
enter_lazy_tlb(oldmm, next);
} else
switch_mm_irqs_off(oldmm, mm, next);
@@ -5532,7 +5532,7 @@ void idle_task_exit(void)
current->active_mm = &init_mm;
finish_arch_post_lock_switch();
}
- mmdrop(mm);
+ drop_lazy_mm(mm);
}
/*
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..d5c2524cdd9a 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -24,12 +24,15 @@ void use_mm(struct mm_struct *mm)
struct mm_struct *active_mm;
struct task_struct *tsk = current;
+ /* Kernel threads have a NULL tsk->mm when entering. */
+ WARN_ON(tsk->mm);
+
task_lock(tsk);
+ /* Previous ->active_mm was held in lazy TLB mode. */
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
- tsk->active_mm = mm;
- }
+ /* Grab mm for reals; tsk->mm needs to stick around until unuse_mm. */
+ mmgrab(mm);
+ tsk->active_mm = mm;
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
@@ -37,8 +40,9 @@ void use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
#endif
- if (active_mm != mm)
- mmdrop(active_mm);
+ /* Drop the lazy TLB mode mm. */
+ if (active_mm)
+ drop_lazy_mm(active_mm);
}
EXPORT_SYMBOL_GPL(use_mm);
@@ -57,8 +61,11 @@ void unuse_mm(struct mm_struct *mm)
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
- /* active_mm is still 'mm' */
+ /* active_mm is still 'mm'; grab it as a lazy TLB mm */
+ grab_lazy_mm(mm);
enter_lazy_tlb(mm, tsk);
+ /* drop the tsk->mm refcount */
+ mmdrop(mm);
task_unlock(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);
--
2.14.4
When switching back from lazy TLB mode to a thread of the same process
that switched into lazy TLB mode, we still have the cr4 (and sometimes
LDT) of that process loaded, and there is no need to reload it.
When there was no TLB flush while the CPU was in lazy TLB mode, the
current code in switch_mm_irqs_off already avoids the reload, by
returning early.
However, when the TLB contents on the CPU are out of date, and we
flush the TLB for the task, we fall through to the regular context
switching code. This patch teaches that code to skip the cr4 and LDT
flushes when switching back to the same mm after a flush.
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 671cc66df801..149fb64e4bf4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -367,8 +367,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
- load_mm_cr4(next);
- switch_ldt(real_prev, next);
+ if (next != real_prev) {
+ load_mm_cr4(next);
+ switch_ldt(real_prev, next);
+ }
}
/*
--
2.14.4
Introduce a variant of on_each_cpu_cond that iterates only over the
CPUs in a cpumask, in order to avoid making callbacks for every single
CPU in the system when we only need to test a subset.
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/smp.h | 4 ++++
kernel/smp.c | 17 +++++++++++++----
kernel/up.c | 14 +++++++++++---
3 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..a56f08ff3097 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -53,6 +53,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags, const struct cpumask *mask);
+
int smp_call_function_single_async(int cpu, call_single_data_t *csd);
#ifdef CONFIG_SMP
diff --git a/kernel/smp.c b/kernel/smp.c
index 2dbc842dd385..f4cf1b0bb3b8 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -667,9 +667,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* You must not call this function with disabled interrupts or
* from a hardware interrupt handler or from a bottom half handler.
*/
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
- gfp_t gfp_flags)
+ gfp_t gfp_flags, const struct cpumask *mask)
{
cpumask_var_t cpus;
int cpu, ret;
@@ -678,7 +678,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
preempt_disable();
- for_each_online_cpu(cpu)
+ for_each_cpu(cpu, mask)
if (cond_func(cpu, info))
__cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
@@ -690,7 +690,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
* just have to IPI them one by one.
*/
preempt_disable();
- for_each_online_cpu(cpu)
+ for_each_cpu(cpu, mask)
if (cond_func(cpu, info)) {
ret = smp_call_function_single(cpu, func,
info, wait);
@@ -699,6 +699,15 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
preempt_enable();
}
}
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags)
+{
+ on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
+ cpu_online_mask);
+}
EXPORT_SYMBOL(on_each_cpu_cond);
static void do_nothing(void *unused)
diff --git a/kernel/up.c b/kernel/up.c
index 42c46bf3e0a5..ff536f9cc8a2 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -68,9 +68,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
* Preemption is disabled here to make sure the cond_func is called under the
* same condtions in UP and SMP.
*/
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
- smp_call_func_t func, void *info, bool wait,
- gfp_t gfp_flags)
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags, const struct cpumask *mask)
{
unsigned long flags;
@@ -82,6 +82,14 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
}
preempt_enable();
}
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ smp_call_func_t func, void *info, bool wait,
+ gfp_t gfp_flags)
+{
+ on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags, NULL);
+}
EXPORT_SYMBOL(on_each_cpu_cond);
int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)
--
2.14.4
The function leave_mm does not use its cpu argument, but always works on
the CPU where it is called. Change the argument to a void *, so leave_mm
can be called directly from smp_call_function_mask, and stop looking up the
CPU number in current leave_mm callers.
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/mm/tlb.c | 2 +-
arch/x86/xen/mmu_pv.c | 2 +-
drivers/acpi/processor_idle.c | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5ff3e8af2c20..ec27966f338f 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -61,6 +61,6 @@ typedef struct {
.ctx_id = 1, \
}
-void leave_mm(int cpu);
+void leave_mm(void * dummy);
#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 149fb64e4bf4..ea4ef5ceaba2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -121,7 +121,7 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
write_cr3(new_mm_cr3);
}
-void leave_mm(int cpu)
+void leave_mm(void *dummy)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 52206ad81e4b..3402d2bf4ae0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -984,7 +984,7 @@ static void drop_mm_ref_this_cpu(void *info)
struct mm_struct *mm = info;
if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
- leave_mm(smp_processor_id());
+ leave_mm(NULL);
/*
* If this cpu still has a stale cr3 reference, then make sure
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index abb559cd28d7..675ffacdd82b 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -714,7 +714,7 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
static void acpi_idle_enter_bm(struct acpi_processor *pr,
struct acpi_processor_cx *cx, bool timer_bc)
{
- acpi_unlazy_tlb(smp_processor_id());
+ acpi_unlazy_tlb(NULL);
/*
* Must be done before busmaster disable as we might need to
--
2.14.4
On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> Introduce a variant of on_each_cpu_cond that iterates only over the
> CPUs in a cpumask, in order to avoid making callbacks for every single
> CPU in the system when we only need to test a subset.
Nice.
Although, if you want to be really fancy, you could optimize this (or
add a variant) that does the callback on the local CPU in parallel
with the remote ones. That would give a small boost to TLB flushes.
On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> Instead of open coding bitmap magic, use on_each_cpu_cond
> to determine which CPUs to send TLB flush IPIs to.
>
> This might be a little bit slower than examining the bitmaps,
> but it should be a lot easier to maintain in the long run.
Looks good.
i assume it's not easy to get the remove-tables case to do a single
on_each_cpu_cond() instead of two? Currently it's doing the lazy ones
and the non-lazy ones separately.
On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask,
> which should never be used by other CPUs simultaneously. There is no
> need to use locked memory accesses to set the bits in this bitmap.
>
> Switch to __cpumask_set_cpu.
Reviewed-by: Andy Lutomirski <[email protected]>
On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> Clarify exactly what the memory barrier synchronizes with.
Reviewed-by: Andy Lutomirski <[email protected]>
On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> Conditionally skip lazy TLB mm refcounting. When an architecture has
> CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> lazy TLB mode anywhere will get shot down from exit_mmap, and there
> in no need to incur the cache line bouncing overhead of refcounting
> a lazy TLB mm.
Unless I've misunderstood something, this patch results in idle tasks
whose active_mm has been freed still having active_mm pointing at
freed memory. This isn't strictly speaking a bug, but it's extremely
confusing and risks all kinds of nasty errors. That's why I prefer
the approach of actually removing the active_mm field on x86 rather
than merely removing the refcount.
I realize that this will add more ifdeffery and make the patch a bit
bigger, but I think it'll be much more robust. Not to mention that it
will save a pointer an a refcount per mm_struct, but that barely
matters.
On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> When switching back from lazy TLB mode to a thread of the same process
> that switched into lazy TLB mode, we still have the cr4 (and sometimes
> LDT) of that process loaded, and there is no need to reload it.
>
> When there was no TLB flush while the CPU was in lazy TLB mode, the
> current code in switch_mm_irqs_off already avoids the reload, by
> returning early.
>
> However, when the TLB contents on the CPU are out of date, and we
> flush the TLB for the task, we fall through to the regular context
> switching code. This patch teaches that code to skip the cr4 and LDT
> flushes when switching back to the same mm after a flush.
Acked-by: Andy Lutomirski <[email protected]>
On Sat, 2018-07-28 at 19:57 -0700, Andy Lutomirski wrote:
> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> wrote:
> > Introduce a variant of on_each_cpu_cond that iterates only over the
> > CPUs in a cpumask, in order to avoid making callbacks for every
> > single
> > CPU in the system when we only need to test a subset.
>
> Nice.
>
> Although, if you want to be really fancy, you could optimize this (or
> add a variant) that does the callback on the local CPU in parallel
> with the remote ones. That would give a small boost to TLB flushes.
The test_func callbacks are not run remotely, but on
the local CPU, before deciding who to send callbacks
to.
The actual IPIs are sent in parallel, if the cpumask
allocation succeeds (it always should in many kernel
configurations, and almost always in the rest).
--
All Rights Reversed.
On Sat, 2018-07-28 at 19:58 -0700, Andy Lutomirski wrote:
> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> wrote:
> > Instead of open coding bitmap magic, use on_each_cpu_cond
> > to determine which CPUs to send TLB flush IPIs to.
> >
> > This might be a little bit slower than examining the bitmaps,
> > but it should be a lot easier to maintain in the long run.
>
> Looks good.
>
> i assume it's not easy to get the remove-tables case to do a single
> on_each_cpu_cond() instead of two? Currently it's doing the lazy
> ones and the non-lazy ones separately.
Indeed. The TLB gather batch size means we need to send
IPIs to the non-lazy CPUs whenever we have gathered so
many pages that our tlb_gather data structure is full.
This could result in many IPIs during a large munmap.
The lazy CPUs get one IPI before page table freeing.
--
All Rights Reversed.
On Sat, 2018-07-28 at 21:21 -0700, Andy Lutomirski wrote:
> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> wrote:
> > Conditionally skip lazy TLB mm refcounting. When an architecture
> > has
> > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> > lazy TLB mode anywhere will get shot down from exit_mmap, and there
> > in no need to incur the cache line bouncing overhead of refcounting
> > a lazy TLB mm.
>
> Unless I've misunderstood something, this patch results in idle tasks
> whose active_mm has been freed still having active_mm pointing at
> freed memory.
Patch 9/10 is supposed to ensure that the lazy TLB CPUs get
switched to init_mm before an mm is freed. No CPU should ever
have its active_mm pointing at a freed mm.
Your message made me re-read the code, and now I realize that
leave_mm does not actually do that.
Looking at the other callers of leave_mm, I might not be the
only one surprised by that; xen_drop_mm_ref comes to mind.
I guess I should some code to leave_mm to have it actually
clear active_mm and call the conditional refcount drop helper
function.
Does that clear up the confusion?
--
All Rights Reversed.
> On Jul 29, 2018, at 5:11 AM, Rik van Riel <[email protected]> wrote:
>
>> On Sat, 2018-07-28 at 21:21 -0700, Andy Lutomirski wrote:
>> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
>> wrote:
>>> Conditionally skip lazy TLB mm refcounting. When an architecture
>>> has
>>> CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
>>> lazy TLB mode anywhere will get shot down from exit_mmap, and there
>>> in no need to incur the cache line bouncing overhead of refcounting
>>> a lazy TLB mm.
>>
>> Unless I've misunderstood something, this patch results in idle tasks
>> whose active_mm has been freed still having active_mm pointing at
>> freed memory.
>
> Patch 9/10 is supposed to ensure that the lazy TLB CPUs get
> switched to init_mm before an mm is freed. No CPU should ever
> have its active_mm pointing at a freed mm.
>
> Your message made me re-read the code, and now I realize that
> leave_mm does not actually do that.
>
> Looking at the other callers of leave_mm, I might not be the
> only one surprised by that; xen_drop_mm_ref comes to mind.
>
> I guess I should some code to leave_mm to have it actually
> clear active_mm and call the conditional refcount drop helper
> function.
>
> Does that clear up the confusion?
>
Kind of. But what’s the point of keeping active_mm? On architectures that opt in to the new mode, there won’t be any code that cares about it’s value. What’s the benefit of keeping it around? If you ifdef it out, then it can’t possibly point to freed memory, and there’s nothing to worry about.
>> On Jul 29, 2018, at 5:00 AM, Rik van Riel <[email protected]> wrote:
>>
>> On Sat, 2018-07-28 at 19:57 -0700, Andy Lutomirski wrote:
>> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
>> wrote:
>>> Introduce a variant of on_each_cpu_cond that iterates only over the
>>> CPUs in a cpumask, in order to avoid making callbacks for every
>>> single
>>> CPU in the system when we only need to test a subset.
>>
>> Nice.
>>
>> Although, if you want to be really fancy, you could optimize this (or
>> add a variant) that does the callback on the local CPU in parallel
>> with the remote ones. That would give a small boost to TLB flushes.
>
> The test_func callbacks are not run remotely, but on
> the local CPU, before deciding who to send callbacks
> to.
>
> The actual IPIs are sent in parallel, if the cpumask
> allocation succeeds (it always should in many kernel
> configurations, and almost always in the rest).
What I meant is that on_each_cpu_mask does:
smp_call_function_many(mask, func, info, wait);
if (cpumask_test_cpu(cpu, mask)) {
unsigned long flags;
local_irq_save(flags); func(info);
local_irq_restore(flags);
}
So it IPIs all the remote CPUs in parallel, then waits, then does the local work. In principle, the local flush could be done after triggering the IPIs but before they all finish.
> --
> All Rights Reversed.
On Sun, 2018-07-29 at 08:29 -0700, Andy Lutomirski wrote:
> > On Jul 29, 2018, at 5:11 AM, Rik van Riel <[email protected]> wrote:
> >
> > > On Sat, 2018-07-28 at 21:21 -0700, Andy Lutomirski wrote:
> > > On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> > > wrote:
> > > > Conditionally skip lazy TLB mm refcounting. When an
> > > > architecture
> > > > has
> > > > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is
> > > > used in
> > > > lazy TLB mode anywhere will get shot down from exit_mmap, and
> > > > there
> > > > in no need to incur the cache line bouncing overhead of
> > > > refcounting
> > > > a lazy TLB mm.
> > >
> > > Unless I've misunderstood something, this patch results in idle
> > > tasks
> > > whose active_mm has been freed still having active_mm pointing at
> > > freed memory.
> >
> > Patch 9/10 is supposed to ensure that the lazy TLB CPUs get
> > switched to init_mm before an mm is freed. No CPU should ever
> > have its active_mm pointing at a freed mm.
> >
> > Your message made me re-read the code, and now I realize that
> > leave_mm does not actually do that.
> >
> > Looking at the other callers of leave_mm, I might not be the
> > only one surprised by that; xen_drop_mm_ref comes to mind.
> >
> > I guess I should some code to leave_mm to have it actually
> > clear active_mm and call the conditional refcount drop helper
> > function.
> >
> > Does that clear up the confusion?
>
> Kind of. But what’s the point of keeping active_mm? On architectures
> that opt in to the new mode, there won’t be any code that cares about
> it’s value. What’s the benefit of keeping it around? If you ifdef
> it out, then it can’t possibly point to freed memory, and there’s
> nothing to worry about.
I would like to get to that point, but in a way that does
not leave the code too difficult to follow.
Getting rid of ->active_mm in context_switch() is straightforward,
but I am not sure at all what to do about idle_task_exit() for
example.
All the subtleties I ran into just with this phase of the
code suggests (to me at least) that we should probably do
this one step at a time.
I agree on the same end goal, though :)
--
All Rights Reversed.
On Sun, 2018-07-29 at 08:36 -0700, Andy Lutomirski wrote:
> On Jul 29, 2018, at 5:00 AM, Rik van Riel <[email protected]> wrote:
>
> > On Sat, 2018-07-28 at 19:57 -0700, Andy Lutomirski wrote:
> > > On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> > > wrote:
> > > > Introduce a variant of on_each_cpu_cond that iterates only over
> > > > the
> > > > CPUs in a cpumask, in order to avoid making callbacks for every
> > > > single
> > > > CPU in the system when we only need to test a subset.
> > > Nice.
> > > Although, if you want to be really fancy, you could optimize this
> > > (or
> > > add a variant) that does the callback on the local CPU in
> > > parallel
> > > with the remote ones. That would give a small boost to TLB
> > > flushes.
> >
> > The test_func callbacks are not run remotely, but on
> > the local CPU, before deciding who to send callbacks
> > to.
> >
> > The actual IPIs are sent in parallel, if the cpumask
> > allocation succeeds (it always should in many kernel
> > configurations, and almost always in the rest).
> >
>
> What I meant is that on_each_cpu_mask does:
>
> smp_call_function_many(mask, func, info, wait);
> if (cpumask_test_cpu(cpu, mask)) {
> unsigned long flags;
> local_irq_save(flags); func(info);
> local_irq_restore(flags);
> }
>
> So it IPIs all the remote CPUs in parallel, then waits, then does the
> local work. In principle, the local flush could be done after
> triggering the IPIs but before they all finish.
Sure, moving the function call for the local CPU
into smp_call_function_many might be a nice optimization.
A quick grep suggests it touch stuff all over the tree,
so it could be a nice Outreachy intern project :)
--
All Rights Reversed.
On Sun, 2018-07-29 at 08:36 -0700, Andy Lutomirski wrote:
> On Jul 29, 2018, at 5:00 AM, Rik van Riel <[email protected]> wrote:
>
> > On Sat, 2018-07-28 at 19:57 -0700, Andy Lutomirski wrote:
> > > On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> > > wrote:
> > > > Introduce a variant of on_each_cpu_cond that iterates only over
> > > > the
> > > > CPUs in a cpumask, in order to avoid making callbacks for every
> > > > single
> > > > CPU in the system when we only need to test a subset.
> > > Nice.
> > > Although, if you want to be really fancy, you could optimize this
> > > (or
> > > add a variant) that does the callback on the local CPU in
> > > parallel
> > > with the remote ones. That would give a small boost to TLB
> > > flushes.
> >
> > The test_func callbacks are not run remotely, but on
> > the local CPU, before deciding who to send callbacks
> > to.
> >
> > The actual IPIs are sent in parallel, if the cpumask
> > allocation succeeds (it always should in many kernel
> > configurations, and almost always in the rest).
> >
>
> What I meant is that on_each_cpu_mask does:
>
> smp_call_function_many(mask, func, info, wait);
> if (cpumask_test_cpu(cpu, mask)) {
> unsigned long flags;
> local_irq_save(flags); func(info);
> local_irq_restore(flags);
> }
>
> So it IPIs all the remote CPUs in parallel, then waits, then does the
> local work. In principle, the local flush could be done after
> triggering the IPIs but before they all finish.
Grepping around the code, I found a few examples where the
calling code appears to expect that smp_call_function_many
also calls "func" on the local CPU.
For example, kvm_emulate_wbinvd_noskip has this:
if (kvm_x86_ops->has_wbinvd_exit()) {
int cpu = get_cpu();
cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
smp_call_function_many(vcpu->arch.wbinvd_dirty_mask,
wbinvd_ipi, NULL, 1);
put_cpu();
cpumask_clear(vcpu->arch.wbinvd_dirty_mask);
} else
wbinvd();
This seems to result in systems with ->has_wbinvd_exit
only calling wbinvd_ipi on OTHER CPUs, and not on the
CPU where the guest exited with wbinvd?
This seems unintended.
I guess looking into on_each_cpu_mask might be a little
higher priority than waiting until the next Outreachy
season :)
--
All Rights Reversed.
> On Jul 29, 2018, at 10:51 AM, Rik van Riel <[email protected]> wrote:
>
>> On Sun, 2018-07-29 at 08:36 -0700, Andy Lutomirski wrote:
>>> On Jul 29, 2018, at 5:00 AM, Rik van Riel <[email protected]> wrote:
>>>
>>>> On Sat, 2018-07-28 at 19:57 -0700, Andy Lutomirski wrote:
>>>> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
>>>> wrote:
>>>>> Introduce a variant of on_each_cpu_cond that iterates only over
>>>>> the
>>>>> CPUs in a cpumask, in order to avoid making callbacks for every
>>>>> single
>>>>> CPU in the system when we only need to test a subset.
>>>> Nice.
>>>> Although, if you want to be really fancy, you could optimize this
>>>> (or
>>>> add a variant) that does the callback on the local CPU in
>>>> parallel
>>>> with the remote ones. That would give a small boost to TLB
>>>> flushes.
>>>
>>> The test_func callbacks are not run remotely, but on
>>> the local CPU, before deciding who to send callbacks
>>> to.
>>>
>>> The actual IPIs are sent in parallel, if the cpumask
>>> allocation succeeds (it always should in many kernel
>>> configurations, and almost always in the rest).
>>>
>>
>> What I meant is that on_each_cpu_mask does:
>>
>> smp_call_function_many(mask, func, info, wait);
>> if (cpumask_test_cpu(cpu, mask)) {
>> unsigned long flags;
>> local_irq_save(flags); func(info);
>> local_irq_restore(flags);
>> }
>>
>> So it IPIs all the remote CPUs in parallel, then waits, then does the
>> local work. In principle, the local flush could be done after
>> triggering the IPIs but before they all finish.
>
> Grepping around the code, I found a few examples where the
> calling code appears to expect that smp_call_function_many
> also calls "func" on the local CPU.
>
> For example, kvm_emulate_wbinvd_noskip has this:
>
> if (kvm_x86_ops->has_wbinvd_exit()) {
> int cpu = get_cpu();
>
> cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
> smp_call_function_many(vcpu->arch.wbinvd_dirty_mask,
> wbinvd_ipi, NULL, 1);
> put_cpu();
> cpumask_clear(vcpu->arch.wbinvd_dirty_mask);
> } else
> wbinvd();
>
> This seems to result in systems with ->has_wbinvd_exit
> only calling wbinvd_ipi on OTHER CPUs, and not on the
> CPU where the guest exited with wbinvd?
>
> This seems unintended.
>
> I guess looking into on_each_cpu_mask might be a little
> higher priority than waiting until the next Outreachy
> season :)
>
The right approach might be a tree wise rename from smp_call_... to on_other_cpus_mask() it similar. The current naming and semantics are extremely confusing.
Linus, this is the kind of thing you seem to like taking outside the merge window. What do you think about a straight-up search_and_replace to make rename the smp_call_... functions to exactly match the corresponding on_each_cpu functions except with “each” replaced with “other”?
On Sat, 28 Jul 2018 21:21:17 -0700
Andy Lutomirski <[email protected]> wrote:
> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> > Conditionally skip lazy TLB mm refcounting. When an architecture has
> > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> > lazy TLB mode anywhere will get shot down from exit_mmap, and there
> > in no need to incur the cache line bouncing overhead of refcounting
> > a lazy TLB mm.
>
> Unless I've misunderstood something, this patch results in idle tasks
> whose active_mm has been freed still having active_mm pointing at
> freed memory.
Below (plus the next email) should fix the bug you pointed
out, in a somewhat non-invasive way. Patches have survived
a few simple tests on my test system, I have not thrown a
full load at them yet.
I would like to save the full rewrite to remove ->active_mm
for a later series, because this is already as much churn
as I am comfortable with for this code :)
---8<---
Author: Rik van Riel <[email protected]>
Subject: [PATCH 10/11] x86,tlb: really leave mm on shootdown
When getting an mm shot down from under us in lazy TLB mode, don't
just switch the TLB over to the init_mm page tables, but really drop
our references to the lazy TLB mm.
This allows for faster (instant) freeing of a lazy TLB mm, which is
a precondition to getting rid of the refcounting of mms in lazy TLB mode.
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7b1add904396..425cb9fa2640 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,8 @@ void leave_mm(void *dummy)
WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
switch_mm(NULL, &init_mm, NULL);
+ current->active_mm = &init_mm;
+ mmdrop(loaded_mm);
}
EXPORT_SYMBOL_GPL(leave_mm);
@@ -483,6 +485,8 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
+ current->active_mm = &init_mm;
+ mmdrop(loaded_mm);
return;
}
--
2.14.4
On Sat, 28 Jul 2018 21:21:17 -0700
Andy Lutomirski <[email protected]> wrote:
> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> > Conditionally skip lazy TLB mm refcounting. When an architecture has
> > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> > lazy TLB mode anywhere will get shot down from exit_mmap, and there
> > in no need to incur the cache line bouncing overhead of refcounting
> > a lazy TLB mm.
>
> Unless I've misunderstood something, this patch results in idle tasks
> whose active_mm has been freed still having active_mm pointing at
> freed memory.
---8<---
Author: Rik van Riel <[email protected]>
Subject: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting
Conditionally skip lazy TLB mm refcounting. When an architecture has
CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
lazy TLB mode anywhere will get shot down from exit_mmap, and there
in no need to incur the cache line bouncing overhead of refcounting
a lazy TLB mm.
Implement this by moving the refcounting of a lazy TLB mm to helper
functions, which skip the refcounting when it is not necessary.
Deal with use_mm and unuse_mm by fully splitting out the refcounting
of the lazy TLB mm a kernel thread may have when entering use_mm from
the refcounting of the mm that use_mm is about to start using.
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 5 +++--
fs/exec.c | 2 +-
include/linux/sched/mm.h | 25 +++++++++++++++++++++++++
kernel/sched/core.c | 6 +++---
mm/mmu_context.c | 21 ++++++++++++++-------
5 files changed, 46 insertions(+), 13 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 425cb9fa2640..d53d9c19b97d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/gfp.h>
+#include <linux/sched/mm.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -141,7 +142,7 @@ void leave_mm(void *dummy)
switch_mm(NULL, &init_mm, NULL);
current->active_mm = &init_mm;
- mmdrop(loaded_mm);
+ drop_lazy_mm(loaded_mm);
}
EXPORT_SYMBOL_GPL(leave_mm);
@@ -486,7 +487,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
current->active_mm = &init_mm;
- mmdrop(loaded_mm);
+ drop_lazy_mm(loaded_mm);
return;
}
diff --git a/fs/exec.c b/fs/exec.c
index bdd0eacefdf5..7a6d4811b02b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1043,7 +1043,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
return 0;
}
- mmdrop(active_mm);
+ drop_lazy_mm(active_mm);
return 0;
}
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 44d356f5e47c..7308bf38012f 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,31 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}
+/*
+ * In lazy TLB mode, a CPU keeps the mm of the last process mapped while
+ * running a kernel thread or idle; we must make sure the lazy TLB mm and
+ * page tables do not disappear while a lazy TLB mode CPU uses them.
+ * There are two ways to handle the race between lazy TLB CPUs and exit_mmap:
+ * 1) Have a lazy TLB CPU hold a refcount on the lazy TLB mm.
+ * 2) Have the architecture code shoot down the lazy TLB mm from exit_mmap;
+ * in that case, refcounting can be skipped, reducing cache line bouncing.
+ */
+static inline void grab_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmgrab(mm);
+}
+
+static inline void drop_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmdrop(mm);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c45de46fdf10..11724c9e88b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2691,7 +2691,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
- mmdrop(mm);
+ drop_lazy_mm(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
@@ -2805,7 +2805,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
if (!mm) {
next->active_mm = oldmm;
- mmgrab(oldmm);
+ grab_lazy_mm(oldmm);
enter_lazy_tlb(oldmm, next);
} else
switch_mm_irqs_off(oldmm, mm, next);
@@ -5532,7 +5532,7 @@ void idle_task_exit(void)
current->active_mm = &init_mm;
finish_arch_post_lock_switch();
}
- mmdrop(mm);
+ drop_lazy_mm(mm);
}
/*
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..d5c2524cdd9a 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -24,12 +24,15 @@ void use_mm(struct mm_struct *mm)
struct mm_struct *active_mm;
struct task_struct *tsk = current;
+ /* Kernel threads have a NULL tsk->mm when entering. */
+ WARN_ON(tsk->mm);
+
task_lock(tsk);
+ /* Previous ->active_mm was held in lazy TLB mode. */
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
- tsk->active_mm = mm;
- }
+ /* Grab mm for reals; tsk->mm needs to stick around until unuse_mm. */
+ mmgrab(mm);
+ tsk->active_mm = mm;
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
@@ -37,8 +40,9 @@ void use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
#endif
- if (active_mm != mm)
- mmdrop(active_mm);
+ /* Drop the lazy TLB mode mm. */
+ if (active_mm)
+ drop_lazy_mm(active_mm);
}
EXPORT_SYMBOL_GPL(use_mm);
@@ -57,8 +61,11 @@ void unuse_mm(struct mm_struct *mm)
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
- /* active_mm is still 'mm' */
+ /* active_mm is still 'mm'; grab it as a lazy TLB mm */
+ grab_lazy_mm(mm);
enter_lazy_tlb(mm, tsk);
+ /* drop the tsk->mm refcount */
+ mmdrop(mm);
task_unlock(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);
--
2.14.4
On Sun, Jul 29, 2018 at 11:55 AM Andy Lutomirski <[email protected]> wrote:
> > On Jul 29, 2018, at 10:51 AM, Rik van Riel <[email protected]> wrote:
> >
> > This seems to result in systems with ->has_wbinvd_exit
> > only calling wbinvd_ipi on OTHER CPUs, and not on the
> > CPU where the guest exited with wbinvd?
> >
> > This seems unintended.
> >
> > I guess looking into on_each_cpu_mask might be a little
> > higher priority than waiting until the next Outreachy
> > season :)
>
> The right approach might be a tree wise rename from smp_call_... to on_other_cpus_mask() it similar. The current naming and semantics are extremely confusing.
Ugh.
Renaming might be worth it, but at least one issue is that we are
simply not very consistent.
For example. smp_call_function_many() does indeed explicitly ignore
the current CPU.
But smp_call_function_any() (one "m" less) does _not_ ignore the
current CPU, and in fact prefers it.
So it's not that smp_call_... should *generally* be renamed. Only some
of the cases might be worth renaming.
And just a "rename and forget" isn't really great. As Rik's example
shows, existing users should be checked too..
Linus
On Sun, Jul 29, 2018 at 03:54:52PM -0400, Rik van Riel wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c45de46fdf10..11724c9e88b0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2691,7 +2691,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> */
> if (mm) {
> membarrier_mm_sync_core_before_usermode(mm);
> - mmdrop(mm);
> + drop_lazy_mm(mm);
> }
> if (unlikely(prev_state == TASK_DEAD)) {
> if (prev->sched_class->task_dead)
> @@ -2805,7 +2805,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> */
> if (!mm) {
> next->active_mm = oldmm;
> - mmgrab(oldmm);
> + grab_lazy_mm(oldmm);
> enter_lazy_tlb(oldmm, next);
> } else
> switch_mm_irqs_off(oldmm, mm, next);
What happened to the rework I did there? That not only avoided fiddling
with active_mm, but also avoids grab/drop cycles for the other
architectures when doing task->kthread->kthread->task things.
I agree with Andy that if you avoid the refcount fiddling, then you
should also not muck with active_mm.
That is, if you keep active_mm for now (which seems a reasonable first
step) then at least ensure you keep ->mm == ->active_mm at all times.
* Rik van Riel <[email protected]> wrote:
> This patch series implements the cleanups suggested by Peter and Andy,
> removes lazy TLB mm refcounting on x86, and shows how other architectures
> could implement that same optimization.
>
> The previous patch series already seems to have removed most of the
> cache line contention I was seeing at context switch time, so CPU use
> of the memcache and memcache-like workloads has not changed measurably
> with this patch series.
>
> However, the memory bandwidth used by the memcache system has been
> reduced by about 1%, to serve the same number of queries per second.
>
> This happens on two socket Haswell and Broadwell systems. Maybe on
> larger systems (4 or 8 socket) one might also see a measurable drop
> in the amount of CPU time used, with workloads where the previous
> patch series does not remove all cache line contention on the mm.
>
> This is against the latest -tip tree, and seems to be stable (on top
> of another tree) with workloads that do over a million context switches
> a second.
Just a quick logistics request: once all the review feedback from Andy and PeterZ
is sorted out, could you please (re-)send this series with the Reviewed-by
and Acked-by tags added?
If any patch is still under discussion then please leave it out from the next
series temporarily, so that I can just apply them all immediately to tip:x86/mm
before the next merge window opens.
( If the series reaches this state later today then don't hesitate to do a resend
with the tags added - I don't want to delay these improvements. )
Thanks!
Ingo
On Mon, 2018-07-30 at 11:55 +0200, Peter Zijlstra wrote:
> On Sun, Jul 29, 2018 at 03:54:52PM -0400, Rik van Riel wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index c45de46fdf10..11724c9e88b0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2691,7 +2691,7 @@ static struct rq *finish_task_switch(struct
> > task_struct *prev)
> > */
> > if (mm) {
> > membarrier_mm_sync_core_before_usermode(mm);
> > - mmdrop(mm);
> > + drop_lazy_mm(mm);
> > }
> > if (unlikely(prev_state == TASK_DEAD)) {
> > if (prev->sched_class->task_dead)
> > @@ -2805,7 +2805,7 @@ context_switch(struct rq *rq, struct
> > task_struct *prev,
> > */
> > if (!mm) {
> > next->active_mm = oldmm;
> > - mmgrab(oldmm);
> > + grab_lazy_mm(oldmm);
> > enter_lazy_tlb(oldmm, next);
> > } else
> > switch_mm_irqs_off(oldmm, mm, next);
>
> What happened to the rework I did there? That not only avoided
> fiddling
> with active_mm, but also avoids grab/drop cycles for the other
> architectures when doing task->kthread->kthread->task things.
I don't think I saw that. I only saw your email from
July 20th with this fragment of code, which does not
appear to avoid the grab/drop cycles, and still fiddles
with active_mm:
Date: Fri, 20 Jul 2018 11:32:39 +0200
From: Peter Zijlstra <[email protected]>
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
Message-ID: <[email protected]>
+ /*
+ * kernel -> kernel lazy + transfer active
+ * user -> kernel lazy + mmgrab() active
+ *
+ * kernel -> user switch + mmdrop() active
+ * user -> user switch
+ */
+ if (!next->mm) { // to kernel
+ enter_lazy_tlb(prev->active_mm, next);
+
+#ifdef ARCH_NO_ACTIVE_MM
+ next->active_mm = prev->active_mm;
+ if (prev->mm) // from user
+ mmgrab(prev->active_mm);
+#endif
+ } else { // to user
+ switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
+#ifdef ARCH_NO_ACTIVE_MM
+ if (!prev->mm) { // from kernel
+ /* will mmdrop() in finish_task_switch(). */
+ rq->prev_mm = prev->active_mm;
+ prev->active_mm = NULL;
+ }
+#endif
What email should I look for to find the thing you
referenced above?
> I agree with Andy that if you avoid the refcount fiddling, then you
> should also not muck with active_mm.
>
> That is, if you keep active_mm for now (which seems a reasonable
> first
> step) then at least ensure you keep ->mm == ->active_mm at all times.
There do not seem to be a lot of places left in
arch/x86/ that reference active_mm. I guess the
next patch series should excise those? :)
--
All Rights Reversed.
On Mon, Jul 30, 2018 at 10:30:11AM -0400, Rik van Riel wrote:
> > What happened to the rework I did there? That not only avoided
> > fiddling
> > with active_mm, but also avoids grab/drop cycles for the other
> > architectures when doing task->kthread->kthread->task things.
>
> I don't think I saw that. I only saw your email from
> July 20th with this fragment of code, which does not
> appear to avoid the grab/drop cycles, and still fiddles
> with active_mm:
Yeah, that's it. Note how it doesn't do a grab+drop for kernel->kernel,
where the current could would have.
And also note that it only fiddles with active_mm if it does the
grab+drop thing (the below should have s/ifdef/ifndef/ to make more
sense maybe).
So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and therefore
->active_mm == ->mm.
> + /*
> + * kernel -> kernel lazy + transfer active
> + * user -> kernel lazy + mmgrab() active
> + *
> + * kernel -> user switch + mmdrop() active
> + * user -> user switch
> + */
> + if (!next->mm) { // to kernel
> + enter_lazy_tlb(prev->active_mm, next);
> +
#ifndef ARCH_NO_ACTIVE_MM
> + next->active_mm = prev->active_mm;
> + if (prev->mm) // from user
> + mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
> +#endif
> + } else { // to user
> + switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
#ifndef ARCH_NO_ACTIVE_MM
> + if (!prev->mm) { // from kernel
> + /* will mmdrop() in finish_task_switch(). */
> + rq->prev_mm = prev->active_mm;
> + prev->active_mm = NULL;
> + }
> +#endif
On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> On Mon, Jul 30, 2018 at 10:30:11AM -0400, Rik van Riel wrote:
>
> > > What happened to the rework I did there? That not only avoided
> > > fiddling
> > > with active_mm, but also avoids grab/drop cycles for the other
> > > architectures when doing task->kthread->kthread->task things.
> >
> > I don't think I saw that. I only saw your email from
> > July 20th with this fragment of code, which does not
> > appear to avoid the grab/drop cycles, and still fiddles
> > with active_mm:
>
> Yeah, that's it. Note how it doesn't do a grab+drop for kernel-
> >kernel,
> where the current could would have.
>
> And also note that it only fiddles with active_mm if it does the
> grab+drop thing (the below should have s/ifdef/ifndef/ to make more
> sense maybe).
I'll kick off a test with your variant. I don't think we
will see any performance difference on x86 (due to not
using a refcount at all any more), but unless Ingo is in
a hurry I guess there's no issue rewriting this part of
the patch series :)
Do the other patches look ok to you and Andy?
> So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and therefore
> ->active_mm == ->mm.
>
> > + /*
> > + * kernel -> kernel lazy + transfer active
> > + * user -> kernel lazy + mmgrab() active
> > + *
> > + * kernel -> user switch + mmdrop() active
> > + * user -> user switch
> > + */
> > + if (!next->mm) { // to
> > kernel
> > + enter_lazy_tlb(prev->active_mm, next);
> > +
>
> #ifndef ARCH_NO_ACTIVE_MM
> > + next->active_mm = prev->active_mm;
> > + if (prev->mm) // from
> > user
> > + mmgrab(prev->active_mm);
>
> else
> prev->active_mm = NULL;
> > +#endif
> > + } else { // to user
> > + switch_mm_irqs_off(prev->active_mm, next->mm,
> > next);
> > +
>
> #ifndef ARCH_NO_ACTIVE_MM
> > + if (!prev->mm) { // from
> > kernel
> > + /* will mmdrop() in finish_task_switch().
> > */
> > + rq->prev_mm = prev->active_mm;
> > + prev->active_mm = NULL;
> > + }
> > +#endif
>
>
>
--
All Rights Reversed.
On Mon, Jul 30, 2018 at 12:15 PM, Rik van Riel <[email protected]> wrote:
> On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
>> On Mon, Jul 30, 2018 at 10:30:11AM -0400, Rik van Riel wrote:
>>
>> > > What happened to the rework I did there? That not only avoided
>> > > fiddling
>> > > with active_mm, but also avoids grab/drop cycles for the other
>> > > architectures when doing task->kthread->kthread->task things.
>> >
>> > I don't think I saw that. I only saw your email from
>> > July 20th with this fragment of code, which does not
>> > appear to avoid the grab/drop cycles, and still fiddles
>> > with active_mm:
>>
>> Yeah, that's it. Note how it doesn't do a grab+drop for kernel-
>> >kernel,
>> where the current could would have.
>>
>> And also note that it only fiddles with active_mm if it does the
>> grab+drop thing (the below should have s/ifdef/ifndef/ to make more
>> sense maybe).
>
> I'll kick off a test with your variant. I don't think we
> will see any performance difference on x86 (due to not
> using a refcount at all any more), but unless Ingo is in
> a hurry I guess there's no issue rewriting this part of
> the patch series :)
>
> Do the other patches look ok to you and Andy?
>
The whole series other than the active_mm stuff looked okay to me.
On Mon, 2018-07-30 at 12:30 -0700, Andy Lutomirski wrote:
> On Mon, Jul 30, 2018 at 12:15 PM, Rik van Riel <[email protected]>
> wrote:
> > On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 30, 2018 at 10:30:11AM -0400, Rik van Riel wrote:
> > >
> > > > > What happened to the rework I did there? That not only
> > > > > avoided
> > > > > fiddling
> > > > > with active_mm, but also avoids grab/drop cycles for the
> > > > > other
> > > > > architectures when doing task->kthread->kthread->task things.
> > > >
> > > > I don't think I saw that. I only saw your email from
> > > > July 20th with this fragment of code, which does not
> > > > appear to avoid the grab/drop cycles, and still fiddles
> > > > with active_mm:
> > >
> > > Yeah, that's it. Note how it doesn't do a grab+drop for kernel-
> > > > kernel,
> > >
> > > where the current could would have.
> > >
> > > And also note that it only fiddles with active_mm if it does the
> > > grab+drop thing (the below should have s/ifdef/ifndef/ to make
> > > more
> > > sense maybe).
> >
> > I'll kick off a test with your variant. I don't think we
> > will see any performance difference on x86 (due to not
> > using a refcount at all any more), but unless Ingo is in
> > a hurry I guess there's no issue rewriting this part of
> > the patch series :)
> >
> > Do the other patches look ok to you and Andy?
> >
>
> The whole series other than the active_mm stuff looked okay to me.
Does the active_mm stuff look like a step in the right
direction with the bugfix, or would you prefer the code
to go in an entirely different direction?
If this looks like a step in the right direction, it
may make sense to make this step before the merge window
opens, and continue with more patches in this direction
later.
--
All Rights Reversed.
On Mon, Jul 30, 2018 at 12:36 PM, Rik van Riel <[email protected]> wrote:
> On Mon, 2018-07-30 at 12:30 -0700, Andy Lutomirski wrote:
>> On Mon, Jul 30, 2018 at 12:15 PM, Rik van Riel <[email protected]>
>> wrote:
>> > On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
>> > > On Mon, Jul 30, 2018 at 10:30:11AM -0400, Rik van Riel wrote:
>> > >
>> > > > > What happened to the rework I did there? That not only
>> > > > > avoided
>> > > > > fiddling
>> > > > > with active_mm, but also avoids grab/drop cycles for the
>> > > > > other
>> > > > > architectures when doing task->kthread->kthread->task things.
>> > > >
>> > > > I don't think I saw that. I only saw your email from
>> > > > July 20th with this fragment of code, which does not
>> > > > appear to avoid the grab/drop cycles, and still fiddles
>> > > > with active_mm:
>> > >
>> > > Yeah, that's it. Note how it doesn't do a grab+drop for kernel-
>> > > > kernel,
>> > >
>> > > where the current could would have.
>> > >
>> > > And also note that it only fiddles with active_mm if it does the
>> > > grab+drop thing (the below should have s/ifdef/ifndef/ to make
>> > > more
>> > > sense maybe).
>> >
>> > I'll kick off a test with your variant. I don't think we
>> > will see any performance difference on x86 (due to not
>> > using a refcount at all any more), but unless Ingo is in
>> > a hurry I guess there's no issue rewriting this part of
>> > the patch series :)
>> >
>> > Do the other patches look ok to you and Andy?
>> >
>>
>> The whole series other than the active_mm stuff looked okay to me.
>
> Does the active_mm stuff look like a step in the right
> direction with the bugfix, or would you prefer the code
> to go in an entirely different direction?
I think it's a big step in the right direction, but it still makes be
nervous. I'd be more comfortable with it if you at least had a
functional set of patches that result in active_mm being gone, because
that will mean that you actually audited the whole mess and fixed
anything that might rely on active_mm pointing somewhere or that might
be putting a value you didn't take into account into active_mm. IOW
I'm not totally thrilled by applying the patches as is if we're still
a bit unsure as to what might have gotten missed.
I don't think it's at all necessary to redo the patches.
Does that seem reasonable?
>
> If this looks like a step in the right direction, it
> may make sense to make this step before the merge window
> opens, and continue with more patches in this direction
> later.
>
> --
> All Rights Reversed.
On Mon, 2018-07-30 at 12:49 -0700, Andy Lutomirski wrote:
>
> I think it's a big step in the right direction, but it still makes be
> nervous. I'd be more comfortable with it if you at least had a
> functional set of patches that result in active_mm being gone,
> because
> that will mean that you actually audited the whole mess and fixed
> anything that might rely on active_mm pointing somewhere or that
> might
> be putting a value you didn't take into account into active_mm. IOW
> I'm not totally thrilled by applying the patches as is if we're still
> a bit unsure as to what might have gotten missed.
>
> I don't think it's at all necessary to redo the patches.
>
> Does that seem reasonable?
Absolutely. I tried to keep ->active_mm very similar
to before for exactly that reason.
Lets go through all the places where it is used, in
x86 and architecture independent code. I have not
checked other architectures.
It looks like we should be able to get rid of
->active_mm at some point, but a lot of it depends
on other architecture maintainers.
arch/x86/events/core.c:
- get_segment_base: get current->active_mm->context.ldt,
this appears to be for TIF_IA32 user programs only, so
we should be able to use current->mm here
arch/x86/kernel/cpu/common.c:
- current task's ->active_mm assigned in two places,
never read
arch/x86/lib/insn-eval.c:
- get_desc() gets current->active_mm->context.ldt, this
appears to be only for user space programs
arch/x86/mm/tlb.c:
- this series adds two places where current->active_mm is
written, it is never read
arch/x86/platform/efi/efi_64.c:
- current->active_mm is set to efi_mm for a little bit,
with irqs disabled, and then changed back, with irqs still
disabled; we should be able to get rid of ->active_mm here
- in the init code, ->active_mm is set to efi_mm as well,
presumably the kernel automatically switches that back on
the next context switch; this may be buggy, since preemption
is enabled and a GFP_KERNEL allocation is just a few lines
below
arch/x86/power/cpu.c:
- fix_processor_context() calls load_mm_ldt(current->active_mm);,
we should be able to use cpu_tlbstate.loaded_mm instead
drivers/cpufreq/pmac32-cpufreq.c:
- pmu_set_cpu_speed() restores current->active_mm - don't know if
anyone still cares about 32 bit PPC :)
drivers/firmware/efi/arm-runtime.c:
- efi_virtmap_unload switches back the pgd to current->active_mm
from &efi_mm; that mm could be stored elsewhere if we excised
->active_mm everywhere
drivers/macintosh/via-pmu.c:
- same deal as pmap32-cpufreq.c above
mm/mmu_context.c:
- use_mm() tracks the ->active_mm a kernel thread is pointing to,
but the mm is also tracked in ->mm
- unuse_mm() is the same deal as use_mm(), we should be able to
get rid of ->active_mm if everybody stops using it, and we
no longer refcount it anywhere
init/init_task.c:
- init_task.active_mm = &init_mm
fs/exec.c:
- exec_mmap() juggles both ->mm and ->active_mm, in order to
get refcounting right; without refcounting we can lose ->active_mm
--
All Rights Reversed.
> On Jul 30, 2018, at 2:46 PM, Rik van Riel <[email protected]> wrote:
>
>> On Mon, 2018-07-30 at 12:49 -0700, Andy Lutomirski wrote:
>>
>>
>> I think it's a big step in the right direction, but it still makes be
>> nervous. I'd be more comfortable with it if you at least had a
>> functional set of patches that result in active_mm being gone,
>> because
>> that will mean that you actually audited the whole mess and fixed
>> anything that might rely on active_mm pointing somewhere or that
>> might
>> be putting a value you didn't take into account into active_mm. IOW
>> I'm not totally thrilled by applying the patches as is if we're still
>> a bit unsure as to what might have gotten missed.
>>
>> I don't think it's at all necessary to redo the patches.
>>
>> Does that seem reasonable?
>
> Absolutely. I tried to keep ->active_mm very similar
> to before for exactly that reason.
>
> Lets go through all the places where it is used, in
> x86 and architecture independent code. I have not
> checked other architectures.
>
> It looks like we should be able to get rid of
> ->active_mm at some point, but a lot of it depends
> on other architecture maintainers.
>
>
> arch/x86/events/core.c:
> - get_segment_base: get current->active_mm->context.ldt,
> this appears to be for TIF_IA32 user programs only, so
> we should be able to use current->mm here
->mm sounds more correct anyway
>
> arch/x86/kernel/cpu/common.c:
> - current task's ->active_mm assigned in two places,
> never read
>
> arch/x86/lib/insn-eval.c:
> - get_desc() gets current->active_mm->context.ldt, this
> appears to be only for user space programs
Same as above
>
> arch/x86/mm/tlb.c:
> - this series adds two places where current->active_mm is
> written, it is never read
>
> arch/x86/platform/efi/efi_64.c:
> - current->active_mm is set to efi_mm for a little bit,
> with irqs disabled, and then changed back, with irqs still
> disabled; we should be able to get rid of ->active_mm here
> - in the init code, ->active_mm is set to efi_mm as well,
> presumably the kernel automatically switches that back on
> the next context switch; this may be buggy, since preemption
> is enabled and a GFP_KERNEL allocation is just a few lines
> below
Ick. This should mostly go away soon — most EFI code will move to a real thread.
>
> arch/x86/power/cpu.c:
> - fix_processor_context() calls load_mm_ldt(current->active_mm);,
> we should be able to use cpu_tlbstate.loaded_mm instead
Agreed
>
> drivers/cpufreq/pmac32-cpufreq.c:
> - pmu_set_cpu_speed() restores current->active_mm - don't know if
> anyone still cares about 32 bit PPC :)
>
I think we should only remove active_mm when the new #define is set. So this doesn’t need to change.
> drivers/firmware/efi/arm-runtime.c:
> - efi_virtmap_unload switches back the pgd to current->active_mm
> from &efi_mm; that mm could be stored elsewhere if we excised
> ->active_mm everywhere
Ditto
IOW the active_mm refcounting may be genuinely useful for architectures that are not able to efficiently shoot down remote lazy mm references in exit_mmap(). I suspect that ARM64 may be in that category.
On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
>
> So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and therefore
> ->active_mm == ->mm.
Close, but not true for kernel threads, which have a
NULL ->mm, but a non-null ->active_mm that gets passed
to enter_lazy_tlb().
I stuck to the structure of your code, but ended up
removing all the ifdefs because the final mmdrop requires
that we actually keep track of the ->active_mm across
potentially several kernel->kernel context switches.
Ifdefs around the reference counting code are also not
needed, because grab_lazy_mm and drop_lazy_mm already
contain the equivalent of an ifdef themselves.
By morning I should know what the test results look
like. I expect they will be identical to my patches,
since the refcounting is disabled completely anyway :)
> > + /*
> > + * kernel -> kernel lazy + transfer active
> > + * user -> kernel lazy + mmgrab() active
> > + *
> > + * kernel -> user switch + mmdrop() active
> > + * user -> user switch
> > + */
> > + if (!next->mm) { // to
> > kernel
> > + enter_lazy_tlb(prev->active_mm, next);
> > +
>
> #ifndef ARCH_NO_ACTIVE_MM
> > + next->active_mm = prev->active_mm;
> > + if (prev->mm) // from
> > user
> > + mmgrab(prev->active_mm);
>
> else
> prev->active_mm = NULL;
> > +#endif
> > + } else { // to user
> > + switch_mm_irqs_off(prev->active_mm, next->mm,
> > next);
> > +
>
> #ifndef ARCH_NO_ACTIVE_MM
> > + if (!prev->mm) { // from
> > kernel
> > + /* will mmdrop() in finish_task_switch().
> > */
> > + rq->prev_mm = prev->active_mm;
> > + prev->active_mm = NULL;
> > + }
> > +#endif
>
>
--
All Rights Reversed.
On Mon, Jul 30, 2018 at 09:05:55PM -0400, Rik van Riel wrote:
> On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> >
> > So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and therefore
> > ->active_mm == ->mm.
>
> Close, but not true for kernel threads, which have a
> NULL ->mm, but a non-null ->active_mm that gets passed
> to enter_lazy_tlb().
I'm confused on the need for this. We mark the CPU lazy, why do we still
care about this?
> On Jul 31, 2018, at 2:12 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, Jul 30, 2018 at 09:05:55PM -0400, Rik van Riel wrote:
>>> On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
>>>
>>> So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and therefore
>>> ->active_mm == ->mm.
>>
>> Close, but not true for kernel threads, which have a
>> NULL ->mm, but a non-null ->active_mm that gets passed
>> to enter_lazy_tlb().
>
> I'm confused on the need for this. We mark the CPU lazy, why do we still
> care about this?
I have considered renaming enter_lazy_tlb() to something like lazy_switch_to_kernel_mm() (or an irqs_off variant) and making it take no parameters or maybe just task pointer parameters. M
On Tue, 2018-07-31 at 07:29 -0700, Andy Lutomirski wrote:
> > On Jul 31, 2018, at 2:12 AM, Peter Zijlstra <[email protected]>
> > wrote:
> >
> > > On Mon, Jul 30, 2018 at 09:05:55PM -0400, Rik van Riel wrote:
> > > > On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> > > >
> > > > So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and
> > > > therefore
> > > > ->active_mm == ->mm.
> > >
> > > Close, but not true for kernel threads, which have a
> > > NULL ->mm, but a non-null ->active_mm that gets passed
> > > to enter_lazy_tlb().
> >
> > I'm confused on the need for this. We mark the CPU lazy, why do we
> > still
> > care about this?
>
> I have considered renaming enter_lazy_tlb() to something like
> lazy_switch_to_kernel_mm() (or an irqs_off variant) and making it
> take no parameters or maybe just task pointer parameters.
Of all the architectures, only Alpha uses the "mm"
parameter to enter_lazy_tlb.
It uses it to store the pgd address of the current
active_mm context into the thread info of the next
task.
If it can get that from a per CPU variable, we can
get rid of the mm parameter to enter_lazy_tlb.
--
All Rights Reversed.
On Tue, Jul 31, 2018 at 11:03:03AM -0400, Rik van Riel wrote:
> On Tue, 2018-07-31 at 07:29 -0700, Andy Lutomirski wrote:
> > > On Jul 31, 2018, at 2:12 AM, Peter Zijlstra <[email protected]>
> > > wrote:
> > >
> > > > On Mon, Jul 30, 2018 at 09:05:55PM -0400, Rik van Riel wrote:
> > > > > On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> > > > >
> > > > > So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and
> > > > > therefore
> > > > > ->active_mm == ->mm.
> > > >
> > > > Close, but not true for kernel threads, which have a
> > > > NULL ->mm, but a non-null ->active_mm that gets passed
> > > > to enter_lazy_tlb().
> > >
> > > I'm confused on the need for this. We mark the CPU lazy, why do we
> > > still
> > > care about this?
> >
> > I have considered renaming enter_lazy_tlb() to something like
> > lazy_switch_to_kernel_mm() (or an irqs_off variant) and making it
> > take no parameters or maybe just task pointer parameters.
>
> Of all the architectures, only Alpha uses the "mm"
> parameter to enter_lazy_tlb.
>
> It uses it to store the pgd address of the current
> active_mm context into the thread info of the next
> task.
>
> If it can get that from a per CPU variable, we can
> get rid of the mm parameter to enter_lazy_tlb.
We only really care for arches that select this new ARCH_NO_ACTIVE_MM
thingy. And x86 (the only one so far) really doesn't give a crap about
the argument.
We should just list/document that being one of the requirements for
selecting that symbol. All this will need semi decent documentation
anyway, otherwise the poor sod trying to understand this all in about 6
months time will go crazy :-)