2020-11-28 22:03:56

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH 0/8] shoot lazy tlbs

This is a rebase now on top of Arnd's asm-generic tree, which has
reduced most of the fluff from this patch series.

The x86 refactoring is still in the way a bit, I hope to get some
movement on that rather than rebase the main patches off it, because
I think it's a good cleanup. I think it could go in a generic
mm/scheduler series if we get arch acks because it's really just
refactoring wrappers.

The main result is reduced contention on lazy tlb mm refcount that
helps very big systems.

Thanks,
Nick

Nicholas Piggin (8):
lazy tlb: introduce exit_lazy_tlb
x86: use exit_lazy_tlb rather than
membarrier_mm_sync_core_before_usermode
x86: remove ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
lazy tlb: introduce lazy mm refcount helper functions
lazy tlb: allow lazy tlb mm switching to be configurable
lazy tlb: shoot lazies, a non-refcounting lazy tlb option
powerpc: use lazy mm refcount helper functions
powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

.../membarrier-sync-core/arch-support.txt | 6 +-
arch/Kconfig | 24 +++++
arch/arm/mach-rpc/ecard.c | 3 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/kernel/smp.c | 2 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 5 +-
arch/x86/Kconfig | 1 -
arch/x86/include/asm/mmu_context.h | 27 ++++++
arch/x86/kernel/alternative.c | 2 +-
arch/x86/kernel/cpu/mce/core.c | 2 +-
drivers/misc/sgi-gru/grufault.c | 2 +-
drivers/misc/sgi-gru/gruhandles.c | 2 +-
drivers/misc/sgi-gru/grukservices.c | 2 +-
fs/exec.c | 6 +-
include/asm-generic/mmu_context.h | 21 ++++
include/linux/sched/mm.h | 34 ++++---
include/linux/sync_core.h | 21 ----
init/Kconfig | 3 -
kernel/cpu.c | 6 +-
kernel/exit.c | 2 +-
kernel/fork.c | 53 ++++++++++
kernel/kthread.c | 12 ++-
kernel/sched/core.c | 97 +++++++++++++------
kernel/sched/sched.h | 4 +-
24 files changed, 247 insertions(+), 91 deletions(-)
delete mode 100644 include/linux/sync_core.h

--
2.23.0


2020-11-28 22:04:14

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH 3/8] x86: remove ARCH_HAS_SYNC_CORE_BEFORE_USERMODE

Switch remaining x86-specific users to asm/sync_core.h, remove the
linux/sync_core.h header and ARCH_ option.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/x86/Kconfig | 1 -
arch/x86/kernel/alternative.c | 2 +-
arch/x86/kernel/cpu/mce/core.c | 2 +-
drivers/misc/sgi-gru/grufault.c | 2 +-
drivers/misc/sgi-gru/gruhandles.c | 2 +-
drivers/misc/sgi-gru/grukservices.c | 2 +-
include/linux/sync_core.h | 21 ---------------------
init/Kconfig | 3 ---
8 files changed, 5 insertions(+), 30 deletions(-)
delete mode 100644 include/linux/sync_core.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..160d3ad90507 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -80,7 +80,6 @@ config X86
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
- select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2400ad62f330..9a7ab08f4157 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
#include <linux/kprobes.h>
#include <linux/mmu_context.h>
#include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
#include <asm/text-patching.h>
#include <asm/alternative.h>
#include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b866e7c0..282ea9942829 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
#include <linux/irq_work.h>
#include <linux/export.h>
#include <linux/set_memory.h>
-#include <linux/sync_core.h>
#include <linux/task_work.h>
#include <linux/hardirq.h>

#include <asm/intel-family.h>
#include <asm/processor.h>
+#include <asm/sync_core.h>
#include <asm/traps.h>
#include <asm/tlbflush.h>
#include <asm/mce.h>
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 723825524ea0..48fd5b101de1 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
#include <linux/io.h>
#include <linux/uaccess.h>
#include <linux/security.h>
-#include <linux/sync_core.h>
#include <linux/prefetch.h>
+#include <asm/sync_core.h>
#include "gru.h"
#include "grutables.h"
#include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
#define GRU_OPERATION_TIMEOUT (((cycles_t) local_cpu_data->itc_freq)*10)
#define CLKS2NSEC(c) ((c) *1000000000 / local_cpu_data->itc_freq)
#else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
#include <asm/tsc.h>
#define GRU_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000)
#define CLKS2NSEC(c) ((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..860aea9deb45 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,11 +16,11 @@
#include <linux/miscdevice.h>
#include <linux/proc_fs.h>
#include <linux/interrupt.h>
-#include <linux/sync_core.h>
#include <linux/uaccess.h>
#include <linux/delay.h>
#include <linux/export.h>
#include <asm/io_apic.h>
+#include <asm/sync_core.h>
#include "gru.h"
#include "grulib.h"
#include "grutables.h"
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 02d13ae27abb..82f9b5c937cb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2334,9 +2334,6 @@ source "kernel/Kconfig.locks"
config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
bool

-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
- bool
-
# It may be useful for an architecture to override the definitions of the
# SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
# and the COMPAT_ variants in <linux/compat.h>, in particular to use a
--
2.23.0

2020-11-28 22:05:20

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH 7/8] powerpc: use lazy mm refcount helper functions

Use _lazy_tlb functions for lazy mm refcounting in powerpc, to prepare
to move to MMU_LAZY_TLB_SHOOTDOWN.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/powerpc/kernel/smp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857cbd960..93c0eaa6f4bf 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1395,7 +1395,7 @@ void start_secondary(void *unused)
{
unsigned int cpu = raw_smp_processor_id();

- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;

smp_store_cpu_info(cpu);
--
2.23.0

2020-11-28 22:05:45

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable

NOMMU systems could easily go without this and save a bit of code
and the refcount atomics, because their mm switch is a no-op. I
haven't flipped them over because haven't audited all arch code to
convert over to using the _lazy_tlb refcounting.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/Kconfig | 11 +++++++
include/linux/sched/mm.h | 13 ++++++--
kernel/sched/core.c | 68 +++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 4 ++-
4 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..596bf589d74b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -430,6 +430,17 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
irqs disabled over activate_mm. Architectures that do IPI based TLB
shootdowns should enable this.

+# Should make this depend on MMU, because there is little use for lazy mm switching
+# with NOMMU. Must audit NOMMU architecture code for lazy mm refcounting first.
+config MMU_LAZY_TLB
+ def_bool y
+ help
+ Enable "lazy TLB" mmu context switching for kernel threads.
+
+config MMU_LAZY_TLB_REFCOUNT
+ def_bool y
+ depends on MMU_LAZY_TLB
+
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 7157c0f6fef8..bd0f27402d4b 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -51,12 +51,21 @@ static inline void mmdrop(struct mm_struct *mm)
/* Helpers for lazy TLB mm refcounting */
static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
{
- mmgrab(mm);
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+ mmgrab(mm);
}

static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
{
- mmdrop(mm);
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+ mmdrop(mm);
+ } else {
+ /*
+ * mmdrop_lazy_tlb must provide a full memory barrier, see the
+ * membarrier comment finish_task_switch.
+ */
+ smp_mb();
+ }
}

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e372b613d514..3b79c6cc3a37 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3579,7 +3579,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
__releases(rq->lock)
{
struct rq *rq = this_rq();
- struct mm_struct *mm = rq->prev_mm;
+ struct mm_struct *mm = NULL;
long prev_state;

/*
@@ -3598,7 +3598,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);

- rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ mm = rq->prev_lazy_mm;
+ rq->prev_lazy_mm = NULL;
+#endif

/*
* A task struct has one reference for the use as "current".
@@ -3630,6 +3633,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
* rq->curr, before returning to userspace, for
* {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by
* mmdrop_lazy_tlb().
+ *
+ * This same issue applies to other places that mmdrop_lazy_tlb().
*/
if (mm)
mmdrop_lazy_tlb(mm);
@@ -3719,22 +3724,10 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
calculate_sigpending();
}

-/*
- * context_switch - switch to the new MM and the new thread's register state.
- */
-static __always_inline struct rq *
-context_switch(struct rq *rq, struct task_struct *prev,
- struct task_struct *next, struct rq_flags *rf)
+static __always_inline void
+context_switch_mm(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next)
{
- prepare_task_switch(rq, prev, next);
-
- /*
- * For paravirt, this is coupled with an exit in switch_to to
- * combine the page table reload and the switch backend into
- * one hypercall.
- */
- arch_start_context_switch(prev);
-
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab_lazy_tlb() active
@@ -3765,11 +3758,50 @@ context_switch(struct rq *rq, struct task_struct *prev,
if (!prev->mm) { // from kernel
exit_lazy_tlb(prev->active_mm, next);

+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
/* will mmdrop_lazy_tlb() in finish_task_switch(). */
- rq->prev_mm = prev->active_mm;
+ rq->prev_lazy_mm = prev->active_mm;
prev->active_mm = NULL;
+#else
+ /* See membarrier comment in finish_task_switch(). */
+ smp_mb();
+#endif
}
}
+}
+
+static __always_inline void
+context_switch_mm_nolazy(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next)
+{
+ if (!next->mm)
+ next->active_mm = &init_mm;
+ membarrier_switch_mm(rq, prev->active_mm, next->active_mm);
+ switch_mm_irqs_off(prev->active_mm, next->active_mm, next);
+ if (!prev->mm)
+ prev->active_mm = NULL;
+}
+
+/*
+ * context_switch - switch to the new MM and the new thread's register state.
+ */
+static __always_inline struct rq *
+context_switch(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next, struct rq_flags *rf)
+{
+ prepare_task_switch(rq, prev, next);
+
+ /*
+ * For paravirt, this is coupled with an exit in switch_to to
+ * combine the page table reload and the switch backend into
+ * one hypercall.
+ */
+ arch_start_context_switch(prev);
+
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+ context_switch_mm(rq, prev, next);
+ else
+ context_switch_mm_nolazy(rq, prev, next);

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..3b72aec5a2f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -950,7 +950,9 @@ struct rq {
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
- struct mm_struct *prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ struct mm_struct *prev_lazy_mm;
+#endif

unsigned int clock_update_flags;
u64 clock;
--
2.23.0

2020-11-28 22:06:07

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

And get rid of the generic sync_core_before_usermode facility. This is
functionally a no-op in the core scheduler code, but it also catches

This helper is the wrong way around I think. The idea that membarrier
state requires a core sync before returning to user is the easy one
that does not need hiding behind membarrier calls. The gap in core
synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
tricky detail that is better put in x86 lazy tlb code.

Consider if an arch did not synchronize core in switch_mm either, then
membarrier_mm_sync_core_before_usermode would be in the wrong place
but arch specific mmu context functions would still be the right place.
There is also a exit_lazy_tlb case that is not covered by this call, which
could be a bugs (kthread use mm the membarrier process's mm then context
switch back to the process without switching mm or lazy mm switch).

This makes lazy tlb code a bit more modular.

Signed-off-by: Nicholas Piggin <[email protected]>
---
.../membarrier-sync-core/arch-support.txt | 6 ++++-
arch/x86/include/asm/mmu_context.h | 27 +++++++++++++++++++
include/linux/sched/mm.h | 14 ----------
kernel/cpu.c | 4 ++-
kernel/sched/core.c | 16 +++++------
5 files changed, 42 insertions(+), 25 deletions(-)

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 47e6903f47a5..0763a63a7097 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,6 +5,10 @@
#
# Architecture requirements
#
+# If your architecture returns to user-space through non-core-serializing
+# instructions, you need to ensure these are done in switch_mm and exit_lazy_tlb
+# (if lazy tlb switching is implemented).
+#
# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
@@ -24,7 +28,7 @@
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
+# serializing instruction in exit_lazy_mm() in that specific case.
#
-----------------------
| arch |status|
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 36afcbea6a9f..8094893254f1 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -6,12 +6,14 @@
#include <linux/atomic.h>
#include <linux/mm_types.h>
#include <linux/pkeys.h>
+#include <linux/sched/mm.h>

#include <trace/events/tlb.h>

#include <asm/tlbflush.h>
#include <asm/paravirt.h>
#include <asm/debugreg.h>
+#include <asm/sync_core.h>

extern atomic64_t last_mm_ctx_id;

@@ -94,6 +96,31 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
#define enter_lazy_tlb enter_lazy_tlb
extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);

+#ifdef CONFIG_MEMBARRIER
+/*
+ * Ensure that a core serializing instruction is issued before returning
+ * to user-mode, if a SYNC_CORE was requested. x86 implements return to
+ * user-space through sysexit, sysrel, and sysretq, which are not core
+ * serializing.
+ *
+ * See the membarrier comment in finish_task_switch as to why this is done
+ * in exit_lazy_tlb.
+ */
+#define exit_lazy_tlb exit_lazy_tlb
+static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+ /* Switching mm is serializing with write_cr3 */
+ if (tsk->mm != mm)
+ return;
+
+ if (likely(!(atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
+ return;
+
+ sync_core_before_usermode();
+}
+#endif
+
/*
* Init a new mm. Used on mm copies, like at fork()
* and on mm's that are brand-new, like at execve().
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index d5ece7a9a403..2c6bcdf76d99 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/gfp.h>
-#include <linux/sync_core.h>

/*
* Routines for handling mm_structs
@@ -335,16 +334,6 @@ enum {
#include <asm/membarrier.h>
#endif

-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
- if (current->mm != mm)
- return;
- if (likely(!(atomic_read(&mm->membarrier_state) &
- MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
- return;
- sync_core_before_usermode();
-}
-
extern void membarrier_exec_mmap(struct mm_struct *mm);

#else
@@ -358,9 +347,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
static inline void membarrier_exec_mmap(struct mm_struct *mm)
{
}
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
-}
#endif

#endif /* _LINUX_SCHED_MM_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6ff2578ecf17..134688d79589 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu)

/*
* idle_task_exit() will have switched to &init_mm, now
- * clean up any remaining active_mm state.
+ * clean up any remaining active_mm state. exit_lazy_tlb
+ * is not done, if an arch did any accounting in these
+ * functions it would have to be added.
*/
if (mm != &init_mm)
idle->active_mm = &init_mm;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dcc46039ade5..e4e8cebd82e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3620,22 +3620,19 @@ static struct rq *finish_task_switch(struct task_struct *prev)
kcov_finish_switch(current);

fire_sched_in_preempt_notifiers(current);
+
/*
* When switching through a kernel thread, the loop in
* membarrier_{private,global}_expedited() may have observed that
* kernel thread and not issued an IPI. It is therefore possible to
* schedule between user->kernel->user threads without passing though
- * switch_mm(). Membarrier requires a barrier after storing to
- * rq->curr, before returning to userspace, so provide them here:
- *
- * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
- * provided by mmdrop(),
- * - a sync_core for SYNC_CORE.
+ * switch_mm(). Membarrier requires a full barrier after storing to
+ * rq->curr, before returning to userspace, for
+ * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop().
*/
- if (mm) {
- membarrier_mm_sync_core_before_usermode(mm);
+ if (mm)
mmdrop(mm);
- }
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
@@ -6689,6 +6686,7 @@ void idle_task_exit(void)
BUG_ON(current != this_rq()->idle);

if (mm != &init_mm) {
+ /* enter_lazy_tlb is not done because we're about to go down */
switch_mm(mm, &init_mm, current);
finish_arch_post_lock_switch();
}
--
2.23.0

2020-11-28 22:06:08

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb

This is called at points where a lazy mm is switched away or made not
lazy (by its owner switching back).

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/arm/mach-rpc/ecard.c | 1 +
arch/powerpc/mm/book3s64/radix_tlb.c | 1 +
fs/exec.c | 6 ++++--
include/asm-generic/mmu_context.h | 21 +++++++++++++++++++++
kernel/kthread.c | 1 +
kernel/sched/core.c | 2 ++
6 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 827b50f1c73e..43eb1bfba466 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,6 +253,7 @@ static int ecard_init_mm(void)
current->mm = mm;
current->active_mm = mm;
activate_mm(active_mm, mm);
+ exit_lazy_tlb(active_mm, current);
mmdrop(active_mm);
ecard_init_pgtables(mm);
return 0;
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index b487b489d4b6..ac3fec03926a 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -661,6 +661,7 @@ static void do_exit_flush_lazy_tlb(void *arg)
mmgrab(&init_mm);
current->active_mm = &init_mm;
switch_mm_irqs_off(mm, &init_mm, current);
+ exit_lazy_tlb(mm, current);
mmdrop(mm);
}

diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..4b4dea1bb7ba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1017,6 +1017,8 @@ static int exec_mmap(struct mm_struct *mm)
if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
activate_mm(active_mm, mm);
+ if (!old_mm)
+ exit_lazy_tlb(active_mm, tsk);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
@@ -1028,9 +1030,9 @@ static int exec_mmap(struct mm_struct *mm)
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
mm_update_next_owner(old_mm);
mmput(old_mm);
- return 0;
+ } else {
+ mmdrop(active_mm);
}
- mmdrop(active_mm);
return 0;
}

diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index 91727065bacb..4626d0020e65 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -24,6 +24,27 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
}
#endif

+/*
+ * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm.
+ *
+ * mm: the lazy mm context that was switched
+ * tsk: the task that was switched to (with a non-lazy mm)
+ *
+ * mm may equal tsk->mm.
+ * mm and tsk->mm will not be NULL.
+ *
+ * Note this is not symmetrical to enter_lazy_tlb, this is not
+ * called when tasks switch into the lazy mm, it's called after the
+ * lazy mm becomes non-lazy (either switched to a different mm or the
+ * owner of the mm returns).
+ */
+#ifndef exit_lazy_tlb
+static inline void exit_lazy_tlb(struct mm_struct *mm,
+ struct task_struct *tsk)
+{
+}
+#endif
+
/**
* init_new_context - Initialize context of a new mm_struct.
* @tsk: task struct for the mm
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 933a625621b8..e380302aac13 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1250,6 +1250,7 @@ void kthread_use_mm(struct mm_struct *mm)
}
tsk->mm = mm;
switch_mm_irqs_off(active_mm, mm, tsk);
+ exit_lazy_tlb(active_mm, tsk);
local_irq_enable();
task_unlock(tsk);
#ifdef finish_arch_post_lock_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7e453492cff..dcc46039ade5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3765,6 +3765,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);

if (!prev->mm) { // from kernel
+ exit_lazy_tlb(prev->active_mm, next);
+
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
--
2.23.0

2020-11-28 22:16:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <[email protected]> wrote:
>
> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.

I have a couple of membarrier fixes that I want to send out today or
tomorrow, and they might eliminate the need for this patch. Let me
think about this a little bit. I'll cc you. The existing code is way
to subtle and the comments are far too confusing for me to be quickly
confident about any of my conclusions :)

2020-11-29 00:40:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable

On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <[email protected]> wrote:
>
> NOMMU systems could easily go without this and save a bit of code
> and the refcount atomics, because their mm switch is a no-op. I
> haven't flipped them over because haven't audited all arch code to
> convert over to using the _lazy_tlb refcounting.
>
> Signed-off-by: Nicholas Piggin <[email protected]>
> ---
> arch/Kconfig | 11 +++++++
> include/linux/sched/mm.h | 13 ++++++--
> kernel/sched/core.c | 68 +++++++++++++++++++++++++++++-----------
> kernel/sched/sched.h | 4 ++-
> 4 files changed, 75 insertions(+), 21 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 56b6ccc0e32d..596bf589d74b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -430,6 +430,17 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> irqs disabled over activate_mm. Architectures that do IPI based TLB
> shootdowns should enable this.
>
> +# Should make this depend on MMU, because there is little use for lazy mm switching
> +# with NOMMU. Must audit NOMMU architecture code for lazy mm refcounting first.
> +config MMU_LAZY_TLB
> + def_bool y
> + help
> + Enable "lazy TLB" mmu context switching for kernel threads.
> +
> +config MMU_LAZY_TLB_REFCOUNT
> + def_bool y
> + depends on MMU_LAZY_TLB
> +

This could use some documentation as to what "no" means.

> config ARCH_HAVE_NMI_SAFE_CMPXCHG
> bool
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 7157c0f6fef8..bd0f27402d4b 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -51,12 +51,21 @@ static inline void mmdrop(struct mm_struct *mm)
> /* Helpers for lazy TLB mm refcounting */
> static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
> {
> - mmgrab(mm);
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
> + mmgrab(mm);
> }
>
> static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
> {
> - mmdrop(mm);
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
> + mmdrop(mm);
> + } else {
> + /*
> + * mmdrop_lazy_tlb must provide a full memory barrier, see the
> + * membarrier comment finish_task_switch.

"membarrier comment in finish_task_switch()", perhaps?

> + */
> + smp_mb();
> + }
> }
>
> /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e372b613d514..3b79c6cc3a37 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3579,7 +3579,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> __releases(rq->lock)
> {
> struct rq *rq = this_rq();
> - struct mm_struct *mm = rq->prev_mm;
> + struct mm_struct *mm = NULL;
> long prev_state;
>
> /*
> @@ -3598,7 +3598,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> current->comm, current->pid, preempt_count()))
> preempt_count_set(FORK_PREEMPT_COUNT);
>
> - rq->prev_mm = NULL;
> +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
> + mm = rq->prev_lazy_mm;
> + rq->prev_lazy_mm = NULL;
> +#endif
>
> /*
> * A task struct has one reference for the use as "current".
> @@ -3630,6 +3633,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> * rq->curr, before returning to userspace, for
> * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by
> * mmdrop_lazy_tlb().
> + *
> + * This same issue applies to other places that mmdrop_lazy_tlb().
> */
> if (mm)
> mmdrop_lazy_tlb(mm);
> @@ -3719,22 +3724,10 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> calculate_sigpending();
> }
>
> -/*
> - * context_switch - switch to the new MM and the new thread's register state.
> - */
> -static __always_inline struct rq *
> -context_switch(struct rq *rq, struct task_struct *prev,
> - struct task_struct *next, struct rq_flags *rf)
> +static __always_inline void
> +context_switch_mm(struct rq *rq, struct task_struct *prev,
> + struct task_struct *next)
> {
> - prepare_task_switch(rq, prev, next);
> -
> - /*
> - * For paravirt, this is coupled with an exit in switch_to to
> - * combine the page table reload and the switch backend into
> - * one hypercall.
> - */
> - arch_start_context_switch(prev);
> -
> /*
> * kernel -> kernel lazy + transfer active
> * user -> kernel lazy + mmgrab_lazy_tlb() active
> @@ -3765,11 +3758,50 @@ context_switch(struct rq *rq, struct task_struct *prev,
> if (!prev->mm) { // from kernel
> exit_lazy_tlb(prev->active_mm, next);
>
> +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
> /* will mmdrop_lazy_tlb() in finish_task_switch(). */
> - rq->prev_mm = prev->active_mm;
> + rq->prev_lazy_mm = prev->active_mm;
> prev->active_mm = NULL;
> +#else
> + /* See membarrier comment in finish_task_switch(). */
> + smp_mb();
> +#endif
> }
> }
> +}
> +

Comment here describing what this does, please.


> +static __always_inline void
> +context_switch_mm_nolazy(struct rq *rq, struct task_struct *prev,
> + struct task_struct *next)
> +{
> + if (!next->mm)
> + next->active_mm = &init_mm;
> + membarrier_switch_mm(rq, prev->active_mm, next->active_mm);
> + switch_mm_irqs_off(prev->active_mm, next->active_mm, next);
> + if (!prev->mm)
> + prev->active_mm = NULL;
> +}
> +
> +/*
> + * context_switch - switch to the new MM and the new thread's register state.
> + */
> +static __always_inline struct rq *
> +context_switch(struct rq *rq, struct task_struct *prev,
> + struct task_struct *next, struct rq_flags *rf)
> +{
> + prepare_task_switch(rq, prev, next);
> +
> + /*
> + * For paravirt, this is coupled with an exit in switch_to to
> + * combine the page table reload and the switch backend into
> + * one hypercall.
> + */
> + arch_start_context_switch(prev);
> +
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
> + context_switch_mm(rq, prev, next);
> + else
> + context_switch_mm_nolazy(rq, prev, next);
>
> rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index df80bfcea92e..3b72aec5a2f2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -950,7 +950,9 @@ struct rq {
> struct task_struct *idle;
> struct task_struct *stop;
> unsigned long next_balance;
> - struct mm_struct *prev_mm;
> +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
> + struct mm_struct *prev_lazy_mm;
> +#endif
>
> unsigned int clock_update_flags;
> u64 clock;
> --
> 2.23.0
>

2020-11-29 00:42:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb

On Sat, Nov 28, 2020 at 8:01 AM Nicholas Piggin <[email protected]> wrote:
>
> This is called at points where a lazy mm is switched away or made not
> lazy (by its owner switching back).
>
> Signed-off-by: Nicholas Piggin <[email protected]>
> ---
> arch/arm/mach-rpc/ecard.c | 1 +
> arch/powerpc/mm/book3s64/radix_tlb.c | 1 +
> fs/exec.c | 6 ++++--
> include/asm-generic/mmu_context.h | 21 +++++++++++++++++++++
> kernel/kthread.c | 1 +
> kernel/sched/core.c | 2 ++
> 6 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
> index 827b50f1c73e..43eb1bfba466 100644
> --- a/arch/arm/mach-rpc/ecard.c
> +++ b/arch/arm/mach-rpc/ecard.c
> @@ -253,6 +253,7 @@ static int ecard_init_mm(void)
> current->mm = mm;
> current->active_mm = mm;
> activate_mm(active_mm, mm);
> + exit_lazy_tlb(active_mm, current);
> mmdrop(active_mm);
> ecard_init_pgtables(mm);
> return 0;
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index b487b489d4b6..ac3fec03926a 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -661,6 +661,7 @@ static void do_exit_flush_lazy_tlb(void *arg)
> mmgrab(&init_mm);
> current->active_mm = &init_mm;
> switch_mm_irqs_off(mm, &init_mm, current);
> + exit_lazy_tlb(mm, current);
> mmdrop(mm);
> }
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 547a2390baf5..4b4dea1bb7ba 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1017,6 +1017,8 @@ static int exec_mmap(struct mm_struct *mm)
> if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> local_irq_enable();
> activate_mm(active_mm, mm);
> + if (!old_mm)
> + exit_lazy_tlb(active_mm, tsk);
> if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> local_irq_enable();
> tsk->mm->vmacache_seqnum = 0;
> @@ -1028,9 +1030,9 @@ static int exec_mmap(struct mm_struct *mm)
> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> mm_update_next_owner(old_mm);
> mmput(old_mm);
> - return 0;
> + } else {
> + mmdrop(active_mm);
> }
> - mmdrop(active_mm);

This looks like an unrelated change.

> return 0;
> }
>
> diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
> index 91727065bacb..4626d0020e65 100644
> --- a/include/asm-generic/mmu_context.h
> +++ b/include/asm-generic/mmu_context.h
> @@ -24,6 +24,27 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
> }
> #endif
>
> +/*
> + * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm.
> + *
> + * mm: the lazy mm context that was switched
> + * tsk: the task that was switched to (with a non-lazy mm)
> + *
> + * mm may equal tsk->mm.
> + * mm and tsk->mm will not be NULL.
> + *
> + * Note this is not symmetrical to enter_lazy_tlb, this is not
> + * called when tasks switch into the lazy mm, it's called after the
> + * lazy mm becomes non-lazy (either switched to a different mm or the
> + * owner of the mm returns).
> + */
> +#ifndef exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm,

Maybe name this parameter prev_lazy_mm?

2020-11-30 15:00:41

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

----- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin [email protected] wrote:

> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches

This sentence is incomplete.

>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.

Ideally yes this complexity should sit within the x86 architecture code
if only that architecture requires it.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-12-02 02:52:28

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb

Excerpts from Andy Lutomirski's message of November 29, 2020 10:38 am:
> On Sat, Nov 28, 2020 at 8:01 AM Nicholas Piggin <[email protected]> wrote:
>>
>> This is called at points where a lazy mm is switched away or made not
>> lazy (by its owner switching back).
>>
>> Signed-off-by: Nicholas Piggin <[email protected]>
>> ---
>> arch/arm/mach-rpc/ecard.c | 1 +
>> arch/powerpc/mm/book3s64/radix_tlb.c | 1 +
>> fs/exec.c | 6 ++++--
>> include/asm-generic/mmu_context.h | 21 +++++++++++++++++++++
>> kernel/kthread.c | 1 +
>> kernel/sched/core.c | 2 ++
>> 6 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
>> index 827b50f1c73e..43eb1bfba466 100644
>> --- a/arch/arm/mach-rpc/ecard.c
>> +++ b/arch/arm/mach-rpc/ecard.c
>> @@ -253,6 +253,7 @@ static int ecard_init_mm(void)
>> current->mm = mm;
>> current->active_mm = mm;
>> activate_mm(active_mm, mm);
>> + exit_lazy_tlb(active_mm, current);
>> mmdrop(active_mm);
>> ecard_init_pgtables(mm);
>> return 0;
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index b487b489d4b6..ac3fec03926a 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -661,6 +661,7 @@ static void do_exit_flush_lazy_tlb(void *arg)
>> mmgrab(&init_mm);
>> current->active_mm = &init_mm;
>> switch_mm_irqs_off(mm, &init_mm, current);
>> + exit_lazy_tlb(mm, current);
>> mmdrop(mm);
>> }
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 547a2390baf5..4b4dea1bb7ba 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1017,6 +1017,8 @@ static int exec_mmap(struct mm_struct *mm)
>> if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
>> local_irq_enable();
>> activate_mm(active_mm, mm);
>> + if (!old_mm)
>> + exit_lazy_tlb(active_mm, tsk);
>> if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
>> local_irq_enable();
>> tsk->mm->vmacache_seqnum = 0;
>> @@ -1028,9 +1030,9 @@ static int exec_mmap(struct mm_struct *mm)
>> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
>> mm_update_next_owner(old_mm);
>> mmput(old_mm);
>> - return 0;
>> + } else {
>> + mmdrop(active_mm);
>> }
>> - mmdrop(active_mm);
>
> This looks like an unrelated change.

I thought the old style was pointless and made me look twice to make
sure we weren't mmdrop()ing the lazy.

>
>> return 0;
>> }
>>
>> diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
>> index 91727065bacb..4626d0020e65 100644
>> --- a/include/asm-generic/mmu_context.h
>> +++ b/include/asm-generic/mmu_context.h
>> @@ -24,6 +24,27 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
>> }
>> #endif
>>
>> +/*
>> + * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm.
>> + *
>> + * mm: the lazy mm context that was switched
>> + * tsk: the task that was switched to (with a non-lazy mm)
>> + *
>> + * mm may equal tsk->mm.
>> + * mm and tsk->mm will not be NULL.
>> + *
>> + * Note this is not symmetrical to enter_lazy_tlb, this is not
>> + * called when tasks switch into the lazy mm, it's called after the
>> + * lazy mm becomes non-lazy (either switched to a different mm or the
>> + * owner of the mm returns).
>> + */
>> +#ifndef exit_lazy_tlb
>> +static inline void exit_lazy_tlb(struct mm_struct *mm,
>
> Maybe name this parameter prev_lazy_mm?
>

mm is better because it's the mm that we're "exiting lazy" from, the
function name gives the context.

prev might suggest it was the previous but it's the current one, or
that we're switching to another mm but we may not be at all.

Thanks,
Nick

2020-12-02 02:53:02

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am:
> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <[email protected]> wrote:
>>
>> And get rid of the generic sync_core_before_usermode facility. This is
>> functionally a no-op in the core scheduler code, but it also catches
>>
>> This helper is the wrong way around I think. The idea that membarrier
>> state requires a core sync before returning to user is the easy one
>> that does not need hiding behind membarrier calls. The gap in core
>> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
>> tricky detail that is better put in x86 lazy tlb code.
>>
>> Consider if an arch did not synchronize core in switch_mm either, then
>> membarrier_mm_sync_core_before_usermode would be in the wrong place
>> but arch specific mmu context functions would still be the right place.
>> There is also a exit_lazy_tlb case that is not covered by this call, which
>> could be a bugs (kthread use mm the membarrier process's mm then context
>> switch back to the process without switching mm or lazy mm switch).
>>
>> This makes lazy tlb code a bit more modular.
>
> I have a couple of membarrier fixes that I want to send out today or
> tomorrow, and they might eliminate the need for this patch. Let me
> think about this a little bit. I'll cc you. The existing code is way
> to subtle and the comments are far too confusing for me to be quickly
> confident about any of my conclusions :)
>

Thanks for the head's up. I'll have to have a better look through them
but I don't know that it eliminates the need for this entirely although
it might close some gaps and make this not a bug fix. The problem here
is x86 code wanted something to be called when a lazy mm is unlazied,
but it missed some spots and also the core scheduler doesn't need to
know about those x86 details if it has this generic call that annotates
the lazy handling better.

I'll go through the wording again and look at your patches a bit better
but I think they are somewhat orthogonal.

Thanks,
Nick

2020-12-02 02:53:46

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable

Excerpts from Andy Lutomirski's message of November 29, 2020 10:36 am:
> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <[email protected]> wrote:
>>
>> NOMMU systems could easily go without this and save a bit of code
>> and the refcount atomics, because their mm switch is a no-op. I
>> haven't flipped them over because haven't audited all arch code to
>> convert over to using the _lazy_tlb refcounting.
>>
>> Signed-off-by: Nicholas Piggin <[email protected]>
>> ---
>> arch/Kconfig | 11 +++++++
>> include/linux/sched/mm.h | 13 ++++++--
>> kernel/sched/core.c | 68 +++++++++++++++++++++++++++++-----------
>> kernel/sched/sched.h | 4 ++-
>> 4 files changed, 75 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 56b6ccc0e32d..596bf589d74b 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -430,6 +430,17 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>> irqs disabled over activate_mm. Architectures that do IPI based TLB
>> shootdowns should enable this.
>>
>> +# Should make this depend on MMU, because there is little use for lazy mm switching
>> +# with NOMMU. Must audit NOMMU architecture code for lazy mm refcounting first.
>> +config MMU_LAZY_TLB
>> + def_bool y
>> + help
>> + Enable "lazy TLB" mmu context switching for kernel threads.
>> +
>> +config MMU_LAZY_TLB_REFCOUNT
>> + def_bool y
>> + depends on MMU_LAZY_TLB
>> +
>
> This could use some documentation as to what "no" means.

Sure I can add a bit more.

>
>> config ARCH_HAVE_NMI_SAFE_CMPXCHG
>> bool
>>
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 7157c0f6fef8..bd0f27402d4b 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -51,12 +51,21 @@ static inline void mmdrop(struct mm_struct *mm)
>> /* Helpers for lazy TLB mm refcounting */
>> static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
>> {
>> - mmgrab(mm);
>> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
>> + mmgrab(mm);
>> }
>>
>> static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
>> {
>> - mmdrop(mm);
>> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
>> + mmdrop(mm);
>> + } else {
>> + /*
>> + * mmdrop_lazy_tlb must provide a full memory barrier, see the
>> + * membarrier comment finish_task_switch.
>
> "membarrier comment in finish_task_switch()", perhaps?

Sure.

Thanks,
Nick

2020-12-03 05:13:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

On Tue, Dec 1, 2020 at 6:50 PM Nicholas Piggin <[email protected]> wrote:
>
> Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am:
> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <[email protected]> wrote:
> >>
> >> And get rid of the generic sync_core_before_usermode facility. This is
> >> functionally a no-op in the core scheduler code, but it also catches
> >>
> >> This helper is the wrong way around I think. The idea that membarrier
> >> state requires a core sync before returning to user is the easy one
> >> that does not need hiding behind membarrier calls. The gap in core
> >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> >> tricky detail that is better put in x86 lazy tlb code.
> >>
> >> Consider if an arch did not synchronize core in switch_mm either, then
> >> membarrier_mm_sync_core_before_usermode would be in the wrong place
> >> but arch specific mmu context functions would still be the right place.
> >> There is also a exit_lazy_tlb case that is not covered by this call, which
> >> could be a bugs (kthread use mm the membarrier process's mm then context
> >> switch back to the process without switching mm or lazy mm switch).
> >>
> >> This makes lazy tlb code a bit more modular.
> >
> > I have a couple of membarrier fixes that I want to send out today or
> > tomorrow, and they might eliminate the need for this patch. Let me
> > think about this a little bit. I'll cc you. The existing code is way
> > to subtle and the comments are far too confusing for me to be quickly
> > confident about any of my conclusions :)
> >
>
> Thanks for the head's up. I'll have to have a better look through them
> but I don't know that it eliminates the need for this entirely although
> it might close some gaps and make this not a bug fix. The problem here
> is x86 code wanted something to be called when a lazy mm is unlazied,
> but it missed some spots and also the core scheduler doesn't need to
> know about those x86 details if it has this generic call that annotates
> the lazy handling better.

I'll send v3 tomorrow. They add more sync_core_before_usermode() callers.

Having looked at your patches a bunch and the membarrier code a bunch,
I don't think I like the approach of pushing this logic out into new
core functions called by arch code. Right now, even with my
membarrier patches applied, understanding how (for example) the x86
switch_mm_irqs_off() plus the scheduler code provides the barriers
that membarrier needs is quite complicated, and it's not clear to me
that the code is even correct. At the very least I'm pretty sure that
the x86 comments are misleading. With your patches, someone trying to
audit the code would have to follow core code calling into arch code
and back out into core code to figure out what's going on. I think
the result is worse.

I wrote this incomplete patch which takes the opposite approach (sorry
for whitespace damage):

commit 928b5c60e93f475934892d6e0b357ebf0a2bf87d
Author: Andy Lutomirski <[email protected]>
Date: Wed Dec 2 17:24:02 2020 -0800

[WIP] x86/mm: Handle unlazying membarrier core sync in the arch code

The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler
doesn't actually know whether we are lazy. With the old code, if a
CPU is running a membarrier-registered task, goes idle, gets unlazied
via a TLB shootdown IPI, and switches back to the
membarrier-registered task, it will do an unnecessary core sync.

Conveniently, x86 is the only architecture that does anything in this
hook, so we can just move the code.

XXX: actually delete the old code.

Signed-off-by: Andy Lutomirski <[email protected]>

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3338a1feccf9..e27300fc865b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -496,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev,
struct mm_struct *next,
* from one thread in a process to another thread in the same
* process. No TLB flush required.
*/
+
+ // XXX: why is this okay wrt membarrier?
if (!was_lazy)
return;

@@ -508,12 +510,24 @@ void switch_mm_irqs_off(struct mm_struct *prev,
struct mm_struct *next,
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
- next_tlb_gen)
+ next_tlb_gen) {
+ /*
+ * We're reactivating an mm, and membarrier might
+ * need to serialize. Tell membarrier.
+ */
+
+ // XXX: I can't understand the logic in
+ // membarrier_mm_sync_core_before_usermode(). What's
+ // the mm check for?
+ membarrier_mm_sync_core_before_usermode(next);
return;
+ }

/*
* TLB contents went out of date while we were in lazy
* mode. Fall through to the TLB switching code below.
+ * No need for an explicit membarrier invocation -- the CR3
+ * write will serialize.
*/
new_asid = prev_asid;
need_flush = true;

2020-12-05 08:05:28

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

Excerpts from Andy Lutomirski's message of December 3, 2020 3:09 pm:
> On Tue, Dec 1, 2020 at 6:50 PM Nicholas Piggin <[email protected]> wrote:
>>
>> Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am:
>> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <[email protected]> wrote:
>> >>
>> >> And get rid of the generic sync_core_before_usermode facility. This is
>> >> functionally a no-op in the core scheduler code, but it also catches
>> >>
>> >> This helper is the wrong way around I think. The idea that membarrier
>> >> state requires a core sync before returning to user is the easy one
>> >> that does not need hiding behind membarrier calls. The gap in core
>> >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
>> >> tricky detail that is better put in x86 lazy tlb code.
>> >>
>> >> Consider if an arch did not synchronize core in switch_mm either, then
>> >> membarrier_mm_sync_core_before_usermode would be in the wrong place
>> >> but arch specific mmu context functions would still be the right place.
>> >> There is also a exit_lazy_tlb case that is not covered by this call, which
>> >> could be a bugs (kthread use mm the membarrier process's mm then context
>> >> switch back to the process without switching mm or lazy mm switch).
>> >>
>> >> This makes lazy tlb code a bit more modular.
>> >
>> > I have a couple of membarrier fixes that I want to send out today or
>> > tomorrow, and they might eliminate the need for this patch. Let me
>> > think about this a little bit. I'll cc you. The existing code is way
>> > to subtle and the comments are far too confusing for me to be quickly
>> > confident about any of my conclusions :)
>> >
>>
>> Thanks for the head's up. I'll have to have a better look through them
>> but I don't know that it eliminates the need for this entirely although
>> it might close some gaps and make this not a bug fix. The problem here
>> is x86 code wanted something to be called when a lazy mm is unlazied,
>> but it missed some spots and also the core scheduler doesn't need to
>> know about those x86 details if it has this generic call that annotates
>> the lazy handling better.
>
> I'll send v3 tomorrow. They add more sync_core_before_usermode() callers.
>
> Having looked at your patches a bunch and the membarrier code a bunch,
> I don't think I like the approach of pushing this logic out into new
> core functions called by arch code. Right now, even with my
> membarrier patches applied, understanding how (for example) the x86
> switch_mm_irqs_off() plus the scheduler code provides the barriers
> that membarrier needs is quite complicated, and it's not clear to me
> that the code is even correct. At the very least I'm pretty sure that
> the x86 comments are misleading.
>
> With your patches, someone trying to
> audit the code would have to follow core code calling into arch code
> and back out into core code to figure out what's going on. I think
> the result is worse.

Sorry I missed this and rather than reply to the later version you
have a bigger comment here.

I disagree. Until now nobody following it noticed that the mm gets
un-lazied in other cases, because that was not too clear from the
code (only indirectly using non-standard terminology in the arch
support document).

In other words, membarrier needs a special sync to deal with the case
when a kthread takes the mm. exit_lazy_tlb gives membarrier code that
exact hook that it wants from the core scheduler code.

>
> I wrote this incomplete patch which takes the opposite approach (sorry
> for whitespace damage):

That said, if you want to move the code entirely in the x86 arch from
exit_lazy_tlb to switch_mm_irqs_off, it's trivial and touches no core
code after my series :) and I would have no problem with doing that.

I suspect it might actually be more readable to go the other way and
pull most of the real_prev == next membarrier code into exit_lazy_tlb
instead, but I could be wrong I don't know exactly how the x86 lazy
state correlates with core lazy tlb state.

Thanks,
Nick

>
> commit 928b5c60e93f475934892d6e0b357ebf0a2bf87d
> Author: Andy Lutomirski <[email protected]>
> Date: Wed Dec 2 17:24:02 2020 -0800
>
> [WIP] x86/mm: Handle unlazying membarrier core sync in the arch code
>
> The core scheduler isn't a great place for
> membarrier_mm_sync_core_before_usermode() -- the core scheduler
> doesn't actually know whether we are lazy. With the old code, if a
> CPU is running a membarrier-registered task, goes idle, gets unlazied
> via a TLB shootdown IPI, and switches back to the
> membarrier-registered task, it will do an unnecessary core sync.
>
> Conveniently, x86 is the only architecture that does anything in this
> hook, so we can just move the code.
>
> XXX: actually delete the old code.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 3338a1feccf9..e27300fc865b 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -496,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev,
> struct mm_struct *next,
> * from one thread in a process to another thread in the same
> * process. No TLB flush required.
> */
> +
> + // XXX: why is this okay wrt membarrier?
> if (!was_lazy)
> return;
>
> @@ -508,12 +510,24 @@ void switch_mm_irqs_off(struct mm_struct *prev,
> struct mm_struct *next,
> smp_mb();
> next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> - next_tlb_gen)
> + next_tlb_gen) {
> + /*
> + * We're reactivating an mm, and membarrier might
> + * need to serialize. Tell membarrier.
> + */
> +
> + // XXX: I can't understand the logic in
> + // membarrier_mm_sync_core_before_usermode(). What's
> + // the mm check for?
> + membarrier_mm_sync_core_before_usermode(next);
> return;
> + }
>
> /*
> * TLB contents went out of date while we were in lazy
> * mode. Fall through to the TLB switching code below.
> + * No need for an explicit membarrier invocation -- the CR3
> + * write will serialize.
> */
> new_asid = prev_asid;
> need_flush = true;
>

2020-12-05 17:29:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode


> On Dec 5, 2020, at 12:00 AM, Nicholas Piggin <[email protected]> wrote:
>
>
> I disagree. Until now nobody following it noticed that the mm gets
> un-lazied in other cases, because that was not too clear from the
> code (only indirectly using non-standard terminology in the arch
> support document).

> In other words, membarrier needs a special sync to deal with the case
> when a kthread takes the mm.

I don’t think this is actually true. Somehow the x86 oddities about CR3 writes leaked too much into the membarrier core code and comments. (I doubt this is x86 specific. The actual x86 specific part seems to be that we can return to user mode without syncing the instruction stream.)

As far as I can tell, membarrier doesn’t care at all about laziness. Membarrier cares about rq->curr->mm. The fact that a cpu can switch its actual loaded mm without scheduling at all (on x86 at least) is entirely beside the point except insofar as it has an effect on whether a subsequent switch_mm() call serializes. If we notify membarrier about x86’s asynchronous CR3 writes, then membarrier needs to understand what to do with them, which results in an unmaintainable mess in membarrier *and* in the x86 code.

I’m currently trying to document how membarrier actually works, and hopefully this will result in untangling membarrier from mmdrop() and such.

A silly part of this is that x86 already has a high quality implementation of most of membarrier(): flush_tlb_mm(). If you flush an mm’s TLB, we carefully propagate the flush to all threads, with attention to memory ordering. We can’t use this directly as an arch-specific implementation of membarrier because it has the annoying side affect of flushing the TLB and because upcoming hardware might be able to flush without guaranteeing a core sync. (Upcoming means Zen 3, but the Zen 3 implementation is sadly not usable by Linux.)

2020-12-05 23:18:18

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am:
>
>> On Dec 5, 2020, at 12:00 AM, Nicholas Piggin <[email protected]> wrote:
>>
>>
>> I disagree. Until now nobody following it noticed that the mm gets
>> un-lazied in other cases, because that was not too clear from the
>> code (only indirectly using non-standard terminology in the arch
>> support document).
>
>> In other words, membarrier needs a special sync to deal with the case
>> when a kthread takes the mm.
>
> I don’t think this is actually true. Somehow the x86 oddities about
> CR3 writes leaked too much into the membarrier core code and comments.
> (I doubt this is x86 specific. The actual x86 specific part seems to
> be that we can return to user mode without syncing the instruction
> stream.)
>
> As far as I can tell, membarrier doesn’t care at all about laziness.
> Membarrier cares about rq->curr->mm. The fact that a cpu can switch
> its actual loaded mm without scheduling at all (on x86 at least) is
> entirely beside the point except insofar as it has an effect on
> whether a subsequent switch_mm() call serializes.

Core membarrier itself doesn't care about laziness, which is why the
membarrier flush should go in exit_lazy_tlb() or other x86 specific
code (at least until more architectures did the same thing and we moved
it into generic code). I just meant this non-serialising return as
documented in the membarrier arch enablement doc specifies the lazy tlb
requirement.

If an mm was lazy tlb for a kernel thread and then it becomes unlazy,
and if switch_mm is serialising but return to user is not, then you
need a serialising instruction somewhere before return to user. unlazy
is the logical place to add that, because the lazy tlb mm (i.e.,
switching to a kernel thread and back without switching mm) is what
opens the hole.

> If we notify
> membarrier about x86’s asynchronous CR3 writes, then membarrier needs
> to understand what to do with them, which results in an unmaintainable
> mess in membarrier *and* in the x86 code.

How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying
arch code about when an mm becomes not-lazy, and nothing to do with
membarrier at all even. It's a convenient hook to do your un-lazying.
I guess you can do it also checking things in switch_mm and keeping state
in arch code, I don't think that's necessarily the best place to put it.

So membarrier code is unchanged (it cares that the serialise is done at
un-lazy time), core code is simpler (no knowledge of this membarrier
quirk and it already knows about lazy-tlb so the calls actually improve
the documentation), and x86 code I would argue becomes nicer (or no real
difference at worst) because you can move some exit lazy tlb handling to
that specific call rather than decipher it from switch_mm.

>
> I’m currently trying to document how membarrier actually works, and
> hopefully this will result in untangling membarrier from mmdrop() and
> such.

That would be nice.

>
> A silly part of this is that x86 already has a high quality
> implementation of most of membarrier(): flush_tlb_mm(). If you flush
> an mm’s TLB, we carefully propagate the flush to all threads, with
> attention to memory ordering. We can’t use this directly as an
> arch-specific implementation of membarrier because it has the annoying
> side affect of flushing the TLB and because upcoming hardware might be
> able to flush without guaranteeing a core sync. (Upcoming means Zen
> 3, but the Zen 3 implementation is sadly not usable by Linux.)
>

A hardware broadcast TLB flush, you mean? What makes it unusable by
Linux out of curiosity?

2020-12-06 00:41:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin <[email protected]> wrote:
>
> Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am:
> >

> If an mm was lazy tlb for a kernel thread and then it becomes unlazy,
> and if switch_mm is serialising but return to user is not, then you
> need a serialising instruction somewhere before return to user. unlazy
> is the logical place to add that, because the lazy tlb mm (i.e.,
> switching to a kernel thread and back without switching mm) is what
> opens the hole.

The issue here is that unlazying on x86 sometimes serializes and
sometimes doesn't. It's straightforward to add logic to the x86 code
to serialize specifically in the case where membarrier core sync is
registered and unlazying would otherwise not serialize, but trying to
define sensible semantics for this in a call to core code seems
complicated. (Specifically, the x86 code only sometimes sends IPIs to
lazy CPUs for TLB flushes. If an IPI is skipped, then unlazying will
flush the TLB, and that operation is serializing.

The whole lazy thing is IMO a red herring for membarrier(). The
membarrier() logic requires that switching *logical* mms
(rq->curr->mm) serializes before user mode if the new mm is registered
for core sync. AFAICT the only architecture on which this isn't
automatic is x86, and somehow the logic turned into "actually changing
rq->curr->mm serializes, but unlazying only sometimes serializes, so
we need to do an extra serialization step on unlazying operations"
instead of "tell x86 to make sure it always serializes when it
switches logical mms". The latter is easy to specify and easy to
implement.

>
> How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying
> arch code about when an mm becomes not-lazy, and nothing to do with
> membarrier at all even. It's a convenient hook to do your un-lazying.
> I guess you can do it also checking things in switch_mm and keeping state
> in arch code, I don't think that's necessarily the best place to put it.

I'm confused. I just re-read your patches, and it looks like you have
arch code calling exit_lazy_tlb(). On x86, if we do a TLB shootdown
IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching
to init_mm for real), and we have no way to notify the core scheduler
about this, so we don't. The result is that the core scheduler state
and the x86 state gets out of sync. If the core scheduler
subsequently switches us back to the mm that it thinks we were still
using lazily them, from the x86 code's perspective, we're not
unlazying -- we're just doing a regular switch from init_mm to some
other mm. This is why x86's switch_mm_irqs_off() totally ignores its
'prev' argument.

I'm honestly a bit surprised that other architectures don't do the
same thing. I suppose that some architectures don't use shootdown
IPIs at all, in which case there doesn't seem to be any good reason to
aggressively unlazy.

(Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush
the TLB" instruction, that instruction is considerably slower than the
old "switch mm and flush" operation. So the operation "switch to
init_mm" is only ever any slower than "flush and stay lazy" if we get
lucky and unlazy to the same mm before we get a second TLB shootdown
*and* if unlazying to the same mm would not have needed to flush. I
spend quite a bit of time tuning this stuff and being quite surprised
at the bizarre performance properties of Intel's TLB management
instructions.)

>
> So membarrier code is unchanged (it cares that the serialise is done at
> un-lazy time), core code is simpler (no knowledge of this membarrier
> quirk and it already knows about lazy-tlb so the calls actually improve
> the documentation), and x86 code I would argue becomes nicer (or no real
> difference at worst) because you can move some exit lazy tlb handling to
> that specific call rather than decipher it from switch_mm.

As above, I can't move the exit-lazy handling because the scheduler
doesn't know when I'm unlazying.

>
> >
> > I’m currently trying to document how membarrier actually works, and
> > hopefully this will result in untangling membarrier from mmdrop() and
> > such.
>
> That would be nice.

It's still a work in progress. I haven't actually convinced myself
that the non-IPI case in membarrier() is correct, nor have I convinced
myself that it's incorrect.

Anyway, I think that my patch is a bit incorrect and I either need a
barrier somewhere (which may already exist) or a store-release to
lazy_mm to make sure that all accesses to the lazy mm are done before
lazy_mm is freed. On x86, even aside from the fact that all stores
are releases, this isn't needed -- stopping using an mm is itself a
full barrier. Will this be a performance issue on power?

>
> >
> > A silly part of this is that x86 already has a high quality
> > implementation of most of membarrier(): flush_tlb_mm(). If you flush
> > an mm’s TLB, we carefully propagate the flush to all threads, with
> > attention to memory ordering. We can’t use this directly as an
> > arch-specific implementation of membarrier because it has the annoying
> > side affect of flushing the TLB and because upcoming hardware might be
> > able to flush without guaranteeing a core sync. (Upcoming means Zen
> > 3, but the Zen 3 implementation is sadly not usable by Linux.)
> >
>
> A hardware broadcast TLB flush, you mean? What makes it unusable by
> Linux out of curiosity?

The new instruction is INVLPGB. Unfortunately, x86's ASID field is
very narrow, and there's no way we can give each mm the same ASID
across all CPUs, which means we can't accurately target the flush at
the correct set of TLB entries. I've asked engineers at both Intel
and AMD to widen the ASID field, but that will end up being
complicated -- x86 has run out of bits in its absurdly overloaded CR3
encoding, and widening the ASID to any reasonable size would require
adding a new way to switch mms. There are lots of reasons that x86
should do that anyway [0], but it would be a big project and I'm not
sure that either company is interested in big projects like that.

[0] On x86, you can't switch between (64-bit execution, 48-bit virtual
address space) and (64-bit execution, 57-bit address space) without
exiting 64-bit mode in the middle. This is because the way that the
addressing mode is split among multiple registers prevents a single
instruction from switching between the two states. This is absolutely
delightful for anyone trying to boot an OS on a system with a very,
very large amount of memory.

2020-12-06 04:16:11

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

Excerpts from Andy Lutomirski's message of December 6, 2020 10:36 am:
> On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin <[email protected]> wrote:
>>
>> Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am:
>> >
>
>> If an mm was lazy tlb for a kernel thread and then it becomes unlazy,
>> and if switch_mm is serialising but return to user is not, then you
>> need a serialising instruction somewhere before return to user. unlazy
>> is the logical place to add that, because the lazy tlb mm (i.e.,
>> switching to a kernel thread and back without switching mm) is what
>> opens the hole.
>
> The issue here is that unlazying on x86 sometimes serializes and
> sometimes doesn't.

That's additional state that x86 keeps around though, which is
fine. It can optimise that case if it knows it's already
serialised.

> It's straightforward to add logic to the x86 code
> to serialize specifically in the case where membarrier core sync is
> registered and unlazying would otherwise not serialize, but trying to
> define sensible semantics for this in a call to core code seems
> complicated.

It's not though, it's a call from core code (to arch code).

> (Specifically, the x86 code only sometimes sends IPIs to
> lazy CPUs for TLB flushes. If an IPI is skipped, then unlazying will
> flush the TLB, and that operation is serializing.
>
> The whole lazy thing is IMO a red herring for membarrier(). The
> membarrier() logic requires that switching *logical* mms
> (rq->curr->mm) serializes before user mode if the new mm is registered
> for core sync.

It's not a red herring, the reason the IPI gets skipped is because
we go to a kernel thread -- that's all core code and core lazy tlb
handling.

That x86 might do some additional ops serialise during un-lazy in
some cases doesn't make that a red herring, it just means that you
can take advantage of it and avoid doing an extra serialising op.

> AFAICT the only architecture on which this isn't
> automatic is x86, and somehow the logic turned into "actually changing
> rq->curr->mm serializes, but unlazying only sometimes serializes, so
> we need to do an extra serialization step on unlazying operations"
> instead of "tell x86 to make sure it always serializes when it
> switches logical mms". The latter is easy to specify and easy to
> implement.
>
>>
>> How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying
>> arch code about when an mm becomes not-lazy, and nothing to do with
>> membarrier at all even. It's a convenient hook to do your un-lazying.
>> I guess you can do it also checking things in switch_mm and keeping state
>> in arch code, I don't think that's necessarily the best place to put it.
>
> I'm confused. I just re-read your patches, and it looks like you have
> arch code calling exit_lazy_tlb().

More for code-comment / consistency than anything else. They are
entirely arch hooks.

> On x86, if we do a TLB shootdown
> IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching
> to init_mm for real), and we have no way to notify the core scheduler
> about this, so we don't. The result is that the core scheduler state
> and the x86 state gets out of sync. If the core scheduler
> subsequently switches us back to the mm that it thinks we were still
> using lazily them, from the x86 code's perspective, we're not
> unlazying -- we're just doing a regular switch from init_mm to some
> other mm. This is why x86's switch_mm_irqs_off() totally ignores its
> 'prev' argument.

You actually do now have such a way to do that now that we've
(hopefully) closed races, and I think should use it, which might make
things simpler for you. See patch 6 do_shoot_lazy_tlb().

> I'm honestly a bit surprised that other architectures don't do the
> same thing. I suppose that some architectures don't use shootdown
> IPIs at all, in which case there doesn't seem to be any good reason to
> aggressively unlazy.

powerpc/radix does (in some cases) since a few years ago. It just
doesn't fully exploit that for the final TLB shootdown to always clean
them all up and avoid the subsequent shoot-lazies IPI, but it could be
more aggressive there.

The powerpc virtualised hash architecture is the traditional one and
isn't conducive to this (translation management is done via hcalls, and
the hypervisor maintains the TLB) so I suspect that's why it wasn't
done earlier there. That will continue to rely on shoot-lazies.

> (Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush
> the TLB" instruction, that instruction is considerably slower than the
> old "switch mm and flush" operation. So the operation "switch to
> init_mm" is only ever any slower than "flush and stay lazy" if we get
> lucky and unlazy to the same mm before we get a second TLB shootdown
> *and* if unlazying to the same mm would not have needed to flush. I
> spend quite a bit of time tuning this stuff and being quite surprised
> at the bizarre performance properties of Intel's TLB management
> instructions.)

Well, you also casue an extra mm switch in case you returned to the
same mm. Which probably isn't uncommon (app<->idle).

>>
>> So membarrier code is unchanged (it cares that the serialise is done at
>> un-lazy time), core code is simpler (no knowledge of this membarrier
>> quirk and it already knows about lazy-tlb so the calls actually improve
>> the documentation), and x86 code I would argue becomes nicer (or no real
>> difference at worst) because you can move some exit lazy tlb handling to
>> that specific call rather than decipher it from switch_mm.
>
> As above, I can't move the exit-lazy handling because the scheduler
> doesn't know when I'm unlazying.

As above, you can actually tell it. But even if you don't do that, in
the current scheme it's still telling you a superset of what you need,
so you'd just put move your extra checks there.

>
>>
>> >
>> > I’m currently trying to document how membarrier actually works, and
>> > hopefully this will result in untangling membarrier from mmdrop() and
>> > such.
>>
>> That would be nice.
>
> It's still a work in progress. I haven't actually convinced myself
> that the non-IPI case in membarrier() is correct, nor have I convinced
> myself that it's incorrect.
>
> Anyway, I think that my patch is a bit incorrect and I either need a
> barrier somewhere (which may already exist) or a store-release to
> lazy_mm to make sure that all accesses to the lazy mm are done before
> lazy_mm is freed. On x86, even aside from the fact that all stores
> are releases, this isn't needed -- stopping using an mm is itself a
> full barrier. Will this be a performance issue on power?

store-release is lwsync on power. Not so bad as a full barrier, but
probably not wonderful. The fast path would be worse than shoot-lazies
of course, but may not be prohibitive.

I'm still going to persue shoot-lazies for the merge window. As you
see it's about a dozen lines and a if (IS_ENABLED(... in core code.
Your change is common code, but a significant complexity (which
affects all archs) so needs a lot more review and testing at this
point.

If x86 is already shooting lazies in its final TLB flush, I don't
know why you're putting so much effort in though, surely it's more
complexity and (even slightly) more cost there too.

>
>>
>> >
>> > A silly part of this is that x86 already has a high quality
>> > implementation of most of membarrier(): flush_tlb_mm(). If you flush
>> > an mm’s TLB, we carefully propagate the flush to all threads, with
>> > attention to memory ordering. We can’t use this directly as an
>> > arch-specific implementation of membarrier because it has the annoying
>> > side affect of flushing the TLB and because upcoming hardware might be
>> > able to flush without guaranteeing a core sync. (Upcoming means Zen
>> > 3, but the Zen 3 implementation is sadly not usable by Linux.)
>> >
>>
>> A hardware broadcast TLB flush, you mean? What makes it unusable by
>> Linux out of curiosity?
>
> The new instruction is INVLPGB. Unfortunately, x86's ASID field is
> very narrow, and there's no way we can give each mm the same ASID
> across all CPUs, which means we can't accurately target the flush at
> the correct set of TLB entries. I've asked engineers at both Intel
> and AMD to widen the ASID field, but that will end up being
> complicated -- x86 has run out of bits in its absurdly overloaded CR3
> encoding, and widening the ASID to any reasonable size would require
> adding a new way to switch mms. There are lots of reasons that x86
> should do that anyway [0], but it would be a big project and I'm not
> sure that either company is interested in big projects like that.

Interesting, thanks. powerpc has a PID register for guest ASIDs that
implements about 20 bits.

The IPI is very flexible though, it allows more complex/fine grained
flushes and also software state to be updated, so we've started using
it a bit. I haven't seen much software where performance of IPIs is
prohibitive these days. Maybe improvements to threaded malloc, JVMs
databases etc reduce the amount of flushes.


> [0] On x86, you can't switch between (64-bit execution, 48-bit virtual
> address space) and (64-bit execution, 57-bit address space) without
> exiting 64-bit mode in the middle. This is because the way that the
> addressing mode is split among multiple registers prevents a single
> instruction from switching between the two states. This is absolutely
> delightful for anyone trying to boot an OS on a system with a very,
> very large amount of memory.
>

powerpc has some issues like that with context switching guest / host
state there are several MMU registers involved that can't be switched
with a single instruction. It doesn't require such a big hammer, but
a careful sequence to switch things.

Thanks,
Nick

2020-12-11 10:32:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin <[email protected]> wrote:
>

> I'm still going to persue shoot-lazies for the merge window. As you
> see it's about a dozen lines and a if (IS_ENABLED(... in core code.
> Your change is common code, but a significant complexity (which
> affects all archs) so needs a lot more review and testing at this
> point.

I don't think it's ready for this merge window. I read the early
patches again, and I think they make the membarrier code worse, not
better. I'm not fundamentally opposed to the shoot-lazies concept,
but it needs more thought and it needs a cleaner foundation.

2020-12-14 09:48:48

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

Excerpts from Andy Lutomirski's message of December 11, 2020 10:11 am:
>> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin <[email protected]> wrote:
>>
>
>> I'm still going to persue shoot-lazies for the merge window. As you
>> see it's about a dozen lines and a if (IS_ENABLED(... in core code.
>> Your change is common code, but a significant complexity (which
>> affects all archs) so needs a lot more review and testing at this
>> point.
>
> I don't think it's ready for this merge window.

Yes next one I meant (aka this one for development perspective :)).

> I read the early
> patches again, and I think they make the membarrier code worse, not
> better.

Mathieu and I disagree, so we are at an impasse. I addressed your
comment about not being able to do the additional core sync avoidance
from the exit tlb call (you can indeed do so in your arch code) and
about exit_lazy_tlb being a call into the scheduler (it's not) and
about the arch code not being able to reconcile lazy tlb mm with the
core scheduler code (you can).

I fundamentally think the core sync is an issue with what the membarrier
/ arch specifics are doing with lazy tlb mm switching, and not something
the core scheduler needs to know about at all. I don't see the big
problem with essentially moving it from an explicit call to
exit_lazy_tlb (which from scheduler POV describes better what it is
doing, not how).

> I'm not fundamentally opposed to the shoot-lazies concept,
> but it needs more thought and it needs a cleaner foundation.

Well shoot lazies actually doesn't really rely on that membarrier
change at all, it just came as a nice looking cleanup so that part
can be dropped from the series. It's not really foundational.

Thanks,
Nick

2020-12-14 13:49:30

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

Excerpts from Nicholas Piggin's message of December 14, 2020 2:07 pm:
> Excerpts from Andy Lutomirski's message of December 11, 2020 10:11 am:
>>> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin <[email protected]> wrote:
>>>
>>
>>> I'm still going to persue shoot-lazies for the merge window. As you
>>> see it's about a dozen lines and a if (IS_ENABLED(... in core code.
>>> Your change is common code, but a significant complexity (which
>>> affects all archs) so needs a lot more review and testing at this
>>> point.
>>
>> I don't think it's ready for this merge window.
>
> Yes next one I meant (aka this one for development perspective :)).
>
>> I read the early
>> patches again, and I think they make the membarrier code worse, not
>> better.
>
> Mathieu and I disagree, so we are at an impasse.

Well actually not really, I went and cut out the exit_lazy_tlb stuff
from the patch series, those are better to be untangled anyway. I think
an earlier version had something in exit_lazy_tlb for the mm refcounting
change but it's not required now anyway.

I'll split them out and just work on the shoot lazies series for now, I
might revisit exit_lazy_tlb after the dust settles from that and the
current membarrier changes. I'll test and repost shortly.

Thanks,
Nick