2021-11-09 12:53:27

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH v5 0/4] shoot lazy tlbs

Since v4, this fixes a kthread_use_mm refcounting bug and adds some
comments in code and changelogs around the kthread_use_mm change in
patch 1 (due to akpm's comment -- thanks).

It also adds and improves comments in code, changelogs, and Kconfig
options. The overall design is unchanged though. Please merge.

This series has suffered some issues getting agreement, so I would
like to address a few sticking points or misconceptions up front,
which hopefully can result in constructive disagreement and actual
actionable feedback.

* That the lazy mm scheme is complicated or bug prone.

This is not true, the concept is trivial and core code is extremely
simple and basically unchanged since Linus' active_mm email 20 years
ago in 2.3 days.

This series leaves the lazy tlb switching and ->active_mm semantics
entirely unchanged. It does change the refcounting, but the effects
are hidden under wrappers. It does not add anything new for code
outside those few places to think about except that they must specify
_lazy_mm when refcounting this particular type of reference. This
is not much of a problem since lazy mm references never "escape"
from specific switching sequences and become hard to track. Refs
that go into the wider world are always normal ones (i.e., created
by explicit mmgrab or kthread_use_mm).

* That membarrier code is complicated

This is true. My series changes exactly nothing to do with
membarriers. My series is entirely about lazy mm, which has been
virtually unchanged for many years before membarrier.
membarrier code takes advantage of memory ordering in scheduler
switch code that lazy mm refcounting was providing, so this series
adds one commented smp_mb() ifdef there to replace the refcount op
being removed. That does not affect the ability to change membarrier
code in future because the refcounted path has to be accounted for
here anyway.

In other words, any changes to membarrier code which deal with the
refcounted lazy mm path that exists today, then dealing with the non
refcounted option is trivial.

* That active_mm should be removed from core code.

I don't know how to address this other than it's not a good or well
thought out idea. This is not happening and is certainly not related
to my series which does not change ->active_mm semantics at all.

* That this series provides an option for archs to enable which result
in stale ->active_mm pointers, whereby it is up to the arch to
ensure nothing dereferences those pointers.

This is FUD. It has always been false. Archs that enable
MMU_LAZY_TLB_SHOOTDOWN never ever have stale ->active_mm pointers,
ever. If active_mm is non-NULL, then that gives exactly the same
guarantees as you have today.

* That performance of IPIs or other things is a problem.

I posted actual numbers showing this was not a concern, and listed
some options that could reduce them further if needed. No numbers
were ever posted to support the other side of the argument.

* That the series is a powerpc specific thing.

Untrue. I have trivial sparc and alpha conversions as the first two
things I looked at which I have SMP qemu environments for.

* That this series somehow prevents future changes or improvements.

It doesn't.

* That the series is very complex, code is bad or has problems.

Look at the patches. They seem pretty small and simple to me. I am
happy to address specific issues that are pointed out though, and
have done so.

* That x86 is relevant here.

This patch does not touch or affect x86 in any way. x86 has gone off
and done its own horrendously complicated and under-documented thing
with active_mm and the lazy mm concept. But that has been entirely
hidden from core code by the arch context switching hooks. Core code
continues to operate on the concept of ->mm and ->active_mm, and this
series does not change that at all. x86 is no more or less divorced
from that after the series.

Nothing the series does constrains x86 or changes to it in future. The
option can not be used immediately by x86, but there is no reason x86
could not be adapted to use it, or change their scheme to something
else entirely. Where code can be adapted to be shared or made usable by
x86, I have no problem with doing that.

If I've missed something or I've got anything wrong with the above,
I'm happy to hear it.

Thanks,
Nick

Nicholas Piggin (4):
lazy tlb: introduce lazy mm refcount helper functions
lazy tlb: allow lazy tlb mm refcounting to be configurable
lazy tlb: shoot lazies, a non-refcounting lazy tlb option
powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

Documentation/vm/active_mm.rst | 6 ++++
arch/Kconfig | 32 +++++++++++++++++
arch/arm/mach-rpc/ecard.c | 2 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/kernel/smp.c | 2 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 4 +--
fs/exec.c | 2 +-
include/linux/sched/mm.h | 20 +++++++++++
kernel/cpu.c | 2 +-
kernel/exit.c | 2 +-
kernel/fork.c | 51 ++++++++++++++++++++++++++++
kernel/kthread.c | 21 +++++++-----
kernel/sched/core.c | 35 +++++++++++++------
kernel/sched/sched.h | 4 ++-
14 files changed, 158 insertions(+), 26 deletions(-)

--
2.23.0


2021-11-09 12:53:28

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH v5 1/4] lazy tlb: introduce lazy mm refcount helper functions

Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.

The non-trivial change in kthread_use_mm/kthread_unuse_mm is because it
is clever with refcounting: If it happens that the kthread's lazy tlb mm
(active_mm) is the same as the mm to be used, the code doesn't touch the
refcount but rather transfers the lazy refcount to used-mm refcount. Now
that lazy tlb mm refcount may not be equivalent to regular refcount,
this trick doesn't work and we must mmgrab a regular reference on mm to
use, and mmdrop_lazy_tlb the previous active_mm.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/arm/mach-rpc/ecard.c | 2 +-
arch/powerpc/kernel/smp.c | 2 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 4 ++--
fs/exec.c | 2 +-
include/linux/sched/mm.h | 11 +++++++++++
kernel/cpu.c | 2 +-
kernel/exit.c | 2 +-
kernel/kthread.c | 21 +++++++++++++--------
kernel/sched/core.c | 15 ++++++++-------
9 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 53813f9464a2..c30df1097c52 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
current->mm = mm;
current->active_mm = mm;
activate_mm(active_mm, mm);
- mmdrop(active_mm);
+ mmdrop_lazy_tlb(active_mm);
ecard_init_pgtables(mm);
return 0;
}
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 605bab448f84..88875387a347 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1582,7 +1582,7 @@ void start_secondary(void *unused)
if (IS_ENABLED(CONFIG_PPC32))
setup_kup();

- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;

smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 7724af19ed7e..59156c2d2ebe 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -786,10 +786,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
if (current->active_mm == mm) {
WARN_ON_ONCE(current->mm != NULL);
/* Is a kernel thread and is using mm as the lazy tlb */
- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
switch_mm_irqs_off(mm, &init_mm, current);
- mmdrop(mm);
+ mmdrop_lazy_tlb(mm);
}

/*
diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..30ba5449bb14 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1030,7 +1030,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
return 0;
}
- mmdrop(active_mm);
+ mmdrop_lazy_tlb(active_mm);
return 0;
}

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5561486fddef..f7a0b347fecb 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,17 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}

+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+ mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+ mmdrop(mm);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 192e43a87407..fffe8b738201 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -613,7 +613,7 @@ static int finish_cpu(unsigned int cpu)
*/
if (mm != &init_mm)
idle->active_mm = &init_mm;
- mmdrop(mm);
+ mmdrop_lazy_tlb(mm);
return 0;
}

diff --git a/kernel/exit.c b/kernel/exit.c
index 91a43e57a32e..8e7b41702ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,7 +475,7 @@ static void exit_mm(void)
__set_current_state(TASK_RUNNING);
mmap_read_lock(mm);
}
- mmgrab(mm);
+ mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5b37a8567168..83ed75d531b4 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1350,14 +1350,19 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);

+ /*
+ * It's possible that tsk->active_mm == mm here, but we must
+ * still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), because lazy
+ * mm may not have its own refcount (see mmgrab/drop_lazy_tlb()).
+ */
+ mmgrab(mm);
+
task_lock(tsk);
/* Hold off tlb flush IPIs while switching mm's */
local_irq_disable();
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
+ if (active_mm != mm)
tsk->active_mm = mm;
- }
tsk->mm = mm;
membarrier_update_current_mm(mm);
switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1374,12 +1379,9 @@ void kthread_use_mm(struct mm_struct *mm)
* memory barrier after storing to tsk->mm, before accessing
* user-space memory. A full memory barrier for membarrier
* {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
- * mmdrop(), or explicitly with smp_mb().
+ * mmdrop_lazy_tlb().
*/
- if (active_mm != mm)
- mmdrop(active_mm);
- else
- smp_mb();
+ mmdrop_lazy_tlb(active_mm);

to_kthread(tsk)->oldfs = force_uaccess_begin();
}
@@ -1411,10 +1413,13 @@ void kthread_unuse_mm(struct mm_struct *mm)
local_irq_disable();
tsk->mm = NULL;
membarrier_update_current_mm(NULL);
+ mmgrab_lazy_tlb(mm);
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
local_irq_enable();
task_unlock(tsk);
+
+ mmdrop(mm);
}
EXPORT_SYMBOL_GPL(kthread_unuse_mm);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bba4128a3e6..480205b6a188 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4831,13 +4831,14 @@ static struct rq *finish_task_switch(struct task_struct *prev)
* rq->curr, before returning to userspace, so provide them here:
*
* - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
- * provided by mmdrop(),
+ * provided by mmdrop_lazy_tlb(),
* - a sync_core for SYNC_CORE.
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
- mmdrop(mm);
+ mmdrop_lazy_tlb(mm);
}
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
@@ -4900,9 +4901,9 @@ context_switch(struct rq *rq, struct task_struct *prev,

/*
* kernel -> kernel lazy + transfer active
- * user -> kernel lazy + mmgrab() active
+ * user -> kernel lazy + mmgrab_lazy_tlb() active
*
- * kernel -> user switch + mmdrop() active
+ * kernel -> user switch + mmdrop_lazy_tlb() active
* user -> user switch
*/
if (!next->mm) { // to kernel
@@ -4910,7 +4911,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

next->active_mm = prev->active_mm;
if (prev->mm) // from user
- mmgrab(prev->active_mm);
+ mmgrab_lazy_tlb(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
@@ -4926,7 +4927,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);

if (!prev->mm) { // from kernel
- /* will mmdrop() in finish_task_switch(). */
+ /* will mmdrop_lazy_tlb() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
@@ -9441,7 +9442,7 @@ void __init sched_init(void)
/*
* The boot idle thread does lazy MMU switching as well:
*/
- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
enter_lazy_tlb(&init_mm, current);

/*
--
2.23.0

2021-11-09 12:54:47

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH v5 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be
a big problem with this scheme (the powerpc implementation generated
314 additional interrupts on a 144 CPU system during a kernel compile).
There are a number of strategies that could be employed to reduce IPIs
if they turn out to be a problem for some workload.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/Kconfig | 15 +++++++++++++++
kernel/fork.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 73d98edc5cdc..2b70a9e7b142 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -444,6 +444,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
# already).
config MMU_LAZY_TLB_REFCOUNT
def_bool y
+ depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch *must*:
+# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
+# at least all possible CPUs in which the mm is lazy.
+# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+ bool

config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..a7da9b0bc402 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,53 @@ static void check_mm(struct mm_struct *mm)
#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm)))

+static void do_shoot_lazy_tlb(void *arg)
+{
+ struct mm_struct *mm = arg;
+
+ if (current->active_mm == mm) {
+ WARN_ON_ONCE(current->mm);
+ current->active_mm = &init_mm;
+ switch_mm(mm, &init_mm, current);
+ }
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+ struct mm_struct *mm = arg;
+
+ WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+ /*
+ * IPI overheads have not found to be expensive, but they could
+ * be reduced in a number of possible ways, for example (in
+ * roughly increasing order of complexity):
+ * - A batch of mms requiring IPIs could be gathered and freed
+ * at once.
+ * - CPUs could store their active mm somewhere that can be
+ * remotely checked without a lock, to filter out
+ * false-positives in the cpumask.
+ * - After mm_users or mm_count reaches zero, switching away
+ * from the mm could clear mm_cpumask to reduce some IPIs
+ * (some batching or delaying would help).
+ * - A delayed freeing and RCU-like quiescing sequence based on
+ * mm switching to avoid IPIs completely.
+ */
+ on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+ if (IS_ENABLED(CONFIG_DEBUG_VM))
+ on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+ } else {
+ /*
+ * In this case, lazy tlb mms are refounted and would not reach
+ * __mmdrop until all CPUs have switched away and mmdrop()ed.
+ */
+ }
+}
+
/*
* Called when the last reference to the mm
* is dropped: either by a lazy thread or by
@@ -695,6 +742,10 @@ void __mmdrop(struct mm_struct *mm)
{
BUG_ON(mm == &init_mm);
WARN_ON_ONCE(mm == current->mm);
+
+ /* Ensure no CPUs are using this as their lazy tlb mm */
+ shoot_lazy_tlbs(mm);
+
WARN_ON_ONCE(mm == current->active_mm);
mm_free_pgd(mm);
destroy_context(mm);
--
2.23.0

2021-11-09 13:48:59

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH v5 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second, due to contention on the mm refcount.

powerpc/64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN,
so enable the option. This increases the above benchmark to 118 million.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/powerpc/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ba5b66189358..8a584414ef67 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -249,6 +249,7 @@ config PPC
select IRQ_FORCED_THREADING
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
+ select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE
select NEED_SG_DMA_LENGTH
--
2.23.0

2021-11-09 13:48:59

by Nicholas Piggin

[permalink] [raw]
Subject: [PATCH v5 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin <[email protected]>
---
Documentation/vm/active_mm.rst | 6 ++++++
arch/Kconfig | 17 +++++++++++++++++
include/linux/sched/mm.h | 13 +++++++++++--
kernel/sched/core.c | 22 ++++++++++++++++++----
kernel/sched/sched.h | 4 +++-
5 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/Documentation/vm/active_mm.rst b/Documentation/vm/active_mm.rst
index 6f8269c284ed..2b0d08332400 100644
--- a/Documentation/vm/active_mm.rst
+++ b/Documentation/vm/active_mm.rst
@@ -4,6 +4,12 @@
Active MM
=========

+Note, the mm_count refcount may no longer include the "lazy" users
+(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
+with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
+references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers which abstracts this config option.
+
::

List: linux-kernel
diff --git a/arch/Kconfig b/arch/Kconfig
index 8df1c7102643..73d98edc5cdc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -428,6 +428,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
irqs disabled over activate_mm. Architectures that do IPI based TLB
shootdowns should enable this.

+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example.
+#
+# To implement this, an arch *must*:
+# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the
+# lazy reference of a kthread's ->active_mm (non-arch code has been converted
+# already).
+config MMU_LAZY_TLB_REFCOUNT
+ def_bool y
+
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f7a0b347fecb..939dee9f5bc8 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -52,12 +52,21 @@ static inline void mmdrop(struct mm_struct *mm)
/* Helpers for lazy TLB mm refcounting */
static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
{
- mmgrab(mm);
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+ mmgrab(mm);
}

static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
{
- mmdrop(mm);
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+ mmdrop(mm);
+ } else {
+ /*
+ * mmdrop_lazy_tlb must provide a full memory barrier, see the
+ * membarrier comment finish_task_switch which relies on this.
+ */
+ smp_mb();
+ }
}

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 480205b6a188..8fdc004b4f1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4772,7 +4772,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
__releases(rq->lock)
{
struct rq *rq = this_rq();
- struct mm_struct *mm = rq->prev_mm;
+ struct mm_struct *mm = NULL;
long prev_state;

/*
@@ -4791,7 +4791,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);

- rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ mm = rq->prev_lazy_mm;
+ rq->prev_lazy_mm = NULL;
+#endif

/*
* A task struct has one reference for the use as "current".
@@ -4927,9 +4930,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);

if (!prev->mm) { // from kernel
- /* will mmdrop_lazy_tlb() in finish_task_switch(). */
- rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ /* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+ rq->prev_lazy_mm = prev->active_mm;
prev->active_mm = NULL;
+#else
+ /*
+ * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+ * tracking (because no rq->prev_lazy_mm) in
+ * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+ * memory barrier for membarrier (see the membarrier
+ * comment in finish_task_switch()). Do it here.
+ */
+ smp_mb();
+#endif
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d3e5793e117..43e7fbe06557 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -977,7 +977,9 @@ struct rq {
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
- struct mm_struct *prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ struct mm_struct *prev_lazy_mm;
+#endif

unsigned int clock_update_flags;
u64 clock;
--
2.23.0