LinuxLists.cc - [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes

2020-09-14 04:53:45

Subject: [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes

This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

Testing so far indicates this has fixed a mm refcounting bug that
powerpc was running into (via distro report and backport). I haven't
had any real feedback on this series outside powerpc (and it doesn't
really affect other archs), so I propose patches 1,2,4 go via the
powerpc tree.

There is no dependency between them and patch 3, I put it there only
because it follows the history of the code (powerpc code was written
using the sparc64 logic), but I guess they have to go via different arch
trees. Dave, I'll leave patch 3 with you.

Thanks,
Nick

Since v1:
- Updates from Michael Ellerman's review comments.

Nicholas Piggin (4):
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

arch/Kconfig | 7 +++
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/mmu_context.h | 2 +-
arch/powerpc/include/asm/tlb.h | 13 ------
arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++---
arch/sparc/kernel/smp_64.c | 65 ++++++--------------------
fs/exec.c | 17 ++++++-
7 files changed, 54 insertions(+), 74 deletions(-)

--
2.23.0

2020-09-14 04:54:09

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/Kconfig | 7 +++++++
fs/exec.c | 17 +++++++++++++++--
2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
bool
depends on MMU_GATHER_TABLE_FREE

+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+ bool
+ help
+ Temporary select until all architectures can be converted to have
+ irqs disabled over activate_mm. Architectures that do IPI based TLB
+ shootdowns should enable this.
+
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool

diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
}

task_lock(tsk);
- active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
- tsk->mm = mm;
+
+ local_irq_disable();
+ active_mm = tsk->active_mm;
tsk->active_mm = mm;
+ tsk->mm = mm;
+ /*
+ * This prevents preemption while active_mm is being loaded and
+ * it and mm are being updated, which could cause problems for
+ * lazy tlb mm refcounting when these are updated by context
+ * switches. Not all architectures can handle irqs off over
+ * activate_mm yet.
+ */
+ if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+ local_irq_enable();
activate_mm(active_mm, mm);
+ if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+ local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
--
2.23.0

2020-09-14 04:54:20

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM

powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/mmu_context.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65bed1fdeaad..587ba8352d01 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKS if PPC_QUEUED_SPINLOCKS
select ARCH_WANT_IPC_PARSE_VERSION
+ select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index a3a12a8341b2..b42813359f49 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
#define activate_mm activate_mm
static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
{
- switch_mm(prev, next, current);
+ switch_mm_irqs_off(prev, next, current);
}

/* We don't currently use enter_lazy_tlb() for anything */
--
2.23.0

2020-09-14 04:55:30

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Reviewed-by: Michael Ellerman <[email protected]>
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/powerpc/include/asm/tlb.h | 13 -------------
arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
}
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
- WARN_ON(atomic_read(&mm->context.copros) > 0);
- /*
- * It's possible for mm_access to take a reference on mm_users to
- * access the remote mm from another thread, but it's not allowed
- * to set mm_cpumask, so mm_users may be > 1 here.
- */
- WARN_ON(current->mm != mm);
- atomic_set(&mm->context.active_cpus, 1);
- cpumask_clear(mm_cpumask(mm));
- cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
#else /* CONFIG_PPC_BOOK3S_64 */
static inline int mm_is_thread_local(struct mm_struct *mm)
{
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..143b4fd396f0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
struct mm_struct *mm = arg;
unsigned long pid = mm->context.id;

+ /*
+ * A kthread could have done a mmget_not_zero() after the flushing CPU
+ * checked mm_is_singlethreaded, and be in the process of
+ * kthread_use_mm when interrupted here. In that case, current->mm will
+ * be set to mm, because kthread_use_mm() setting ->mm and switching to
+ * the mm is done with interrupts off.
+ */
if (current->mm == mm)
- return; /* Local CPU */
+ goto out_flush;

if (current->active_mm == mm) {
- /*
- * Must be a kernel thread because sender is single-threaded.
- */
- BUG_ON(current->mm);
+ WARN_ON_ONCE(current->mm != NULL);
+ /* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(&init_mm);
- switch_mm(mm, &init_mm, current);
current->active_mm = &init_mm;
+ switch_mm_irqs_off(mm, &init_mm, current);
mmdrop(mm);
}
+
+ atomic_dec(&mm->context.active_cpus);
+ cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
}

@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
*/
smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
(void *)mm, 1);
- mm_reset_thread_local(mm);
}

void radix__flush_tlb_mm(struct mm_struct *mm)
--
2.23.0

2020-09-14 04:57:22

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

if (atomic_read(&mm->mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}

vs fs/io_uring.c

if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
!mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
* are flush_tlb_*() routines, and these run after flush_cache_*()
* which performs the flushw.
*
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- * space has (potentially) executed on, this is the heuristic
- * we use to avoid doing cross calls.
- *
- * Also, for flushing from kswapd and also for clones, we
- * use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- * in the system, this allows us to play several games to avoid
- * cross calls.
- *
- * One invariant is that when a cpu switches to a process, and
- * that processes tsk->active_mm->cpu_vm_mask does not have the
- * current cpu's bit set, that tlb context is flushed locally.
- *
- * If the address space is non-shared (ie. mm->count == 1) we avoid
- * cross calls when we want to flush the currently running process's
- * tlb state. This is done by clearing all cpu bits except the current
- * processor's in current->mm->cpu_vm_mask and performing the
- * flush locally only. This will force any subsequent cpus which run
- * this task to flush the context from the local tlb if the process
- * migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- * bullet for most cases and perform the cross call (but only to
- * the cpus listed in cpu_vm_mask).
- *
- * The performance gain from "optimizing" away the cross call for threads is
- * questionable (in theory the big win for threads is the massive sharing of
- * address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
*/

/* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
void smp_flush_tlb_mm(struct mm_struct *mm)
{
u32 ctx = CTX_HWBITS(mm->context);
- int cpu = get_cpu();

- if (atomic_read(&mm->mm_users) == 1) {
- cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
- goto local_flush_and_out;
- }
+ get_cpu();

smp_cross_call_masked(&xcall_flush_tlb_mm,
ctx, 0, 0,
mm_cpumask(mm));

-local_flush_and_out:
__flush_tlb_mm(ctx, SECONDARY_CONTEXT);

put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
{
u32 ctx = CTX_HWBITS(mm->context);
struct tlb_pending_info info;
- int cpu = get_cpu();
+
+ get_cpu();

info.ctx = ctx;
info.nr = nr;
info.vaddrs = vaddrs;

- if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
- cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
- else
- smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
- &info, 1);
+ smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
+ &info, 1);

__flush_tlb_pending(ctx, nr, vaddrs);

@@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
{
unsigned long context = CTX_HWBITS(mm->context);
- int cpu = get_cpu();

- if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
- cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
- else
- smp_cross_call_masked(&xcall_flush_tlb_page,
- context, vaddr, 0,
- mm_cpumask(mm));
+ get_cpu();
+
+ smp_cross_call_masked(&xcall_flush_tlb_page,
+ context, vaddr, 0,
+ mm_cpumask(mm));
+
__flush_tlb_page(context, vaddr);

put_cpu();
--
2.23.0

2020-09-14 07:01:46

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:

[...]

> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> optimisation could be effectively restored by sending IPIs to mm_cpumask
> members and having them remove themselves from mm_cpumask. This is more
> tricky so I leave it as an exercise for someone with a sparc64 SMP.
> powerpc has a (currently similarly broken) example.

So this compiles and boots on qemu, but qemu does not support any
sparc64 machines with SMP. Attempting some simple hacks doesn't get
me far because openbios isn't populating an SMP device tree, which
blows up everywhere.

The patch is _relatively_ simple, hopefully it shouldn't explode, so
it's probably ready for testing on real SMP hardware, if someone has
a few cycles.

Thanks,
Nick

>
> Signed-off-by: Nicholas Piggin <[email protected]>
> ---
> arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
> 1 file changed, 14 insertions(+), 51 deletions(-)
>
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index e286e2badc8a..e38d8bf454e8 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
> * are flush_tlb_*() routines, and these run after flush_cache_*()
> * which performs the flushw.
> *
> - * The SMP TLB coherency scheme we use works as follows:
> - *
> - * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
> - * space has (potentially) executed on, this is the heuristic
> - * we use to avoid doing cross calls.
> - *
> - * Also, for flushing from kswapd and also for clones, we
> - * use cpu_vm_mask as the list of cpus to make run the TLB.
> - *
> - * 2) TLB context numbers are shared globally across all processors
> - * in the system, this allows us to play several games to avoid
> - * cross calls.
> - *
> - * One invariant is that when a cpu switches to a process, and
> - * that processes tsk->active_mm->cpu_vm_mask does not have the
> - * current cpu's bit set, that tlb context is flushed locally.
> - *
> - * If the address space is non-shared (ie. mm->count == 1) we avoid
> - * cross calls when we want to flush the currently running process's
> - * tlb state. This is done by clearing all cpu bits except the current
> - * processor's in current->mm->cpu_vm_mask and performing the
> - * flush locally only. This will force any subsequent cpus which run
> - * this task to flush the context from the local tlb if the process
> - * migrates to another cpu (again).
> - *
> - * 3) For shared address spaces (threads) and swapping we bite the
> - * bullet for most cases and perform the cross call (but only to
> - * the cpus listed in cpu_vm_mask).
> - *
> - * The performance gain from "optimizing" away the cross call for threads is
> - * questionable (in theory the big win for threads is the massive sharing of
> - * address space state across processors).
> + * mm->cpu_vm_mask is a bit mask of which cpus an address
> + * space has (potentially) executed on, this is the heuristic
> + * we use to limit cross calls.
> */
>
> /* This currently is only used by the hugetlb arch pre-fault
> @@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
> void smp_flush_tlb_mm(struct mm_struct *mm)
> {
> u32 ctx = CTX_HWBITS(mm->context);
> - int cpu = get_cpu();
>
> - if (atomic_read(&mm->mm_users) == 1) {
> - cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> - goto local_flush_and_out;
> - }
> + get_cpu();
>
> smp_cross_call_masked(&xcall_flush_tlb_mm,
> ctx, 0, 0,
> mm_cpumask(mm));
>
> -local_flush_and_out:
> __flush_tlb_mm(ctx, SECONDARY_CONTEXT);
>
> put_cpu();
> @@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
> {
> u32 ctx = CTX_HWBITS(mm->context);
> struct tlb_pending_info info;
> - int cpu = get_cpu();
> +
> + get_cpu();
>
> info.ctx = ctx;
> info.nr = nr;
> info.vaddrs = vaddrs;
>
> - if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
> - cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> - else
> - smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> - &info, 1);
> + smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> + &info, 1);
>
> __flush_tlb_pending(ctx, nr, vaddrs);
>
> @@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
> void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
> {
> unsigned long context = CTX_HWBITS(mm->context);
> - int cpu = get_cpu();
>
> - if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
> - cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> - else
> - smp_cross_call_masked(&xcall_flush_tlb_page,
> - context, vaddr, 0,
> - mm_cpumask(mm));
> + get_cpu();
> +
> + smp_cross_call_masked(&xcall_flush_tlb_page,
> + context, vaddr, 0,
> + mm_cpumask(mm));
> +
> __flush_tlb_page(context, vaddr);
>
> put_cpu();
> --
> 2.23.0
>
>

2020-09-14 10:25:20

by Anatoly Pugachev

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

On Mon, Sep 14, 2020 at 10:00 AM Nicholas Piggin <[email protected]> wrote:
>
> Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:
>
> [...]
>
> > The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> > optimisation could be effectively restored by sending IPIs to mm_cpumask
> > members and having them remove themselves from mm_cpumask. This is more
> > tricky so I leave it as an exercise for someone with a sparc64 SMP.
> > powerpc has a (currently similarly broken) example.
>
> So this compiles and boots on qemu, but qemu does not support any
> sparc64 machines with SMP. Attempting some simple hacks doesn't get
> me far because openbios isn't populating an SMP device tree, which
> blows up everywhere.
>
> The patch is _relatively_ simple, hopefully it shouldn't explode, so
> it's probably ready for testing on real SMP hardware, if someone has
> a few cycles.

Nick,

applied this patch to over 'v5.9-rc5' tag , used my test VM (ldom)
with 32 vcpus.
Machine boot, stress-ng test ( run as
"stress-ng --cpu 8 --io 8 --vm 8 --vm-bytes 2G --fork 8 --timeout 15m" )
finishes without errors.

2020-09-14 11:00:13

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
> Reading and modifying current->mm and current->active_mm and switching
> mm should be done with irqs off, to prevent races seeing an intermediate
> state.
>
> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
> invalidate"). At exec-time when the new mm is activated, the old one
> should usually be single-threaded and no longer used, unless something
> else is holding an mm_users reference (which may be possible).
>
> Absent other mm_users, there is also a race with preemption and lazy tlb
> switching. Consider the kernel_execve case where the current thread is
> using a lazy tlb active mm:
>
> call_usermodehelper()
> kernel_execve()
> old_mm = current->mm;
> active_mm = current->active_mm;
> *** preempt *** --------------------> schedule()
> prev->active_mm = NULL;
> mmdrop(prev active_mm);
> ...
> <-------------------- schedule()
> current->mm = mm;
> current->active_mm = mm;
> if (!old_mm)
> mmdrop(active_mm);
>
> If we switch back to the kernel thread from a different mm, there is a
> double free of the old active_mm, and a missing free of the new one.
>
> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().
>
> So as a first step, disable interrupts across the mm/active_mm updates
> to close the lazy tlb preempt race, and provide an arch option to
> extend that to activate_mm which allows architectures doing IPI based
> TLB shootdowns to close the second race.
>
> This is a bit ugly, but in the interest of fixing the bug and backporting
> before all architectures are converted this is a compromise.
>
> Signed-off-by: Nicholas Piggin <[email protected]>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

I'm thinking we want this selected on x86 as well. Andy?

> ---
> arch/Kconfig | 7 +++++++
> fs/exec.c | 17 +++++++++++++++--
> 2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..94821e3f94d1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
> bool
> depends on MMU_GATHER_TABLE_FREE
>
> +config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> + bool
> + help
> + Temporary select until all architectures can be converted to have
> + irqs disabled over activate_mm. Architectures that do IPI based TLB
> + shootdowns should enable this.
> +
> config ARCH_HAVE_NMI_SAFE_CMPXCHG
> bool
>
> diff --git a/fs/exec.c b/fs/exec.c
> index a91003e28eaa..d4fb18baf1fb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
> }
>
> task_lock(tsk);
> - active_mm = tsk->active_mm;
> membarrier_exec_mmap(mm);
> - tsk->mm = mm;
> +
> + local_irq_disable();
> + active_mm = tsk->active_mm;
> tsk->active_mm = mm;
> + tsk->mm = mm;
> + /*
> + * This prevents preemption while active_mm is being loaded and
> + * it and mm are being updated, which could cause problems for
> + * lazy tlb mm refcounting when these are updated by context
> + * switches. Not all architectures can handle irqs off over
> + * activate_mm yet.
> + */
> + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> + local_irq_enable();
> activate_mm(active_mm, mm);
> + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> + local_irq_enable();
> tsk->mm->vmacache_seqnum = 0;
> vmacache_flush(tsk);
> task_unlock(tsk);
> --
> 2.23.0
>

2020-09-14 20:00:57

by David Miller

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

From: Nicholas Piggin <[email protected]>
Date: Mon, 14 Sep 2020 14:52:18 +1000

...
> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> optimisation could be effectively restored by sending IPIs to mm_cpumask
> members and having them remove themselves from mm_cpumask. This is more
> tricky so I leave it as an exercise for someone with a sparc64 SMP.
> powerpc has a (currently similarly broken) example.
>
> Signed-off-by: Nicholas Piggin <[email protected]>

Sad to see this optimization go away, but what can I do:

Acked-by: David S. Miller <[email protected]>

2020-09-15 02:49:29

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

Excerpts from [email protected]'s message of September 14, 2020 8:56 pm:
> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>> Reading and modifying current->mm and current->active_mm and switching
>> mm should be done with irqs off, to prevent races seeing an intermediate
>> state.
>>
>> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
>> invalidate"). At exec-time when the new mm is activated, the old one
>> should usually be single-threaded and no longer used, unless something
>> else is holding an mm_users reference (which may be possible).
>>
>> Absent other mm_users, there is also a race with preemption and lazy tlb
>> switching. Consider the kernel_execve case where the current thread is
>> using a lazy tlb active mm:
>>
>> call_usermodehelper()
>> kernel_execve()
>> old_mm = current->mm;
>> active_mm = current->active_mm;
>> *** preempt *** --------------------> schedule()
>> prev->active_mm = NULL;
>> mmdrop(prev active_mm);
>> ...
>> <-------------------- schedule()
>> current->mm = mm;
>> current->active_mm = mm;
>> if (!old_mm)
>> mmdrop(active_mm);
>>
>> If we switch back to the kernel thread from a different mm, there is a
>> double free of the old active_mm, and a missing free of the new one.
>>
>> Closing this race only requires interrupts to be disabled while ->mm
>> and ->active_mm are being switched, but the TLB problem requires also
>> holding interrupts off over activate_mm. Unfortunately not all archs
>> can do that yet, e.g., arm defers the switch if irqs are disabled and
>> expects finish_arch_post_lock_switch() to be called to complete the
>> flush; um takes a blocking lock in activate_mm().
>>
>> So as a first step, disable interrupts across the mm/active_mm updates
>> to close the lazy tlb preempt race, and provide an arch option to
>> extend that to activate_mm which allows architectures doing IPI based
>> TLB shootdowns to close the second race.
>>
>> This is a bit ugly, but in the interest of fixing the bug and backporting
>> before all architectures are converted this is a compromise.
>>
>> Signed-off-by: Nicholas Piggin <[email protected]>
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>
> I'm thinking we want this selected on x86 as well. Andy?

Thanks for the ack. The plan was to take it through the powerpc tree,
but if you'd want x86 to select it, maybe a topic branch? Although
Michael will be away during the next merge window so I don't want to
get too fancy. Would you mind doing it in a follow up merge after
powerpc, being that it's (I think) a small change?

I do think all archs should be selecting this, and we want to remove
the divergent code paths from here as soon as possible. I was planning
to send patches for the N+1 window at least for all the easy archs.
But the sooner the better really, we obviously want to share code
coverage with x86 :)

Thanks,
Nick

2020-09-15 02:50:35

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

Excerpts from Anatoly Pugachev's message of September 14, 2020 8:23 pm:
> On Mon, Sep 14, 2020 at 10:00 AM Nicholas Piggin <[email protected]> wrote:
>>
>> Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:
>>
>> [...]
>>
>> > The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>> > optimisation could be effectively restored by sending IPIs to mm_cpumask
>> > members and having them remove themselves from mm_cpumask. This is more
>> > tricky so I leave it as an exercise for someone with a sparc64 SMP.
>> > powerpc has a (currently similarly broken) example.
>>
>> So this compiles and boots on qemu, but qemu does not support any
>> sparc64 machines with SMP. Attempting some simple hacks doesn't get
>> me far because openbios isn't populating an SMP device tree, which
>> blows up everywhere.
>>
>> The patch is _relatively_ simple, hopefully it shouldn't explode, so
>> it's probably ready for testing on real SMP hardware, if someone has
>> a few cycles.
>
> Nick,
>
> applied this patch to over 'v5.9-rc5' tag , used my test VM (ldom)
> with 32 vcpus.
> Machine boot, stress-ng test ( run as
> "stress-ng --cpu 8 --io 8 --vm 8 --vm-bytes 2G --fork 8 --timeout 15m" )
> finishes without errors.
>

Thank you very much Anatoly.

Thanks,
Nick

2020-09-15 03:25:11

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

Excerpts from David Miller's message of September 15, 2020 5:59 am:
> From: Nicholas Piggin <[email protected]>
> Date: Mon, 14 Sep 2020 14:52:18 +1000
>
> ...
>> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>> optimisation could be effectively restored by sending IPIs to mm_cpumask
>> members and having them remove themselves from mm_cpumask. This is more
>> tricky so I leave it as an exercise for someone with a sparc64 SMP.
>> powerpc has a (currently similarly broken) example.
>>
>> Signed-off-by: Nicholas Piggin <[email protected]>
>
> Sad to see this optimization go away, but what can I do:
>
> Acked-by: David S. Miller <[email protected]>
>

Thanks Dave, any objection if we merge this via the powerpc tree
to keep the commits together?

Thanks,
Nick

2020-09-15 11:34:28

by Michael Ellerman

[permalink] [raw]

Subject: Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

Nicholas Piggin <[email protected]> writes:
> Excerpts from [email protected]'s message of September 14, 2020 8:56 pm:
>> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>>> Reading and modifying current->mm and current->active_mm and switching
>>> mm should be done with irqs off, to prevent races seeing an intermediate
>>> state.
...
>>>
>>> Signed-off-by: Nicholas Piggin <[email protected]>
>>
>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>
>> I'm thinking we want this selected on x86 as well. Andy?
>
> Thanks for the ack. The plan was to take it through the powerpc tree,
> but if you'd want x86 to select it, maybe a topic branch? Although
> Michael will be away during the next merge window so I don't want to
> get too fancy. Would you mind doing it in a follow up merge after
> powerpc, being that it's (I think) a small change?

Or get akpm to take the series, including the x86 change.

cheers

2020-09-15 19:45:51

by David Miller

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

From: Nicholas Piggin <[email protected]>
Date: Tue, 15 Sep 2020 13:24:07 +1000

> Excerpts from David Miller's message of September 15, 2020 5:59 am:
>> From: Nicholas Piggin <[email protected]>
>> Date: Mon, 14 Sep 2020 14:52:18 +1000
>>
>> ...
>>> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>>> optimisation could be effectively restored by sending IPIs to mm_cpumask
>>> members and having them remove themselves from mm_cpumask. This is more
>>> tricky so I leave it as an exercise for someone with a sparc64 SMP.
>>> powerpc has a (currently similarly broken) example.
>>>
>>> Signed-off-by: Nicholas Piggin <[email protected]>
>>
>> Sad to see this optimization go away, but what can I do:
>>
>> Acked-by: David S. Miller <[email protected]>
>>
>
> Thanks Dave, any objection if we merge this via the powerpc tree
> to keep the commits together?

No objection.

2020-09-18 12:20:40

by Michael Ellerman

[permalink] [raw]

Subject: Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

Nicholas Piggin <[email protected]> writes:
> Excerpts from [email protected]'s message of September 14, 2020 8:56 pm:
>> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>>> Reading and modifying current->mm and current->active_mm and switching
>>> mm should be done with irqs off, to prevent races seeing an intermediate
>>> state.
...
>>>
>>> This is a bit ugly, but in the interest of fixing the bug and backporting
>>> before all architectures are converted this is a compromise.
>>>
>>> Signed-off-by: Nicholas Piggin <[email protected]>
>>
>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>
>> I'm thinking we want this selected on x86 as well. Andy?
>
> Thanks for the ack. The plan was to take it through the powerpc tree,
> but if you'd want x86 to select it, maybe a topic branch?

I've put this series in a topic branch based on v5.9-rc2:

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/irqs-off-activate-mm

I plan to merge it into the powerpc/next tree for v5.10, but if anyone
else wants to merge it that's fine too.

cheers

2020-09-24 12:31:42

by Michael Ellerman

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes

On Mon, 14 Sep 2020 14:52:15 +1000, Nicholas Piggin wrote:
> This is an attempt to fix a few different related issues around
> switching mm, TLB flushing, and lazy tlb mm handling.
>
> This will require all architectures to eventually move to disabling
> irqs over activate_mm, but it's possible we could add another arch
> call after irqs are re-enabled for those few which can't do their
> entire activation with irqs disabled.
>
> [...]

Applied to powerpc/next.

[1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
https://git.kernel.org/powerpc/c/d53c3dfb23c45f7d4f910c3a3ca84bf0a99c6143
[2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
https://git.kernel.org/powerpc/c/66acd46080bd9e5ad2be4b0eb1d498d5145d058e
[3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
https://git.kernel.org/powerpc/c/bafb056ce27940c9994ea905336aa8f27b4f7275
[4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
https://git.kernel.org/powerpc/c/a665eec0a22e11cdde708c1c256a465ebe768047

cheers