by Andy Lutomirski

On Sun, 2018-07-29 at 08:29 -0700, Andy Lutomirski wrote:
> > On Jul 29, 2018, at 5:11 AM, Rik van Riel <[email protected]> wrote:
> >
> > > On Sat, 2018-07-28 at 21:21 -0700, Andy Lutomirski wrote:
> > > On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> > > wrote:
> > > > Conditionally skip lazy TLB mm refcounting. When an
> > > > architecture
> > > > has
> > > > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is
> > > > used in
> > > > lazy TLB mode anywhere will get shot down from exit_mmap, and
> > > > there
> > > > in no need to incur the cache line bouncing overhead of
> > > > refcounting
> > > > a lazy TLB mm.
> > >
> > > Unless I've misunderstood something, this patch results in idle
> > > tasks
> > > whose active_mm has been freed still having active_mm pointing at
> > > freed memory.
> >
> > Patch 9/10 is supposed to ensure that the lazy TLB CPUs get
> > switched to init_mm before an mm is freed. No CPU should ever
> > have its active_mm pointing at a freed mm.
> >
> > Your message made me re-read the code, and now I realize that
> > leave_mm does not actually do that.
> >
> > Looking at the other callers of leave_mm, I might not be the
> > only one surprised by that; xen_drop_mm_ref comes to mind.
> >
> > I guess I should some code to leave_mm to have it actually
> > clear active_mm and call the conditional refcount drop helper
> > function.
> >
> > Does that clear up the confusion?
>
> Kind of. But what’s the point of keeping active_mm? On architectures
> that opt in to the new mode, there won’t be any code that cares about
> it’s value. What’s the benefit of keeping it around? If you ifdef
> it out, then it can’t possibly point to freed memory, and there’s
> nothing to worry about.

I would like to get to that point, but in a way that does
not leave the code too difficult to follow.

Getting rid of ->active_mm in context_switch() is straightforward,
but I am not sure at all what to do about idle_task_exit() for
example.

All the subtleties I ran into just with this phase of the
code suggests (to me at least) that we should probably do
this one step at a time.

I agree on the same end goal, though :)

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2018-07-29 17:40:18

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 03/10] smp,cpumask: introduce on_each_cpu_cond_mask

On Sun, 2018-07-29 at 08:36 -0700, Andy Lutomirski wrote:
> On Jul 29, 2018, at 5:00 AM, Rik van Riel <[email protected]> wrote:
>
> > On Sat, 2018-07-28 at 19:57 -0700, Andy Lutomirski wrote:
> > > On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]>
> > > wrote:
> > > > Introduce a variant of on_each_cpu_cond that iterates only over
> > > > the
> > > > CPUs in a cpumask, in order to avoid making callbacks for every
> > > > single
> > > > CPU in the system when we only need to test a subset.
> > > Nice.
> > > Although, if you want to be really fancy, you could optimize this
> > > (or
> > > add a variant) that does the callback on the local CPU in
> > > parallel
> > > with the remote ones. That would give a small boost to TLB
> > > flushes.
> >
> > The test_func callbacks are not run remotely, but on
> > the local CPU, before deciding who to send callbacks
> > to.
> >
> > The actual IPIs are sent in parallel, if the cpumask
> > allocation succeeds (it always should in many kernel
> > configurations, and almost always in the rest).
> >
>
> What I meant is that on_each_cpu_mask does:
>
> smp_call_function_many(mask, func, info, wait);
> if (cpumask_test_cpu(cpu, mask)) {
> unsigned long flags;
> local_irq_save(flags); func(info);
> local_irq_restore(flags);
> }
>
> So it IPIs all the remote CPUs in parallel, then waits, then does the
> local work. In principle, the local flush could be done after
> triggering the IPIs but before they all finish.

Sure, moving the function call for the local CPU
into smp_call_function_many might be a nice optimization.

A quick grep suggests it touch stuff all over the tree,
so it could be a nice Outreachy intern project :)

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2018-07-29 17:52:22

2018-07-29 19:56:46

by Rik van Riel

[permalink] [raw]

Subject: [PATCH v2 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Sat, 28 Jul 2018 21:21:17 -0700
Andy Lutomirski <[email protected]> wrote:

> On Sat, Jul 28, 2018 at 2:53 PM, Rik van Riel <[email protected]> wrote:
> > Conditionally skip lazy TLB mm refcounting. When an architecture has
> > CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
> > lazy TLB mode anywhere will get shot down from exit_mmap, and there
> > in no need to incur the cache line bouncing overhead of refcounting
> > a lazy TLB mm.
>
> Unless I've misunderstood something, this patch results in idle tasks
> whose active_mm has been freed still having active_mm pointing at
> freed memory.

---8<---

Author: Rik van Riel <[email protected]>
Subject: [PATCH 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

Conditionally skip lazy TLB mm refcounting. When an architecture has
CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING enabled, an mm that is used in
lazy TLB mode anywhere will get shot down from exit_mmap, and there
in no need to incur the cache line bouncing overhead of refcounting
a lazy TLB mm.

Implement this by moving the refcounting of a lazy TLB mm to helper
functions, which skip the refcounting when it is not necessary.

Deal with use_mm and unuse_mm by fully splitting out the refcounting
of the lazy TLB mm a kernel thread may have when entering use_mm from
the refcounting of the mm that use_mm is about to start using.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/mm/tlb.c | 5 +++--
fs/exec.c | 2 +-
include/linux/sched/mm.h | 25 +++++++++++++++++++++++++
kernel/sched/core.c | 6 +++---
mm/mmu_context.c | 21 ++++++++++++++-------
5 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 425cb9fa2640..d53d9c19b97d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/gfp.h>
+#include <linux/sched/mm.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -141,7 +142,7 @@ void leave_mm(void *dummy)

switch_mm(NULL, &init_mm, NULL);
current->active_mm = &init_mm;
- mmdrop(loaded_mm);
+ drop_lazy_mm(loaded_mm);
}
EXPORT_SYMBOL_GPL(leave_mm);

@@ -486,7 +487,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
current->active_mm = &init_mm;
- mmdrop(loaded_mm);
+ drop_lazy_mm(loaded_mm);
return;
}

diff --git a/fs/exec.c b/fs/exec.c
index bdd0eacefdf5..7a6d4811b02b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1043,7 +1043,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
return 0;
}
- mmdrop(active_mm);
+ drop_lazy_mm(active_mm);
return 0;
}

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 44d356f5e47c..7308bf38012f 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,31 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}

+/*
+ * In lazy TLB mode, a CPU keeps the mm of the last process mapped while
+ * running a kernel thread or idle; we must make sure the lazy TLB mm and
+ * page tables do not disappear while a lazy TLB mode CPU uses them.
+ * There are two ways to handle the race between lazy TLB CPUs and exit_mmap:
+ * 1) Have a lazy TLB CPU hold a refcount on the lazy TLB mm.
+ * 2) Have the architecture code shoot down the lazy TLB mm from exit_mmap;
+ * in that case, refcounting can be skipped, reducing cache line bouncing.
+ */
+static inline void grab_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmgrab(mm);
+}
+
+static inline void drop_lazy_mm(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_ARCH_NO_ACTIVE_MM_REFCOUNTING))
+ return;
+
+ mmdrop(mm);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c45de46fdf10..11724c9e88b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2691,7 +2691,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
- mmdrop(mm);
+ drop_lazy_mm(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
@@ -2805,7 +2805,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
if (!mm) {
next->active_mm = oldmm;
- mmgrab(oldmm);
+ grab_lazy_mm(oldmm);
enter_lazy_tlb(oldmm, next);
} else
switch_mm_irqs_off(oldmm, mm, next);
@@ -5532,7 +5532,7 @@ void idle_task_exit(void)
current->active_mm = &init_mm;
finish_arch_post_lock_switch();
}
- mmdrop(mm);
+ drop_lazy_mm(mm);
}

/*
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..d5c2524cdd9a 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -24,12 +24,15 @@ void use_mm(struct mm_struct *mm)
struct mm_struct *active_mm;
struct task_struct *tsk = current;

+ /* Kernel threads have a NULL tsk->mm when entering. */
+ WARN_ON(tsk->mm);
+
task_lock(tsk);
+ /* Previous ->active_mm was held in lazy TLB mode. */
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
- tsk->active_mm = mm;
- }
+ /* Grab mm for reals; tsk->mm needs to stick around until unuse_mm. */
+ mmgrab(mm);
+ tsk->active_mm = mm;
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
@@ -37,8 +40,9 @@ void use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
#endif

- if (active_mm != mm)
- mmdrop(active_mm);
+ /* Drop the lazy TLB mode mm. */
+ if (active_mm)
+ drop_lazy_mm(active_mm);
}
EXPORT_SYMBOL_GPL(use_mm);

@@ -57,8 +61,11 @@ void unuse_mm(struct mm_struct *mm)
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
- /* active_mm is still 'mm' */
+ /* active_mm is still 'mm'; grab it as a lazy TLB mm */
+ grab_lazy_mm(mm);
enter_lazy_tlb(mm, tsk);
+ /* drop the tsk->mm refcount */
+ mmdrop(mm);
task_unlock(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);
--
2.14.4

2018-07-29 19:58:29

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 03/10] smp,cpumask: introduce on_each_cpu_cond_mask

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH v2 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
>
> So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and therefore
> ->active_mm == ->mm.

Close, but not true for kernel threads, which have a
NULL ->mm, but a non-null ->active_mm that gets passed
to enter_lazy_tlb().

I stuck to the structure of your code, but ended up
removing all the ifdefs because the final mmdrop requires
that we actually keep track of the ->active_mm across
potentially several kernel->kernel context switches.

Ifdefs around the reference counting code are also not
needed, because grab_lazy_mm and drop_lazy_mm already
contain the equivalent of an ifdef themselves.

By morning I should know what the test results look
like. I expect they will be identical to my patches,
since the refcounting is disabled completely anyway :)

> > + /*
> > + * kernel -> kernel lazy + transfer active
> > + * user -> kernel lazy + mmgrab() active
> > + *
> > + * kernel -> user switch + mmdrop() active
> > + * user -> user switch
> > + */
> > + if (!next->mm) { // to
> > kernel
> > + enter_lazy_tlb(prev->active_mm, next);
> > +
>
> #ifndef ARCH_NO_ACTIVE_MM
> > + next->active_mm = prev->active_mm;
> > + if (prev->mm) // from
> > user
> > + mmgrab(prev->active_mm);
>
> else
> prev->active_mm = NULL;
> > +#endif
> > + } else { // to user
> > + switch_mm_irqs_off(prev->active_mm, next->mm,
> > next);
> > +
>
> #ifndef ARCH_NO_ACTIVE_MM
> > + if (!prev->mm) { // from
> > kernel
> > + /* will mmdrop() in finish_task_switch().
> > */
> > + rq->prev_mm = prev->active_mm;
> > + prev->active_mm = NULL;
> > + }
> > +#endif
>
>
--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2018-07-31 09:14:19

On Tue, 2018-07-31 at 07:29 -0700, Andy Lutomirski wrote:
> > On Jul 31, 2018, at 2:12 AM, Peter Zijlstra <[email protected]>
> > wrote:
> >
> > > On Mon, Jul 30, 2018 at 09:05:55PM -0400, Rik van Riel wrote:
> > > > On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> > > >
> > > > So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and
> > > > therefore
> > > > ->active_mm == ->mm.
> > >
> > > Close, but not true for kernel threads, which have a
> > > NULL ->mm, but a non-null ->active_mm that gets passed
> > > to enter_lazy_tlb().
> >
> > I'm confused on the need for this. We mark the CPU lazy, why do we
> > still
> > care about this?
>
> I have considered renaming enter_lazy_tlb() to something like
> lazy_switch_to_kernel_mm() (or an irqs_off variant) and making it
> take no parameters or maybe just task pointer parameters.

Of all the architectures, only Alpha uses the "mm"
parameter to enter_lazy_tlb.

It uses it to store the pgd address of the current
active_mm context into the thread info of the next
task.

If it can get that from a per CPU variable, we can
get rid of the mm parameter to enter_lazy_tlb.

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2018-07-31 15:14:21

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 11/11] mm,sched: conditionally skip lazy TLB mm refcounting

On Tue, Jul 31, 2018 at 11:03:03AM -0400, Rik van Riel wrote:
> On Tue, 2018-07-31 at 07:29 -0700, Andy Lutomirski wrote:
> > > On Jul 31, 2018, at 2:12 AM, Peter Zijlstra <[email protected]>
> > > wrote:
> > >
> > > > On Mon, Jul 30, 2018 at 09:05:55PM -0400, Rik van Riel wrote:
> > > > > On Mon, 2018-07-30 at 18:26 +0200, Peter Zijlstra wrote:
> > > > >
> > > > > So for ARCH_NO_ACTIVE_MM we never touch ->active_mm and
> > > > > therefore
> > > > > ->active_mm == ->mm.
> > > >
> > > > Close, but not true for kernel threads, which have a
> > > > NULL ->mm, but a non-null ->active_mm that gets passed
> > > > to enter_lazy_tlb().
> > >
> > > I'm confused on the need for this. We mark the CPU lazy, why do we
> > > still
> > > care about this?
> >
> > I have considered renaming enter_lazy_tlb() to something like
> > lazy_switch_to_kernel_mm() (or an irqs_off variant) and making it
> > take no parameters or maybe just task pointer parameters.
>
> Of all the architectures, only Alpha uses the "mm"
> parameter to enter_lazy_tlb.
>
> It uses it to store the pgd address of the current
> active_mm context into the thread info of the next
> task.
>
> If it can get that from a per CPU variable, we can
> get rid of the mm parameter to enter_lazy_tlb.

We only really care for arches that select this new ARCH_NO_ACTIVE_MM
thingy. And x86 (the only one so far) really doesn't give a crap about
the argument.

We should just list/document that being one of the requirements for
selecting that symbol. All this will need semi decent documentation
anyway, otherwise the poor sod trying to understand this all in about 6
months time will go crazy :-)

Attachments:

(No filename) (1.67 kB)
signature.asc (849.00 B)
Download all attachments