2016-04-26 16:39:14

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements

Hi all-

I've been playing with context switching lately, and I'm going to start
sending out some of the patches that should be mostly self-contained and
ready for -tip.

Here's a little batch to start improving switch_mm. It uninlines it
and makes it run with IRQs off. (AFAICT everyone who's modified it
thought it ran with IRQs off, but that's not always the case. I
don't know of any bugs that this fixes, but it'll be needed for PCID
to avoid introducing really nasty races.)

This may also help a bit with FSGSBASE -- not sure yet. It certainly
won't hurt.

It contains a trivial off-topic ARM patch to avoid breaking the build.

Andy Lutomirski (5):
arm: Include linux/preempt.h from asm/mmu_context.h
sched: Add switch_mm_irqs_off and use it in the scheduler
x86/mm: Build arch/x86/mm/tlb.c even on !SMP
x86/mm: Uninline switch_mm
x86/mm: Turn off IRQs in switch_mm

arch/arm/include/asm/mmu_context.h | 1 +
arch/x86/include/asm/mmu_context.h | 101 ++------------------------------
arch/x86/mm/Makefile | 3 +-
arch/x86/mm/tlb.c | 116 +++++++++++++++++++++++++++++++++++++
include/linux/mmu_context.h | 7 +++
kernel/sched/core.c | 6 +-
6 files changed, 133 insertions(+), 101 deletions(-)

--
2.5.5


2016-04-26 16:39:20

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 3/5] x86/mm: Build arch/x86/mm/tlb.c even on !SMP

Currently all of the functions that live in tlb.c are inlined on
!SMP builds. One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/Makefile | 3 +--
arch/x86/mm/tlb.c | 4 ++++
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f98913258c63..62c0043a5fd5 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@
KCOV_INSTRUMENT_tlb.o := n

obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
- pat.o pgtable.o physaddr.o gup.o setup_nx.o
+ pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o

# Make sure __phys_addr has no stackprotector
nostackp := $(call cc-option, -fno-stack-protector)
@@ -12,7 +12,6 @@ CFLAGS_setup_nx.o := $(nostackp)
CFLAGS_fault.o := -I$(src)/../include/asm/trace

obj-$(CONFIG_X86_PAT) += pat_rbtree.o
-obj-$(CONFIG_SMP) += tlb.o

obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3dfac32..3f7a7939e65e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/

+#ifdef CONFIG_SMP
+
struct flush_tlb_info {
struct mm_struct *flush_mm;
unsigned long flush_start;
@@ -347,3 +349,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
return 0;
}
late_initcall(create_tlb_single_page_flush_ceiling);
+
+#endif /* CONFIG_SMP */
--
2.5.5

2016-04-26 16:39:16

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h

arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
include anything that would pull in the declaration.

If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
without this, the build breaks.

Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/arm/include/asm/mmu_context.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
index fa5b42d44985..ed73babc0dc9 100644
--- a/arch/arm/include/asm/mmu_context.h
+++ b/arch/arm/include/asm/mmu_context.h
@@ -15,6 +15,7 @@

#include <linux/compiler.h>
#include <linux/sched.h>
+#include <linux/preempt.h>
#include <asm/cacheflush.h>
#include <asm/cachetype.h>
#include <asm/proc-fns.h>
--
2.5.5

2016-04-26 16:39:26

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 5/5] x86/mm: Turn off IRQs in switch_mm

Potential races between switch_mm and TLB-flush or LDT-flush IPIs
could be very messy. AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a bug hammer: run switch_mm with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 3 +++
arch/x86/mm/tlb.c | 10 ++++++++++
2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index bb911dd7cd01..396348196aa7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -118,6 +118,9 @@ static inline void destroy_context(struct mm_struct *mm)
extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk);

+extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk);
+#define switch_mm_irqs_off switch_mm_irqs_off

#define activate_mm(prev, next) \
do { \
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 07558fabc222..c93481569e3c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -64,6 +64,16 @@ EXPORT_SYMBOL_GPL(leave_mm);
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ switch_mm_irqs_off(prev, next, tsk);
+ local_irq_restore(flags);
+}
+
+void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk)
+{
unsigned cpu = smp_processor_id();

if (likely(prev != next)) {
--
2.5.5

2016-04-26 16:39:42

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 4/5] x86/mm: Uninline switch_mm

It's fairly large and it has quite a few callers. This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 98 +----------------------------------
arch/x86/mm/tlb.c | 102 +++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+), 96 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 84280029cafd..bb911dd7cd01 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -115,103 +115,9 @@ static inline void destroy_context(struct mm_struct *mm)
destroy_context_ldt(mm);
}

-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk)
-{
- unsigned cpu = smp_processor_id();
+extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk);

- if (likely(prev != next)) {
-#ifdef CONFIG_SMP
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
- cpumask_set_cpu(cpu, mm_cpumask(next));
-
- /*
- * Re-load page tables.
- *
- * This logic has an ordering constraint:
- *
- * CPU 0: Write to a PTE for 'next'
- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
- * CPU 1: set bit 1 in next's mm_cpumask
- * CPU 1: load from the PTE that CPU 0 writes (implicit)
- *
- * We need to prevent an outcome in which CPU 1 observes
- * the new PTE value and CPU 0 observes bit 1 clear in
- * mm_cpumask. (If that occurs, then the IPI will never
- * be sent, and CPU 0's TLB will contain a stale entry.)
- *
- * The bad outcome can occur if either CPU's load is
- * reordered before that CPU's store, so both CPUs must
- * execute full barriers to prevent this from happening.
- *
- * Thus, switch_mm needs a full barrier between the
- * store to mm_cpumask and any operation that could load
- * from next->pgd. TLB fills are special and can happen
- * due to instruction fetches or for no reason at all,
- * and neither LOCK nor MFENCE orders them.
- * Fortunately, load_cr3() is serializing and gives the
- * ordering guarantee we need.
- *
- */
- load_cr3(next->pgd);
-
- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-
- /* Stop flush ipis for the previous mm */
- cpumask_clear_cpu(cpu, mm_cpumask(prev));
-
- /* Load per-mm CR4 state */
- load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- /*
- * Load the LDT, if the LDT is different.
- *
- * It's possible that prev->context.ldt doesn't match
- * the LDT register. This can happen if leave_mm(prev)
- * was called and then modify_ldt changed
- * prev->context.ldt but suppressed an IPI to this CPU.
- * In this case, prev->context.ldt != NULL, because we
- * never set context.ldt to NULL while the mm still
- * exists. That means that next->context.ldt !=
- * prev->context.ldt, because mms never share an LDT.
- */
- if (unlikely(prev->context.ldt != next->context.ldt))
- load_mm_ldt(next);
-#endif
- }
-#ifdef CONFIG_SMP
- else {
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
- if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
- /*
- * On established mms, the mm_cpumask is only changed
- * from irq context, from ptep_clear_flush() while in
- * lazy tlb mode, and here. Irqs are blocked during
- * schedule, protecting us from simultaneous changes.
- */
- cpumask_set_cpu(cpu, mm_cpumask(next));
-
- /*
- * We were in lazy tlb mode and leave_mm disabled
- * tlb flush IPI delivery. We must reload CR3
- * to make sure to use no freed page tables.
- *
- * As above, load_cr3() is serializing and orders TLB
- * fills with respect to the mm_cpumask write.
- */
- load_cr3(next->pgd);
- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- load_mm_cr4(next);
- load_mm_ldt(next);
- }
- }
-#endif
-}

#define activate_mm(prev, next) \
do { \
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3f7a7939e65e..07558fabc222 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -59,6 +59,108 @@ void leave_mm(int cpu)
}
EXPORT_SYMBOL_GPL(leave_mm);

+#endif /* CONFIG_SMP */
+
+void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ unsigned cpu = smp_processor_id();
+
+ if (likely(prev != next)) {
+#ifdef CONFIG_SMP
+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ this_cpu_write(cpu_tlbstate.active_mm, next);
+#endif
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+
+ /*
+ * Re-load page tables.
+ *
+ * This logic has an ordering constraint:
+ *
+ * CPU 0: Write to a PTE for 'next'
+ * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
+ * CPU 1: set bit 1 in next's mm_cpumask
+ * CPU 1: load from the PTE that CPU 0 writes (implicit)
+ *
+ * We need to prevent an outcome in which CPU 1 observes
+ * the new PTE value and CPU 0 observes bit 1 clear in
+ * mm_cpumask. (If that occurs, then the IPI will never
+ * be sent, and CPU 0's TLB will contain a stale entry.)
+ *
+ * The bad outcome can occur if either CPU's load is
+ * reordered before that CPU's store, so both CPUs must
+ * execute full barriers to prevent this from happening.
+ *
+ * Thus, switch_mm needs a full barrier between the
+ * store to mm_cpumask and any operation that could load
+ * from next->pgd. TLB fills are special and can happen
+ * due to instruction fetches or for no reason at all,
+ * and neither LOCK nor MFENCE orders them.
+ * Fortunately, load_cr3() is serializing and gives the
+ * ordering guarantee we need.
+ *
+ */
+ load_cr3(next->pgd);
+
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+ /* Stop flush ipis for the previous mm */
+ cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
+ /* Load per-mm CR4 state */
+ load_mm_cr4(next);
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ /*
+ * Load the LDT, if the LDT is different.
+ *
+ * It's possible that prev->context.ldt doesn't match
+ * the LDT register. This can happen if leave_mm(prev)
+ * was called and then modify_ldt changed
+ * prev->context.ldt but suppressed an IPI to this CPU.
+ * In this case, prev->context.ldt != NULL, because we
+ * never set context.ldt to NULL while the mm still
+ * exists. That means that next->context.ldt !=
+ * prev->context.ldt, because mms never share an LDT.
+ */
+ if (unlikely(prev->context.ldt != next->context.ldt))
+ load_mm_ldt(next);
+#endif
+ }
+#ifdef CONFIG_SMP
+ else {
+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
+
+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /*
+ * On established mms, the mm_cpumask is only changed
+ * from irq context, from ptep_clear_flush() while in
+ * lazy tlb mode, and here. Irqs are blocked during
+ * schedule, protecting us from simultaneous changes.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+
+ /*
+ * We were in lazy tlb mode and leave_mm disabled
+ * tlb flush IPI delivery. We must reload CR3
+ * to make sure to use no freed page tables.
+ *
+ * As above, load_cr3() is serializing and orders TLB
+ * fills with respect to the mm_cpumask write.
+ */
+ load_cr3(next->pgd);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ load_mm_cr4(next);
+ load_mm_ldt(next);
+ }
+ }
+#endif
+}
+
+#ifdef CONFIG_SMP
+
/*
* The flush IPI assumes that a thread switch happens in this order:
* [cpu0: the cpu that switches]
--
2.5.5

2016-04-26 16:39:56

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 2/5] sched: Add switch_mm_irqs_off and use it in the scheduler

By defailt, this is the same thing as switch_mm. x86 will override it
as an optimization.

Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
include/linux/mmu_context.h | 7 +++++++
kernel/sched/core.c | 6 +++---
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 70fffeba7495..f9e09613c113 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -1,9 +1,16 @@
#ifndef _LINUX_MMU_CONTEXT_H
#define _LINUX_MMU_CONTEXT_H

+#include <asm/mmu_context.h>
+
struct mm_struct;

void use_mm(struct mm_struct *mm);
void unuse_mm(struct mm_struct *mm);

+/* Architectures that care about IRQ state in switch_mm can override this. */
+#ifndef switch_mm_irqs_off
+#define switch_mm_irqs_off switch_mm
+#endif
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8465eeab8b3..3636abcafac0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -33,7 +33,7 @@
#include <linux/init.h>
#include <linux/uaccess.h>
#include <linux/highmem.h>
-#include <asm/mmu_context.h>
+#include <linux/mmu_context.h>
#include <linux/interrupt.h>
#include <linux/capability.h>
#include <linux/completion.h>
@@ -2712,7 +2712,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
- switch_mm(oldmm, mm, next);
+ switch_mm_irqs_off(oldmm, mm, next);

if (!prev->mm) {
prev->active_mm = NULL;
@@ -5202,7 +5202,7 @@ void idle_task_exit(void)
BUG_ON(cpu_online(smp_processor_id()));

if (mm != &init_mm) {
- switch_mm(mm, &init_mm, current);
+ switch_mm_irqs_off(mm, &init_mm, current);
finish_arch_post_lock_switch();
}
mmdrop(mm);
--
2.5.5

2016-04-27 18:09:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements

On Tue, Apr 26, 2016 at 09:39:04AM -0700, Andy Lutomirski wrote:
> Hi all-
>
> I've been playing with context switching lately, and I'm going to start
> sending out some of the patches that should be mostly self-contained and
> ready for -tip.
>
> Here's a little batch to start improving switch_mm. It uninlines it
> and makes it run with IRQs off. (AFAICT everyone who's modified it
> thought it ran with IRQs off, but that's not always the case. I
> don't know of any bugs that this fixes, but it'll be needed for PCID
> to avoid introducing really nasty races.)
>
> This may also help a bit with FSGSBASE -- not sure yet. It certainly
> won't hurt.
>
> It contains a trivial off-topic ARM patch to avoid breaking the build.
>
> Andy Lutomirski (5):
> arm: Include linux/preempt.h from asm/mmu_context.h
> sched: Add switch_mm_irqs_off and use it in the scheduler
> x86/mm: Build arch/x86/mm/tlb.c even on !SMP
> x86/mm: Uninline switch_mm
> x86/mm: Turn off IRQs in switch_mm
>
> arch/arm/include/asm/mmu_context.h | 1 +
> arch/x86/include/asm/mmu_context.h | 101 ++------------------------------
> arch/x86/mm/Makefile | 3 +-
> arch/x86/mm/tlb.c | 116 +++++++++++++++++++++++++++++++++++++
> include/linux/mmu_context.h | 7 +++
> kernel/sched/core.c | 6 +-
> 6 files changed, 133 insertions(+), 101 deletions(-)

Reviewed-by: Borislav Petkov <[email protected]>

There was a concern that maybe disabling IRQs in
exec_mmap->activate_mm()->switch_mm() would be a little bit of a
slowdown but that's not a hot path anyway.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Subject: [tip:sched/core] sched/core: Add switch_mm_irqs_off() and use it in the scheduler

Commit-ID: f98db6013c557c216da5038d9c52045be55cd039
Gitweb: http://git.kernel.org/tip/f98db6013c557c216da5038d9c52045be55cd039
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 26 Apr 2016 09:39:06 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 28 Apr 2016 11:44:19 +0200

sched/core: Add switch_mm_irqs_off() and use it in the scheduler

By default, this is the same thing as switch_mm().

x86 will override it as an optimization.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mmu_context.h | 7 +++++++
kernel/sched/core.c | 6 +++---
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 70fffeb..a444178 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -1,9 +1,16 @@
#ifndef _LINUX_MMU_CONTEXT_H
#define _LINUX_MMU_CONTEXT_H

+#include <asm/mmu_context.h>
+
struct mm_struct;

void use_mm(struct mm_struct *mm);
void unuse_mm(struct mm_struct *mm);

+/* Architectures that care about IRQ state in switch_mm can override this. */
+#ifndef switch_mm_irqs_off
+# define switch_mm_irqs_off switch_mm
+#endif
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d84d60..adcafda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -33,7 +33,7 @@
#include <linux/init.h>
#include <linux/uaccess.h>
#include <linux/highmem.h>
-#include <asm/mmu_context.h>
+#include <linux/mmu_context.h>
#include <linux/interrupt.h>
#include <linux/capability.h>
#include <linux/completion.h>
@@ -2733,7 +2733,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
- switch_mm(oldmm, mm, next);
+ switch_mm_irqs_off(oldmm, mm, next);

if (!prev->mm) {
prev->active_mm = NULL;
@@ -5274,7 +5274,7 @@ void idle_task_exit(void)
BUG_ON(cpu_online(smp_processor_id()));

if (mm != &init_mm) {
- switch_mm(mm, &init_mm, current);
+ switch_mm_irqs_off(mm, &init_mm, current);
finish_arch_post_lock_switch();
}
mmdrop(mm);

Subject: [tip:sched/core] sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h

Commit-ID: 88f10e37e150569a390be7a6161fa0f26b7372e9
Gitweb: http://git.kernel.org/tip/88f10e37e150569a390be7a6161fa0f26b7372e9
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 26 Apr 2016 09:39:05 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 28 Apr 2016 11:08:41 +0200

sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h

arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
include anything that would pull in the declaration.

If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
without this, the build breaks.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/5b95730a70f2dafe12d4fbf38d20eb7330d67ba3.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/arm/include/asm/mmu_context.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
index fa5b42d..ed73bab 100644
--- a/arch/arm/include/asm/mmu_context.h
+++ b/arch/arm/include/asm/mmu_context.h
@@ -15,6 +15,7 @@

#include <linux/compiler.h>
#include <linux/sched.h>
+#include <linux/preempt.h>
#include <asm/cacheflush.h>
#include <asm/cachetype.h>
#include <asm/proc-fns.h>

2016-04-28 10:42:34

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h

I've not got around to looking at this yet as I've been away, and I'm
still catching up with email.

On Thu, Apr 28, 2016 at 03:29:57AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID: 88f10e37e150569a390be7a6161fa0f26b7372e9
> Gitweb: http://git.kernel.org/tip/88f10e37e150569a390be7a6161fa0f26b7372e9
> Author: Andy Lutomirski <[email protected]>
> AuthorDate: Tue, 26 Apr 2016 09:39:05 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Thu, 28 Apr 2016 11:08:41 +0200
>
> sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h
>
> arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
> include anything that would pull in the declaration.
>
> If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
> without this, the build breaks.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> Reviewed-by: Borislav Petkov <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Link: http://lkml.kernel.org/r/5b95730a70f2dafe12d4fbf38d20eb7330d67ba3.1461688545.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/arm/include/asm/mmu_context.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
> index fa5b42d..ed73bab 100644
> --- a/arch/arm/include/asm/mmu_context.h
> +++ b/arch/arm/include/asm/mmu_context.h
> @@ -15,6 +15,7 @@
>
> #include <linux/compiler.h>
> #include <linux/sched.h>
> +#include <linux/preempt.h>
> #include <asm/cacheflush.h>
> #include <asm/cachetype.h>
> #include <asm/proc-fns.h>

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Subject: [tip:sched/core] x86/mm: Build arch/x86/mm/tlb.c even on !SMP

Commit-ID: e1074888c326038340a1ada9129d679e661f2ea6
Gitweb: http://git.kernel.org/tip/e1074888c326038340a1ada9129d679e661f2ea6
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 26 Apr 2016 09:39:07 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 28 Apr 2016 11:44:19 +0200

x86/mm: Build arch/x86/mm/tlb.c even on !SMP

Currently all of the functions that live in tlb.c are inlined on
!SMP builds. One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/Makefile | 3 +--
arch/x86/mm/tlb.c | 4 ++++
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f989132..62c0043 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@
KCOV_INSTRUMENT_tlb.o := n

obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
- pat.o pgtable.o physaddr.o gup.o setup_nx.o
+ pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o

# Make sure __phys_addr has no stackprotector
nostackp := $(call cc-option, -fno-stack-protector)
@@ -12,7 +12,6 @@ CFLAGS_setup_nx.o := $(nostackp)
CFLAGS_fault.o := -I$(src)/../include/asm/trace

obj-$(CONFIG_X86_PAT) += pat_rbtree.o
-obj-$(CONFIG_SMP) += tlb.o

obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fe9b9f7..a4530e2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/

+#ifdef CONFIG_SMP
+
struct flush_tlb_info {
struct mm_struct *flush_mm;
unsigned long flush_start;
@@ -353,3 +355,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
return 0;
}
late_initcall(create_tlb_single_page_flush_ceiling);
+
+#endif /* CONFIG_SMP */

Subject: [tip:sched/core] x86/mm, sched/core: Uninline switch_mm()

Commit-ID: 69c0319aabba45bcf33178916a2f06967b4adede
Gitweb: http://git.kernel.org/tip/69c0319aabba45bcf33178916a2f06967b4adede
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 26 Apr 2016 09:39:08 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 28 Apr 2016 11:44:19 +0200

x86/mm, sched/core: Uninline switch_mm()

It's fairly large and it has quite a few callers. This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 98 +----------------------------------
arch/x86/mm/tlb.c | 102 +++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+), 96 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 8428002..bb911dd 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -115,103 +115,9 @@ static inline void destroy_context(struct mm_struct *mm)
destroy_context_ldt(mm);
}

-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk)
-{
- unsigned cpu = smp_processor_id();
+extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk);

- if (likely(prev != next)) {
-#ifdef CONFIG_SMP
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
- cpumask_set_cpu(cpu, mm_cpumask(next));
-
- /*
- * Re-load page tables.
- *
- * This logic has an ordering constraint:
- *
- * CPU 0: Write to a PTE for 'next'
- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
- * CPU 1: set bit 1 in next's mm_cpumask
- * CPU 1: load from the PTE that CPU 0 writes (implicit)
- *
- * We need to prevent an outcome in which CPU 1 observes
- * the new PTE value and CPU 0 observes bit 1 clear in
- * mm_cpumask. (If that occurs, then the IPI will never
- * be sent, and CPU 0's TLB will contain a stale entry.)
- *
- * The bad outcome can occur if either CPU's load is
- * reordered before that CPU's store, so both CPUs must
- * execute full barriers to prevent this from happening.
- *
- * Thus, switch_mm needs a full barrier between the
- * store to mm_cpumask and any operation that could load
- * from next->pgd. TLB fills are special and can happen
- * due to instruction fetches or for no reason at all,
- * and neither LOCK nor MFENCE orders them.
- * Fortunately, load_cr3() is serializing and gives the
- * ordering guarantee we need.
- *
- */
- load_cr3(next->pgd);
-
- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-
- /* Stop flush ipis for the previous mm */
- cpumask_clear_cpu(cpu, mm_cpumask(prev));
-
- /* Load per-mm CR4 state */
- load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- /*
- * Load the LDT, if the LDT is different.
- *
- * It's possible that prev->context.ldt doesn't match
- * the LDT register. This can happen if leave_mm(prev)
- * was called and then modify_ldt changed
- * prev->context.ldt but suppressed an IPI to this CPU.
- * In this case, prev->context.ldt != NULL, because we
- * never set context.ldt to NULL while the mm still
- * exists. That means that next->context.ldt !=
- * prev->context.ldt, because mms never share an LDT.
- */
- if (unlikely(prev->context.ldt != next->context.ldt))
- load_mm_ldt(next);
-#endif
- }
-#ifdef CONFIG_SMP
- else {
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
- if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
- /*
- * On established mms, the mm_cpumask is only changed
- * from irq context, from ptep_clear_flush() while in
- * lazy tlb mode, and here. Irqs are blocked during
- * schedule, protecting us from simultaneous changes.
- */
- cpumask_set_cpu(cpu, mm_cpumask(next));
-
- /*
- * We were in lazy tlb mode and leave_mm disabled
- * tlb flush IPI delivery. We must reload CR3
- * to make sure to use no freed page tables.
- *
- * As above, load_cr3() is serializing and orders TLB
- * fills with respect to the mm_cpumask write.
- */
- load_cr3(next->pgd);
- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- load_mm_cr4(next);
- load_mm_ldt(next);
- }
- }
-#endif
-}

#define activate_mm(prev, next) \
do { \
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a4530e2..ce7a0c9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -59,6 +59,108 @@ void leave_mm(int cpu)
}
EXPORT_SYMBOL_GPL(leave_mm);

+#endif /* CONFIG_SMP */
+
+void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ unsigned cpu = smp_processor_id();
+
+ if (likely(prev != next)) {
+#ifdef CONFIG_SMP
+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ this_cpu_write(cpu_tlbstate.active_mm, next);
+#endif
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+
+ /*
+ * Re-load page tables.
+ *
+ * This logic has an ordering constraint:
+ *
+ * CPU 0: Write to a PTE for 'next'
+ * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
+ * CPU 1: set bit 1 in next's mm_cpumask
+ * CPU 1: load from the PTE that CPU 0 writes (implicit)
+ *
+ * We need to prevent an outcome in which CPU 1 observes
+ * the new PTE value and CPU 0 observes bit 1 clear in
+ * mm_cpumask. (If that occurs, then the IPI will never
+ * be sent, and CPU 0's TLB will contain a stale entry.)
+ *
+ * The bad outcome can occur if either CPU's load is
+ * reordered before that CPU's store, so both CPUs must
+ * execute full barriers to prevent this from happening.
+ *
+ * Thus, switch_mm needs a full barrier between the
+ * store to mm_cpumask and any operation that could load
+ * from next->pgd. TLB fills are special and can happen
+ * due to instruction fetches or for no reason at all,
+ * and neither LOCK nor MFENCE orders them.
+ * Fortunately, load_cr3() is serializing and gives the
+ * ordering guarantee we need.
+ *
+ */
+ load_cr3(next->pgd);
+
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+ /* Stop flush ipis for the previous mm */
+ cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
+ /* Load per-mm CR4 state */
+ load_mm_cr4(next);
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ /*
+ * Load the LDT, if the LDT is different.
+ *
+ * It's possible that prev->context.ldt doesn't match
+ * the LDT register. This can happen if leave_mm(prev)
+ * was called and then modify_ldt changed
+ * prev->context.ldt but suppressed an IPI to this CPU.
+ * In this case, prev->context.ldt != NULL, because we
+ * never set context.ldt to NULL while the mm still
+ * exists. That means that next->context.ldt !=
+ * prev->context.ldt, because mms never share an LDT.
+ */
+ if (unlikely(prev->context.ldt != next->context.ldt))
+ load_mm_ldt(next);
+#endif
+ }
+#ifdef CONFIG_SMP
+ else {
+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
+
+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /*
+ * On established mms, the mm_cpumask is only changed
+ * from irq context, from ptep_clear_flush() while in
+ * lazy tlb mode, and here. Irqs are blocked during
+ * schedule, protecting us from simultaneous changes.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+
+ /*
+ * We were in lazy tlb mode and leave_mm disabled
+ * tlb flush IPI delivery. We must reload CR3
+ * to make sure to use no freed page tables.
+ *
+ * As above, load_cr3() is serializing and orders TLB
+ * fills with respect to the mm_cpumask write.
+ */
+ load_cr3(next->pgd);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ load_mm_cr4(next);
+ load_mm_ldt(next);
+ }
+ }
+#endif
+}
+
+#ifdef CONFIG_SMP
+
/*
* The flush IPI assumes that a thread switch happens in this order:
* [cpu0: the cpu that switches]

Subject: [tip:sched/core] x86/mm, sched/core: Turn off IRQs in switch_mm()

Commit-ID: 078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a
Gitweb: http://git.kernel.org/tip/078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 26 Apr 2016 09:39:09 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 28 Apr 2016 11:44:20 +0200

x86/mm, sched/core: Turn off IRQs in switch_mm()

Potential races between switch_mm() and TLB-flush or LDT-flush IPIs
could be very messy. AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a big hammer: run switch_mm() with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm().

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 3 +++
arch/x86/mm/tlb.c | 10 ++++++++++
2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index bb911dd..39634819 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -118,6 +118,9 @@ static inline void destroy_context(struct mm_struct *mm)
extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk);

+extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk);
+#define switch_mm_irqs_off switch_mm_irqs_off

#define activate_mm(prev, next) \
do { \
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ce7a0c9..5643fd0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -64,6 +64,16 @@ EXPORT_SYMBOL_GPL(leave_mm);
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ switch_mm_irqs_off(prev, next, tsk);
+ local_irq_restore(flags);
+}
+
+void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk)
+{
unsigned cpu = smp_processor_id();

if (likely(prev != next)) {

2016-04-28 13:27:32

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h

On Tue, Apr 26, 2016 at 09:39:05AM -0700, Andy Lutomirski wrote:
> arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
> include anything that would pull in the declaration.
>
> If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
> without this, the build breaks.

>From an inspection of the include paths, this looks okay.
>
> Cc: Russell King <[email protected]>

Acked-by: Russell King <[email protected]>

Thanks.

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.