2017-06-29 15:53:38

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 00/10] PCID and improved laziness

*** Ingo, even if this misses 4.13, please apply the first patch before
*** the merge window.

There are three performance benefits here:

1. TLB flushing is slow. (I.e. the flush itself takes a while.)
This avoids many of them when switching tasks by using PCID. In
a stupid little benchmark I did, it saves about 100ns on my laptop
per context switch. I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
their TLB entries alive across process switches with this patch
set. TLB fills are pretty fast on modern CPUs, but they're even
faster when they don't happen.

3. Lazy TLB is way better. We used to do two stupid things when we
ran kernel threads: we'd send IPIs to flush user contexts on their
CPUs and then we'd write to CR3 for no particular reason as an excuse
to stop further IPIs. With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates. (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is still missing a final comment-only patch to add overall
documentation for the whole thing, but I didn't want to block sending
the maybe-hopefully-final code on that.

This is based on tip:x86/mm. The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

In general, performance seems to exceed my expectations. Here are
some performance numbers copy-and-pasted from the changelogs for
"Rework lazy TLB mode and TLB freshness" and "Try to preserve old
TLB entries using PCID":

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.
patched: 13.4µs
unpatched: 21.6µs

Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores),
nrounds = 100, 256M data
patched: 1.1 seconds or so
unpatched: 1.9 seconds or so

ping-pong between two mms on the same CPU using eventfd:
patched: 1.22µs
patched, nopcid: 1.33µs
unpatched: 1.34µs

Same ping-pong, but now touch 512 pages (all zero-page to minimize
cache misses) each iteration. dTLB misses are measured by
dtlb_load_misses.miss_causes_a_walk:
patched: 1.8µs 11M dTLB misses
patched, nopcid: 6.2µs, 207M dTLB misses
unpatched: 6.1µs, 190M dTLB misses

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.
patched: 13.4µs
unpatched: 21.6µs

Changes from v3:
- Lots more acks.
- Move comment deletion to the beginning.
- Misc cleanups from lots of reviewers.

Changes from v2:
- Add some Acks
- Move the reentrancy issue to the beginning.
(I also sent the same patch as a standalone fix -- it's just in here
so that this series applies to x86/mm.)
- Fix some comments.

Changes from RFC:
- flush_tlb_func_common() no longer gets reentered (Nadav)
- Fix ASID corruption on unlazying (kbuild bot)
- Move Xen init to the right place
- Misc cleanups

Andy Lutomirski (10):
x86/mm: Don't reenter flush_tlb_func_common()
x86/mm: Delete a big outdated comment about TLB flushing
x86/mm: Give each mm TLB flush generation a unique ID
x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
x86/mm: Rework lazy TLB mode and TLB freshness tracking
x86/mm: Stop calling leave_mm() in idle code
x86/mm: Disable PCID on 32-bit kernels
x86/mm: Add nopcid to turn off PCID
x86/mm: Enable CR4.PCIDE on supported systems
x86/mm: Try to preserve old TLB entries using PCID

Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/ia64/include/asm/acpi.h | 2 -
arch/x86/include/asm/acpi.h | 2 -
arch/x86/include/asm/disabled-features.h | 4 +-
arch/x86/include/asm/mmu.h | 25 +-
arch/x86/include/asm/mmu_context.h | 15 +-
arch/x86/include/asm/processor-flags.h | 2 +
arch/x86/include/asm/tlbflush.h | 87 +++++-
arch/x86/kernel/cpu/bugs.c | 8 +
arch/x86/kernel/cpu/common.c | 40 +++
arch/x86/mm/init.c | 2 +-
arch/x86/mm/tlb.c | 382 ++++++++++++++++--------
arch/x86/xen/enlighten_pv.c | 6 +
arch/x86/xen/mmu_pv.c | 5 +-
drivers/acpi/processor_idle.c | 2 -
drivers/idle/intel_idle.c | 9 +-
16 files changed, 446 insertions(+), 147 deletions(-)

--
2.9.4


2017-06-29 15:53:48

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 01/10] x86/mm: Don't reenter flush_tlb_func_common()

It was historically possible to have two concurrent TLB flushes
targetting the same CPU: one initiated locally and one initiated
remotely. This can now cause an OOPS in leave_mm() at
arch/x86/mm/tlb.c:47:

if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();

with this call trace:
flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317

Without reentrancy, this OOPS is impossible: leave_mm() is only
called if we're not in TLBSTATE_OK, but then we're unexpectedly
in TLBSTATE_OK in leave_mm().

This can be caused by flush_tlb_func_remote() happening between
the two checks and calling leave_mm(), resulting in two consecutive
leave_mm() calls on the same CPU with no intervening switch_mm()
calls.

We never saw this OOPS before because the old leave_mm()
implementation didn't put us back in TLBSTATE_OK, so the assertion
didn't fire.

Nadav noticed the reentrancy issue in a different context, but
neither of us realized that it caused a problem yet.

Cc: Dave Hansen <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reported-by: "Levin, Alexander (Sasha Levin)" <[email protected]>
Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/tlb.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b2485d69f7c2..1cc47838d1e8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -192,6 +192,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
+ /* This code cannot presently handle being reentered. */
+ VM_WARN_ON(!irqs_disabled());
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
return;
@@ -297,8 +300,13 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
info.end = TLB_FLUSH_ALL;
}

- if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
+ if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+ VM_WARN_ON(irqs_disabled());
+ local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+ local_irq_enable();
+ }
+
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
put_cpu();
@@ -354,8 +362,13 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)

int cpu = get_cpu();

- if (cpumask_test_cpu(cpu, &batch->cpumask))
+ if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+ VM_WARN_ON(irqs_disabled());
+ local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
+ local_irq_enable();
+ }
+
if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(&batch->cpumask, &info);
cpumask_clear(&batch->cpumask);
--
2.9.4

2017-06-29 15:53:59

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 03/10] x86/mm: Give each mm TLB flush generation a unique ID

This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review. By itself, it has no
real effect other than creating and updating the new variables.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu.h | 25 +++++++++++++++++++++++--
arch/x86/include/asm/mmu_context.h | 6 ++++++
arch/x86/include/asm/tlbflush.h | 18 ++++++++++++++++++
arch/x86/mm/tlb.c | 6 ++++--
4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a7ebd0..bb8c597c2248 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@

#include <linux/spinlock.h>
#include <linux/mutex.h>
+#include <linux/atomic.h>

/*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
*/
typedef struct {
+ /*
+ * ctx_id uniquely identifies this mm_struct. A ctx_id will never
+ * be reused, and zero is not a valid ctx_id.
+ */
+ u64 ctx_id;
+
+ /*
+ * Any code that needs to do any sort of TLB flushing for this
+ * mm will first make its changes to the page tables, then
+ * increment tlb_gen, then flush. This lets the low-level
+ * flushing code keep track of what needs flushing.
+ *
+ * This is not used on Xen PV.
+ */
+ atomic64_t tlb_gen;
+
#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
#endif
@@ -37,6 +53,11 @@ typedef struct {
#endif
} mm_context_t;

+#define INIT_MM_CONTEXT(mm) \
+ .context = { \
+ .ctx_id = 1, \
+ }
+
void leave_mm(int cpu);

#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ecfcb6643c9b..ae19b9d11259 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -12,6 +12,9 @@
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
#include <asm/mpx.h>
+
+extern atomic64_t last_mm_ctx_id;
+
#ifndef CONFIG_PARAVIRT
static inline void paravirt_activate_mm(struct mm_struct *prev,
struct mm_struct *next)
@@ -132,6 +135,9 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
static inline int init_new_context(struct task_struct *tsk,
struct mm_struct *mm)
{
+ mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
+ atomic64_set(&mm->context.tlb_gen, 0);
+
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 50ea3482e1d1..ad2135385699 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
}

+static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
+{
+ u64 new_tlb_gen;
+
+ /*
+ * Bump the generation count. This also serves as a full barrier
+ * that synchronizes with switch_mm(): callers are required to order
+ * their read of mm_cpumask after their writes to the paging
+ * structures.
+ */
+ smp_mb__before_atomic();
+ new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
+ smp_mb__after_atomic();
+
+ return new_tlb_gen;
+}
+
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
+ inc_mm_tlb_gen(mm);
cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
}

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 014d07a80053..14f4f8f66aa8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/

+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -250,8 +252,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

cpu = get_cpu();

- /* Synchronize with switch_mm. */
- smp_mb();
+ /* This is also a barrier that synchronizes with switch_mm(). */
+ inc_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&
--
2.9.4

2017-06-29 15:54:09

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 02/10] x86/mm: Delete a big outdated comment about TLB flushing

The comment describes the old explicit IPI-based flush logic, which
is long gone.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/tlb.c | 36 ------------------------------------
1 file changed, 36 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1cc47838d1e8..014d07a80053 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -153,42 +153,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
switch_ldt(real_prev, next);
}

-/*
- * The flush IPI assumes that a thread switch happens in this order:
- * [cpu0: the cpu that switches]
- * 1) switch_mm() either 1a) or 1b)
- * 1a) thread switch to a different mm
- * 1a1) set cpu_tlbstate to TLBSTATE_OK
- * Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
- * if cpu0 was in lazy tlb mode.
- * 1a2) update cpu active_mm
- * Now cpu0 accepts tlb flushes for the new mm.
- * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
- * Now the other cpus will send tlb flush ipis.
- * 1a4) change cr3.
- * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
- * Stop ipi delivery for the old mm. This is not synchronized with
- * the other cpus, but flush_tlb_func ignore flush ipis for the wrong
- * mm, and in the worst case we perform a superfluous tlb flush.
- * 1b) thread switch without mm change
- * cpu active_mm is correct, cpu0 already handles flush ipis.
- * 1b1) set cpu_tlbstate to TLBSTATE_OK
- * 1b2) test_and_set the cpu bit in cpu_vm_mask.
- * Atomically set the bit [other cpus will start sending flush ipis],
- * and test the bit.
- * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
- * 2) switch %%esp, ie current
- *
- * The interrupt must handle 2 special cases:
- * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
- * - the cpu performs speculative tlb reads, i.e. even if the cpu only
- * runs in kernel space, the cpu could load tlb entries for user space
- * pages.
- *
- * The good news is that cpu_tlbstate is local to each cpu, no
- * write/read ordering problems.
- */
-
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
--
2.9.4

2017-06-29 15:54:22

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 04/10] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

- Lazy mm switching currently works by switching to init_mm when
it would otherwise flush. This is wasteful: there isn't fundamentally
any need to update CR3 at all when going lazy or when returning from
lazy mode, nor is there any need to receive flush IPIs at all. Instead,
we should just stop trying to keep the TLB coherent when we go lazy and,
when unlazying, check whether we missed any flushes.

- PCID will let us keep recent user contexts alive in the TLB. If we
start doing this, we need a way to decide whether those contexts are
up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on. We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race. The complexity in this patch would not be worth it on its own,
but it will enable improved lazy TLB tracking and PCID.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 43 +++++++++++++++--
arch/x86/mm/tlb.c | 102 +++++++++++++++++++++++++++++++++++++---
2 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ad2135385699..d7df54cc7e4d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif

+struct tlb_context {
+ u64 ctx_id;
+ u64 tlb_gen;
+};
+
struct tlb_state {
/*
* cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
* disabling interrupts when modifying either one.
*/
unsigned long cr4;
+
+ /*
+ * This is a list of all contexts that might exist in the TLB.
+ * Since we don't yet use PCID, there is only one context.
+ *
+ * For each context, ctx_id indicates which mm the TLB's user
+ * entries came from. As an invariant, the TLB will never
+ * contain entries that are out-of-date as when that mm reached
+ * the tlb_gen in the list.
+ *
+ * To be clear, this means that it's legal for the TLB code to
+ * flush the TLB without updating tlb_gen. This can happen
+ * (for now, at least) due to paravirt remote flushes.
+ */
+ struct tlb_context ctxs[1];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
* and page-granular flushes are available only on i486 and up.
*/
struct flush_tlb_info {
- struct mm_struct *mm;
- unsigned long start;
- unsigned long end;
+ /*
+ * We support several kinds of flushes.
+ *
+ * - Fully flush a single mm. .mm will be set, .end will be
+ * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
+ * which the IPI sender is trying to catch us up.
+ *
+ * - Partially flush a single mm. .mm will be set, .start and
+ * .end will indicate the range, and .new_tlb_gen will be set
+ * such that the changes between generation .new_tlb_gen-1 and
+ * .new_tlb_gen are entirely contained in the indicated range.
+ *
+ * - Fully flush all mms whose tlb_gens have been updated. .mm
+ * will be NULL, .end will be TLB_FLUSH_ALL, and .new_tlb_gen
+ * will be zero.
+ */
+ struct mm_struct *mm;
+ unsigned long start;
+ unsigned long end;
+ u64 new_tlb_gen;
};

#define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 14f4f8f66aa8..4e5a5ddb9e4d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
}

this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(&next->context.tlb_gen));

WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -155,25 +157,102 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
switch_ldt(real_prev, next);
}

+/*
+ * flush_tlb_func_common()'s memory ordering requirement is that any
+ * TLB fills that happen after we flush the TLB are ordered after we
+ * read active_mm's tlb_gen. We don't need any explicit barriers
+ * because all x86 flush operations are serializing and the
+ * atomic64_read operation won't be reordered by the compiler.
+ */
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
+ /*
+ * We have three different tlb_gen values in here. They are:
+ *
+ * - mm_tlb_gen: the latest generation.
+ * - local_tlb_gen: the generation that this CPU has already caught
+ * up to.
+ * - f->new_tlb_gen: the generation that the requester of the flush
+ * wants us to catch up to.
+ */
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());

+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ loaded_mm->context.ctx_id);
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+ /*
+ * leave_mm() is adequate to handle any type of flush, and
+ * we would prefer not to receive further IPIs. leave_mm()
+ * clears this CPU's bit in mm_cpumask().
+ */
leave_mm(smp_processor_id());
return;
}

- if (f->end == TLB_FLUSH_ALL) {
- local_flush_tlb();
- if (local)
- count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
- trace_tlb_flush(reason, TLB_FLUSH_ALL);
- } else {
+ if (unlikely(local_tlb_gen == mm_tlb_gen)) {
+ /*
+ * There's nothing to do: we're already up to date. This can
+ * happen if two concurrent flushes happen -- the first flush to
+ * be handled can catch us all the way up, leaving no work for
+ * the second flush.
+ */
+ return;
+ }
+
+ WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
+ WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
+
+ /*
+ * If we get to this point, we know that our TLB is out of date.
+ * This does not strictly imply that we need to flush (it's
+ * possible that f->new_tlb_gen <= local_tlb_gen), but we're
+ * going to need to flush in the very near future, so we might
+ * as well get it over with.
+ *
+ * The only question is whether to do a full or partial flush.
+ *
+ * We do a partial flush if requested and two extra conditions
+ * are met:
+ *
+ * 1. f->new_tlb_gen == local_tlb_gen + 1. We have an invariant that
+ * we've always done all needed flushes to catch up to
+ * local_tlb_gen. If, for example, local_tlb_gen == 2 and
+ * f->new_tlb_gen == 3, then we know that the flush needed to bring
+ * us up to date for tlb_gen 3 is the partial flush we're
+ * processing.
+ *
+ * As an example of why this check is needed, suppose that there
+ * are two concurrent flushes. The first is a full flush that
+ * changes context.tlb_gen from 1 to 2. The second is a partial
+ * flush that changes context.tlb_gen from 2 to 3. If they get
+ * processed on this CPU in reverse order, we'll see
+ * local_tlb_gen == 1, mm_tlb_gen == 3, and end != TLB_FLUSH_ALL.
+ * If we were to use __flush_tlb_single() and set local_tlb_gen to
+ * 3, we'd be break the invariant: we'd update local_tlb_gen above
+ * 1 without the full flush that's needed for tlb_gen 2.
+ *
+ * 2. f->new_tlb_gen == mm_tlb_gen. This is purely an optimiation.
+ * Partial TLB flushes are not all that much cheaper than full TLB
+ * flushes, so it seems unlikely that it would be a performance win
+ * to do a partial flush if that won't bring our TLB fully up to
+ * date. By doing a full flush instead, we can increase
+ * local_tlb_gen all the way to mm_tlb_gen and we can probably
+ * avoid another flush in the very near future.
+ */
+ if (f->end != TLB_FLUSH_ALL &&
+ f->new_tlb_gen == local_tlb_gen + 1 &&
+ f->new_tlb_gen == mm_tlb_gen) {
+ /* Partial flush */
unsigned long addr;
unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
+
addr = f->start;
while (addr < f->end) {
__flush_tlb_single(addr);
@@ -182,7 +261,16 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
if (local)
count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
trace_tlb_flush(reason, nr_pages);
+ } else {
+ /* Full flush. */
+ local_flush_tlb();
+ if (local)
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+ trace_tlb_flush(reason, TLB_FLUSH_ALL);
}
+
+ /* Both paths above update our state to mm_tlb_gen. */
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
}

static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -253,7 +341,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
cpu = get_cpu();

/* This is also a barrier that synchronizes with switch_mm(). */
- inc_mm_tlb_gen(mm);
+ info.new_tlb_gen = inc_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&
--
2.9.4

2017-06-29 15:54:34

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

PCID is a "process context ID" -- it's what other architectures call
an address space ID. Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB. x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID. x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs. To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently. We use a PCID to identify a
recently-used mm on a per-cpu basis. An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs). I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers. Unpatched means commit a4eb8b993554, so all the
bookkeeping overhead is gone.

ping-pong between two mms on the same CPU using eventfd:
patched: 1.22µs
patched, nopcid: 1.33µs
unpatched: 1.34µs

Same ping-pong, but now touch 512 pages (all zero-page to minimize
cache misses) each iteration. dTLB misses are measured by
dtlb_load_misses.miss_causes_a_walk:
patched: 1.8µs 11M dTLB misses
patched, nopcid: 6.2µs, 207M dTLB misses
unpatched: 6.1µs, 190M dTLB misses

Reviewed-by: Nadav Amit <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 3 ++
arch/x86/include/asm/processor-flags.h | 2 +
arch/x86/include/asm/tlbflush.h | 18 +++++++-
arch/x86/mm/init.c | 1 +
arch/x86/mm/tlb.c | 82 +++++++++++++++++++++++++++-------
5 files changed, 88 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 85f6b5575aad..14b3cdccf4f9 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -300,6 +300,9 @@ static inline unsigned long __get_current_cr3_fast(void)
{
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);

+ if (static_cpu_has(X86_FEATURE_PCID))
+ cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
/* Mask off the address space ID bits. */
#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
#define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
#else
/*
* CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
*/
#define CR3_ADDR_MASK 0xFFFFFFFFull
#define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
#endif

#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6397275008db..d23e61dc0640 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif

+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define TLB_NR_DYN_ASIDS 6
+
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
* mode even if we've already switched back to swapper_pg_dir.
*/
struct mm_struct *loaded_mm;
+ u16 loaded_mm_asid;
+ u16 next_asid;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {

/*
* This is a list of all contexts that might exist in the TLB.
- * Since we don't yet use PCID, there is only one context.
+ * There is one per ASID that we use, and the ASID (what the
+ * CPU calls PCID) is the index into ctxts.
*
* For each context, ctx_id indicates which mm the TLB's user
* entries came from. As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
* To be clear, this means that it's legal for the TLB code to
* flush the TLB without updating tlb_gen. This can happen
* (for now, at least) due to paravirt remote flushes.
+ *
+ * NB: context 0 is a bit special, since it's also used by
+ * various bits of init code. This is fine -- code that
+ * isn't aware of PCID will end up harmlessly flushing
+ * context 0.
*/
- struct tlb_context ctxs[1];
+ struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 4d353efb2838..65ae17d45c4a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
+ .next_asid = 1,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2c1b8881e9d3..63a5b451c128 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,40 @@

atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);

+static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
+ u16 *new_asid, bool *need_flush)
+{
+ u16 asid;
+
+ if (!static_cpu_has(X86_FEATURE_PCID)) {
+ *new_asid = 0;
+ *need_flush = true;
+ return;
+ }
+
+ for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+ if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
+ next->context.ctx_id)
+ continue;
+
+ *new_asid = asid;
+ *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
+ next_tlb_gen);
+ return;
+ }
+
+ /*
+ * We don't currently own an ASID slot on this CPU.
+ * Allocate a slot.
+ */
+ *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
+ if (*new_asid >= TLB_NR_DYN_ASIDS) {
+ *new_asid = 0;
+ this_cpu_write(cpu_tlbstate.next_asid, 1);
+ }
+ *need_flush = true;
+}
+
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -65,6 +99,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;

@@ -84,12 +119,13 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/*
* Verify that CR3 is what we think it is. This will catch
* hypothetical buggy code that directly switches to swapper_pg_dir
- * without going through leave_mm() / switch_mm_irqs_off().
+ * without going through leave_mm() / switch_mm_irqs_off() or that
+ * does something like write_cr3(read_cr3_pa()).
*/
- VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
+ VM_BUG_ON(__read_cr3() != (__pa(real_prev->pgd) | prev_asid));

if (real_prev == next) {
- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
@@ -104,18 +140,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

/* Resume remote flushes and then read tlb_gen. */
cpumask_set_cpu(cpu, mm_cpumask(next));
+ smp_mb__after_atomic();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) < next_tlb_gen) {
+ if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
+ next_tlb_gen) {
/*
* Ideally, we'd have a flush_tlb() variant that
* takes the known CR3 value as input. This would
* be faster on Xen PV and on hypothetical CPUs
* on which INVPCID is fast.
*/
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
next_tlb_gen);
- write_cr3(__pa(next->pgd));
+ write_cr3(__pa(next->pgd) | prev_asid);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
TLB_FLUSH_ALL);
}
@@ -126,8 +164,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* are not reflected in tlb_gen.)
*/
} else {
- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
- next->context.ctx_id);
+ u16 new_asid;
+ bool need_flush;

if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
@@ -152,14 +190,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* Start remote flushes and then read tlb_gen.
*/
cpumask_set_cpu(cpu, mm_cpumask(next));
+ smp_mb__after_atomic();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, next_tlb_gen);
- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- write_cr3(__pa(next->pgd));
+ choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ if (need_flush) {
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+ write_cr3(__pa(next->pgd) | new_asid);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ write_cr3(__pa(next->pgd) | new_asid | CR3_NOFLUSH);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
+ }
+
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
}

load_mm_cr4(next);
@@ -186,13 +235,14 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* wants us to catch up to.
*/
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
- u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);

/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());

- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
loaded_mm->context.ctx_id);

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
@@ -280,7 +330,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
}

/* Both paths above update our state to mm_tlb_gen. */
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
+ this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
}

static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
--
2.9.4

2017-06-29 15:54:42

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 09/10] x86/mm: Enable CR4.PCIDE on supported systems

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. The next patch will start using
PCID.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 8 ++++++++
arch/x86/kernel/cpu/common.c | 22 ++++++++++++++++++++++
arch/x86/xen/enlighten_pv.c | 6 ++++++
3 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 06e997a36d49..6397275008db 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
__flush_tlb_global();
else
__flush_tlb();
+
+ /*
+ * Note: if we somehow had PCID but not PGE, then this wouldn't work --
+ * we'd end up flushing kernel translations for the current ASID but
+ * we might fail to flush kernel translations for other cached ASIDs.
+ *
+ * To avoid this issue, we force PCID off if PGE is off.
+ */
}

static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e7b230..b95cd94ca97b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -329,6 +329,25 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c)
}
}

+static void setup_pcid(struct cpuinfo_x86 *c)
+{
+ if (cpu_has(c, X86_FEATURE_PCID)) {
+ if (cpu_has(c, X86_FEATURE_PGE)) {
+ cr4_set_bits(X86_CR4_PCIDE);
+ } else {
+ /*
+ * flush_tlb_all(), as currently implemented, won't
+ * work if PCID is on but PGE is not. Since that
+ * combination doesn't exist on real hardware, there's
+ * no reason to try to fully support it, but it's
+ * polite to avoid corrupting data if we're on
+ * an improperly configured VM.
+ */
+ clear_cpu_cap(c, X86_FEATURE_PCID);
+ }
+ }
+}
+
/*
* Protection Keys are not available in 32-bit mode.
*/
@@ -1143,6 +1162,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smep(c);
setup_smap(c);

+ /* Set up PCID */
+ setup_pcid(c);
+
/*
* The vendor-specific functions might have changed features.
* Now we do "generic changes."
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f33eef4ebd12..a136aac543c3 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
setup_clear_cpu_cap(X86_FEATURE_ACC);
setup_clear_cpu_cap(X86_FEATURE_X2APIC);

+ /*
+ * Xen PV would need some work to support PCID: CR3 handling as well
+ * as xen_flush_tlb_others() would need updating.
+ */
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+
if (!xen_initial_domain())
setup_clear_cpu_cap(X86_FEATURE_ACPI);

--
2.9.4

2017-06-29 15:54:50

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 08/10] x86/mm: Add nopcid to turn off PCID

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 ++
arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7737ab5d04b2..705666a44ba2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2648,6 +2648,8 @@
nopat [X86] Disable PAT (page attribute table extension of
pagetables) support.

+ nopcid [X86-64] Disable the PCID cpu feature.
+
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b39870f33e..904485e7b230 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
}
__setup("nompx", x86_mpx_setup);

+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+ /* require an exact match without trailing characters */
+ if (strlen(s))
+ return 0;
+
+ /* do not emit a message if the feature is not present */
+ if (!boot_cpu_has(X86_FEATURE_PCID))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+ pr_info("nopcid: PCID feature disabled\n");
+ return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
static int __init x86_noinvpcid_setup(char *s)
{
/* noinvpcid doesn't accept parameters */
--
2.9.4

2017-06-29 15:55:01

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 07/10] x86/mm: Disable PCID on 32-bit kernels

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode. Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 4 +++-
arch/x86/kernel/cpu/bugs.c | 8 ++++++++
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dff775af7cd..c10c9128f54e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
# define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31))
# define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31))
# define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID 0
#else
# define DISABLE_VME 0
# define DISABLE_K6_MTRR 0
# define DISABLE_CYRIX_ARR 0
# define DISABLE_CENTAUR_MCR 0
+# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31))
#endif /* CONFIG_X86_64 */

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
#define DISABLED_MASK1 0
#define DISABLED_MASK2 0
#define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4 0
+#define DISABLED_MASK4 (DISABLE_PCID)
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
#define DISABLED_MASK7 0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9242da..db684880d74a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@

void __init check_bugs(void)
{
+#ifdef CONFIG_X86_32
+ /*
+ * Regardless of whether PCID is enumerated, the SDM says
+ * that it can't be enabled in 32-bit mode.
+ */
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
identify_boot_cpu();

if (!IS_ENABLED(CONFIG_SMP)) {
--
2.9.4

2017-06-29 15:55:14

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB. This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely. When we enter lazy mode, we simply remove the
CPU from mm_cpumask. This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd. I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1. Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works. This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs). I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers. Unpatched means commit a4eb8b993554, so all the
bookkeeping overhead is gone.

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.
patched: 13.4µs
unpatched: 21.6µs

Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores),
nrounds = 100, 256M data
patched: 1.1 seconds or so
unpatched: 1.9 seconds or so

The sleepup on Vitaly's test appearss to be because it spends a lot
of time blocked on mmap_sem, and this patch avoids sending IPIs to
blocked CPUs.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Banman <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 6 +-
arch/x86/include/asm/tlbflush.h | 4 -
arch/x86/mm/init.c | 1 -
arch/x86/mm/tlb.c | 197 ++++++++++++++++++++++---------------
arch/x86/xen/mmu_pv.c | 5 +-
5 files changed, 124 insertions(+), 89 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ae19b9d11259..85f6b5575aad 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -128,8 +128,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+ int cpu = smp_processor_id();
+
+ if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+ cpumask_clear_cpu(cpu, mm_cpumask(mm));
}

static inline int init_new_context(struct task_struct *tsk,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d7df54cc7e4d..06e997a36d49 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
* mode even if we've already switched back to swapper_pg_dir.
*/
struct mm_struct *loaded_mm;
- int state;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info);

-#define TLBSTATE_OK 1
-#define TLBSTATE_LAZY 2
-
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 673541eb3b3f..4d353efb2838 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
- .state = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4e5a5ddb9e4d..0982c997d36f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == &init_mm)
return;

- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
- BUG();
+ /* Warn if we're not lazy. */
+ WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

switch_mm(NULL, &init_mm, NULL);
}
@@ -65,94 +65,117 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
- unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned cpu = smp_processor_id();
+ u64 next_tlb_gen;

/*
- * NB: The scheduler will call us with prev == next when
- * switching from lazy TLB mode to normal mode if active_mm
- * isn't changing. When this happens, there is no guarantee
- * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+ * NB: The scheduler will call us with prev == next when switching
+ * from lazy TLB mode to normal mode if active_mm isn't changing.
+ * When this happens, we don't assume that CR3 (and hence
+ * cpu_tlbstate.loaded_mm) matches next.
*
* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
*/

- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ /* We don't want flush_tlb_func_* to run concurrently with us. */
+ if (IS_ENABLED(CONFIG_PROVE_LOCKING))
+ WARN_ON_ONCE(!irqs_disabled());
+
+ /*
+ * Verify that CR3 is what we think it is. This will catch
+ * hypothetical buggy code that directly switches to swapper_pg_dir
+ * without going through leave_mm() / switch_mm_irqs_off().
+ */
+ VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));

if (real_prev == next) {
- /*
- * There's nothing to do: we always keep the per-mm control
- * regs in sync with cpu_tlbstate.loaded_mm. Just
- * sanity-check mm_cpumask.
- */
- if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
- cpumask_set_cpu(cpu, mm_cpumask(next));
- return;
- }
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ next->context.ctx_id);
+
+ if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /*
+ * There's nothing to do: we weren't lazy, and we
+ * aren't changing our mm. We don't need to flush
+ * anything, nor do we need to update CR3, CR4, or
+ * LDTR.
+ */
+ return;
+ }
+
+ /* Resume remote flushes and then read tlb_gen. */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+ if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) < next_tlb_gen) {
+ /*
+ * Ideally, we'd have a flush_tlb() variant that
+ * takes the known CR3 value as input. This would
+ * be faster on Xen PV and on hypothetical CPUs
+ * on which INVPCID is fast.
+ */
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ next_tlb_gen);
+ write_cr3(__pa(next->pgd));
+
+ /*
+ * This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we have to call the tracepoint
+ * specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ }

- if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
- * If our current stack is in vmalloc space and isn't
- * mapped in the new pgd, we'll double-fault. Forcibly
- * map it.
+ * We just exited lazy mode, which means that CR4 and/or LDTR
+ * may be stale. (Changes to the required CR4 and LDTR states
+ * are not reflected in tlb_gen.)
*/
- unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
-
- pgd_t *pgd = next->pgd + stack_pgd_index;
-
- if (unlikely(pgd_none(*pgd)))
- set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
- }
+ } else {
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
+ next->context.ctx_id);
+
+ if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+ /*
+ * If our current stack is in vmalloc space and isn't
+ * mapped in the new pgd, we'll double-fault. Forcibly
+ * map it.
+ */
+ unsigned int index = pgd_index(current_stack_pointer());
+ pgd_t *pgd = next->pgd + index;
+
+ if (unlikely(pgd_none(*pgd)))
+ set_pgd(pgd, init_mm.pgd[index]);
+ }

- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(&next->context.tlb_gen));
+ /* Stop remote flushes for the previous mm */
+ if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));

- /*
- * Re-load page tables.
- *
- * This logic has an ordering constraint:
- *
- * CPU 0: Write to a PTE for 'next'
- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
- * CPU 1: set bit 1 in next's mm_cpumask
- * CPU 1: load from the PTE that CPU 0 writes (implicit)
- *
- * We need to prevent an outcome in which CPU 1 observes
- * the new PTE value and CPU 0 observes bit 1 clear in
- * mm_cpumask. (If that occurs, then the IPI will never
- * be sent, and CPU 0's TLB will contain a stale entry.)
- *
- * The bad outcome can occur if either CPU's load is
- * reordered before that CPU's store, so both CPUs must
- * execute full barriers to prevent this from happening.
- *
- * Thus, switch_mm needs a full barrier between the
- * store to mm_cpumask and any operation that could load
- * from next->pgd. TLB fills are special and can happen
- * due to instruction fetches or for no reason at all,
- * and neither LOCK nor MFENCE orders them.
- * Fortunately, load_cr3() is serializing and gives the
- * ordering guarantee we need.
- */
- load_cr3(next->pgd);
+ /*
+ * Start remote flushes and then read tlb_gen.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- /*
- * This gets called via leave_mm() in the idle path where RCU
- * functions differently. Tracing normally uses RCU, so we have to
- * call the tracepoint specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, next_tlb_gen);
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ write_cr3(__pa(next->pgd));

- /* Stop flush ipis for the previous mm */
- WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
- real_prev != &init_mm);
- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ /*
+ * This gets called via leave_mm() in the idle path where RCU
+ * functions differently. Tracing normally uses RCU, so we
+ * have to call the tracepoint specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ }

- /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
switch_ldt(real_prev, next);
}
@@ -186,13 +209,13 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
loaded_mm->context.ctx_id);

- if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+ if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
/*
- * leave_mm() is adequate to handle any type of flush, and
- * we would prefer not to receive further IPIs. leave_mm()
- * clears this CPU's bit in mm_cpumask().
+ * We're in lazy mode -- don't flush. We can get here on
+ * remote flushes due to races and on local flushes if a
+ * kernel thread coincidentally flushes the mm it's lazily
+ * still using.
*/
- leave_mm(smp_processor_id());
return;
}

@@ -203,6 +226,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* be handled can catch us all the way up, leaving no work for
* the second flush.
*/
+ trace_tlb_flush(reason, 0);
return;
}

@@ -304,6 +328,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(info->end - info->start) >> PAGE_SHIFT);

if (is_uv_system()) {
+ /*
+ * This whole special case is confused. UV has a "Broadcast
+ * Assist Unit", which seems to be a fancy way to send IPIs.
+ * Back when x86 used an explicit TLB flush IPI, UV was
+ * optimized to use its own mechanism. These days, x86 uses
+ * smp_call_function_many(), but UV still uses a manual IPI,
+ * and that IPI's action is out of date -- it does a manual
+ * flush instead of calling flush_tlb_func_remote(). This
+ * means that the percpu tlb_gen variables won't be updated
+ * and we'll do pointless flushes on future context switches.
+ *
+ * Rather than hooking native_flush_tlb_others() here, I think
+ * that UV should be updated so that smp_call_function_many(),
+ * etc, are optimal on UV.
+ */
unsigned int cpu;

cpu = smp_processor_id();
@@ -363,6 +402,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
+
put_cpu();
}

@@ -371,8 +411,6 @@ static void do_flush_tlb_all(void *info)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
__flush_tlb_all();
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
- leave_mm(smp_processor_id());
}

void flush_tlb_all(void)
@@ -425,6 +463,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)

if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(&batch->cpumask, &info);
+
cpumask_clear(&batch->cpumask);

put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 1d7a7213a310..0472790ec20b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1005,14 +1005,12 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
/* Get the "official" set of cpus referring to our pagetable. */
if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
for_each_online_cpu(cpu) {
- if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
- && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
+ if (per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
continue;
smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
}
return;
}
- cpumask_copy(mask, mm_cpumask(mm));

/*
* It's possible that a vcpu may have a stale reference to our
@@ -1021,6 +1019,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
* look at its actual current cr3 value, and force it to flush
* if needed.
*/
+ cpumask_clear(mask);
for_each_online_cpu(cpu) {
if (per_cpu(xen_current_cr3, cpu) == __pa(mm->pgd))
cpumask_set_cpu(cpu, mask);
--
2.9.4

2017-06-29 15:55:10

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 06/10] x86/mm: Stop calling leave_mm() in idle code

Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/ia64/include/asm/acpi.h | 2 --
arch/x86/include/asm/acpi.h | 2 --
arch/x86/mm/tlb.c | 20 +++-----------------
drivers/acpi/processor_idle.c | 2 --
drivers/idle/intel_idle.c | 9 ++++-----
5 files changed, 7 insertions(+), 28 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211970e9..c86a947f5368 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
}

-#define acpi_unlazy_tlb(x)
-
#ifdef CONFIG_ACPI_NUMA
extern cpumask_t early_cpu_possible_map;
#define for_each_possible_early_cpu(cpu) \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768e4362..562286fa151f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
extern int x86_acpi_numa_init(void);
#endif /* CONFIG_ACPI_NUMA */

-#define acpi_unlazy_tlb(x) leave_mm(x)
-
#ifdef CONFIG_ACPI_APEI
static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0982c997d36f..2c1b8881e9d3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)

switch_mm(NULL, &init_mm, NULL);
}
-EXPORT_SYMBOL_GPL(leave_mm);

void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
@@ -117,15 +116,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
next_tlb_gen);
write_cr3(__pa(next->pgd));
-
- /*
- * This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we have to call the tracepoint
- * specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- TLB_FLUSH_ALL);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
}

/*
@@ -167,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
write_cr3(__pa(next->pgd));

- /*
- * This gets called via leave_mm() in the idle path where RCU
- * functions differently. Tracing normally uses RCU, so we
- * have to call the tracepoint specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- TLB_FLUSH_ALL);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}

load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9cf62d7..fe3d2a40f311 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
static void acpi_idle_enter_bm(struct acpi_processor *pr,
struct acpi_processor_cx *cx, bool timer_bc)
{
- acpi_unlazy_tlb(smp_processor_id());
-
/*
* Must be done before busmaster disable as we might need to
* access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec88c0c..2ae43f59091d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
struct cpuidle_state *state = &drv->states[index];
unsigned long eax = flg2MWAIT(state->flags);
unsigned int cstate;
- int cpu = smp_processor_id();

cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;

/*
- * leave_mm() to avoid costly and often unnecessary wakeups
- * for flushing the user TLB's associated with the active mm.
+ * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+ * will probably flush the TLB. It's not guaranteed to flush
+ * the TLB, though, so it's not clear that we can do anything
+ * useful with this knowledge.
*/
- if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
- leave_mm(cpu);

if (!(lapic_timer_reliable_states & (1 << (cstate))))
tick_broadcast_enter();
--
2.9.4

2017-06-30 12:44:26

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
> *** Ingo, even if this misses 4.13, please apply the first patch before
> *** the merge window.
>
> There are three performance benefits here:
>
> 1. TLB flushing is slow. (I.e. the flush itself takes a while.)
> This avoids many of them when switching tasks by using PCID. In
> a stupid little benchmark I did, it saves about 100ns on my laptop
> per context switch. I'll try to improve that benchmark.
>
> 2. Mms that have been used recently on a given CPU might get to keep
> their TLB entries alive across process switches with this patch
> set. TLB fills are pretty fast on modern CPUs, but they're even
> faster when they don't happen.
>
> 3. Lazy TLB is way better. We used to do two stupid things when we
> ran kernel threads: we'd send IPIs to flush user contexts on their
> CPUs and then we'd write to CR3 for no particular reason as an excuse
> to stop further IPIs. With this patch, we do neither.

Heads up, I'm gonna queue this for a run on SUSE's performance test
grid.

Subject: [tip:x86/mm] x86/mm: Don't reenter flush_tlb_func_common()

Commit-ID: bc0d5a89fbe3c83ac45438d7ba88309f4713615d
Gitweb: http://git.kernel.org/tip/bc0d5a89fbe3c83ac45438d7ba88309f4713615d
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:13 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 30 Jun 2017 10:12:35 +0200

x86/mm: Don't reenter flush_tlb_func_common()

It was historically possible to have two concurrent TLB flushes
targetting the same CPU: one initiated locally and one initiated
remotely. This can now cause an OOPS in leave_mm() at
arch/x86/mm/tlb.c:47:

if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();

with this call trace:
flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317

Without reentrancy, this OOPS is impossible: leave_mm() is only
called if we're not in TLBSTATE_OK, but then we're unexpectedly
in TLBSTATE_OK in leave_mm().

This can be caused by flush_tlb_func_remote() happening between
the two checks and calling leave_mm(), resulting in two consecutive
leave_mm() calls on the same CPU with no intervening switch_mm()
calls.

We never saw this OOPS before because the old leave_mm()
implementation didn't put us back in TLBSTATE_OK, so the assertion
didn't fire.

Nadav noticed the reentrancy issue in a different context, but
neither of us realized that it caused a problem yet.

Reported-by: Levin, Alexander (Sasha Levin) <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
Link: http://lkml.kernel.org/r/855acf733268d521c9f2e191faee2dcc23a29729.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b2485d6..1cc4783 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -192,6 +192,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
+ /* This code cannot presently handle being reentered. */
+ VM_WARN_ON(!irqs_disabled());
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
return;
@@ -297,8 +300,13 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
info.end = TLB_FLUSH_ALL;
}

- if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
+ if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+ VM_WARN_ON(irqs_disabled());
+ local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+ local_irq_enable();
+ }
+
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
put_cpu();
@@ -354,8 +362,13 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)

int cpu = get_cpu();

- if (cpumask_test_cpu(cpu, &batch->cpumask))
+ if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+ VM_WARN_ON(irqs_disabled());
+ local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
+ local_irq_enable();
+ }
+
if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(&batch->cpumask, &info);
cpumask_clear(&batch->cpumask);

Subject: [tip:x86/mm] x86/mm: Delete a big outdated comment about TLB flushing

Commit-ID: 8781fb7e9749da424e01daacd14834b674658c63
Gitweb: http://git.kernel.org/tip/8781fb7e9749da424e01daacd14834b674658c63
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:14 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 30 Jun 2017 10:12:35 +0200

x86/mm: Delete a big outdated comment about TLB flushing

The comment describes the old explicit IPI-based flush logic, which
is long gone.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/55e44997e56086528140c5180f8337dc53fb7ffc.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 36 ------------------------------------
1 file changed, 36 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1cc4783..014d07a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -153,42 +153,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
switch_ldt(real_prev, next);
}

-/*
- * The flush IPI assumes that a thread switch happens in this order:
- * [cpu0: the cpu that switches]
- * 1) switch_mm() either 1a) or 1b)
- * 1a) thread switch to a different mm
- * 1a1) set cpu_tlbstate to TLBSTATE_OK
- * Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
- * if cpu0 was in lazy tlb mode.
- * 1a2) update cpu active_mm
- * Now cpu0 accepts tlb flushes for the new mm.
- * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
- * Now the other cpus will send tlb flush ipis.
- * 1a4) change cr3.
- * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
- * Stop ipi delivery for the old mm. This is not synchronized with
- * the other cpus, but flush_tlb_func ignore flush ipis for the wrong
- * mm, and in the worst case we perform a superfluous tlb flush.
- * 1b) thread switch without mm change
- * cpu active_mm is correct, cpu0 already handles flush ipis.
- * 1b1) set cpu_tlbstate to TLBSTATE_OK
- * 1b2) test_and_set the cpu bit in cpu_vm_mask.
- * Atomically set the bit [other cpus will start sending flush ipis],
- * and test the bit.
- * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
- * 2) switch %%esp, ie current
- *
- * The interrupt must handle 2 special cases:
- * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
- * - the cpu performs speculative tlb reads, i.e. even if the cpu only
- * runs in kernel space, the cpu could load tlb entries for user space
- * pages.
- *
- * The good news is that cpu_tlbstate is local to each cpu, no
- * write/read ordering problems.
- */
-
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{

2017-07-03 10:56:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Thu, 29 Jun 2017, Andy Lutomirski wrote:
> ping-pong between two mms on the same CPU using eventfd:
> patched: 1.22µs
> patched, nopcid: 1.33µs
> unpatched: 1.34µs
>
> Same ping-pong, but now touch 512 pages (all zero-page to minimize
> cache misses) each iteration. dTLB misses are measured by
> dtlb_load_misses.miss_causes_a_walk:
> patched: 1.8µs 11M dTLB misses
> patched, nopcid: 6.2µs, 207M dTLB misses
> unpatched: 6.1µs, 190M dTLB misses
>
> Reviewed-by: Nadav Amit <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-07-05 08:57:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness


* Andy Lutomirski <[email protected]> wrote:

> *** Ingo, even if this misses 4.13, please apply the first patch before
> *** the merge window.

> Andy Lutomirski (10):
> x86/mm: Don't reenter flush_tlb_func_common()
> x86/mm: Delete a big outdated comment about TLB flushing
> x86/mm: Give each mm TLB flush generation a unique ID
> x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
> x86/mm: Rework lazy TLB mode and TLB freshness tracking
> x86/mm: Stop calling leave_mm() in idle code
> x86/mm: Disable PCID on 32-bit kernels
> x86/mm: Add nopcid to turn off PCID
> x86/mm: Enable CR4.PCIDE on supported systems
> x86/mm: Try to preserve old TLB entries using PCID

So this series is really nice, and the first two patches are already upstream, and
I've just applied all but the final patch to tip:x86/mm (out of caution - I'm a wimp).

That should already offer some improvements and enables the CR4 bit - but doesn't
actually use the PCID hardware yet.

I'll push it all out when it passes testing.

If it's all super stable I plan to tempt Linus with a late merge window pull
request for all these preparatory patches. (Unless he objects that is. Hint, hint.)

Any objections?

Thanks,

Ingo

Subject: [tip:x86/mm] x86/mm: Give each mm TLB flush generation a unique ID

Commit-ID: f39681ed0f48498b80455095376f11535feea332
Gitweb: http://git.kernel.org/tip/f39681ed0f48498b80455095376f11535feea332
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:15 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:56 +0200

x86/mm: Give each mm TLB flush generation a unique ID

This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review. By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/413a91c24dab3ed0caa5f4e4d017d87b0857f920.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mmu.h | 25 +++++++++++++++++++++++--
arch/x86/include/asm/mmu_context.h | 6 ++++++
arch/x86/include/asm/tlbflush.h | 18 ++++++++++++++++++
arch/x86/mm/tlb.c | 6 ++++--
4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a..bb8c597 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@

#include <linux/spinlock.h>
#include <linux/mutex.h>
+#include <linux/atomic.h>

/*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
*/
typedef struct {
+ /*
+ * ctx_id uniquely identifies this mm_struct. A ctx_id will never
+ * be reused, and zero is not a valid ctx_id.
+ */
+ u64 ctx_id;
+
+ /*
+ * Any code that needs to do any sort of TLB flushing for this
+ * mm will first make its changes to the page tables, then
+ * increment tlb_gen, then flush. This lets the low-level
+ * flushing code keep track of what needs flushing.
+ *
+ * This is not used on Xen PV.
+ */
+ atomic64_t tlb_gen;
+
#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
#endif
@@ -37,6 +53,11 @@ typedef struct {
#endif
} mm_context_t;

+#define INIT_MM_CONTEXT(mm) \
+ .context = { \
+ .ctx_id = 1, \
+ }
+
void leave_mm(int cpu);

#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ecfcb66..ae19b9d 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -12,6 +12,9 @@
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
#include <asm/mpx.h>
+
+extern atomic64_t last_mm_ctx_id;
+
#ifndef CONFIG_PARAVIRT
static inline void paravirt_activate_mm(struct mm_struct *prev,
struct mm_struct *next)
@@ -132,6 +135,9 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
static inline int init_new_context(struct task_struct *tsk,
struct mm_struct *mm)
{
+ mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
+ atomic64_set(&mm->context.tlb_gen, 0);
+
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 50ea348..ad21353 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
}

+static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
+{
+ u64 new_tlb_gen;
+
+ /*
+ * Bump the generation count. This also serves as a full barrier
+ * that synchronizes with switch_mm(): callers are required to order
+ * their read of mm_cpumask after their writes to the paging
+ * structures.
+ */
+ smp_mb__before_atomic();
+ new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
+ smp_mb__after_atomic();
+
+ return new_tlb_gen;
+}
+
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
+ inc_mm_tlb_gen(mm);
cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
}

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 014d07a..14f4f8f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/

+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -250,8 +252,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

cpu = get_cpu();

- /* Synchronize with switch_mm. */
- smp_mb();
+ /* This is also a barrier that synchronizes with switch_mm(). */
+ inc_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&

Subject: [tip:x86/mm] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

Commit-ID: b0579ade7cd82391360e959cc844e50a160e8a96
Gitweb: http://git.kernel.org/tip/b0579ade7cd82391360e959cc844e50a160e8a96
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:16 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:56 +0200

x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

- Lazy mm switching currently works by switching to init_mm when
it would otherwise flush. This is wasteful: there isn't fundamentally
any need to update CR3 at all when going lazy or when returning from
lazy mode, nor is there any need to receive flush IPIs at all. Instead,
we should just stop trying to keep the TLB coherent when we go lazy and,
when unlazying, check whether we missed any flushes.

- PCID will let us keep recent user contexts alive in the TLB. If we
start doing this, we need a way to decide whether those contexts are
up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on. We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race. The complexity in this patch would not be worth it on its own,
but it will enable improved lazy TLB tracking and PCID.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/1210fb244bc9cbe7677f7f0b72db4d359675f24b.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 43 +++++++++++++++--
arch/x86/mm/tlb.c | 102 +++++++++++++++++++++++++++++++++++++---
2 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ad21353..d7df54c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif

+struct tlb_context {
+ u64 ctx_id;
+ u64 tlb_gen;
+};
+
struct tlb_state {
/*
* cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
* disabling interrupts when modifying either one.
*/
unsigned long cr4;
+
+ /*
+ * This is a list of all contexts that might exist in the TLB.
+ * Since we don't yet use PCID, there is only one context.
+ *
+ * For each context, ctx_id indicates which mm the TLB's user
+ * entries came from. As an invariant, the TLB will never
+ * contain entries that are out-of-date as when that mm reached
+ * the tlb_gen in the list.
+ *
+ * To be clear, this means that it's legal for the TLB code to
+ * flush the TLB without updating tlb_gen. This can happen
+ * (for now, at least) due to paravirt remote flushes.
+ */
+ struct tlb_context ctxs[1];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
* and page-granular flushes are available only on i486 and up.
*/
struct flush_tlb_info {
- struct mm_struct *mm;
- unsigned long start;
- unsigned long end;
+ /*
+ * We support several kinds of flushes.
+ *
+ * - Fully flush a single mm. .mm will be set, .end will be
+ * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
+ * which the IPI sender is trying to catch us up.
+ *
+ * - Partially flush a single mm. .mm will be set, .start and
+ * .end will indicate the range, and .new_tlb_gen will be set
+ * such that the changes between generation .new_tlb_gen-1 and
+ * .new_tlb_gen are entirely contained in the indicated range.
+ *
+ * - Fully flush all mms whose tlb_gens have been updated. .mm
+ * will be NULL, .end will be TLB_FLUSH_ALL, and .new_tlb_gen
+ * will be zero.
+ */
+ struct mm_struct *mm;
+ unsigned long start;
+ unsigned long end;
+ u64 new_tlb_gen;
};

#define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 14f4f8f..4e5a5dd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
}

this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(&next->context.tlb_gen));

WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -155,25 +157,102 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
switch_ldt(real_prev, next);
}

+/*
+ * flush_tlb_func_common()'s memory ordering requirement is that any
+ * TLB fills that happen after we flush the TLB are ordered after we
+ * read active_mm's tlb_gen. We don't need any explicit barriers
+ * because all x86 flush operations are serializing and the
+ * atomic64_read operation won't be reordered by the compiler.
+ */
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
+ /*
+ * We have three different tlb_gen values in here. They are:
+ *
+ * - mm_tlb_gen: the latest generation.
+ * - local_tlb_gen: the generation that this CPU has already caught
+ * up to.
+ * - f->new_tlb_gen: the generation that the requester of the flush
+ * wants us to catch up to.
+ */
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());

+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ loaded_mm->context.ctx_id);
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+ /*
+ * leave_mm() is adequate to handle any type of flush, and
+ * we would prefer not to receive further IPIs. leave_mm()
+ * clears this CPU's bit in mm_cpumask().
+ */
leave_mm(smp_processor_id());
return;
}

- if (f->end == TLB_FLUSH_ALL) {
- local_flush_tlb();
- if (local)
- count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
- trace_tlb_flush(reason, TLB_FLUSH_ALL);
- } else {
+ if (unlikely(local_tlb_gen == mm_tlb_gen)) {
+ /*
+ * There's nothing to do: we're already up to date. This can
+ * happen if two concurrent flushes happen -- the first flush to
+ * be handled can catch us all the way up, leaving no work for
+ * the second flush.
+ */
+ return;
+ }
+
+ WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
+ WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
+
+ /*
+ * If we get to this point, we know that our TLB is out of date.
+ * This does not strictly imply that we need to flush (it's
+ * possible that f->new_tlb_gen <= local_tlb_gen), but we're
+ * going to need to flush in the very near future, so we might
+ * as well get it over with.
+ *
+ * The only question is whether to do a full or partial flush.
+ *
+ * We do a partial flush if requested and two extra conditions
+ * are met:
+ *
+ * 1. f->new_tlb_gen == local_tlb_gen + 1. We have an invariant that
+ * we've always done all needed flushes to catch up to
+ * local_tlb_gen. If, for example, local_tlb_gen == 2 and
+ * f->new_tlb_gen == 3, then we know that the flush needed to bring
+ * us up to date for tlb_gen 3 is the partial flush we're
+ * processing.
+ *
+ * As an example of why this check is needed, suppose that there
+ * are two concurrent flushes. The first is a full flush that
+ * changes context.tlb_gen from 1 to 2. The second is a partial
+ * flush that changes context.tlb_gen from 2 to 3. If they get
+ * processed on this CPU in reverse order, we'll see
+ * local_tlb_gen == 1, mm_tlb_gen == 3, and end != TLB_FLUSH_ALL.
+ * If we were to use __flush_tlb_single() and set local_tlb_gen to
+ * 3, we'd be break the invariant: we'd update local_tlb_gen above
+ * 1 without the full flush that's needed for tlb_gen 2.
+ *
+ * 2. f->new_tlb_gen == mm_tlb_gen. This is purely an optimiation.
+ * Partial TLB flushes are not all that much cheaper than full TLB
+ * flushes, so it seems unlikely that it would be a performance win
+ * to do a partial flush if that won't bring our TLB fully up to
+ * date. By doing a full flush instead, we can increase
+ * local_tlb_gen all the way to mm_tlb_gen and we can probably
+ * avoid another flush in the very near future.
+ */
+ if (f->end != TLB_FLUSH_ALL &&
+ f->new_tlb_gen == local_tlb_gen + 1 &&
+ f->new_tlb_gen == mm_tlb_gen) {
+ /* Partial flush */
unsigned long addr;
unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
+
addr = f->start;
while (addr < f->end) {
__flush_tlb_single(addr);
@@ -182,7 +261,16 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
if (local)
count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
trace_tlb_flush(reason, nr_pages);
+ } else {
+ /* Full flush. */
+ local_flush_tlb();
+ if (local)
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+ trace_tlb_flush(reason, TLB_FLUSH_ALL);
}
+
+ /* Both paths above update our state to mm_tlb_gen. */
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
}

static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -253,7 +341,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
cpu = get_cpu();

/* This is also a barrier that synchronizes with switch_mm(). */
- inc_mm_tlb_gen(mm);
+ info.new_tlb_gen = inc_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&

Subject: [tip:x86/mm] x86/mm: Rework lazy TLB mode and TLB freshness tracking

Commit-ID: 94b1b03b519b81c494900cb112aa00ed205cc2d9
Gitweb: http://git.kernel.org/tip/94b1b03b519b81c494900cb112aa00ed205cc2d9
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:17 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:57 +0200

x86/mm: Rework lazy TLB mode and TLB freshness tracking

x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB. This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely. When we enter lazy mode, we simply remove the
CPU from mm_cpumask. This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd. I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1. Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works. This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs). I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers. Unpatched means commit a4eb8b993554, so all the
bookkeeping overhead is gone.

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.

patched: 13.4µs
unpatched: 21.6µs

Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores),
nrounds = 100, 256M data

patched: 1.1 seconds or so
unpatched: 1.9 seconds or so

The sleepup on Vitaly's test appearss to be because it spends a lot
of time blocked on mmap_sem, and this patch avoids sending IPIs to
blocked CPUs.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Banman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/ddf2c92962339f4ba39d8fc41b853936ec0b44f1.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 6 +-
arch/x86/include/asm/tlbflush.h | 4 -
arch/x86/mm/init.c | 1 -
arch/x86/mm/tlb.c | 197 ++++++++++++++++++++++---------------
arch/x86/xen/mmu_pv.c | 5 +-
5 files changed, 124 insertions(+), 89 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ae19b9d..85f6b55 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -128,8 +128,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+ int cpu = smp_processor_id();
+
+ if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+ cpumask_clear_cpu(cpu, mm_cpumask(mm));
}

static inline int init_new_context(struct task_struct *tsk,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d7df54c..06e997a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
* mode even if we've already switched back to swapper_pg_dir.
*/
struct mm_struct *loaded_mm;
- int state;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info);

-#define TLBSTATE_OK 1
-#define TLBSTATE_LAZY 2
-
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 673541e..4d353ef 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
- .state = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4e5a5dd..0982c99 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == &init_mm)
return;

- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
- BUG();
+ /* Warn if we're not lazy. */
+ WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

switch_mm(NULL, &init_mm, NULL);
}
@@ -65,94 +65,117 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
- unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned cpu = smp_processor_id();
+ u64 next_tlb_gen;

/*
- * NB: The scheduler will call us with prev == next when
- * switching from lazy TLB mode to normal mode if active_mm
- * isn't changing. When this happens, there is no guarantee
- * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+ * NB: The scheduler will call us with prev == next when switching
+ * from lazy TLB mode to normal mode if active_mm isn't changing.
+ * When this happens, we don't assume that CR3 (and hence
+ * cpu_tlbstate.loaded_mm) matches next.
*
* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
*/

- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ /* We don't want flush_tlb_func_* to run concurrently with us. */
+ if (IS_ENABLED(CONFIG_PROVE_LOCKING))
+ WARN_ON_ONCE(!irqs_disabled());
+
+ /*
+ * Verify that CR3 is what we think it is. This will catch
+ * hypothetical buggy code that directly switches to swapper_pg_dir
+ * without going through leave_mm() / switch_mm_irqs_off().
+ */
+ VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));

if (real_prev == next) {
- /*
- * There's nothing to do: we always keep the per-mm control
- * regs in sync with cpu_tlbstate.loaded_mm. Just
- * sanity-check mm_cpumask.
- */
- if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
- cpumask_set_cpu(cpu, mm_cpumask(next));
- return;
- }
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ next->context.ctx_id);
+
+ if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /*
+ * There's nothing to do: we weren't lazy, and we
+ * aren't changing our mm. We don't need to flush
+ * anything, nor do we need to update CR3, CR4, or
+ * LDTR.
+ */
+ return;
+ }
+
+ /* Resume remote flushes and then read tlb_gen. */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+ if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) < next_tlb_gen) {
+ /*
+ * Ideally, we'd have a flush_tlb() variant that
+ * takes the known CR3 value as input. This would
+ * be faster on Xen PV and on hypothetical CPUs
+ * on which INVPCID is fast.
+ */
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ next_tlb_gen);
+ write_cr3(__pa(next->pgd));
+
+ /*
+ * This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we have to call the tracepoint
+ * specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ }

- if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
- * If our current stack is in vmalloc space and isn't
- * mapped in the new pgd, we'll double-fault. Forcibly
- * map it.
+ * We just exited lazy mode, which means that CR4 and/or LDTR
+ * may be stale. (Changes to the required CR4 and LDTR states
+ * are not reflected in tlb_gen.)
*/
- unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
-
- pgd_t *pgd = next->pgd + stack_pgd_index;
-
- if (unlikely(pgd_none(*pgd)))
- set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
- }
+ } else {
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
+ next->context.ctx_id);
+
+ if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+ /*
+ * If our current stack is in vmalloc space and isn't
+ * mapped in the new pgd, we'll double-fault. Forcibly
+ * map it.
+ */
+ unsigned int index = pgd_index(current_stack_pointer());
+ pgd_t *pgd = next->pgd + index;
+
+ if (unlikely(pgd_none(*pgd)))
+ set_pgd(pgd, init_mm.pgd[index]);
+ }

- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(&next->context.tlb_gen));
+ /* Stop remote flushes for the previous mm */
+ if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));

- /*
- * Re-load page tables.
- *
- * This logic has an ordering constraint:
- *
- * CPU 0: Write to a PTE for 'next'
- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
- * CPU 1: set bit 1 in next's mm_cpumask
- * CPU 1: load from the PTE that CPU 0 writes (implicit)
- *
- * We need to prevent an outcome in which CPU 1 observes
- * the new PTE value and CPU 0 observes bit 1 clear in
- * mm_cpumask. (If that occurs, then the IPI will never
- * be sent, and CPU 0's TLB will contain a stale entry.)
- *
- * The bad outcome can occur if either CPU's load is
- * reordered before that CPU's store, so both CPUs must
- * execute full barriers to prevent this from happening.
- *
- * Thus, switch_mm needs a full barrier between the
- * store to mm_cpumask and any operation that could load
- * from next->pgd. TLB fills are special and can happen
- * due to instruction fetches or for no reason at all,
- * and neither LOCK nor MFENCE orders them.
- * Fortunately, load_cr3() is serializing and gives the
- * ordering guarantee we need.
- */
- load_cr3(next->pgd);
+ /*
+ * Start remote flushes and then read tlb_gen.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- /*
- * This gets called via leave_mm() in the idle path where RCU
- * functions differently. Tracing normally uses RCU, so we have to
- * call the tracepoint specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, next_tlb_gen);
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ write_cr3(__pa(next->pgd));

- /* Stop flush ipis for the previous mm */
- WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
- real_prev != &init_mm);
- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ /*
+ * This gets called via leave_mm() in the idle path where RCU
+ * functions differently. Tracing normally uses RCU, so we
+ * have to call the tracepoint specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ }

- /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
switch_ldt(real_prev, next);
}
@@ -186,13 +209,13 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
loaded_mm->context.ctx_id);

- if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+ if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
/*
- * leave_mm() is adequate to handle any type of flush, and
- * we would prefer not to receive further IPIs. leave_mm()
- * clears this CPU's bit in mm_cpumask().
+ * We're in lazy mode -- don't flush. We can get here on
+ * remote flushes due to races and on local flushes if a
+ * kernel thread coincidentally flushes the mm it's lazily
+ * still using.
*/
- leave_mm(smp_processor_id());
return;
}

@@ -203,6 +226,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* be handled can catch us all the way up, leaving no work for
* the second flush.
*/
+ trace_tlb_flush(reason, 0);
return;
}

@@ -304,6 +328,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(info->end - info->start) >> PAGE_SHIFT);

if (is_uv_system()) {
+ /*
+ * This whole special case is confused. UV has a "Broadcast
+ * Assist Unit", which seems to be a fancy way to send IPIs.
+ * Back when x86 used an explicit TLB flush IPI, UV was
+ * optimized to use its own mechanism. These days, x86 uses
+ * smp_call_function_many(), but UV still uses a manual IPI,
+ * and that IPI's action is out of date -- it does a manual
+ * flush instead of calling flush_tlb_func_remote(). This
+ * means that the percpu tlb_gen variables won't be updated
+ * and we'll do pointless flushes on future context switches.
+ *
+ * Rather than hooking native_flush_tlb_others() here, I think
+ * that UV should be updated so that smp_call_function_many(),
+ * etc, are optimal on UV.
+ */
unsigned int cpu;

cpu = smp_processor_id();
@@ -363,6 +402,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
+
put_cpu();
}

@@ -371,8 +411,6 @@ static void do_flush_tlb_all(void *info)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
__flush_tlb_all();
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
- leave_mm(smp_processor_id());
}

void flush_tlb_all(void)
@@ -425,6 +463,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)

if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(&batch->cpumask, &info);
+
cpumask_clear(&batch->cpumask);

put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 1d7a721..0472790 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1005,14 +1005,12 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
/* Get the "official" set of cpus referring to our pagetable. */
if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
for_each_online_cpu(cpu) {
- if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
- && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
+ if (per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
continue;
smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
}
return;
}
- cpumask_copy(mask, mm_cpumask(mm));

/*
* It's possible that a vcpu may have a stale reference to our
@@ -1021,6 +1019,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
* look at its actual current cr3 value, and force it to flush
* if needed.
*/
+ cpumask_clear(mask);
for_each_online_cpu(cpu) {
if (per_cpu(xen_current_cr3, cpu) == __pa(mm->pgd))
cpumask_set_cpu(cpu, mask);

Subject: [tip:x86/mm] x86/mm: Stop calling leave_mm() in idle code

Commit-ID: 43858b4f25cf0adc5c2ca9cf5ce5fdf2532941e5
Gitweb: http://git.kernel.org/tip/43858b4f25cf0adc5c2ca9cf5ce5fdf2532941e5
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:18 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:57 +0200

x86/mm: Stop calling leave_mm() in idle code

Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/03c699cfd6021e467be650d6b73deaccfe4b4bd7.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/include/asm/acpi.h | 2 --
arch/x86/include/asm/acpi.h | 2 --
arch/x86/mm/tlb.c | 20 +++-----------------
drivers/acpi/processor_idle.c | 2 --
drivers/idle/intel_idle.c | 9 ++++-----
5 files changed, 7 insertions(+), 28 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211..c86a947 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
}

-#define acpi_unlazy_tlb(x)
-
#ifdef CONFIG_ACPI_NUMA
extern cpumask_t early_cpu_possible_map;
#define for_each_possible_early_cpu(cpu) \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768..562286f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
extern int x86_acpi_numa_init(void);
#endif /* CONFIG_ACPI_NUMA */

-#define acpi_unlazy_tlb(x) leave_mm(x)
-
#ifdef CONFIG_ACPI_APEI
static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0982c99..2c1b888 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)

switch_mm(NULL, &init_mm, NULL);
}
-EXPORT_SYMBOL_GPL(leave_mm);

void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
@@ -117,15 +116,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
next_tlb_gen);
write_cr3(__pa(next->pgd));
-
- /*
- * This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we have to call the tracepoint
- * specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- TLB_FLUSH_ALL);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
}

/*
@@ -167,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
write_cr3(__pa(next->pgd));

- /*
- * This gets called via leave_mm() in the idle path where RCU
- * functions differently. Tracing normally uses RCU, so we
- * have to call the tracepoint specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- TLB_FLUSH_ALL);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}

load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9c..fe3d2a4 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
static void acpi_idle_enter_bm(struct acpi_processor *pr,
struct acpi_processor_cx *cx, bool timer_bc)
{
- acpi_unlazy_tlb(smp_processor_id());
-
/*
* Must be done before busmaster disable as we might need to
* access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec..2ae43f5 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
struct cpuidle_state *state = &drv->states[index];
unsigned long eax = flg2MWAIT(state->flags);
unsigned int cstate;
- int cpu = smp_processor_id();

cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;

/*
- * leave_mm() to avoid costly and often unnecessary wakeups
- * for flushing the user TLB's associated with the active mm.
+ * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+ * will probably flush the TLB. It's not guaranteed to flush
+ * the TLB, though, so it's not clear that we can do anything
+ * useful with this knowledge.
*/
- if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
- leave_mm(cpu);

if (!(lapic_timer_reliable_states & (1 << (cstate))))
tick_broadcast_enter();

Subject: [tip:x86/mm] x86/mm: Add the 'nopcid' boot option to turn off PCID

Commit-ID: 0790c9aad84901ca1bdc14746175549c8b5da215
Gitweb: http://git.kernel.org/tip/0790c9aad84901ca1bdc14746175549c8b5da215
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:20 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:57 +0200

x86/mm: Add the 'nopcid' boot option to turn off PCID

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/8bbb2e65bcd249a5f18bfb8128b4689f08ac2b60.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 ++
arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9b0b3de..7037a0f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2657,6 +2657,8 @@
nopat [X86] Disable PAT (page attribute table extension of
pagetables) support.

+ nopcid [X86-64] Disable the PCID cpu feature.
+
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b3987..904485e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
}
__setup("nompx", x86_mpx_setup);

+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+ /* require an exact match without trailing characters */
+ if (strlen(s))
+ return 0;
+
+ /* do not emit a message if the feature is not present */
+ if (!boot_cpu_has(X86_FEATURE_PCID))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+ pr_info("nopcid: PCID feature disabled\n");
+ return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
static int __init x86_noinvpcid_setup(char *s)
{
/* noinvpcid doesn't accept parameters */

Subject: [tip:x86/mm] x86/mm: Enable CR4.PCIDE on supported systems

Commit-ID: 660da7c9228f685b2ebe664f9fd69aaddcc420b5
Gitweb: http://git.kernel.org/tip/660da7c9228f685b2ebe664f9fd69aaddcc420b5
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:21 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:58 +0200

x86/mm: Enable CR4.PCIDE on supported systems

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. A followup patch will start using PCID.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/6327ecd907b32f79d5aa0d466f04503bbec5df88.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 8 ++++++++
arch/x86/kernel/cpu/common.c | 22 ++++++++++++++++++++++
arch/x86/xen/enlighten_pv.c | 6 ++++++
3 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 06e997a..6397275 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
__flush_tlb_global();
else
__flush_tlb();
+
+ /*
+ * Note: if we somehow had PCID but not PGE, then this wouldn't work --
+ * we'd end up flushing kernel translations for the current ASID but
+ * we might fail to flush kernel translations for other cached ASIDs.
+ *
+ * To avoid this issue, we force PCID off if PGE is off.
+ */
}

static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e..b95cd94 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -329,6 +329,25 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c)
}
}

+static void setup_pcid(struct cpuinfo_x86 *c)
+{
+ if (cpu_has(c, X86_FEATURE_PCID)) {
+ if (cpu_has(c, X86_FEATURE_PGE)) {
+ cr4_set_bits(X86_CR4_PCIDE);
+ } else {
+ /*
+ * flush_tlb_all(), as currently implemented, won't
+ * work if PCID is on but PGE is not. Since that
+ * combination doesn't exist on real hardware, there's
+ * no reason to try to fully support it, but it's
+ * polite to avoid corrupting data if we're on
+ * an improperly configured VM.
+ */
+ clear_cpu_cap(c, X86_FEATURE_PCID);
+ }
+ }
+}
+
/*
* Protection Keys are not available in 32-bit mode.
*/
@@ -1143,6 +1162,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smep(c);
setup_smap(c);

+ /* Set up PCID */
+ setup_pcid(c);
+
/*
* The vendor-specific functions might have changed features.
* Now we do "generic changes."
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f33eef4..a136aac 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
setup_clear_cpu_cap(X86_FEATURE_ACC);
setup_clear_cpu_cap(X86_FEATURE_X2APIC);

+ /*
+ * Xen PV would need some work to support PCID: CR3 handling as well
+ * as xen_flush_tlb_others() would need updating.
+ */
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+
if (!xen_initial_domain())
setup_clear_cpu_cap(X86_FEATURE_ACPI);


Subject: [tip:x86/mm] x86/mm: Disable PCID on 32-bit kernels

Commit-ID: cba4671af7550e008f7a7835f06df0763825bf3e
Gitweb: http://git.kernel.org/tip/cba4671af7550e008f7a7835f06df0763825bf3e
Author: Andy Lutomirski <[email protected]>
AuthorDate: Thu, 29 Jun 2017 08:53:19 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 10:52:57 +0200

x86/mm: Disable PCID on 32-bit kernels

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode. Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/2e391769192a4d31b808410c383c6bf0734bc6ea.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 4 +++-
arch/x86/kernel/cpu/bugs.c | 8 ++++++++
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dff775..c10c912 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
# define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31))
# define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31))
# define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID 0
#else
# define DISABLE_VME 0
# define DISABLE_K6_MTRR 0
# define DISABLE_CYRIX_ARR 0
# define DISABLE_CENTAUR_MCR 0
+# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31))
#endif /* CONFIG_X86_64 */

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
#define DISABLED_MASK1 0
#define DISABLED_MASK2 0
#define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4 0
+#define DISABLED_MASK4 (DISABLE_PCID)
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
#define DISABLED_MASK7 0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9..db68488 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@

void __init check_bugs(void)
{
+#ifdef CONFIG_X86_32
+ /*
+ * Regardless of whether PCID is enumerated, the SDM says
+ * that it can't be enabled in 32-bit mode.
+ */
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
identify_boot_cpu();

if (!IS_ENABLED(CONFIG_SMP)) {

2017-07-05 10:58:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm: Give each mm TLB flush generation a unique ID

On Wed, Jul 05, 2017 at 03:31:03AM -0700, tip-bot for Andy Lutomirski wrote:

> @@ -132,6 +135,9 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> static inline int init_new_context(struct task_struct *tsk,
> struct mm_struct *mm)
> {
> + mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);

You could use atomic64_inc_return_relaxed() here; but since its x86
specific there is no difference.

> + atomic64_set(&mm->context.tlb_gen, 0);
> +
> #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
> /* pkey 0 is the default and always allocated */

> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 50ea348..ad21353 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
> __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
> }
>
> +static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
> +{
> + u64 new_tlb_gen;
> +
> + /*
> + * Bump the generation count. This also serves as a full barrier
> + * that synchronizes with switch_mm(): callers are required to order
> + * their read of mm_cpumask after their writes to the paging
> + * structures.
> + */
> + smp_mb__before_atomic();
> + new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
> + smp_mb__after_atomic();

that's wrong... smp_mb__{before,after}_atomic() are entirely superfluous
here:

- they're no-ops on x86
- atomic_*_return() is already fully serializing


2017-07-05 12:18:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
> @@ -104,18 +140,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>
> /* Resume remote flushes and then read tlb_gen. */
> cpumask_set_cpu(cpu, mm_cpumask(next));

Barriers should have a comment... what is being ordered here against
what?

> + smp_mb__after_atomic();
> next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>
> + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
> + next_tlb_gen) {
> /*
> * Ideally, we'd have a flush_tlb() variant that
> * takes the known CR3 value as input. This would
> * be faster on Xen PV and on hypothetical CPUs
> * on which INVPCID is fast.
> */
> + this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
> next_tlb_gen);
> + write_cr3(__pa(next->pgd) | prev_asid);
> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
> TLB_FLUSH_ALL);
> }

> @@ -152,14 +190,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> * Start remote flushes and then read tlb_gen.
> */
> cpumask_set_cpu(cpu, mm_cpumask(next));
> + smp_mb__after_atomic();

idem

> next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>
> + choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
>
> + if (need_flush) {
> + this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
> + this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
> + write_cr3(__pa(next->pgd) | new_asid);
> + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
> + TLB_FLUSH_ALL);
> + } else {
> + /* The new ASID is already up to date. */
> + write_cr3(__pa(next->pgd) | new_asid | CR3_NOFLUSH);
> + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
> + }
> +
> + this_cpu_write(cpu_tlbstate.loaded_mm, next);
> + this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
> }
>
> load_mm_cr4(next);

2017-07-05 12:25:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> + u16 *new_asid, bool *need_flush)
> +{
> + u16 asid;
> +
> + if (!static_cpu_has(X86_FEATURE_PCID)) {
> + *new_asid = 0;
> + *need_flush = true;
> + return;
> + }
> +
> + for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
> + next->context.ctx_id)
> + continue;
> +
> + *new_asid = asid;
> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
> + next_tlb_gen);
> + return;
> + }
> +
> + /*
> + * We don't currently own an ASID slot on this CPU.
> + * Allocate a slot.
> + */
> + *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;

So this basically RR the ASID slots. Have you tried slightly more
complex replacement policies like CLOCK ?

> + if (*new_asid >= TLB_NR_DYN_ASIDS) {
> + *new_asid = 0;
> + this_cpu_write(cpu_tlbstate.next_asid, 1);
> + }
> + *need_flush = true;
> +}

2017-07-05 16:05:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Wed, Jul 5, 2017 at 5:18 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
>> @@ -104,18 +140,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>>
>> /* Resume remote flushes and then read tlb_gen. */
>> cpumask_set_cpu(cpu, mm_cpumask(next));
>
> Barriers should have a comment... what is being ordered here against
> what?

How's this comment?

/*
* Resume remote flushes and then read tlb_gen. We need to do
* it in this order: any inc_mm_tlb_gen() caller that writes a
* larger tlb_gen than we read here must see our cpu set in
* mm_cpumask() so that it will know to flush us. The barrier
* here synchronizes with inc_mm_tlb_gen().
*/

2017-07-05 16:10:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Wed, Jul 5, 2017 at 5:25 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
>> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>> + u16 *new_asid, bool *need_flush)
>> +{
>> + u16 asid;
>> +
>> + if (!static_cpu_has(X86_FEATURE_PCID)) {
>> + *new_asid = 0;
>> + *need_flush = true;
>> + return;
>> + }
>> +
>> + for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
>> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
>> + next->context.ctx_id)
>> + continue;
>> +
>> + *new_asid = asid;
>> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
>> + next_tlb_gen);
>> + return;
>> + }
>> +
>> + /*
>> + * We don't currently own an ASID slot on this CPU.
>> + * Allocate a slot.
>> + */
>> + *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
>
> So this basically RR the ASID slots. Have you tried slightly more
> complex replacement policies like CLOCK ?

No, mainly because I'm lazy and because CLOCK requires scavenging a
bit. (Which we can certainly do, but it will further complicate the
code.) It could be worth playing with better replacement algorithms
as a followup, though.

I've also considered a slight elaboration of RR in which we make sure
not to reuse the most recent ASID slot, which would guarantee that, if
we switch from task A to B and back to A, we don't flush on the way
back to A. (Currently, if B is not in the cache, there's a 1/6 chance
we'll flush on the way back.)

2017-07-05 16:53:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Wed, Jul 5, 2017 at 1:56 AM, Ingo Molnar <[email protected]> wrote:
>
> If it's all super stable I plan to tempt Linus with a late merge window pull
> request for all these preparatory patches. (Unless he objects that is. Hint, hint.)

I don't think I'll object. At some point the best testing is "lots of users".

TLB issues are a bitch to debug, but at the same time this is clearly
a "..but at some point we need to bite the bullet" case. I doubt the
series is going to get a lot better.

But yes, please do give it as much testing as humanly possible even
without the wider coverage by random people.

Linus

2017-07-05 17:02:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Wed, Jul 05, 2017 at 09:04:39AM -0700, Andy Lutomirski wrote:
> On Wed, Jul 5, 2017 at 5:18 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
> >> @@ -104,18 +140,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> >>
> >> /* Resume remote flushes and then read tlb_gen. */
> >> cpumask_set_cpu(cpu, mm_cpumask(next));
> >
> > Barriers should have a comment... what is being ordered here against
> > what?
>
> How's this comment?
>
> /*
> * Resume remote flushes and then read tlb_gen. We need to do
> * it in this order: any inc_mm_tlb_gen() caller that writes a
> * larger tlb_gen than we read here must see our cpu set in
> * mm_cpumask() so that it will know to flush us. The barrier
> * here synchronizes with inc_mm_tlb_gen().
> */

Slightly confusing, you mean this, right?


cpumask_set_cpu(cpu, mm_cpumask()); inc_mm_tlb_gen();

MB MB

next_tlb_gen = atomic64_read(&next->context.tlb_gen); flush_tlb_others(mm_cpumask());


which seems to make sense.

2017-07-11 11:32:37

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote:
> On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
> > *** Ingo, even if this misses 4.13, please apply the first patch before
> > *** the merge window.
> >
> > There are three performance benefits here:
> >
> > 1. TLB flushing is slow. (I.e. the flush itself takes a while.)
> > This avoids many of them when switching tasks by using PCID. In
> > a stupid little benchmark I did, it saves about 100ns on my laptop
> > per context switch. I'll try to improve that benchmark.
> >
> > 2. Mms that have been used recently on a given CPU might get to keep
> > their TLB entries alive across process switches with this patch
> > set. TLB fills are pretty fast on modern CPUs, but they're even
> > faster when they don't happen.
> >
> > 3. Lazy TLB is way better. We used to do two stupid things when we
> > ran kernel threads: we'd send IPIs to flush user contexts on their
> > CPUs and then we'd write to CR3 for no particular reason as an excuse
> > to stop further IPIs. With this patch, we do neither.
>
> Heads up, I'm gonna queue this for a run on SUSE's performance test
> grid.

FWIW, I didn't see any change in performance with this series on a
PCID-capable machine. On the plus side, I didn't see any weird-looking
bugs either.

Are your benchmarks available anywhere?

2017-07-11 15:01:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Tue, Jul 11, 2017 at 4:32 AM, Matt Fleming <[email protected]> wrote:
> On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote:
>> On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
>> > *** Ingo, even if this misses 4.13, please apply the first patch before
>> > *** the merge window.
>> >
>> > There are three performance benefits here:
>> >
>> > 1. TLB flushing is slow. (I.e. the flush itself takes a while.)
>> > This avoids many of them when switching tasks by using PCID. In
>> > a stupid little benchmark I did, it saves about 100ns on my laptop
>> > per context switch. I'll try to improve that benchmark.
>> >
>> > 2. Mms that have been used recently on a given CPU might get to keep
>> > their TLB entries alive across process switches with this patch
>> > set. TLB fills are pretty fast on modern CPUs, but they're even
>> > faster when they don't happen.
>> >
>> > 3. Lazy TLB is way better. We used to do two stupid things when we
>> > ran kernel threads: we'd send IPIs to flush user contexts on their
>> > CPUs and then we'd write to CR3 for no particular reason as an excuse
>> > to stop further IPIs. With this patch, we do neither.
>>
>> Heads up, I'm gonna queue this for a run on SUSE's performance test
>> grid.
>
> FWIW, I didn't see any change in performance with this series on a
> PCID-capable machine. On the plus side, I didn't see any weird-looking
> bugs either.
>
> Are your benchmarks available anywhere?

https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/

I did:

$ ./context_switch_latency_64 0 process same

and

$ ./madvise_bounce_64 10k [IIRC -- it might have been a different loop count]

--Andy

2017-07-13 19:36:10

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Tue, 11 Jul, at 08:00:47AM, Andy Lutomirski wrote:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/
>
> I did:
>
> $ ./context_switch_latency_64 0 process same

Ah, that's better. I see about a 3.3% speedup with your patches when
running the context-switch benchmark.

2017-07-17 09:57:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Wed, Jul 05, 2017 at 10:56:57AM +0200, Ingo Molnar wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > *** Ingo, even if this misses 4.13, please apply the first patch before
> > *** the merge window.
>
> > Andy Lutomirski (10):
> > x86/mm: Don't reenter flush_tlb_func_common()
> > x86/mm: Delete a big outdated comment about TLB flushing
> > x86/mm: Give each mm TLB flush generation a unique ID
> > x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
> > x86/mm: Rework lazy TLB mode and TLB freshness tracking
> > x86/mm: Stop calling leave_mm() in idle code
> > x86/mm: Disable PCID on 32-bit kernels
> > x86/mm: Add nopcid to turn off PCID
> > x86/mm: Enable CR4.PCIDE on supported systems
> > x86/mm: Try to preserve old TLB entries using PCID
>
> So this series is really nice, and the first two patches are already upstream, and
> I've just applied all but the final patch to tip:x86/mm (out of caution - I'm a wimp).
>
> That should already offer some improvements and enables the CR4 bit - but doesn't
> actually use the PCID hardware yet.
>
> I'll push it all out when it passes testing.
>
> If it's all super stable I plan to tempt Linus with a late merge window pull
> request for all these preparatory patches. (Unless he objects that is. Hint, hint.)
>
> Any objections?
>

What was the final verdict here? I have a patch ready that should be layered
on top which will need a backport. PCID support does not appear to have
made it in this merge window so I'm wondering if I should send the patch
as-is for placement on top of Andy's work or go with the backport and
apply a follow-on patch after Andy's work gets merged.

Thanks.

--
Mel Gorman
SUSE Labs

2017-07-17 15:06:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness


* Mel Gorman <[email protected]> wrote:

> On Wed, Jul 05, 2017 at 10:56:57AM +0200, Ingo Molnar wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> > > *** Ingo, even if this misses 4.13, please apply the first patch before
> > > *** the merge window.
> >
> > > Andy Lutomirski (10):
> > > x86/mm: Don't reenter flush_tlb_func_common()
> > > x86/mm: Delete a big outdated comment about TLB flushing
> > > x86/mm: Give each mm TLB flush generation a unique ID
> > > x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
> > > x86/mm: Rework lazy TLB mode and TLB freshness tracking
> > > x86/mm: Stop calling leave_mm() in idle code
> > > x86/mm: Disable PCID on 32-bit kernels
> > > x86/mm: Add nopcid to turn off PCID
> > > x86/mm: Enable CR4.PCIDE on supported systems
> > > x86/mm: Try to preserve old TLB entries using PCID
> >
> > So this series is really nice, and the first two patches are already upstream, and
> > I've just applied all but the final patch to tip:x86/mm (out of caution - I'm a wimp).
> >
> > That should already offer some improvements and enables the CR4 bit - but doesn't
> > actually use the PCID hardware yet.
> >
> > I'll push it all out when it passes testing.
> >
> > If it's all super stable I plan to tempt Linus with a late merge window pull
> > request for all these preparatory patches. (Unless he objects that is. Hint, hint.)
> >
> > Any objections?
> >
>
> What was the final verdict here? I have a patch ready that should be layered
> on top which will need a backport. PCID support does not appear to have
> made it in this merge window so I'm wondering if I should send the patch
> as-is for placement on top of Andy's work or go with the backport and
> apply a follow-on patch after Andy's work gets merged.

It's en route for v4.14 - it narrowly missed v4.13.

Thanks,

Ingo

2017-07-17 15:56:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

On Mon, Jul 17, 2017 at 05:06:25PM +0200, Ingo Molnar wrote:
> > > I'll push it all out when it passes testing.
> > >
> > > If it's all super stable I plan to tempt Linus with a late merge window pull
> > > request for all these preparatory patches. (Unless he objects that is. Hint, hint.)
> > >
> > > Any objections?
> > >
> >
> > What was the final verdict here? I have a patch ready that should be layered
> > on top which will need a backport. PCID support does not appear to have
> > made it in this merge window so I'm wondering if I should send the patch
> > as-is for placement on top of Andy's work or go with the backport and
> > apply a follow-on patch after Andy's work gets merged.
>
> It's en route for v4.14 - it narrowly missed v4.13.
>

Grand. I sent out a version that doesn't depend on Andy's work to Andrew
as it's purely a mm patch. If that passes inspection then I'll send a
follow-on patch to apply on top of the PCID work.

Thanks Ingo.

--
Mel Gorman
SUSE Labs

2017-07-18 08:53:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID


* Peter Zijlstra <[email protected]> wrote:

> On Wed, Jul 05, 2017 at 09:04:39AM -0700, Andy Lutomirski wrote:
> > On Wed, Jul 5, 2017 at 5:18 AM, Peter Zijlstra <[email protected]> wrote:
> > > On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
> > >> @@ -104,18 +140,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > >>
> > >> /* Resume remote flushes and then read tlb_gen. */
> > >> cpumask_set_cpu(cpu, mm_cpumask(next));
> > >
> > > Barriers should have a comment... what is being ordered here against
> > > what?
> >
> > How's this comment?
> >
> > /*
> > * Resume remote flushes and then read tlb_gen. We need to do
> > * it in this order: any inc_mm_tlb_gen() caller that writes a
> > * larger tlb_gen than we read here must see our cpu set in
> > * mm_cpumask() so that it will know to flush us. The barrier
> > * here synchronizes with inc_mm_tlb_gen().
> > */
>
> Slightly confusing, you mean this, right?
>
>
> cpumask_set_cpu(cpu, mm_cpumask()); inc_mm_tlb_gen();
>
> MB MB
>
> next_tlb_gen = atomic64_read(&next->context.tlb_gen); flush_tlb_others(mm_cpumask());
>
>
> which seems to make sense.

Btw., I'll wait for a v5 iteration before applying this last patch to tip:x86/mm.

Thanks,

Ingo

2017-07-18 17:06:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Tue, Jul 18, 2017 at 1:53 AM, Ingo Molnar <[email protected]> wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Wed, Jul 05, 2017 at 09:04:39AM -0700, Andy Lutomirski wrote:
>> > On Wed, Jul 5, 2017 at 5:18 AM, Peter Zijlstra <[email protected]> wrote:
>> > > On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
>> > >> @@ -104,18 +140,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>> > >>
>> > >> /* Resume remote flushes and then read tlb_gen. */
>> > >> cpumask_set_cpu(cpu, mm_cpumask(next));
>> > >
>> > > Barriers should have a comment... what is being ordered here against
>> > > what?
>> >
>> > How's this comment?
>> >
>> > /*
>> > * Resume remote flushes and then read tlb_gen. We need to do
>> > * it in this order: any inc_mm_tlb_gen() caller that writes a
>> > * larger tlb_gen than we read here must see our cpu set in
>> > * mm_cpumask() so that it will know to flush us. The barrier
>> > * here synchronizes with inc_mm_tlb_gen().
>> > */
>>
>> Slightly confusing, you mean this, right?
>>
>>
>> cpumask_set_cpu(cpu, mm_cpumask()); inc_mm_tlb_gen();
>>
>> MB MB
>>
>> next_tlb_gen = atomic64_read(&next->context.tlb_gen); flush_tlb_others(mm_cpumask());
>>
>>
>> which seems to make sense.
>
> Btw., I'll wait for a v5 iteration before applying this last patch to tip:x86/mm.

I'll send it shortly. I think I'll also add a patch to factor out the
flush calls a bit more to prepare for Mel's upcoming fix.

>
> Thanks,
>
> Ingo

2017-07-28 13:49:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86/mm: Try to preserve old TLB entries using PCID

On Wed, Jul 05, 2017 at 09:10:00AM -0700, Andy Lutomirski wrote:
> On Wed, Jul 5, 2017 at 5:25 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Jun 29, 2017 at 08:53:22AM -0700, Andy Lutomirski wrote:
> >> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> >> + u16 *new_asid, bool *need_flush)
> >> +{
> >> + u16 asid;
> >> +
> >> + if (!static_cpu_has(X86_FEATURE_PCID)) {
> >> + *new_asid = 0;
> >> + *need_flush = true;
> >> + return;
> >> + }
> >> +
> >> + for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
> >> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
> >> + next->context.ctx_id)
> >> + continue;
> >> +
> >> + *new_asid = asid;
> >> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
> >> + next_tlb_gen);
> >> + return;
> >> + }
> >> +
> >> + /*
> >> + * We don't currently own an ASID slot on this CPU.
> >> + * Allocate a slot.
> >> + */
> >> + *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
> >
> > So this basically RR the ASID slots. Have you tried slightly more
> > complex replacement policies like CLOCK ?
>
> No, mainly because I'm lazy and because CLOCK requires scavenging a
> bit. (Which we can certainly do, but it will further complicate the
> code.) It could be worth playing with better replacement algorithms
> as a followup, though.
>
> I've also considered a slight elaboration of RR in which we make sure
> not to reuse the most recent ASID slot, which would guarantee that, if
> we switch from task A to B and back to A, we don't flush on the way
> back to A. (Currently, if B is not in the cache, there's a 1/6 chance
> we'll flush on the way back.)

How's this?

---
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/include/asm/mmu_context.h | 2 +-
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/mm/init.c | 2 +-
arch/x86/mm/tlb.c | 47 ++++++++++++++++++++++++++------------
5 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index bb8c597c2248..9f26ea900df0 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -55,7 +55,7 @@ typedef struct {

#define INIT_MM_CONTEXT(mm) \
.context = { \
- .ctx_id = 1, \
+ .ctx_id = 2, \
}

void leave_mm(int cpu);
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index d25d9f4abb15..f7866733875d 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -137,7 +137,7 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
static inline int init_new_context(struct task_struct *tsk,
struct mm_struct *mm)
{
- mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
+ mm->context.ctx_id = atomic64_add_return(2, &last_mm_ctx_id);
atomic64_set(&mm->context.tlb_gen, 0);

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d23e61dc0640..43a4af25d78a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -102,7 +102,7 @@ struct tlb_state {
*/
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
- u16 next_asid;
+ u16 last_asid;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 65ae17d45c4a..570979714d49 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,7 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
- .next_asid = 1,
+ .last_asid = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ce104b962a17..aacb87f03428 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,12 +28,28 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/

-atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(2);
+
+static inline u64 asid_ctx_id(int asid)
+{
+ return this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) & ~1ULL;
+}
+
+static inline void asid_hit(int asid)
+{
+ this_cpu_or(cpu_tlbstate.ctxs[asid].ctx_id, 1);
+}
+
+static inline bool asid_age(int asid)
+{
+ return this_cpu_xchg(cpu_tlbstate.ctxs[asid].ctx_id, asid_ctx_id(asid)) & 1ULL;
+}

static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
u16 *new_asid, bool *need_flush)
{
u16 asid;
+ int i;

if (!static_cpu_has(X86_FEATURE_PCID)) {
*new_asid = 0;
@@ -42,10 +58,11 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
}

for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
- if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
- next->context.ctx_id)
+ if (asid_ctx_id(asid) != next->context.ctx_id)
continue;

+ asid_hit(asid);
+
*new_asid = asid;
*need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
next_tlb_gen);
@@ -53,14 +70,17 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
}

/*
- * We don't currently own an ASID slot on this CPU.
- * Allocate a slot.
+ * CLOCK - each entry has a single 'used' bit. Cycle through the array
+ * clearing this bit until we find an entry that doesn't have it set.
*/
- *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
- if (*new_asid >= TLB_NR_DYN_ASIDS) {
- *new_asid = 0;
- this_cpu_write(cpu_tlbstate.next_asid, 1);
- }
+ i = this_cpu_read(cpu_tlbstate.last_asid);
+ do {
+ if (--i < 0)
+ i = TLB_NR_DYN_ASIDS - 1;
+ } while (!asid_age(i));
+ this_cpu_write(cpu_tlbstate.last_asid, i);
+
+ *new_asid = i;
*need_flush = true;
}

@@ -125,8 +145,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
VM_BUG_ON(__read_cr3() != (__sme_pa(real_prev->pgd) | prev_asid));

if (real_prev == next) {
- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
- next->context.ctx_id);
+ VM_BUG_ON(asid_ctx_id(prev_asid) != next->context.ctx_id);

if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
/*
@@ -239,9 +258,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,

/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
-
- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
- loaded_mm->context.ctx_id);
+ VM_WARN_ON(asid_ctx_id(loaded_mm_asid) != loaded_mm->context.ctx_id);

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
/*

2017-09-12 19:36:43

by Prakhya, Sai Praneeth

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

> From: Andy Lutomirski <[email protected]>
> Date: Thu, Jun 29, 2017 at 8:53 AM
> Subject: [PATCH v4 00/10] PCID and improved laziness
> To: [email protected]
> Cc: [email protected], Borislav Petkov <[email protected]>,
> Linus Torvalds <[email protected]>, Andrew Morton
> <[email protected]>, Mel Gorman <[email protected]>,
> "[email protected]" <[email protected]>, Nadav Amit
> <[email protected]>, Rik van Riel <[email protected]>, Dave Hansen
> <[email protected]>, Arjan van de Ven <[email protected]>,
> Peter Zijlstra <[email protected]>, Andy Lutomirski
> <[email protected]>
>
>
> *** Ingo, even if this misses 4.13, please apply the first patch
> before
> *** the merge window.
>
> There are three performance benefits here:
>
> 1. TLB flushing is slow. (I.e. the flush itself takes a while.)
> This avoids many of them when switching tasks by using PCID. In
> a stupid little benchmark I did, it saves about 100ns on my laptop
> per context switch. I'll try to improve that benchmark.
>
> 2. Mms that have been used recently on a given CPU might get to keep
> their TLB entries alive across process switches with this patch
> set. TLB fills are pretty fast on modern CPUs, but they're even
> faster when they don't happen.
>
> 3. Lazy TLB is way better. We used to do two stupid things when we
> ran kernel threads: we'd send IPIs to flush user contexts on their
> CPUs and then we'd write to CR3 for no particular reason as an
> excuse
> to stop further IPIs. With this patch, we do neither.
>
> This will, in general, perform suboptimally if paravirt TLB flushing
> is in use (currently just Xen, I think, but Hyper-V is in the works).
> The code is structured so we could fix it in one of two ways: we
> could take a spinlock when touching the percpu state so we can update
> it remotely after a paravirt flush, or we could be more careful about
> our exactly how we access the state and use cmpxchg16b to do atomic
> remote updates. (On SMP systems without cmpxchg16b, we'd just skip
> the optimization entirely.)
>
> This is still missing a final comment-only patch to add overall
> documentation for the whole thing, but I didn't want to block sending
> the maybe-hopefully-final code on that.
>
> This is based on tip:x86/mm. The branch is here if you want to play:
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>
> In general, performance seems to exceed my expectations. Here are
> some performance numbers copy-and-pasted from the changelogs for
> "Rework lazy TLB mode and TLB freshness" and "Try to preserve old
> TLB entries using PCID":
>
>

Hi Andy,

I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on
Skylake server and noticed that it reboots automatically.

When I booted the same kernel with command line arg "nopcid" it works
fine. Please find below a snippet of dmesg. Please let me know if you
need more info to debug.

[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+
root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0
console=ttyS0,115200n8
[ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[ 0.000000] log_buf_len total cpu_extra contributions: 258048 bytes
[ 0.000000] log_buf_len min size: 262144 bytes
[ 0.000000] log_buf_len: 524288 bytes
[ 0.000000] early log buf free: 212560(81%)
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245
initialize_tlbstate_and_flush+0x6c/0xf0
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5
[ 0.000000] task: ffffffff8960f480 task.stack: ffffffff89600000
[ 0.000000] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0
[ 0.000000] RSP: 0000:ffffffff89603e60 EFLAGS: 00010046
[ 0.000000] RAX: 00000000000406b0 RBX: ffff9f1700a17880 RCX:
ffffffff8965de60
[ 0.000000] RDX: 0000008383a0a000 RSI: 000000000960a000 RDI:
0000008383a0a000
[ 0.000000] RBP: ffffffff89603e60 R08: 0000000000000000 R09:
0000ffffffffffff
[ 0.000000] R10: ffffffff89603ee8 R11: ffffffff0000ffff R12:
0000000000000000
[ 0.000000] R13: ffff9f1700a0c3e0 R14: ffffffff8960f480 R15:
0000000000000000
[ 0.000000] FS: 0000000000000000(0000) GS:ffff9f1700a00000(0000)
knlGS:0000000000000000
[ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.000000] CR2: ffff9fa7bffff000 CR3: 0000008383a0a000 CR4:
00000000000406b0
[ 0.000000] Call Trace:
[ 0.000000] cpu_init+0x206/0x4f0
[ 0.000000] ? __set_pte_vaddr+0x1d/0x30
[ 0.000000] trap_init+0x3e/0x50
[ 0.000000] ? trap_init+0x3e/0x50
[ 0.000000] start_kernel+0x1e2/0x3f2
[ 0.000000] x86_64_start_reservations+0x24/0x26
[ 0.000000] x86_64_start_kernel+0x6f/0x72
[ 0.000000] secondary_startup_64+0xa5/0xa5
[ 0.000000] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05
ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00
75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00
[ 0.000000] ---[ end trace c258f2d278fe031f ]---
[ 0.000000] Memory: 791050356K/803934656K available (9585K kernel
code, 1313K rwdata, 3000K rodata, 1176K init, 680K bss, 12884300K
reserved, 0K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64,
Nodes=4
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] RCU event tracing is enabled.
[ 0.000000] NR_IRQS: 4352, nr_irqs: 3928, preallocated irqs: 16
[ 0.000000] Console: colour dummy device 80x25
[ 0.000000] console [tty0] enabled
[ 0.000000] console [ttyS0] enabled
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 79635855245 ns
[ 0.001000] tsc: Detected 2000.000 MHz processor
[ 0.002000] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4000.00 BogoMIPS (lpj=2000000)
[ 0.003003] pid_max: default: 65536 minimum: 512
[ 0.004030] ACPI: Core revision 20170728
[ 0.091853] ACPI: 6 ACPI AML tables successfully acquired and loaded
[ 0.094143] Security Framework initialized
[ 0.095004] SELinux: Initializing.
[ 0.145612] Dentry cache hash table entries: 33554432 (order: 16,
268435456 bytes)
[ 0.170544] Inode-cache hash table entries: 16777216 (order: 15,
134217728 bytes)
[ 0.172699] Mount-cache hash table entries: 524288 (order: 10,
4194304 bytes)
[ 0.174441] Mountpoint-cache hash table entries: 524288 (order: 10,
4194304 bytes)
[ 0.176351] CPU: Physical Processor ID: 0
[ 0.177003] CPU: Processor Core ID: 0
[ 0.178007] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[ 0.179003] ENERGY_PERF_BIAS: View and update with
x86_energy_perf_policy(8)
[ 0.180013] mce: CPU supports 20 MCE banks
[ 0.181018] CPU0: Thermal monitoring enabled (TM1)
[ 0.182057] process: using mwait in idle threads
[ 0.183005] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
[ 0.184003] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
[ 0.185223] Freeing SMP alternatives memory: 36K
[ 0.193912] smpboot: Max logical packages: 8
[ 0.194017] Switched APIC routing to physical flat.
[ 0.196496] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.206252] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8164 CPU @
2.00GHz (family: 0x6, model: 0x55, stepping: 0x4)
[ 0.207131] Performance Events: PEBS fmt3+, Skylake events, 32-deep
LBR, full-width counters, Intel PMU driver.
[ 0.208003] ... version: 4
[ 0.209001] ... bit width: 48
[ 0.210001] ... generic registers: 4
[ 0.211001] ... value mask: 0000ffffffffffff
[ 0.212001] ... max period: 00007fffffffffff
[ 0.213001] ... fixed-purpose events: 3
[ 0.214001] ... event mask: 000000070000000f
[ 0.215078] Hierarchical SRCU implementation.
[ 0.216867] smp: Bringing up secondary CPUs ...
[ 0.217085] x86: Booting SMP configuration:
[ 0.218001] .... node #0, CPUs: #1
[ 0.001000] ------------[ cut here ]------------
[ 0.001000] WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:245
initialize_tlbstate_and_flush+0x6c/0xf0
[ 0.001000] Modules linked in:
[ 0.001000] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
4.13.0+ #5
[ 0.001000] task: ffff9f16fa393e40 task.stack: ffffaf0e98afc000
[ 0.001000] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0
[ 0.001000] RSP: 0000:ffffaf0e98affeb0 EFLAGS: 00010046
[ 0.001000] RAX: 00000000000000a0 RBX: ffff9f1700a57880 RCX:
ffffffff8965de60
[ 0.001000] RDX: 0000008383a0a000 RSI: 000000000960a000 RDI:
0000008383a0a000
[ 0.001000] RBP: ffffaf0e98affeb0 R08: 0000000000000000 R09:
0000000000000000
[ 0.001000] R10: ffffaf0e98affe78 R11: ffffaf0e98affdb6 R12:
0000000000000001
[ 0.001000] R13: ffff9f1700a4c3e0 R14: ffff9f16fa393e40 R15:
0000000000000001
[ 0.001000] FS: 0000000000000000(0000) GS:ffff9f1700a40000(0000)
knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: 0000000000000000 CR3: 0000008383a0a000 CR4:
00000000000000a0
[ 0.001000] invalid opcode: 0000 [#1] SMP
[ 0.001000] Modules linked in:
[ 0.001000] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
4.13.0+ #5
[ 0.001000] task: ffff9f16fa393e40 task.stack: ffffaf0e98afc000
[ 0.001000] RIP: 0010:__show_regs+0x255/0x290
[ 0.001000] RSP: 0000:ffffaf0e98affbc0 EFLAGS: 00010002
[ 0.001000] RAX: 0000000000000018 RBX: 0000000000000000 RCX:
0000000000000000
[ 0.001000] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffffff898a978c
[ 0.001000] RBP: ffffaf0e98affc10 R08: 0000000000000001 R09:
0000000000000373
[ 0.001000] R10: ffffffff8884fb8c R11: ffffffff898ab7cd R12:
00000000ffff0ff0
[ 0.001000] R13: 0000000000000400 R14: ffff9f1700a40000 R15:
0000000000000000
[ 0.001000] FS: 0000000000000000(0000) GS:ffff9f1700a40000(0000)
knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: 0000000000000000 CR3: 0000008383a0a000 CR4:
00000000000000a0
--------------------<snip>--------------------------------------
[ 0.001000] invalid opcode: 0000 [#20] SMP
[ 0.001000] Modules linked in:
[ 0.001000] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
4.13.0+ #5
[ 0.001000] task: ffff9f16fa393e40 task.stack: ffffaf0e98afc000
[ 0.001000] RIP: 0010:__show_regs+0x255/0x290
[ 0.001000] RSP: 0000:ffffaf0e98afc788 EFLAGS: 00010002
[ 0.001000] RAX: 0000000000000018 RBX: 0000000000000000 RCX:
0000000000000000
[ 0.001000] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffffff898a978c
[ 0.001000] RBP: ffffaf0e98afc7d8 R08: 0000000000000001 R09:
0000000000000490
[ 0.001000] R10: ffffffff88818785 R11: ffffffff898ab7cd R12:
00000000ffff0ff0
[ 0.001000] R13: 0000000000000400 R14: ffff9f1700a40000 R15:
0000000000000000
[ 0.001000] FS: 0000000000000000(0000) GS:ffff9f1700a40000(0000)
knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: 0000000000000000 CR3: 0000008383a0a000 CR4:
00000000000000a0
Force an S5 exit path.

Regards,
Sai


2017-09-13 04:19:02

by Prakhya, Sai Praneeth

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness

>
>
> Hi Andy,
>
> I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on
> Skylake server and noticed that it reboots automatically.
>
> When I booted the same kernel with command line arg "nopcid" it works
> fine. Please find below a snippet of dmesg. Please let me know if you
> need more info to debug.
>
> [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+
> root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0
> console=ttyS0,115200n8
> [ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
> [ 0.000000] log_buf_len total cpu_extra contributions: 258048 bytes
> [ 0.000000] log_buf_len min size: 262144 bytes
> [ 0.000000] log_buf_len: 524288 bytes
> [ 0.000000] early log buf free: 212560(81%)
> [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [ 0.000000] ------------[ cut here ]------------
> [ 0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245
> initialize_tlbstate_and_flush+0x6c/0xf0
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5
> [ 0.000000] task: ffffffff8960f480 task.stack: ffffffff89600000
> [ 0.000000] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0
> [ 0.000000] RSP: 0000:ffffffff89603e60 EFLAGS: 00010046
> [ 0.000000] RAX: 00000000000406b0 RBX: ffff9f1700a17880 RCX:
> ffffffff8965de60
> [ 0.000000] RDX: 0000008383a0a000 RSI: 000000000960a000 RDI:
> 0000008383a0a000
> [ 0.000000] RBP: ffffffff89603e60 R08: 0000000000000000 R09:
> 0000ffffffffffff
> [ 0.000000] R10: ffffffff89603ee8 R11: ffffffff0000ffff R12:
> 0000000000000000
> [ 0.000000] R13: ffff9f1700a0c3e0 R14: ffffffff8960f480 R15:
> 0000000000000000
> [ 0.000000] FS: 0000000000000000(0000) GS:ffff9f1700a00000(0000)
> knlGS:0000000000000000
> [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.000000] CR2: ffff9fa7bffff000 CR3: 0000008383a0a000 CR4:
> 00000000000406b0
> [ 0.000000] Call Trace:
> [ 0.000000] cpu_init+0x206/0x4f0
> [ 0.000000] ? __set_pte_vaddr+0x1d/0x30
> [ 0.000000] trap_init+0x3e/0x50
> [ 0.000000] ? trap_init+0x3e/0x50
> [ 0.000000] start_kernel+0x1e2/0x3f2
> [ 0.000000] x86_64_start_reservations+0x24/0x26
> [ 0.000000] x86_64_start_kernel+0x6f/0x72
> [ 0.000000] secondary_startup_64+0xa5/0xa5
> [ 0.000000] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05
> ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00
> 75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00
> [ 0.000000] ---[ end trace c258f2d278fe031f ]---
> [ 0.000000] Memory: 791050356K/803934656K available (9585K kernel
> code, 1313K rwdata, 3000K rodata, 1176K init, 680K bss, 12884300K
> reserved, 0K cma-reserved)
> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64,
> Nodes=4
> [ 0.000000] Hierarchical RCU implementation.
> [ 0.000000] RCU event tracing is enabled.
> [ 0.000000] NR_IRQS: 4352, nr_irqs: 3928, preallocated irqs: 16
> [ 0.000000] Console: colour dummy device 80x25
> [ 0.000000] console [tty0] enabled
> [ 0.000000] console [ttyS0] enabled
> [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles:
> 0xffffffff, max_idle_ns: 79635855245 ns
> [ 0.001000] tsc: Detected 2000.000 MHz processor
> [ 0.002000] Calibrating delay loop (skipped), value calculated using
> timer frequency.. 4000.00 BogoMIPS (lpj=2000000)
> [ 0.003003] pid_max: default: 65536 minimum: 512
> [ 0.004030] ACPI: Core revision 20170728
> [ 0.091853] ACPI: 6 ACPI AML tables successfully acquired and loaded
> [ 0.094143] Security Framework initialized
> [ 0.095004] SELinux: Initializing.
> [ 0.145612] Dentry cache hash table entries: 33554432 (order: 16,
> 268435456 bytes)
> [ 0.170544] Inode-cache hash table entries: 16777216 (order: 15,
> 134217728 bytes)
> [ 0.172699] Mount-cache hash table entries: 524288 (order: 10,
> 4194304 bytes)
> [ 0.174441] Mountpoint-cache hash table entries: 524288 (order: 10,
> 4194304 bytes)
> [ 0.176351] CPU: Physical Processor ID: 0
> [ 0.177003] CPU: Processor Core ID: 0
> [ 0.178007] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
> [ 0.179003] ENERGY_PERF_BIAS: View and update with
> x86_energy_perf_policy(8)
> [ 0.180013] mce: CPU supports 20 MCE banks
> [ 0.181018] CPU0: Thermal monitoring enabled (TM1)
> [ 0.182057] process: using mwait in idle threads
> [ 0.183005] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
> [ 0.184003] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
> [ 0.185223] Freeing SMP alternatives memory: 36K
> [ 0.193912] smpboot: Max logical packages: 8
> [ 0.194017] Switched APIC routing to physical flat.
> [ 0.196496] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.206252] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8164 CPU @
> 2.00GHz (family: 0x6, model: 0x55, stepping: 0x4)
> [ 0.207131] Performance Events: PEBS fmt3+, Skylake events, 32-deep
> LBR, full-width counters, Intel PMU driver.
> [ 0.208003] ... version: 4
> [ 0.209001] ... bit width: 48
> [ 0.210001] ... generic registers: 4
> [ 0.211001] ... value mask: 0000ffffffffffff
> [ 0.212001] ... max period: 00007fffffffffff
> [ 0.213001] ... fixed-purpose events: 3
> [ 0.214001] ... event mask: 000000070000000f
> [ 0.215078] Hierarchical SRCU implementation.
> [ 0.216867] smp: Bringing up secondary CPUs ...
> [ 0.217085] x86: Booting SMP configuration:
> [ 0.218001] .... node #0, CPUs: #1
> [ 0.001000] ------------[ cut here ]------------
> [ 0.001000] WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:245
> initialize_tlbstate_and_flush+0x6c/0xf0
> [ 0.001000] Modules linked in:
> [ 0.001000] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
> 4.13.0+ #5
> [ 0.001000] task: ffff9f16fa393e40 task.stack: ffffaf0e98afc000
> [ 0.001000] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0
> [ 0.001000] RSP: 0000:ffffaf0e98affeb0 EFLAGS: 00010046
> [ 0.001000] RAX: 00000000000000a0 RBX: ffff9f1700a57880 RCX:
> ffffffff8965de60
> [ 0.001000] RDX: 0000008383a0a000 RSI: 000000000960a000 RDI:
> 0000008383a0a000
> [ 0.001000] RBP: ffffaf0e98affeb0 R08: 0000000000000000 R09:
> 0000000000000000
> [ 0.001000] R10: ffffaf0e98affe78 R11: ffffaf0e98affdb6 R12:
> 0000000000000001
> [ 0.001000] R13: ffff9f1700a4c3e0 R14: ffff9f16fa393e40 R15:
> 0000000000000001
> [ 0.001000] FS: 0000000000000000(0000) GS:ffff9f1700a40000(0000)
> knlGS:0000000000000000
> [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.001000] CR2: 0000000000000000 CR3: 0000008383a0a000 CR4:
> 00000000000000a0
> [ 0.001000] invalid opcode: 0000 [#1] SMP
> [ 0.001000] Modules linked in:
> [ 0.001000] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
> 4.13.0+ #5
> [ 0.001000] task: ffff9f16fa393e40 task.stack: ffffaf0e98afc000
> [ 0.001000] RIP: 0010:__show_regs+0x255/0x290
> [ 0.001000] RSP: 0000:ffffaf0e98affbc0 EFLAGS: 00010002
> [ 0.001000] RAX: 0000000000000018 RBX: 0000000000000000 RCX:
> 0000000000000000
> [ 0.001000] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffff898a978c
> [ 0.001000] RBP: ffffaf0e98affc10 R08: 0000000000000001 R09:
> 0000000000000373
> [ 0.001000] R10: ffffffff8884fb8c R11: ffffffff898ab7cd R12:
> 00000000ffff0ff0
> [ 0.001000] R13: 0000000000000400 R14: ffff9f1700a40000 R15:
> 0000000000000000
> [ 0.001000] FS: 0000000000000000(0000) GS:ffff9f1700a40000(0000)
> knlGS:0000000000000000
> [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.001000] CR2: 0000000000000000 CR3: 0000008383a0a000 CR4:
> 00000000000000a0

Hi All,

Since I didn't find conversation between Andy and myself on the mailing
list, I wanted to keep this thread updated. So, sorry! for spamming your
inbox if it's a duplicate.

Andy has asked me to try a patch that's already in his git repo and it
fixes the issue.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864


Regards,
Sai

2017-09-13 07:43:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness


* Andy Lutomirski <[email protected]> wrote:

> I'm on my way to LPC, so I can't easily work on this right this instant.
>
> Can you try this branch, though?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864

Any objections against me applying these fixes directly and getting them to Linus
later today, to not widen the window of breakage any further?

I'll also apply:

x86/mm/64: Initialize CR4.PCIDE early

Thanks,

Ingo

2017-09-13 07:45:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] PCID and improved laziness


* Ingo Molnar <[email protected]> wrote:

>
> * Andy Lutomirski <[email protected]> wrote:
>
> > I'm on my way to LPC, so I can't easily work on this right this instant.
> >
> > Can you try this branch, though?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864
>
> Any objections against me applying these fixes directly and getting them to Linus
> later today, to not widen the window of breakage any further?
>
> I'll also apply:
>
> x86/mm/64: Initialize CR4.PCIDE early

So now that I've looked at the branch, I think I'll apply these three:

cb88ae619b4c: x86/mm/64: Initialize CR4.PCIDE early
6ec68017784f: x86/hibernate/64: Mask off CR3's PCID bits in the saved CR3
5d1298206d7b: x86/mm: Get rid of VM_BUG_ON in switch_tlb_irqs_off()

The first one is that likely fixed the crash that Sai Praneeth Prakhya reported.

Thanks,

Ingo