2017-06-21 05:22:22

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 00/11] PCID and improved laziness

There are three performance benefits here:

1. TLB flushing is slow. (I.e. the flush itself takes a while.)
This avoids many of them when switching tasks by using PCID. In
a stupid little benchmark I did, it saves about 100ns on my laptop
per context switch. I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
their TLB entries alive across process switches with this patch
set. TLB fills are pretty fast on modern CPUs, but they're even
faster when they don't happen.

3. Lazy TLB is way better. We used to do two stupid things when we
ran kernel threads: we'd send IPIs to flush user contexts on their
CPUs and then we'd write to CR3 for no particular reason as an excuse
to stop further IPIs. With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates. (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is based on tip:x86/mm. The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

Changes from v2:
- Add some Acks
- Move the reentrancy issue to the beginning.
(I also sent the same patch as a standalone fix -- it's just in here
so that this series applies to x86/mm.)
- Fix some comments.

Changes from RFC:
- flush_tlb_func_common() no longer gets reentered (Nadav)
- Fix ASID corruption on unlazying (kbuild bot)
- Move Xen init to the right place
- Misc cleanups

Andy Lutomirski (11):
x86/mm: Don't reenter flush_tlb_func_common()
x86/ldt: Simplify LDT switching logic
x86/mm: Remove reset_lazy_tlbstate()
x86/mm: Give each mm TLB flush generation a unique ID
x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
x86/mm: Rework lazy TLB mode and TLB freshness tracking
x86/mm: Stop calling leave_mm() in idle code
x86/mm: Disable PCID on 32-bit kernels
x86/mm: Add nopcid to turn off PCID
x86/mm: Enable CR4.PCIDE on supported systems
x86/mm: Try to preserve old TLB entries using PCID

Documentation/admin-guide/kernel-parameters.txt | 2 +
arch/ia64/include/asm/acpi.h | 2 -
arch/x86/include/asm/acpi.h | 2 -
arch/x86/include/asm/disabled-features.h | 4 +-
arch/x86/include/asm/mmu.h | 25 +-
arch/x86/include/asm/mmu_context.h | 40 ++-
arch/x86/include/asm/processor-flags.h | 2 +
arch/x86/include/asm/tlbflush.h | 89 +++++-
arch/x86/kernel/cpu/bugs.c | 8 +
arch/x86/kernel/cpu/common.c | 33 +++
arch/x86/kernel/smpboot.c | 1 -
arch/x86/mm/init.c | 2 +-
arch/x86/mm/tlb.c | 368 +++++++++++++++---------
arch/x86/xen/enlighten_pv.c | 6 +
arch/x86/xen/mmu_pv.c | 3 +-
drivers/acpi/processor_idle.c | 2 -
drivers/idle/intel_idle.c | 9 +-
17 files changed, 430 insertions(+), 168 deletions(-)

--
2.9.4


2017-06-21 05:22:25

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review. By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu.h | 25 +++++++++++++++++++++++--
arch/x86/include/asm/mmu_context.h | 5 +++++
arch/x86/include/asm/tlbflush.h | 18 ++++++++++++++++++
arch/x86/mm/tlb.c | 6 ++++--
4 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a7ebd0..bb8c597c2248 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@

#include <linux/spinlock.h>
#include <linux/mutex.h>
+#include <linux/atomic.h>

/*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
*/
typedef struct {
+ /*
+ * ctx_id uniquely identifies this mm_struct. A ctx_id will never
+ * be reused, and zero is not a valid ctx_id.
+ */
+ u64 ctx_id;
+
+ /*
+ * Any code that needs to do any sort of TLB flushing for this
+ * mm will first make its changes to the page tables, then
+ * increment tlb_gen, then flush. This lets the low-level
+ * flushing code keep track of what needs flushing.
+ *
+ * This is not used on Xen PV.
+ */
+ atomic64_t tlb_gen;
+
#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
#endif
@@ -37,6 +53,11 @@ typedef struct {
#endif
} mm_context_t;

+#define INIT_MM_CONTEXT(mm) \
+ .context = { \
+ .ctx_id = 1, \
+ }
+
void leave_mm(int cpu);

#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ecfcb6643c9b..e5295d485899 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
}

+extern atomic64_t last_mm_ctx_id;
+
static inline int init_new_context(struct task_struct *tsk,
struct mm_struct *mm)
{
+ mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
+ atomic64_set(&mm->context.tlb_gen, 0);
+
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 50ea3482e1d1..1eb946c0507e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
}

+static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
+{
+ u64 new_tlb_gen;
+
+ /*
+ * Bump the generation count. This also serves as a full barrier
+ * that synchronizes with switch_mm: callers are required to order
+ * their read of mm_cpumask after their writes to the paging
+ * structures.
+ */
+ smp_mb__before_atomic();
+ new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
+ smp_mb__after_atomic();
+
+ return new_tlb_gen;
+}
+
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
+ bump_mm_tlb_gen(mm);
cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
}

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fd593833a854..6d9d37323a43 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/

+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -286,8 +288,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

cpu = get_cpu();

- /* Synchronize with switch_mm. */
- smp_mb();
+ /* This is also a barrier that synchronizes with switch_mm(). */
+ bump_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&
--
2.9.4

2017-06-21 05:22:29

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()

The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 8 --------
arch/x86/kernel/smpboot.c | 1 -
2 files changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5f78c6a77578..50ea3482e1d1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
#define TLBSTATE_OK 1
#define TLBSTATE_LAZY 2

-static inline void reset_lazy_tlbstate(void)
-{
- this_cpu_write(cpu_tlbstate.state, 0);
- this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
-
- WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir));
-}
-
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f04479a8f74f..6169a56aab49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu)
void play_dead_common(void)
{
idle_task_exit();
- reset_lazy_tlbstate();

/* Ack it */
(void)cpu_report_death();
--
2.9.4

2017-06-21 05:22:32

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode. Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 4 +++-
arch/x86/kernel/cpu/bugs.c | 8 ++++++++
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dff775af7cd..c10c9128f54e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
# define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31))
# define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31))
# define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID 0
#else
# define DISABLE_VME 0
# define DISABLE_K6_MTRR 0
# define DISABLE_CYRIX_ARR 0
# define DISABLE_CENTAUR_MCR 0
+# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31))
#endif /* CONFIG_X86_64 */

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
#define DISABLED_MASK1 0
#define DISABLED_MASK2 0
#define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4 0
+#define DISABLED_MASK4 (DISABLE_PCID)
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
#define DISABLED_MASK7 0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9242da..db684880d74a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@

void __init check_bugs(void)
{
+#ifdef CONFIG_X86_32
+ /*
+ * Regardless of whether PCID is enumerated, the SDM says
+ * that it can't be enabled in 32-bit mode.
+ */
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
identify_boot_cpu();

if (!IS_ENABLED(CONFIG_SMP)) {
--
2.9.4

2017-06-21 05:22:40

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

It was historically possible to have two concurrent TLB flushes
targetting the same CPU: one initiated locally and one initiated
remotely. This can now cause an OOPS in leave_mm() at
arch/x86/mm/tlb.c:47:

if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();

with this call trace:
flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317

Without reentrancy, this OOPS is impossible: leave_mm() is only
called if we're not in TLBSTATE_OK, but then we're unexpectedly
in TLBSTATE_OK in leave_mm().

This can be caused by flush_tlb_func_remote() happening between
the two checks and calling leave_mm(), resulting in two consecutive
leave_mm() calls on the same CPU with no intervening switch_mm()
calls.

We never saw this OOPS before because the old leave_mm()
implementation didn't put us back in TLBSTATE_OK, so the assertion
didn't fire.

Nadav noticed the reentrancy issue in a different context, but
neither of us realized that it caused a problem yet.

Cc: Nadav Amit <[email protected]>
Cc: Dave Hansen <[email protected]>
Reported-by: "Levin, Alexander (Sasha Levin)" <[email protected]>
Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/tlb.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a5e851f2035..f06239c6919f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
+ /* This code cannot presently handle being reentered. */
+ VM_WARN_ON(!irqs_disabled());
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
return;
@@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
info.end = TLB_FLUSH_ALL;
}

- if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
+ if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+ local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+ local_irq_enable();
+ }
+
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
put_cpu();
@@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)

int cpu = get_cpu();

- if (cpumask_test_cpu(cpu, &batch->cpumask))
+ if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+ local_irq_disable();
flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
+ local_irq_enable();
+ }
+
if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(&batch->cpumask, &info);
cpumask_clear(&batch->cpumask);
--
2.9.4

2017-06-21 05:22:46

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code

Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/ia64/include/asm/acpi.h | 2 --
arch/x86/include/asm/acpi.h | 2 --
arch/x86/mm/tlb.c | 19 +++----------------
drivers/acpi/processor_idle.c | 2 --
drivers/idle/intel_idle.c | 9 ++++-----
5 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211970e9..c86a947f5368 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
}

-#define acpi_unlazy_tlb(x)
-
#ifdef CONFIG_ACPI_NUMA
extern cpumask_t early_cpu_possible_map;
#define for_each_possible_early_cpu(cpu) \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768e4362..562286fa151f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
extern int x86_acpi_numa_init(void);
#endif /* CONFIG_ACPI_NUMA */

-#define acpi_unlazy_tlb(x) leave_mm(x)
-
#ifdef CONFIG_ACPI_APEI
static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fea2b07ac7d8..5f932fd80881 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)

switch_mm(NULL, &init_mm, NULL);
}
-EXPORT_SYMBOL_GPL(leave_mm);

void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
@@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
next_tlb_gen);
write_cr3(__pa(next->pgd));
- /*
- * This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we have to call the tracepoint
- * specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- TLB_FLUSH_ALL);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
}

/*
@@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
write_cr3(__pa(next->pgd));

- /*
- * This gets called via leave_mm() in the idle path where RCU
- * functions differently. Tracing normally uses RCU, so we
- * have to call the tracepoint specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
- TLB_FLUSH_ALL);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}

load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9cf62d7..fe3d2a40f311 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
static void acpi_idle_enter_bm(struct acpi_processor *pr,
struct acpi_processor_cx *cx, bool timer_bc)
{
- acpi_unlazy_tlb(smp_processor_id());
-
/*
* Must be done before busmaster disable as we might need to
* access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec88c0c..2ae43f59091d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
struct cpuidle_state *state = &drv->states[index];
unsigned long eax = flg2MWAIT(state->flags);
unsigned int cstate;
- int cpu = smp_processor_id();

cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;

/*
- * leave_mm() to avoid costly and often unnecessary wakeups
- * for flushing the user TLB's associated with the active mm.
+ * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+ * will probably flush the TLB. It's not guaranteed to flush
+ * the TLB, though, so it's not clear that we can do anything
+ * useful with this knowledge.
*/
- if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
- leave_mm(cpu);

if (!(lapic_timer_reliable_states & (1 << (cstate))))
tick_broadcast_enter();
--
2.9.4

2017-06-21 05:23:06

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. The next patch will start using
PCID.

Cc: Juergen Gross <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 8 ++++++++
arch/x86/kernel/cpu/common.c | 15 +++++++++++++++
arch/x86/xen/enlighten_pv.c | 6 ++++++
3 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 87b13e51e867..57b305e13c4c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
__flush_tlb_global();
else
__flush_tlb();
+
+ /*
+ * Note: if we somehow had PCID but not PGE, then this wouldn't work --
+ * we'd end up flushing kernel translations for the current ASID but
+ * we might fail to flush kernel translations for other cached ASIDs.
+ *
+ * To avoid this issue, we force PCID off if PGE is off.
+ */
}

static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e7b230..01caf66b270f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smep(c);
setup_smap(c);

+ /* Set up PCID */
+ if (cpu_has(c, X86_FEATURE_PCID)) {
+ if (cpu_has(c, X86_FEATURE_PGE)) {
+ cr4_set_bits(X86_CR4_PCIDE);
+ } else {
+ /*
+ * flush_tlb_all(), as currently implemented, won't
+ * work if PCID is on but PGE is not. Since that
+ * combination doesn't exist on real hardware, there's
+ * no reason to try to fully support it.
+ */
+ clear_cpu_cap(c, X86_FEATURE_PCID);
+ }
+ }
+
/*
* The vendor-specific functions might have changed features.
* Now we do "generic changes."
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f33eef4ebd12..a136aac543c3 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
setup_clear_cpu_cap(X86_FEATURE_ACC);
setup_clear_cpu_cap(X86_FEATURE_X2APIC);

+ /*
+ * Xen PV would need some work to support PCID: CR3 handling as well
+ * as xen_flush_tlb_others() would need updating.
+ */
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+
if (!xen_initial_domain())
setup_clear_cpu_cap(X86_FEATURE_ACPI);

--
2.9.4

2017-06-21 05:22:39

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

PCID is a "process context ID" -- it's what other architectures call
an address space ID. Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB. x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID. x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs. To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently. We use a PCID to identify a
recently-used mm on a per-cpu basis. An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 3 ++
arch/x86/include/asm/processor-flags.h | 2 +
arch/x86/include/asm/tlbflush.h | 18 +++++++-
arch/x86/mm/init.c | 1 +
arch/x86/mm/tlb.c | 82 ++++++++++++++++++++++++++--------
5 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 69a4f1ee86ac..2537ec03c9b7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void)
{
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);

+ if (static_cpu_has(X86_FEATURE_PCID))
+ cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
/* Mask off the address space ID bits. */
#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
#define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
#else
/*
* CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
*/
#define CR3_ADDR_MASK 0xFFFFFFFFull
#define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
#endif

#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif

+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
* mode even if we've already switched back to swapper_pg_dir.
*/
struct mm_struct *loaded_mm;
+ u16 loaded_mm_asid;
+ u16 next_asid;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {

/*
* This is a list of all contexts that might exist in the TLB.
- * Since we don't yet use PCID, there is only one context.
+ * There is one per ASID that we use, and the ASID (what the
+ * CPU calls PCID) is the index into ctxts.
*
* For each context, ctx_id indicates which mm the TLB's user
* entries came from. As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
* To be clear, this means that it's legal for the TLB code to
* flush the TLB without updating tlb_gen. This can happen
* (for now, at least) due to paravirt remote flushes.
+ *
+ * NB: context 0 is a bit special, since it's also used by
+ * various bits of init code. This is fine -- code that
+ * isn't aware of PCID will end up harmlessly flushing
+ * context 0.
*/
- struct tlb_context ctxs[1];
+ struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
+ .next_asid = 1,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5f932fd80881..cd7f604ee818 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,40 @@

atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);

+static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
+ u16 *new_asid, bool *need_flush)
+{
+ u16 asid;
+
+ if (!static_cpu_has(X86_FEATURE_PCID)) {
+ *new_asid = 0;
+ *need_flush = true;
+ return;
+ }
+
+ for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
+ if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
+ next->context.ctx_id)
+ continue;
+
+ *new_asid = asid;
+ *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
+ next_tlb_gen);
+ return;
+ }
+
+ /*
+ * We don't currently own an ASID slot on this CPU.
+ * Allocate a slot.
+ */
+ *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
+ if (*new_asid >= NR_DYNAMIC_ASIDS) {
+ *new_asid = 0;
+ this_cpu_write(cpu_tlbstate.next_asid, 1);
+ }
+ *need_flush = true;
+}
+
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -66,6 +100,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 next_tlb_gen;

/*
@@ -81,7 +116,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());

- VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
+ VM_BUG_ON(__read_cr3() != (__pa(real_prev->pgd) | prev_asid));

if (real_prev == next) {
if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
@@ -98,10 +133,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

- if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+ if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
next_tlb_gen) {
/*
* Ideally, we'd have a flush_tlb() variant that
@@ -109,9 +144,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* be faster on Xen PV and on hypothetical CPUs
* on which INVPCID is fast.
*/
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
next_tlb_gen);
- write_cr3(__pa(next->pgd));
+ write_cr3(__pa(next->pgd) | prev_asid);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
TLB_FLUSH_ALL);
}
@@ -122,6 +157,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* are not reflected in tlb_gen.)
*/
} else {
+ u16 new_asid;
+ bool need_flush;
+
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
* If our current stack is in vmalloc space and isn't
@@ -141,7 +179,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+ VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));

/*
* Start remote flushes and then read tlb_gen.
@@ -149,17 +187,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
- next->context.ctx_id);
+ choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

- this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
- next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
- next_tlb_gen);
- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- write_cr3(__pa(next->pgd));
+ if (need_flush) {
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id,
+ next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen,
+ next_tlb_gen);
+ write_cr3(__pa(next->pgd) | new_asid);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ write_cr3(__pa(next->pgd) | new_asid | CR3_NOFLUSH);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
+ }

- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
}

load_mm_cr4(next);
@@ -170,6 +215,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);

/*
* Our memory ordering requirement is that any TLB fills that
@@ -179,12 +225,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* atomic64_read operation won't be reordered by the compiler.
*/
u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
- u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);

/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());

- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
loaded_mm->context.ctx_id);

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
@@ -256,7 +302,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
}

/* Both paths above update our state to mm_tlb_gen. */
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
+ this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
}

static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
--
2.9.4

2017-06-21 05:23:31

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB. This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely. When we enter lazy mode, we simply remove the
cpu from mm_cpumask. This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd. I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1. Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works. This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 6 +-
arch/x86/include/asm/tlbflush.h | 4 -
arch/x86/mm/init.c | 1 -
arch/x86/mm/tlb.c | 227 +++++++++++++++++++------------------
arch/x86/xen/mmu_pv.c | 3 +-
5 files changed, 119 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e5295d485899..69a4f1ee86ac 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+ int cpu = smp_processor_id();
+
+ if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+ cpumask_clear_cpu(cpu, mm_cpumask(mm));
}

extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f6c30d6ec39..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
* mode even if we've already switched back to swapper_pg_dir.
*/
struct mm_struct *loaded_mm;
- int state;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info);

-#define TLBSTATE_OK 1
-#define TLBSTATE_LAZY 2
-
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
- .state = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9f5ef7a5e74a..fea2b07ac7d8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == &init_mm)
return;

- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
- BUG();
+ /* Warn if we're not lazy. */
+ WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

switch_mm(NULL, &init_mm, NULL);
}
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u64 next_tlb_gen;

/*
- * NB: The scheduler will call us with prev == next when
- * switching from lazy TLB mode to normal mode if active_mm
- * isn't changing. When this happens, there is no guarantee
- * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+ * NB: The scheduler will call us with prev == next when switching
+ * from lazy TLB mode to normal mode if active_mm isn't changing.
+ * When this happens, we don't assume that CR3 (and hence
+ * cpu_tlbstate.loaded_mm) matches next.
*
* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
*/

- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+ /* We don't want flush_tlb_func_* to run concurrently with us. */
+ if (IS_ENABLED(CONFIG_PROVE_LOCKING))
+ WARN_ON_ONCE(!irqs_disabled());
+
+ VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));

if (real_prev == next) {
- /*
- * There's nothing to do: we always keep the per-mm control
- * regs in sync with cpu_tlbstate.loaded_mm. Just
- * sanity-check mm_cpumask.
- */
- if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
- cpumask_set_cpu(cpu, mm_cpumask(next));
- return;
- }
+ if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /*
+ * There's nothing to do: we weren't lazy, and we
+ * aren't changing our mm. We don't need to flush
+ * anything, nor do we need to update CR3, CR4, or
+ * LDTR.
+ */
+ return;
+ }
+
+ /* Resume remote flushes and then read tlb_gen. */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ next->context.ctx_id);
+
+ if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+ next_tlb_gen) {
+ /*
+ * Ideally, we'd have a flush_tlb() variant that
+ * takes the known CR3 value as input. This would
+ * be faster on Xen PV and on hypothetical CPUs
+ * on which INVPCID is fast.
+ */
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ next_tlb_gen);
+ write_cr3(__pa(next->pgd));
+ /*
+ * This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we have to call the tracepoint
+ * specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ }

- if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
- * If our current stack is in vmalloc space and isn't
- * mapped in the new pgd, we'll double-fault. Forcibly
- * map it.
+ * We just exited lazy mode, which means that CR4 and/or LDTR
+ * may be stale. (Changes to the required CR4 and LDTR states
+ * are not reflected in tlb_gen.)
*/
- unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
+ } else {
+ if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+ /*
+ * If our current stack is in vmalloc space and isn't
+ * mapped in the new pgd, we'll double-fault. Forcibly
+ * map it.
+ */
+ unsigned int stack_pgd_index =
+ pgd_index(current_stack_pointer());
+
+ pgd_t *pgd = next->pgd + stack_pgd_index;
+
+ if (unlikely(pgd_none(*pgd)))
+ set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+ }

- pgd_t *pgd = next->pgd + stack_pgd_index;
+ /* Stop remote flushes for the previous mm */
+ if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- if (unlikely(pgd_none(*pgd)))
- set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
- }
+ WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));

- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
- atomic64_read(&next->context.tlb_gen));
+ /*
+ * Start remote flushes and then read tlb_gen.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
+ next->context.ctx_id);

- /*
- * Re-load page tables.
- *
- * This logic has an ordering constraint:
- *
- * CPU 0: Write to a PTE for 'next'
- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
- * CPU 1: set bit 1 in next's mm_cpumask
- * CPU 1: load from the PTE that CPU 0 writes (implicit)
- *
- * We need to prevent an outcome in which CPU 1 observes
- * the new PTE value and CPU 0 observes bit 1 clear in
- * mm_cpumask. (If that occurs, then the IPI will never
- * be sent, and CPU 0's TLB will contain a stale entry.)
- *
- * The bad outcome can occur if either CPU's load is
- * reordered before that CPU's store, so both CPUs must
- * execute full barriers to prevent this from happening.
- *
- * Thus, switch_mm needs a full barrier between the
- * store to mm_cpumask and any operation that could load
- * from next->pgd. TLB fills are special and can happen
- * due to instruction fetches or for no reason at all,
- * and neither LOCK nor MFENCE orders them.
- * Fortunately, load_cr3() is serializing and gives the
- * ordering guarantee we need.
- */
- load_cr3(next->pgd);
-
- /*
- * This gets called via leave_mm() in the idle path where RCU
- * functions differently. Tracing normally uses RCU, so we have to
- * call the tracepoint specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
+ next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ next_tlb_gen);
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ write_cr3(__pa(next->pgd));

- /* Stop flush ipis for the previous mm */
- WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
- real_prev != &init_mm);
- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ /*
+ * This gets called via leave_mm() in the idle path where RCU
+ * functions differently. Tracing normally uses RCU, so we
+ * have to call the tracepoint specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ }

- /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
switch_ldt(real_prev, next);
}

-/*
- * The flush IPI assumes that a thread switch happens in this order:
- * [cpu0: the cpu that switches]
- * 1) switch_mm() either 1a) or 1b)
- * 1a) thread switch to a different mm
- * 1a1) set cpu_tlbstate to TLBSTATE_OK
- * Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
- * if cpu0 was in lazy tlb mode.
- * 1a2) update cpu active_mm
- * Now cpu0 accepts tlb flushes for the new mm.
- * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
- * Now the other cpus will send tlb flush ipis.
- * 1a4) change cr3.
- * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
- * Stop ipi delivery for the old mm. This is not synchronized with
- * the other cpus, but flush_tlb_func ignore flush ipis for the wrong
- * mm, and in the worst case we perform a superfluous tlb flush.
- * 1b) thread switch without mm change
- * cpu active_mm is correct, cpu0 already handles flush ipis.
- * 1b1) set cpu_tlbstate to TLBSTATE_OK
- * 1b2) test_and_set the cpu bit in cpu_vm_mask.
- * Atomically set the bit [other cpus will start sending flush ipis],
- * and test the bit.
- * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
- * 2) switch %%esp, ie current
- *
- * The interrupt must handle 2 special cases:
- * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
- * - the cpu performs speculative tlb reads, i.e. even if the cpu only
- * runs in kernel space, the cpu could load tlb entries for user space
- * pages.
- *
- * The good news is that cpu_tlbstate is local to each cpu, no
- * write/read ordering problems.
- */
-
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
@@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
loaded_mm->context.ctx_id);

- if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+ if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
/*
- * leave_mm() is adequate to handle any type of flush, and
- * we would prefer not to receive further IPIs.
+ * We're in lazy mode -- don't flush. We can get here on
+ * remote flushes due to races and on local flushes if a
+ * kernel thread coincidentally flushes the mm it's lazily
+ * still using.
*/
- leave_mm(smp_processor_id());
return;
}

@@ -317,6 +303,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(info->end - info->start) >> PAGE_SHIFT);

if (is_uv_system()) {
+ /*
+ * This whole special case is confused. UV has a "Broadcast
+ * Assist Unit", which seems to be a fancy way to send IPIs.
+ * Back when x86 used an explicit TLB flush IPI, UV was
+ * optimized to use its own mechanism. These days, x86 uses
+ * smp_call_function_many(), but UV still uses a manual IPI,
+ * and that IPI's action is out of date -- it does a manual
+ * flush instead of calling flush_tlb_func_remote(). This
+ * means that the percpu tlb_gen variables won't be updated
+ * and we'll do pointless flushes on future context switches.
+ *
+ * Rather than hooking native_flush_tlb_others() here, I think
+ * that UV should be updated so that smp_call_function_many(),
+ * etc, are optimal on UV.
+ */
unsigned int cpu;

cpu = smp_processor_id();
@@ -375,6 +376,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
+
put_cpu();
}

@@ -383,8 +385,6 @@ static void do_flush_tlb_all(void *info)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
__flush_tlb_all();
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
- leave_mm(smp_processor_id());
}

void flush_tlb_all(void)
@@ -436,6 +436,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)

if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(&batch->cpumask, &info);
+
cpumask_clear(&batch->cpumask);

put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 1d7a7213a310..f5df56fb8b5c 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
/* Get the "official" set of cpus referring to our pagetable. */
if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
for_each_online_cpu(cpu) {
- if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
- && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
+ if (per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
continue;
smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
}
--
2.9.4

2017-06-21 05:24:08

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 ++
arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0f5c3b4347c6..aa385109ae58 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2648,6 +2648,8 @@
nopat [X86] Disable PAT (page attribute table extension of
pagetables) support.

+ nopcid [X86-64] Disable the PCID cpu feature.
+
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b39870f33e..904485e7b230 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
}
__setup("nompx", x86_mpx_setup);

+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+ /* require an exact match without trailing characters */
+ if (strlen(s))
+ return 0;
+
+ /* do not emit a message if the feature is not present */
+ if (!boot_cpu_has(X86_FEATURE_PCID))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+ pr_info("nopcid: PCID feature disabled\n");
+ return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
static int __init x86_noinvpcid_setup(char *s)
{
/* noinvpcid doesn't accept parameters */
--
2.9.4

2017-06-21 05:24:28

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

- Lazy mm switching currently works by switching to init_mm when
it would otherwise flush. This is wasteful: there isn't fundamentally
any need to update CR3 at all when going lazy or when returning from
lazy mode, nor is there any need to receive flush IPIs at all. Instead,
we should just stop trying to keep the TLB coherent when we go lazy and,
when unlazying, check whether we missed any flushes.

- PCID will let us keep recent user contexts alive in the TLB. If we
start doing this, we need a way to decide whether those contexts are
up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on. We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 37 +++++++++++++++++++
arch/x86/mm/tlb.c | 79 +++++++++++++++++++++++++++++++++++++----
2 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1eb946c0507e..4f6c30d6ec39 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif

+struct tlb_context {
+ u64 ctx_id;
+ u64 tlb_gen;
+};
+
struct tlb_state {
/*
* cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
* disabling interrupts when modifying either one.
*/
unsigned long cr4;
+
+ /*
+ * This is a list of all contexts that might exist in the TLB.
+ * Since we don't yet use PCID, there is only one context.
+ *
+ * For each context, ctx_id indicates which mm the TLB's user
+ * entries came from. As an invariant, the TLB will never
+ * contain entries that are out-of-date as when that mm reached
+ * the tlb_gen in the list.
+ *
+ * To be clear, this means that it's legal for the TLB code to
+ * flush the TLB without updating tlb_gen. This can happen
+ * (for now, at least) due to paravirt remote flushes.
+ */
+ struct tlb_context ctxs[1];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
* and page-granular flushes are available only on i486 and up.
*/
struct flush_tlb_info {
+ /*
+ * We support several kinds of flushes.
+ *
+ * - Fully flush a single mm. flush_mm will be set, flush_end will be
+ * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
+ * IPI sender is trying to catch us up.
+ *
+ * - Partially flush a single mm. flush_mm will be set, flush_start
+ * and flush_end will indicate the range, and new_tlb_gen will be
+ * set such that the changes between generation new_tlb_gen-1 and
+ * new_tlb_gen are entirely contained in the indicated range.
+ *
+ * - Fully flush all mms whose tlb_gens have been updated. flush_mm
+ * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
+ * will be zero.
+ */
struct mm_struct *mm;
unsigned long start;
unsigned long end;
+ u64 new_tlb_gen;
};

#define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6d9d37323a43..9f5ef7a5e74a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
}

this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ atomic64_read(&next->context.tlb_gen));

WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
static void flush_tlb_func_common(const struct flush_tlb_info *f,
bool local, enum tlb_flush_reason reason)
{
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+ /*
+ * Our memory ordering requirement is that any TLB fills that
+ * happen after we flush the TLB are ordered after we read
+ * active_mm's tlb_gen. We don't need any explicit barrier
+ * because all x86 flush operations are serializing and the
+ * atomic64_read operation won't be reordered by the compiler.
+ */
+ u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());

+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ loaded_mm->context.ctx_id);
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+ /*
+ * leave_mm() is adequate to handle any type of flush, and
+ * we would prefer not to receive further IPIs.
+ */
leave_mm(smp_processor_id());
return;
}

- if (f->end == TLB_FLUSH_ALL) {
- local_flush_tlb();
- if (local)
- count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
- trace_tlb_flush(reason, TLB_FLUSH_ALL);
- } else {
+ if (local_tlb_gen == mm_tlb_gen) {
+ /*
+ * There's nothing to do: we're already up to date. This can
+ * happen if two concurrent flushes happen -- the first IPI to
+ * be handled can catch us all the way up, leaving no work for
+ * the second IPI to be handled.
+ */
+ return;
+ }
+
+ WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
+ WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
+
+ /*
+ * If we get to this point, we know that our TLB is out of date.
+ * This does not strictly imply that we need to flush (it's
+ * possible that f->new_tlb_gen <= local_tlb_gen), but we're
+ * going to need to flush in the very near future, so we might
+ * as well get it over with.
+ *
+ * The only question is whether to do a full or partial flush.
+ *
+ * A partial TLB flush is safe and worthwhile if two conditions are
+ * met:
+ *
+ * 1. We wouldn't be skipping a tlb_gen. If the requester bumped
+ * the mm's tlb_gen from p to p+1, a partial flush is only correct
+ * if we would be bumping the local CPU's tlb_gen from p to p+1 as
+ * well.
+ *
+ * 2. If there are no more flushes on their way. Partial TLB
+ * flushes are not all that much cheaper than full TLB
+ * flushes, so it seems unlikely that it would be a
+ * performance win to do a partial flush if that won't bring
+ * our TLB fully up to date.
+ */
+ if (f->end != TLB_FLUSH_ALL &&
+ f->new_tlb_gen == local_tlb_gen + 1 &&
+ f->new_tlb_gen == mm_tlb_gen) {
+ /* Partial flush */
unsigned long addr;
unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
addr = f->start;
@@ -218,7 +274,16 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
if (local)
count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
trace_tlb_flush(reason, nr_pages);
+ } else {
+ /* Full flush. */
+ local_flush_tlb();
+ if (local)
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+ trace_tlb_flush(reason, TLB_FLUSH_ALL);
}
+
+ /* Both paths above update our state to mm_tlb_gen. */
+ this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
}

static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -289,7 +354,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
cpu = get_cpu();

/* This is also a barrier that synchronizes with switch_mm(). */
- bump_mm_tlb_gen(mm);
+ info.new_tlb_gen = bump_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&
--
2.9.4

2017-06-21 05:24:42

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 02/11] x86/ldt: Simplify LDT switching logic

Originally, Linux reloaded the LDT whenever the prev mm or the next
mm had an LDT. It was changed in 0bbed3beb4f2 ("[PATCH]
Thread-Local Storage (TLS) support") (from the historical tree) like
this:

- /* load_LDT, if either the previous or next thread
- * has a non-default LDT.
+ /*
+ * load the LDT, if the LDT is different:
*/
- if (next->context.size+prev->context.size)
+ if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT(&next->context);

The current code is unlikely to avoid any LDT reloads, since different
mms won't share an LDT.

When we redo lazy mode to stop flush IPIs without switching to
init_mm, though, the current logic would become incorrect: it will
be possible to have real_prev == next but nonetheless have a stale
LDT descriptor.

Simplify the code to update LDTR if either the previous or the next
mm has an LDT, i.e. effectively restore the historical logic..
While we're at it, clean up the code by moving all the ifdeffery to
a header where it belongs.

Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 26 ++++++++++++++++++++++++++
arch/x86/mm/tlb.c | 20 ++------------------
2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1458f530948b..ecfcb6643c9b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm)
#else
clear_LDT();
#endif
+}
+
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ /*
+ * Load the LDT if either the old or new mm had an LDT.
+ *
+ * An mm will never go from having an LDT to not having an LDT. Two
+ * mms never share an LDT, so we don't gain anything by checking to
+ * see whether the LDT changed. There's also no guarantee that
+ * prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
+ * then prev->context.ldt will also be non-NULL.
+ *
+ * If we really cared, we could optimize the case where prev == next
+ * and we're exiting lazy mode. Most of the time, if this happens,
+ * we don't actually need to reload LDTR, but modify_ldt() is mostly
+ * used by legacy code and emulators where we don't need this level of
+ * performance.
+ *
+ * This uses | instead of || because it generates better code.
+ */
+ if (unlikely((unsigned long)prev->context.ldt |
+ (unsigned long)next->context.ldt))
+ load_mm_ldt(next);
+#endif

DEBUG_LOCKS_WARN_ON(preemptible());
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f06239c6919f..fd593833a854 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
real_prev != &init_mm);
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- /* Load per-mm CR4 state */
+ /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- /*
- * Load the LDT, if the LDT is different.
- *
- * It's possible that prev->context.ldt doesn't match
- * the LDT register. This can happen if leave_mm(prev)
- * was called and then modify_ldt changed
- * prev->context.ldt but suppressed an IPI to this CPU.
- * In this case, prev->context.ldt != NULL, because we
- * never set context.ldt to NULL while the mm still
- * exists. That means that next->context.ldt !=
- * prev->context.ldt, because mms never share an LDT.
- */
- if (unlikely(real_prev->context.ldt != next->context.ldt))
- load_mm_ldt(next);
-#endif
+ switch_ldt(real_prev, next);
}

/*
--
2.9.4

2017-06-21 08:02:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> Nadav noticed the reentrancy issue in a different context, but
> neither of us realized that it caused a problem yet.
>
> Cc: Nadav Amit <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Reported-by: "Levin, Alexander (Sasha Levin)" <[email protected]>
> Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 08:03:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] x86/ldt: Simplify LDT switching logic

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> When we redo lazy mode to stop flush IPIs without switching to
> init_mm, though, the current logic would become incorrect: it will
> be possible to have real_prev == next but nonetheless have a stale
> LDT descriptor.
>
> Simplify the code to update LDTR if either the previous or the next
> mm has an LDT, i.e. effectively restore the historical logic..
> While we're at it, clean up the code by moving all the ifdeffery to
> a header where it belongs.
>
> Acked-by: Rik van Riel <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 08:03:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()

On Tue, 20 Jun 2017, Andy Lutomirski wrote:

> The only call site also calls idle_task_exit(), and idle_task_exit()
> puts us into a clean state by explicitly switching to init_mm.
>
> Reviewed-by: Rik van Riel <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 08:06:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

On Tue, 20 Jun 2017, Andy Lutomirski wrote:

> This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
> ctx_id uniquely identifies the mm_struct and will never be reused.
> For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
> count of the number of times that a TLB flush has been requested.
> The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
> flush actions and will be used in subsequent patches to reliably
> determine whether all needed TLB flushes have occurred on a given
> CPU.
>
> This patch is split out for ease of review. By itself, it has no
> real effect other than creating and updating the new variables.

Thanks for splitting this apart!

>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 08:32:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> struct flush_tlb_info {
> + /*
> + * We support several kinds of flushes.
> + *
> + * - Fully flush a single mm. flush_mm will be set, flush_end will be

flush_mm is the *mm member in the struct, right? You might rename that as a
preparatory step so comments and implementation match.

> + * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
> + * IPI sender is trying to catch us up.
> + *
> + * - Partially flush a single mm. flush_mm will be set, flush_start
> + * and flush_end will indicate the range, and new_tlb_gen will be
> + * set such that the changes between generation new_tlb_gen-1 and
> + * new_tlb_gen are entirely contained in the indicated range.
> + *
> + * - Fully flush all mms whose tlb_gens have been updated. flush_mm
> + * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
> + * will be zero.
> + */
> struct mm_struct *mm;
> unsigned long start;
> unsigned long end;
> + u64 new_tlb_gen;

Nit. While at it could you please make that struct tabular aligned as we
usually do in x86?

> static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
> {
> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> + /*
> + * Our memory ordering requirement is that any TLB fills that
> + * happen after we flush the TLB are ordered after we read
> + * active_mm's tlb_gen. We don't need any explicit barrier
> + * because all x86 flush operations are serializing and the
> + * atomic64_read operation won't be reordered by the compiler.
> + */

Can you please move the comment above the loaded_mm assignment?

> + u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +
> /* This code cannot presently handle being reentered. */
> VM_WARN_ON(!irqs_disabled());
>
> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> + loaded_mm->context.ctx_id);
> +
> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + /*
> + * leave_mm() is adequate to handle any type of flush, and
> + * we would prefer not to receive further IPIs.

While I know what you mean, it might be useful to have a more elaborate
explanation why this prevents new IPIs.

> + */
> leave_mm(smp_processor_id());
> return;
> }
>
> - if (f->end == TLB_FLUSH_ALL) {
> - local_flush_tlb();
> - if (local)
> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
> - } else {
> + if (local_tlb_gen == mm_tlb_gen) {
> + /*
> + * There's nothing to do: we're already up to date. This can
> + * happen if two concurrent flushes happen -- the first IPI to
> + * be handled can catch us all the way up, leaving no work for
> + * the second IPI to be handled.

That not restricted to IPIs, right? A local flush / IPI combo can do that
as well.

Other than those nits;

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 08:49:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

On Tue, Jun 20, 2017 at 10:22:07PM -0700, Andy Lutomirski wrote:
> It was historically possible to have two concurrent TLB flushes
> targetting the same CPU: one initiated locally and one initiated
> remotely. This can now cause an OOPS in leave_mm() at
> arch/x86/mm/tlb.c:47:
>
> if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> BUG();
>
> with this call trace:
> flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
> flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317

These line numbers would most likely mean nothing soon. I think you
should rather explain why the bug can happen so that future lookers at
that code can find the spot...

>
> Without reentrancy, this OOPS is impossible: leave_mm() is only
> called if we're not in TLBSTATE_OK, but then we're unexpectedly
> in TLBSTATE_OK in leave_mm().
>
> This can be caused by flush_tlb_func_remote() happening between
> the two checks and calling leave_mm(), resulting in two consecutive
> leave_mm() calls on the same CPU with no intervening switch_mm()
> calls.

...like this, for example. That should be more future-code-changes-proof.

> We never saw this OOPS before because the old leave_mm()
> implementation didn't put us back in TLBSTATE_OK, so the assertion
> didn't fire.
>
> Nadav noticed the reentrancy issue in a different context, but
> neither of us realized that it caused a problem yet.
>
> Cc: Nadav Amit <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Reported-by: "Levin, Alexander (Sasha Levin)" <[email protected]>
> Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/mm/tlb.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 2a5e851f2035..f06239c6919f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
> {
> + /* This code cannot presently handle being reentered. */
> + VM_WARN_ON(!irqs_disabled());
> +
> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> leave_mm(smp_processor_id());
> return;
> @@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> info.end = TLB_FLUSH_ALL;
> }
>
> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
> + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> + local_irq_disable();
> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> + local_irq_enable();
> + }

I'm assuming this is going away in a future patch, as disabling IRQs
around a TLB flush is kinda expensive. I guess I'll see if I continue
reading...

:)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 09:01:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> -/*
> - * The flush IPI assumes that a thread switch happens in this order:
> - * [cpu0: the cpu that switches]
> - * 1) switch_mm() either 1a) or 1b)
> - * 1a) thread switch to a different mm
> - * 1a1) set cpu_tlbstate to TLBSTATE_OK
> - * Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
> - * if cpu0 was in lazy tlb mode.
> - * 1a2) update cpu active_mm
> - * Now cpu0 accepts tlb flushes for the new mm.
> - * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
> - * Now the other cpus will send tlb flush ipis.
> - * 1a4) change cr3.
> - * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
> - * Stop ipi delivery for the old mm. This is not synchronized with
> - * the other cpus, but flush_tlb_func ignore flush ipis for the wrong
> - * mm, and in the worst case we perform a superfluous tlb flush.
> - * 1b) thread switch without mm change
> - * cpu active_mm is correct, cpu0 already handles flush ipis.
> - * 1b1) set cpu_tlbstate to TLBSTATE_OK
> - * 1b2) test_and_set the cpu bit in cpu_vm_mask.
> - * Atomically set the bit [other cpus will start sending flush ipis],
> - * and test the bit.
> - * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
> - * 2) switch %%esp, ie current
> - *
> - * The interrupt must handle 2 special cases:
> - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
> - * - the cpu performs speculative tlb reads, i.e. even if the cpu only
> - * runs in kernel space, the cpu could load tlb entries for user space
> - * pages.
> - *
> - * The good news is that cpu_tlbstate is local to each cpu, no
> - * write/read ordering problems.

While the new code is really well commented, it would be a good thing to
have a single place where all of this including the ordering constraints
are documented.

> @@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
> VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> loaded_mm->context.ctx_id);
>
> - if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
> /*
> - * leave_mm() is adequate to handle any type of flush, and
> - * we would prefer not to receive further IPIs.
> + * We're in lazy mode -- don't flush. We can get here on
> + * remote flushes due to races and on local flushes if a
> + * kernel thread coincidentally flushes the mm it's lazily
> + * still using.

Ok. That's more informative.

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 09:22:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 216d7ec88c0c..2ae43f59091d 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
> struct cpuidle_state *state = &drv->states[index];
> unsigned long eax = flg2MWAIT(state->flags);
> unsigned int cstate;
> - int cpu = smp_processor_id();
>
> cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
>
> /*
> - * leave_mm() to avoid costly and often unnecessary wakeups
> - * for flushing the user TLB's associated with the active mm.
> + * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
> + * will probably flush the TLB. It's not guaranteed to flush
> + * the TLB, though, so it's not clear that we can do anything
> + * useful with this knowledge.

So my understanding here is:

The C-state transition might flush the TLB, when cstate->flags has
CPUIDLE_FLAG_TLB_FLUSHED set. The idle transition already took the
CPU out of the set of CPUs which are remotely flushed, so the
knowledge about this potential flush is not useful for the kernels
view of the TLB state.

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 09:26:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> 32-bit kernels on new hardware will see PCID in CPUID, but PCID can
> only be used in 64-bit mode. Rather than making all PCID code
> conditional, just disable the feature on 32-bit builds.
>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 09:27:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID

On Tue, 20 Jun 2017, Andy Lutomirski wrote:

> The parameter is only present on x86_64 systems to save a few bytes,
> as PCID is always disabled on x86_32.
>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 09:39:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> + /* Set up PCID */
> + if (cpu_has(c, X86_FEATURE_PCID)) {
> + if (cpu_has(c, X86_FEATURE_PGE)) {
> + cr4_set_bits(X86_CR4_PCIDE);

So I assume that you made sure that the PCID bits in CR3 are zero under all
circumstances as setting PCIDE would cause a #GP if not.

Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 09:40:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] x86/ldt: Simplify LDT switching logic

On Tue, Jun 20, 2017 at 10:22:08PM -0700, Andy Lutomirski wrote:
> Originally, Linux reloaded the LDT whenever the prev mm or the next
> mm had an LDT. It was changed in 0bbed3beb4f2 ("[PATCH]
> Thread-Local Storage (TLS) support") (from the historical tree) like
> this:
>
> - /* load_LDT, if either the previous or next thread
> - * has a non-default LDT.
> + /*
> + * load the LDT, if the LDT is different:
> */
> - if (next->context.size+prev->context.size)
> + if (unlikely(prev->context.ldt != next->context.ldt))
> load_LDT(&next->context);
>
> The current code is unlikely to avoid any LDT reloads, since different
> mms won't share an LDT.
>
> When we redo lazy mode to stop flush IPIs without switching to
> init_mm, though, the current logic would become incorrect: it will
> be possible to have real_prev == next but nonetheless have a stale
> LDT descriptor.
>
> Simplify the code to update LDTR if either the previous or the next
> mm has an LDT, i.e. effectively restore the historical logic..
> While we're at it, clean up the code by moving all the ifdeffery to
> a header where it belongs.
>
> Acked-by: Rik van Riel <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/mmu_context.h | 26 ++++++++++++++++++++++++++
> arch/x86/mm/tlb.c | 20 ++------------------
> 2 files changed, 28 insertions(+), 18 deletions(-)

Reviewed-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 09:50:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()

On Tue, Jun 20, 2017 at 10:22:09PM -0700, Andy Lutomirski wrote:
> The only call site also calls idle_task_exit(), and idle_task_exit()
> puts us into a clean state by explicitly switching to init_mm.
>
> Reviewed-by: Rik van Riel <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 8 --------
> arch/x86/kernel/smpboot.c | 1 -
> 2 files changed, 9 deletions(-)

Reviewed-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 10:33:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

On Tue, Jun 20, 2017 at 10:22:10PM -0700, Andy Lutomirski wrote:
> - * The x86 doesn't have a mmu context, but
> - * we put the segment information here.
> + * x86 has arch-specific MMU state beyond what lives in mm_struct.
> */
> typedef struct {
> + /*
> + * ctx_id uniquely identifies this mm_struct. A ctx_id will never
> + * be reused, and zero is not a valid ctx_id.
> + */
> + u64 ctx_id;
> +
> + /*
> + * Any code that needs to do any sort of TLB flushing for this
> + * mm will first make its changes to the page tables, then
> + * increment tlb_gen, then flush. This lets the low-level
> + * flushing code keep track of what needs flushing.
> + *
> + * This is not used on Xen PV.
> + */
> + atomic64_t tlb_gen;
> +
> #ifdef CONFIG_MODIFY_LDT_SYSCALL
> struct ldt_struct *ldt;
> #endif
> @@ -37,6 +53,11 @@ typedef struct {
> #endif
> } mm_context_t;
>
> +#define INIT_MM_CONTEXT(mm) \
> + .context = { \
> + .ctx_id = 1, \

So ctx_id of 0 is invalid?

Let's state that explicitly. We could even use it to sanity-check mms or
whatever.

> + }
> +
> void leave_mm(int cpu);
>
> #endif /* _ASM_X86_MMU_H */
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index ecfcb6643c9b..e5295d485899 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> }
>
> +extern atomic64_t last_mm_ctx_id;

I think we prefer externs/variable defines at the beginning of the file,
not intermixed with functions.

> +
> static inline int init_new_context(struct task_struct *tsk,
> struct mm_struct *mm)
> {
> + mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
> + atomic64_set(&mm->context.tlb_gen, 0);
> +
> #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
> /* pkey 0 is the default and always allocated */
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 50ea3482e1d1..1eb946c0507e 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
> __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
> }
>
> +static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)

inc_mm_tlb_gen() I guess. git grep says like "inc" more :-)

> +{
> + u64 new_tlb_gen;
> +
> + /*
> + * Bump the generation count. This also serves as a full barrier
> + * that synchronizes with switch_mm: callers are required to order

Please end function names with parentheses.

> + * their read of mm_cpumask after their writes to the paging
> + * structures.
> + */
> + smp_mb__before_atomic();
> + new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
> + smp_mb__after_atomic();
> +
> + return new_tlb_gen;
> +}
> +
> #ifdef CONFIG_PARAVIRT
> #include <asm/paravirt.h>
> #else

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 13:38:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> This patch uses PCID differently. We use a PCID to identify a
> recently-used mm on a per-cpu basis. An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
>
> This seems to save about 100ns on context switches between mms.

Depending on the work load I assume. For a CPU switching between a large
number of processes consecutively it won't make a change. In fact it will
be slower due to the extra few cycles required for rotating the asid, but I
doubt that this can be measured.

> +/*
> + * 6 because 6 should be plenty and struct tlb_state will fit in
> + * two cache lines.
> + */
> +#define NR_DYNAMIC_ASIDS 6

That requires a conditional branch

if (asid >= NR_DYNAMIC_ASIDS) {
asid = 0;
....
}

The question is whether 4 IDs would be sufficient which trades the branch
for a mask operation. Or you go for 8 and spend another cache line.

> atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>
> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> + u16 *new_asid, bool *need_flush)
> +{
> + u16 asid;
> +
> + if (!static_cpu_has(X86_FEATURE_PCID)) {
> + *new_asid = 0;
> + *need_flush = true;
> + return;
> + }
> +
> + for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
> + next->context.ctx_id)
> + continue;
> +
> + *new_asid = asid;
> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
> + next_tlb_gen);
> + return;
> + }

Hmm. So this loop needs to be taken unconditionally even if the task stays
on the same CPU. And of course the number of dynamic IDs has to be short in
order to makes this loop suck performance wise.

Something like the completely disfunctional below might be worthwhile to
explore. At least arch/x86/mm/ compiles :)

It gets rid of the loop search and lifts the limit of dynamic ids by
trading it with a percpu variable in mm_context_t.

Thanks,

tglx

8<--------------------

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -25,6 +25,8 @@ typedef struct {
*/
atomic64_t tlb_gen;

+ u32 __percpu *asids;
+
#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
#endif
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -156,11 +156,23 @@ static inline void destroy_context(struc
destroy_context_ldt(mm);
}

-extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk);
+extern void __switch_mm(struct mm_struct *prev, struct mm_struct *next);
+
+static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ __switch_mm(prev, next);
+}
+
+extern void __switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next);
+
+static inline void switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ __switch_mm_irqs_off(prev, next);
+}

-extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk);
#define switch_mm_irqs_off switch_mm_irqs_off

#define activate_mm(prev, next) \
@@ -299,6 +311,9 @@ static inline unsigned long __get_curren
{
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);

+ if (static_cpu_has(X86_FEATURE_PCID))
+ cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());

--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
/* Mask off the address space ID bits. */
#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
#define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
#else
/*
* CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
*/
#define CR3_ADDR_MASK 0xFFFFFFFFull
#define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
#endif

#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,15 @@ static inline u64 bump_mm_tlb_gen(struct
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
#endif

+/*
+ * NR_DYNAMIC_ASIDS must be a power of 2. 4 makes tlb_state fit into two
+ * cache lines.
+ */
+#define NR_DYNAMIC_ASIDS_BITS 2
+#define NR_DYNAMIC_ASIDS (1U << NR_DYNAMIC_ASIDS_BITS)
+#define DYNAMIC_ASIDS_MASK (NR_DYNAMIC_ASIDS - 1)
+#define ASID_NEEDS_FLUSH (1U << 16)
+
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +104,8 @@ struct tlb_state {
* mode even if we've already switched back to swapper_pg_dir.
*/
struct mm_struct *loaded_mm;
+ u16 loaded_mm_asid;
+ u16 next_asid;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +115,8 @@ struct tlb_state {

/*
* This is a list of all contexts that might exist in the TLB.
- * Since we don't yet use PCID, there is only one context.
+ * There is one per ASID that we use, and the ASID (what the
+ * CPU calls PCID) is the index into ctxts.
*
* For each context, ctx_id indicates which mm the TLB's user
* entries came from. As an invariant, the TLB will never
@@ -114,8 +126,13 @@ struct tlb_state {
* To be clear, this means that it's legal for the TLB code to
* flush the TLB without updating tlb_gen. This can happen
* (for now, at least) due to paravirt remote flushes.
+ *
+ * NB: context 0 is a bit special, since it's also used by
+ * various bits of init code. This is fine -- code that
+ * isn't aware of PCID will end up harmlessly flushing
+ * context 0.
*/
- struct tlb_context ctxs[1];
+ struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
+ .next_asid = 1,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,30 @@

atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);

+static u32 choose_new_asid(mm_context_t *mmctx, u64 next_tlb_gen)
+{
+ u64 next_ctx_id = mmctx->ctx_id;
+ struct tlb_context *ctx;
+ u32 asid;
+
+ if (!static_cpu_has(X86_FEATURE_PCID))
+ return ASID_NEEDS_FLUSH;
+
+ asid = this_cpu_read(*mmctx->asids);
+ ctx = this_cpu_ptr(cpu_tlbstate.ctxs + asid);
+ if (likely(ctx->ctx_id == next_ctx_id)) {
+ return ctx->tlb_gen == next_tlb_gen ?
+ asid : asid | ASID_NEEDS_FLUSH ;
+ }
+
+ /* Slot got reused. Get a new one */
+ asid = this_cpu_inc_return(cpu_tlbstate.next_asid) - 1;
+ asid &= DYNAMIC_ASIDS_MASK;
+ this_cpu_write(*mmctx->asids, asid);
+ ctx = this_cpu_ptr(cpu_tlbstate.ctxs + asid);
+ return asid | ASID_NEEDS_FLUSH;
+}
+
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -48,24 +72,23 @@ void leave_mm(int cpu)
/* Warn if we're not lazy. */
WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

- switch_mm(NULL, &init_mm, NULL);
+ __switch_mm(NULL, &init_mm);
}

-void switch_mm(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk)
+void __switch_mm(struct mm_struct *prev, struct mm_struct *next)
{
unsigned long flags;

local_irq_save(flags);
- switch_mm_irqs_off(prev, next, tsk);
+ __switch_mm_irqs_off(prev, next);
local_irq_restore(flags);
}

-void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk)
+void __switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next)
{
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 next_tlb_gen;

/*
@@ -81,7 +104,7 @@ void switch_mm_irqs_off(struct mm_struct
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());

- VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
+ VM_BUG_ON(__read_cr3() != (__pa(real_prev->pgd) | prev_asid));

if (real_prev == next) {
if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
@@ -98,10 +121,10 @@ void switch_mm_irqs_off(struct mm_struct
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

- if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+ if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
next_tlb_gen) {
/*
* Ideally, we'd have a flush_tlb() variant that
@@ -109,9 +132,9 @@ void switch_mm_irqs_off(struct mm_struct
* be faster on Xen PV and on hypothetical CPUs
* on which INVPCID is fast.
*/
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+ this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
next_tlb_gen);
- write_cr3(__pa(next->pgd));
+ write_cr3(__pa(next->pgd) | prev_asid);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
TLB_FLUSH_ALL);
}
@@ -122,6 +145,8 @@ void switch_mm_irqs_off(struct mm_struct
* are not reflected in tlb_gen.)
*/
} else {
+ u32 new_asid;
+
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
* If our current stack is in vmalloc space and isn't
@@ -141,7 +166,7 @@ void switch_mm_irqs_off(struct mm_struct
if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+ VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));

/*
* Start remote flushes and then read tlb_gen.
@@ -149,17 +174,25 @@ void switch_mm_irqs_off(struct mm_struct
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
- next->context.ctx_id);
+ new_asid = choose_new_asid(&next->context, next_tlb_gen);

- this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
- next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
- next_tlb_gen);
- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- write_cr3(__pa(next->pgd));
+ if (new_asid & ASID_NEEDS_FLUSH) {
+ new_asid &= DYNAMIC_ASIDS_MASK;
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id,
+ next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen,
+ next_tlb_gen);
+ write_cr3(__pa(next->pgd) | new_asid);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ write_cr3(__pa(next->pgd) | new_asid | CR3_NOFLUSH);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
+ }

- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
}

load_mm_cr4(next);
@@ -170,6 +203,7 @@ static void flush_tlb_func_common(const
bool local, enum tlb_flush_reason reason)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);

/*
* Our memory ordering requirement is that any TLB fills that
@@ -179,12 +213,12 @@ static void flush_tlb_func_common(const
* atomic64_read operation won't be reordered by the compiler.
*/
u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
- u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);

/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());

- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
loaded_mm->context.ctx_id);

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
@@ -256,7 +290,7 @@ static void flush_tlb_func_common(const
}

/* Both paths above update our state to mm_tlb_gen. */
- this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
+ this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
}

static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)


2017-06-21 13:40:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

On Wed, 21 Jun 2017, Thomas Gleixner wrote:

> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> > + /* Set up PCID */
> > + if (cpu_has(c, X86_FEATURE_PCID)) {
> > + if (cpu_has(c, X86_FEATURE_PGE)) {
> > + cr4_set_bits(X86_CR4_PCIDE);
>
> So I assume that you made sure that the PCID bits in CR3 are zero under all
> circumstances as setting PCIDE would cause a #GP if not.

And what happens on kexec etc? We need to reset the asid and PCIDE I assume.

Thanks,

tglx

2017-06-21 13:41:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Wed, 21 Jun 2017, Thomas Gleixner wrote:
> > + for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
> > + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
> > + next->context.ctx_id)
> > + continue;
> > +
> > + *new_asid = asid;
> > + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
> > + next_tlb_gen);
> > + return;
> > + }
>
> Hmm. So this loop needs to be taken unconditionally even if the task stays
> on the same CPU. And of course the number of dynamic IDs has to be short in
> order to makes this loop suck performance wise.

... not suck ...

2017-06-21 15:11:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Wed, Jun 21, 2017 at 1:32 AM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> struct flush_tlb_info {
>> + /*
>> + * We support several kinds of flushes.
>> + *
>> + * - Fully flush a single mm. flush_mm will be set, flush_end will be
>
> flush_mm is the *mm member in the struct, right? You might rename that as a
> preparatory step so comments and implementation match.

The comment is outdated. Fixed now.

>
>> + * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
>> + * IPI sender is trying to catch us up.
>> + *
>> + * - Partially flush a single mm. flush_mm will be set, flush_start
>> + * and flush_end will indicate the range, and new_tlb_gen will be
>> + * set such that the changes between generation new_tlb_gen-1 and
>> + * new_tlb_gen are entirely contained in the indicated range.
>> + *
>> + * - Fully flush all mms whose tlb_gens have been updated. flush_mm
>> + * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
>> + * will be zero.
>> + */
>> struct mm_struct *mm;
>> unsigned long start;
>> unsigned long end;
>> + u64 new_tlb_gen;
>
> Nit. While at it could you please make that struct tabular aligned as we
> usually do in x86?

Sure.

>
>> static void flush_tlb_func_common(const struct flush_tlb_info *f,
>> bool local, enum tlb_flush_reason reason)
>> {
>> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
>> +
>> + /*
>> + * Our memory ordering requirement is that any TLB fills that
>> + * happen after we flush the TLB are ordered after we read
>> + * active_mm's tlb_gen. We don't need any explicit barrier
>> + * because all x86 flush operations are serializing and the
>> + * atomic64_read operation won't be reordered by the compiler.
>> + */
>
> Can you please move the comment above the loaded_mm assignment?

I'll move it above the function entirely. It's more of a general
comment about how the function works than any particular part of the
function.

>
>> + u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
>> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
>> +
>> /* This code cannot presently handle being reentered. */
>> VM_WARN_ON(!irqs_disabled());
>>
>> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>> + loaded_mm->context.ctx_id);
>> +
>> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> + /*
>> + * leave_mm() is adequate to handle any type of flush, and
>> + * we would prefer not to receive further IPIs.
>
> While I know what you mean, it might be useful to have a more elaborate
> explanation why this prevents new IPIs.

Added, although it just gets deleted again later in the series.

>
>> + */
>> leave_mm(smp_processor_id());
>> return;
>> }
>>
>> - if (f->end == TLB_FLUSH_ALL) {
>> - local_flush_tlb();
>> - if (local)
>> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
>> - } else {
>> + if (local_tlb_gen == mm_tlb_gen) {
>> + /*
>> + * There's nothing to do: we're already up to date. This can
>> + * happen if two concurrent flushes happen -- the first IPI to
>> + * be handled can catch us all the way up, leaving no work for
>> + * the second IPI to be handled.
>
> That not restricted to IPIs, right? A local flush / IPI combo can do that
> as well.

Indeed. Comment fixed.

>
> Other than those nits;
>
> Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 15:16:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

On Wed, Jun 21, 2017 at 1:49 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22:07PM -0700, Andy Lutomirski wrote:
>> It was historically possible to have two concurrent TLB flushes
>> targetting the same CPU: one initiated locally and one initiated
>> remotely. This can now cause an OOPS in leave_mm() at
>> arch/x86/mm/tlb.c:47:
>>
>> if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> BUG();
>>
>> with this call trace:
>> flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
>> flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317
>
> These line numbers would most likely mean nothing soon. I think you
> should rather explain why the bug can happen so that future lookers at
> that code can find the spot...
>

That's why I gave function names and the actual code :)

> I'm assuming this is going away in a future patch, as disabling IRQs
> around a TLB flush is kinda expensive. I guess I'll see if I continue
> reading...

No, it's still there. It's possible that it could be removed with
lots of care, but I'm not convinced it's worth it.
local_irq_disable() and local_irq_enable() are fast, though (3 cycles
each last time I benchmarked them?) -- it's local_irq_save() that
really hurts.

--Andy

2017-06-21 16:20:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code

On Wed, Jun 21, 2017 at 2:22 AM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 216d7ec88c0c..2ae43f59091d 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
>> struct cpuidle_state *state = &drv->states[index];
>> unsigned long eax = flg2MWAIT(state->flags);
>> unsigned int cstate;
>> - int cpu = smp_processor_id();
>>
>> cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
>>
>> /*
>> - * leave_mm() to avoid costly and often unnecessary wakeups
>> - * for flushing the user TLB's associated with the active mm.
>> + * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
>> + * will probably flush the TLB. It's not guaranteed to flush
>> + * the TLB, though, so it's not clear that we can do anything
>> + * useful with this knowledge.
>
> So my understanding here is:
>
> The C-state transition might flush the TLB, when cstate->flags has
> CPUIDLE_FLAG_TLB_FLUSHED set. The idle transition already took the
> CPU out of the set of CPUs which are remotely flushed, so the
> knowledge about this potential flush is not useful for the kernels
> view of the TLB state.

Indeed. I assume the theory behind the old code was that leave_mm()
was expensive and that CPUIDLE_FLAG_TLB_FLUSHED would be a decent
heuristic for when to do it.

2017-06-21 16:23:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Wed, Jun 21, 2017 at 2:01 AM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> -/*
>> - * The flush IPI assumes that a thread switch happens in this order:
>> - * [cpu0: the cpu that switches]
>> - * 1) switch_mm() either 1a) or 1b)
>> - * 1a) thread switch to a different mm
>> - * 1a1) set cpu_tlbstate to TLBSTATE_OK
>> - * Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
>> - * if cpu0 was in lazy tlb mode.
>> - * 1a2) update cpu active_mm
>> - * Now cpu0 accepts tlb flushes for the new mm.
>> - * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
>> - * Now the other cpus will send tlb flush ipis.
>> - * 1a4) change cr3.
>> - * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
>> - * Stop ipi delivery for the old mm. This is not synchronized with
>> - * the other cpus, but flush_tlb_func ignore flush ipis for the wrong
>> - * mm, and in the worst case we perform a superfluous tlb flush.
>> - * 1b) thread switch without mm change
>> - * cpu active_mm is correct, cpu0 already handles flush ipis.
>> - * 1b1) set cpu_tlbstate to TLBSTATE_OK
>> - * 1b2) test_and_set the cpu bit in cpu_vm_mask.
>> - * Atomically set the bit [other cpus will start sending flush ipis],
>> - * and test the bit.
>> - * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
>> - * 2) switch %%esp, ie current
>> - *
>> - * The interrupt must handle 2 special cases:
>> - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
>> - * - the cpu performs speculative tlb reads, i.e. even if the cpu only
>> - * runs in kernel space, the cpu could load tlb entries for user space
>> - * pages.
>> - *
>> - * The good news is that cpu_tlbstate is local to each cpu, no
>> - * write/read ordering problems.
>
> While the new code is really well commented, it would be a good thing to
> have a single place where all of this including the ordering constraints
> are documented.

I'll look at the end of the whole series and see if I can come up with
something good.

>
>> @@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
>> VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>> loaded_mm->context.ctx_id);
>>
>> - if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>> /*
>> - * leave_mm() is adequate to handle any type of flush, and
>> - * we would prefer not to receive further IPIs.
>> + * We're in lazy mode -- don't flush. We can get here on
>> + * remote flushes due to races and on local flushes if a
>> + * kernel thread coincidentally flushes the mm it's lazily
>> + * still using.
>
> Ok. That's more informative.
>
> Reviewed-by: Thomas Gleixner <[email protected]>

2017-06-21 16:38:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

On Wed, Jun 21, 2017 at 3:33 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22:10PM -0700, Andy Lutomirski wrote:
>> +#define INIT_MM_CONTEXT(mm) \
>> + .context = { \
>> + .ctx_id = 1, \
>
> So ctx_id of 0 is invalid?
>
> Let's state that explicitly. We could even use it to sanity-check mms or
> whatever.

It's stated explicitly in the comment where it's declared in the same file.

>
>> + }
>> +
>> void leave_mm(int cpu);
>>
>> #endif /* _ASM_X86_MMU_H */
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index ecfcb6643c9b..e5295d485899 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>> this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> }
>>
>> +extern atomic64_t last_mm_ctx_id;
>
> I think we prefer externs/variable defines at the beginning of the file,
> not intermixed with functions.

Done

>
>> +static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
>
> inc_mm_tlb_gen() I guess. git grep says like "inc" more :-)

Done

>> + * that synchronizes with switch_mm: callers are required to order
>
> Please end function names with parentheses.

Done.

2017-06-21 17:06:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

On Wed, Jun 21, 2017 at 08:23:07AM -0700, Andy Lutomirski wrote:
> It's stated explicitly in the comment where it's declared in the same file.

Doh, it says "zero" there. I should learn how to read.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 17:29:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Wed, Jun 21, 2017 at 09:04:48AM -0700, Andy Lutomirski wrote:
> I'll look at the end of the whole series and see if I can come up with
> something good.

... along with the logic what we flush when, please. I.e., the text in
struct flush_tlb_info.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 17:43:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

On Tue, Jun 20, 2017 at 10:22:10PM -0700, Andy Lutomirski wrote:
> - * The x86 doesn't have a mmu context, but
> - * we put the segment information here.
> + * x86 has arch-specific MMU state beyond what lives in mm_struct.
> */
> typedef struct {
> + /*
> + * ctx_id uniquely identifies this mm_struct. A ctx_id will never
> + * be reused, and zero is not a valid ctx_id.
> + */
> + u64 ctx_id;
> +
> + /*
> + * Any code that needs to do any sort of TLB flushing for this
> + * mm will first make its changes to the page tables, then
> + * increment tlb_gen, then flush. This lets the low-level
> + * flushing code keep track of what needs flushing.
> + *
> + * This is not used on Xen PV.
> + */
> + atomic64_t tlb_gen;

Btw, can this just be a 4-byte int instead? I.e., simply atomic_t. I
mean, it should be enough for all the TLB generations in flight, no?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 18:23:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] PCID and improved laziness

On Tue, Jun 20, 2017 at 10:22 PM, Andy Lutomirski <[email protected]> wrote:
> There are three performance benefits here:

Side note: can you post the actual performance numbers, even if only
from some silly test program on just one platform? Things like lmbench
pipe benchmark or something?

Or maybe you did, and I just missed it. But when talking about
performance, I'd really like to always see some actual numbers.

Linus

2017-06-21 18:44:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote:
> There are two kernel features that would benefit from tracking
> how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
> it up to date in real time:
>
> - Lazy mm switching currently works by switching to init_mm when
> it would otherwise flush. This is wasteful: there isn't fundamentally
> any need to update CR3 at all when going lazy or when returning from
> lazy mode, nor is there any need to receive flush IPIs at all. Instead,
> we should just stop trying to keep the TLB coherent when we go lazy and,
> when unlazying, check whether we missed any flushes.
>
> - PCID will let us keep recent user contexts alive in the TLB. If we
> start doing this, we need a way to decide whether those contexts are
> up to date.
>
> On some paravirt systems, remote TLBs can be flushed without IPIs.
> This won't update the target CPUs' tlb_gens, which may cause
> unnecessary local flushes later on. We can address this if it becomes
> a problem by carefully updating the target CPU's tlb_gen directly.
>
> By itself, this patch is a very minor optimization that avoids
> unnecessary flushes when multiple TLB flushes targetting the same CPU
> race.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 37 +++++++++++++++++++
> arch/x86/mm/tlb.c | 79 +++++++++++++++++++++++++++++++++++++----
> 2 files changed, 109 insertions(+), 7 deletions(-)

...

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 6d9d37323a43..9f5ef7a5e74a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> }
>
> this_cpu_write(cpu_tlbstate.loaded_mm, next);
> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> + atomic64_read(&next->context.tlb_gen));

Just let it stick out:

this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(&next->context.tlb_gen));

Should be a bit better readable this way.

>
> WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
> cpumask_set_cpu(cpu, mm_cpumask(next));
> @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
> {
> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> + /*
> + * Our memory ordering requirement is that any TLB fills that
> + * happen after we flush the TLB are ordered after we read
> + * active_mm's tlb_gen. We don't need any explicit barrier
> + * because all x86 flush operations are serializing and the
> + * atomic64_read operation won't be reordered by the compiler.
> + */
> + u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +
> /* This code cannot presently handle being reentered. */
> VM_WARN_ON(!irqs_disabled());
>
> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> + loaded_mm->context.ctx_id);
> +
> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + /*
> + * leave_mm() is adequate to handle any type of flush, and
> + * we would prefer not to receive further IPIs.
> + */
> leave_mm(smp_processor_id());
> return;
> }
>
> - if (f->end == TLB_FLUSH_ALL) {
> - local_flush_tlb();
> - if (local)
> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
> - } else {
> + if (local_tlb_gen == mm_tlb_gen) {

if (unlikely(...

maybe?

Sounds to me like the concurrent flushes case would be the
uncommon one...

> + /*
> + * There's nothing to do: we're already up to date. This can
> + * happen if two concurrent flushes happen -- the first IPI to
> + * be handled can catch us all the way up, leaving no work for
> + * the second IPI to be handled.
> + */
> + return;
> + }


> +
> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
> +
> + /*
> + * If we get to this point, we know that our TLB is out of date.
> + * This does not strictly imply that we need to flush (it's
> + * possible that f->new_tlb_gen <= local_tlb_gen), but we're
> + * going to need to flush in the very near future, so we might
> + * as well get it over with.
> + *
> + * The only question is whether to do a full or partial flush.
> + *
> + * A partial TLB flush is safe and worthwhile if two conditions are
> + * met:
> + *
> + * 1. We wouldn't be skipping a tlb_gen. If the requester bumped
> + * the mm's tlb_gen from p to p+1, a partial flush is only correct
> + * if we would be bumping the local CPU's tlb_gen from p to p+1 as
> + * well.
> + *
> + * 2. If there are no more flushes on their way. Partial TLB
> + * flushes are not all that much cheaper than full TLB
> + * flushes, so it seems unlikely that it would be a
> + * performance win to do a partial flush if that won't bring
> + * our TLB fully up to date.
> + */
> + if (f->end != TLB_FLUSH_ALL &&
> + f->new_tlb_gen == local_tlb_gen + 1 &&
> + f->new_tlb_gen == mm_tlb_gen) {

I'm certainly still missing something here:

We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
do once

bump_mm_tlb_gen(mm);

and once

info.new_tlb_gen = bump_mm_tlb_gen(mm);

and in both cases, the bumping is done on mm->context.tlb_gen.

So why isn't that enough to do the flushing and we have to consult
info.new_tlb_gen too?

> + /* Partial flush */
> unsigned long addr;
> unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;

<---- newline here.

> addr = f->start;
> @@ -218,7 +274,16 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
> if (local)
> count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
> trace_tlb_flush(reason, nr_pages);
> + } else {
> + /* Full flush. */
> + local_flush_tlb();
> + if (local)
> + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> + trace_tlb_flush(reason, TLB_FLUSH_ALL);
> }
> +
> + /* Both paths above update our state to mm_tlb_gen. */
> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
> }
>
> static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-21 20:34:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

On Wed, Jun 21, 2017 at 2:39 AM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> + /* Set up PCID */
>> + if (cpu_has(c, X86_FEATURE_PCID)) {
>> + if (cpu_has(c, X86_FEATURE_PGE)) {
>> + cr4_set_bits(X86_CR4_PCIDE);
>
> So I assume that you made sure that the PCID bits in CR3 are zero under all
> circumstances as setting PCIDE would cause a #GP if not.

Yes. All existing code just shoves a PA of a page table in there. As
far as I know, neither Linux nor anyone else uses the silly PCD and
PWT bits. It's not even clear to me that they are functional if PAT
is enabled.

>
> And what happens on kexec etc? We need to reset the asid and PCIDE I assume.
>

I assume it works roughly the same way as suspend, etc --
mmu_cr4_features has the desired CR4 and the init code deals with it.
And PGE, PKE, etc all work correctly. I'm not sure why PCIDE needs to
be cleared -- the init code will work just fine even if PCIDE is
unexpectedly set.

That being said, I haven't managed to understand what exactly the
kexec code is doing. But I think the relevant bit is here in
relocate_kernel_64.S:

/*
* Set cr4 to a known state:
* - physical address extension enabled
*/
movl $X86_CR4_PAE, %eax
movq %rax, %cr4

Kexec folks, is it safe to assume that kexec can already deal with the
new and old kernels disagreeing on what CR4 should be?

2017-06-21 23:26:48

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

Andy Lutomirski <[email protected]> wrote:

> index 2a5e851f2035..f06239c6919f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
> {
> + /* This code cannot presently handle being reentered. */
> + VM_WARN_ON(!irqs_disabled());
> +
> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> leave_mm(smp_processor_id());
> return;
> @@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> info.end = TLB_FLUSH_ALL;
> }
>
> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
> + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {

Perhaps you want to add:

VM_WARN_ON(irqs_disabled());

here

> + local_irq_disable();
> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> + local_irq_enable();
> + }
> +
> if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> flush_tlb_others(mm_cpumask(mm), &info);
> put_cpu();
> @@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>
> int cpu = get_cpu();
>
> - if (cpumask_test_cpu(cpu, &batch->cpumask))
> + if (cpumask_test_cpu(cpu, &batch->cpumask)) {

and here?

> + local_irq_disable();
> flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
> + local_irq_enable();
> + }
> +
> if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
> flush_tlb_others(&batch->cpumask, &info);
> cpumask_clear(&batch->cpumask);
> --
> 2.9.4



2017-06-22 02:28:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

On Wed, Jun 21, 2017 at 4:26 PM, Nadav Amit <[email protected]> wrote:
> Andy Lutomirski <[email protected]> wrote:
>
>> index 2a5e851f2035..f06239c6919f 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>> static void flush_tlb_func_common(const struct flush_tlb_info *f,
>> bool local, enum tlb_flush_reason reason)
>> {
>> + /* This code cannot presently handle being reentered. */
>> + VM_WARN_ON(!irqs_disabled());
>> +
>> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> leave_mm(smp_processor_id());
>> return;
>> @@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> info.end = TLB_FLUSH_ALL;
>> }
>>
>> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
>> + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>
> Perhaps you want to add:
>
> VM_WARN_ON(irqs_disabled());
>
> here
>
>> + local_irq_disable();
>> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
>> + local_irq_enable();
>> + }
>> +
>> if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> flush_tlb_others(mm_cpumask(mm), &info);
>> put_cpu();
>> @@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>>
>> int cpu = get_cpu();
>>
>> - if (cpumask_test_cpu(cpu, &batch->cpumask))
>> + if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>
> and here?
>

Will do.

What I really want is lockdep_assert_irqs_disabled() or, even better,
for this to be implicit when calling local_irq_disable(). Ingo?

--Andy

2017-06-22 02:34:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

On Wed, Jun 21, 2017 at 10:43 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22:10PM -0700, Andy Lutomirski wrote:
>> - * The x86 doesn't have a mmu context, but
>> - * we put the segment information here.
>> + * x86 has arch-specific MMU state beyond what lives in mm_struct.
>> */
>> typedef struct {
>> + /*
>> + * ctx_id uniquely identifies this mm_struct. A ctx_id will never
>> + * be reused, and zero is not a valid ctx_id.
>> + */
>> + u64 ctx_id;
>> +
>> + /*
>> + * Any code that needs to do any sort of TLB flushing for this
>> + * mm will first make its changes to the page tables, then
>> + * increment tlb_gen, then flush. This lets the low-level
>> + * flushing code keep track of what needs flushing.
>> + *
>> + * This is not used on Xen PV.
>> + */
>> + atomic64_t tlb_gen;
>
> Btw, can this just be a 4-byte int instead? I.e., simply atomic_t. I
> mean, it should be enough for all the TLB generations in flight, no?

There can only be NR_CPUS generations that actually mean anything at
any given time, but I think they can be arbitrarily discontinuous.
Imagine a malicious program that does:

set affiinitiy to CPU 1
mmap()
set affinity to CPU 0
for (i = 0; i < (1ULL<<32); i++) {
munmap();
mmap();
}
set affinity to CPU 1

With just atomic_t, this could blow up.

2017-06-22 02:46:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Wed, Jun 21, 2017 at 11:44 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote:
>> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
>> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
>> + atomic64_read(&next->context.tlb_gen));
>
> Just let it stick out:
>
> this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
> this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(&next->context.tlb_gen));
>
> Should be a bit better readable this way.

Done

>> + if (local_tlb_gen == mm_tlb_gen) {
>
> if (unlikely(...
>
> maybe?
>
> Sounds to me like the concurrent flushes case would be the
> uncommon one...

Agreed.

>> +
>> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
>> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
>> +
>> + /*
>> + * If we get to this point, we know that our TLB is out of date.
>> + * This does not strictly imply that we need to flush (it's
>> + * possible that f->new_tlb_gen <= local_tlb_gen), but we're
>> + * going to need to flush in the very near future, so we might
>> + * as well get it over with.
>> + *
>> + * The only question is whether to do a full or partial flush.
>> + *
>> + * A partial TLB flush is safe and worthwhile if two conditions are
>> + * met:
>> + *
>> + * 1. We wouldn't be skipping a tlb_gen. If the requester bumped
>> + * the mm's tlb_gen from p to p+1, a partial flush is only correct
>> + * if we would be bumping the local CPU's tlb_gen from p to p+1 as
>> + * well.
>> + *
>> + * 2. If there are no more flushes on their way. Partial TLB
>> + * flushes are not all that much cheaper than full TLB
>> + * flushes, so it seems unlikely that it would be a
>> + * performance win to do a partial flush if that won't bring
>> + * our TLB fully up to date.
>> + */
>> + if (f->end != TLB_FLUSH_ALL &&
>> + f->new_tlb_gen == local_tlb_gen + 1 &&
>> + f->new_tlb_gen == mm_tlb_gen) {
>
> I'm certainly still missing something here:
>
> We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> do once
>
> bump_mm_tlb_gen(mm);
>
> and once
>
> info.new_tlb_gen = bump_mm_tlb_gen(mm);
>
> and in both cases, the bumping is done on mm->context.tlb_gen.
>
> So why isn't that enough to do the flushing and we have to consult
> info.new_tlb_gen too?

The issue is a possible race. Suppose we start at tlb_gen == 1 and
then two concurrent flushes happen. The first flush is a full flush
and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen
to 3. If the second flush gets propagated to a given CPU first and it
were to do an actual partial flush (INVLPG) and set the percpu tlb_gen
to 3, then the first flush won't do anything and we'll fail to flush
all the pages we need to flush.

My solution was to say that we're only allowed to do INVLPG if we're
making exactly the same change to the local tlb_gen that the requester
made to context.tlb_gen.

I'll add a comment to this effect.

>
>> + /* Partial flush */
>> unsigned long addr;
>> unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
>
> <---- newline here.

Yup.

2017-06-22 02:58:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> This patch uses PCID differently. We use a PCID to identify a
>> recently-used mm on a per-cpu basis. An mm has no fixed PCID
>> binding at all; instead, we give it a fresh PCID each time it's
>> loaded except in cases where we want to preserve the TLB, in which
>> case we reuse a recent value.
>>
>> This seems to save about 100ns on context switches between mms.
>
> Depending on the work load I assume. For a CPU switching between a large
> number of processes consecutively it won't make a change. In fact it will
> be slower due to the extra few cycles required for rotating the asid, but I
> doubt that this can be measured.

True. I suspect this can be improved -- see below.

>
>> +/*
>> + * 6 because 6 should be plenty and struct tlb_state will fit in
>> + * two cache lines.
>> + */
>> +#define NR_DYNAMIC_ASIDS 6
>
> That requires a conditional branch
>
> if (asid >= NR_DYNAMIC_ASIDS) {
> asid = 0;
> ....
> }
>
> The question is whether 4 IDs would be sufficient which trades the branch
> for a mask operation. Or you go for 8 and spend another cache line.

Interesting. I'm inclined to either leave it at 6 or reduce it to 4
for now and to optimize later.

>
>> atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>>
>> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>> + u16 *new_asid, bool *need_flush)
>> +{
>> + u16 asid;
>> +
>> + if (!static_cpu_has(X86_FEATURE_PCID)) {
>> + *new_asid = 0;
>> + *need_flush = true;
>> + return;
>> + }
>> +
>> + for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
>> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
>> + next->context.ctx_id)
>> + continue;
>> +
>> + *new_asid = asid;
>> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
>> + next_tlb_gen);
>> + return;
>> + }
>
> Hmm. So this loop needs to be taken unconditionally even if the task stays
> on the same CPU. And of course the number of dynamic IDs has to be short in
> order to makes this loop suck performance wise.
>
> Something like the completely disfunctional below might be worthwhile to
> explore. At least arch/x86/mm/ compiles :)
>
> It gets rid of the loop search and lifts the limit of dynamic ids by
> trading it with a percpu variable in mm_context_t.

That would work, but it would take a lot more memory on large systems
with lots of processes, and I'd also be concerned that we might run
out of dynamic percpu space.

How about a different idea: make the percpu data structure look like a
4-way set associative cache. The ctxs array could be, say, 1024
entries long without using crazy amounts of memory. We'd divide it
into 256 buckets, so you'd index it like ctxs[4*bucket + slot]. For
each mm, we choose a random bucket (from 0 through 256), and then we'd
just loop over the four slots in the bucket in choose_asid(). This
would require very slightly more arithmetic (I'd guess only one or two
cycles, though) but, critically, wouldn't touch any more cachelines.

The downside of both of these approaches over the one in this patch is
that the change that the percpu cacheline we need is not in the cache
is quite a bit higher since it's potentially a different cacheline for
each mm. It would probably still be a win because avoiding the flush
is really quite valuable.

What do you think? The added code would be tiny.

(P.S. Why doesn't random_p32() try arch_random_int()?)

--Andy

2017-06-22 05:19:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 00/11] PCID and improved laziness

On Wed, Jun 21, 2017 at 11:23 AM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22 PM, Andy Lutomirski <[email protected]> wrote:
>> There are three performance benefits here:
>
> Side note: can you post the actual performance numbers, even if only
> from some silly test program on just one platform? Things like lmbench
> pipe benchmark or something?
>
> Or maybe you did, and I just missed it. But when talking about
> performance, I'd really like to always see some actual numbers.

Here are some timings using KVM:

pingpong between two processes using eventfd:
patched: 883ns
unpatched: 1046ns (with considerably higher variance)

madvise(MADV_DONTNEED); write to the page; switch CPUs:
patched: ~12.5us
unpatched: 19us

The latter test is a somewhat contrived example to show off the
improved laziness. Current kernels send an IPI on each iteration if
the system is otherwise idle.

2017-06-22 07:25:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
> > I'm certainly still missing something here:
> >
> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> > do once
> >
> > bump_mm_tlb_gen(mm);
> >
> > and once
> >
> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
> >
> > and in both cases, the bumping is done on mm->context.tlb_gen.
> >
> > So why isn't that enough to do the flushing and we have to consult
> > info.new_tlb_gen too?
>
> The issue is a possible race. Suppose we start at tlb_gen == 1 and
> then two concurrent flushes happen. The first flush is a full flush
> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen
> to 3. If the second flush gets propagated to a given CPU first and it

Maybe I'm still missing something, which is likely...

but if the second flush gets propagated to the CPU first, the CPU will
have local tlb_gen 1 and thus enforce a full flush anyway because we
will go 1 -> 3 on that particular CPU. Or?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-22 07:32:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()


* Andy Lutomirski <[email protected]> wrote:

> On Wed, Jun 21, 2017 at 4:26 PM, Nadav Amit <[email protected]> wrote:
> > Andy Lutomirski <[email protected]> wrote:
> >
> >> index 2a5e851f2035..f06239c6919f 100644
> >> --- a/arch/x86/mm/tlb.c
> >> +++ b/arch/x86/mm/tlb.c
> >> @@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> >> static void flush_tlb_func_common(const struct flush_tlb_info *f,
> >> bool local, enum tlb_flush_reason reason)
> >> {
> >> + /* This code cannot presently handle being reentered. */
> >> + VM_WARN_ON(!irqs_disabled());
> >> +
> >> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> >> leave_mm(smp_processor_id());
> >> return;
> >> @@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >> info.end = TLB_FLUSH_ALL;
> >> }
> >>
> >> - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
> >> + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> >
> > Perhaps you want to add:
> >
> > VM_WARN_ON(irqs_disabled());
> >
> > here
> >
> >> + local_irq_disable();
> >> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> >> + local_irq_enable();
> >> + }
> >> +
> >> if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> >> flush_tlb_others(mm_cpumask(mm), &info);
> >> put_cpu();
> >> @@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >>
> >> int cpu = get_cpu();
> >>
> >> - if (cpumask_test_cpu(cpu, &batch->cpumask))
> >> + if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> >
> > and here?
> >
>
> Will do.
>
> What I really want is lockdep_assert_irqs_disabled() or, even better,
> for this to be implicit when calling local_irq_disable(). Ingo?

I tried that once many years ago and IIRC there were problems - but maybe we could
try it again and enforce it, as I agree that the following pattern:

local_irq_disable();
...
local_irq_disable();
...
local_irq_enable();
...
local_irq_enable();

.. is actively dangerous.

Thanks,

Ingo

Subject: [tip:x86/mm] x86/ldt: Simplify the LDT switching logic

Commit-ID: 7353425881b170a24990b4d3bdcd14b1156fa8bd
Gitweb: http://git.kernel.org/tip/7353425881b170a24990b4d3bdcd14b1156fa8bd
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 20 Jun 2017 22:22:08 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 22 Jun 2017 10:57:50 +0200

x86/ldt: Simplify the LDT switching logic

Originally, Linux reloaded the LDT whenever the prev mm or the next
mm had an LDT. It was changed in 2002 in:

0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support")

(commit from the historical tree), like this:

- /* load_LDT, if either the previous or next thread
- * has a non-default LDT.
+ /*
+ * load the LDT, if the LDT is different:
*/
- if (next->context.size+prev->context.size)
+ if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT(&next->context);

The current code is unlikely to avoid any LDT reloads, since different
mms won't share an LDT.

When we redo lazy mode to stop flush IPIs without switching to
init_mm, though, the current logic would become incorrect: it will
be possible to have real_prev == next but nonetheless have a stale
LDT descriptor.

Simplify the code to update LDTR if either the previous or the next
mm has an LDT, i.e. effectively restore the historical logic..
While we're at it, clean up the code by moving all the ifdeffery to
a header where it belongs.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/2a859ac01245f9594c58f9d0a8b2ed8a7cd2507e.1498022414.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 26 ++++++++++++++++++++++++++
arch/x86/mm/tlb.c | 20 ++------------------
2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1458f53..ecfcb66 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm)
#else
clear_LDT();
#endif
+}
+
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ /*
+ * Load the LDT if either the old or new mm had an LDT.
+ *
+ * An mm will never go from having an LDT to not having an LDT. Two
+ * mms never share an LDT, so we don't gain anything by checking to
+ * see whether the LDT changed. There's also no guarantee that
+ * prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
+ * then prev->context.ldt will also be non-NULL.
+ *
+ * If we really cared, we could optimize the case where prev == next
+ * and we're exiting lazy mode. Most of the time, if this happens,
+ * we don't actually need to reload LDTR, but modify_ldt() is mostly
+ * used by legacy code and emulators where we don't need this level of
+ * performance.
+ *
+ * This uses | instead of || because it generates better code.
+ */
+ if (unlikely((unsigned long)prev->context.ldt |
+ (unsigned long)next->context.ldt))
+ load_mm_ldt(next);
+#endif

DEBUG_LOCKS_WARN_ON(preemptible());
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a5e851..b2485d6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
real_prev != &init_mm);
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

- /* Load per-mm CR4 state */
+ /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- /*
- * Load the LDT, if the LDT is different.
- *
- * It's possible that prev->context.ldt doesn't match
- * the LDT register. This can happen if leave_mm(prev)
- * was called and then modify_ldt changed
- * prev->context.ldt but suppressed an IPI to this CPU.
- * In this case, prev->context.ldt != NULL, because we
- * never set context.ldt to NULL while the mm still
- * exists. That means that next->context.ldt !=
- * prev->context.ldt, because mms never share an LDT.
- */
- if (unlikely(real_prev->context.ldt != next->context.ldt))
- load_mm_ldt(next);
-#endif
+ switch_ldt(real_prev, next);
}

/*

Subject: [tip:x86/mm] x86/mm: Remove reset_lazy_tlbstate()

Commit-ID: d54368127a11c6da0776c109a4c65a7b6a815f32
Gitweb: http://git.kernel.org/tip/d54368127a11c6da0776c109a4c65a7b6a815f32
Author: Andy Lutomirski <[email protected]>
AuthorDate: Tue, 20 Jun 2017 22:22:09 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 22 Jun 2017 10:57:50 +0200

x86/mm: Remove reset_lazy_tlbstate()

The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/3acc7ad02a2ec060d2321a1e0f6de1cb90069517.1498022414.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 8 --------
arch/x86/kernel/smpboot.c | 1 -
2 files changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5f78c6a..50ea348 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
#define TLBSTATE_OK 1
#define TLBSTATE_LAZY 2

-static inline void reset_lazy_tlbstate(void)
-{
- this_cpu_write(cpu_tlbstate.state, 0);
- this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
-
- WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir));
-}
-
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm)
{
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f04479a..6169a56 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu)
void play_dead_common(void)
{
idle_task_exit();
- reset_lazy_tlbstate();

/* Ack it */
(void)cpu_report_death();

2017-06-22 12:22:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Wed, 21 Jun 2017, Andy Lutomirski wrote:
> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner <[email protected]> wrote:
> > That requires a conditional branch
> >
> > if (asid >= NR_DYNAMIC_ASIDS) {
> > asid = 0;
> > ....
> > }
> >
> > The question is whether 4 IDs would be sufficient which trades the branch
> > for a mask operation. Or you go for 8 and spend another cache line.
>
> Interesting. I'm inclined to either leave it at 6 or reduce it to 4
> for now and to optimize later.

:)

> > Hmm. So this loop needs to be taken unconditionally even if the task stays
> > on the same CPU. And of course the number of dynamic IDs has to be short in
> > order to makes this loop suck performance wise.
> >
> > Something like the completely disfunctional below might be worthwhile to
> > explore. At least arch/x86/mm/ compiles :)
> >
> > It gets rid of the loop search and lifts the limit of dynamic ids by
> > trading it with a percpu variable in mm_context_t.
>
> That would work, but it would take a lot more memory on large systems
> with lots of processes, and I'd also be concerned that we might run
> out of dynamic percpu space.

Yeah, did not think about the dynamic percpu space.

> How about a different idea: make the percpu data structure look like a
> 4-way set associative cache. The ctxs array could be, say, 1024
> entries long without using crazy amounts of memory. We'd divide it
> into 256 buckets, so you'd index it like ctxs[4*bucket + slot]. For
> each mm, we choose a random bucket (from 0 through 256), and then we'd
> just loop over the four slots in the bucket in choose_asid(). This
> would require very slightly more arithmetic (I'd guess only one or two
> cycles, though) but, critically, wouldn't touch any more cachelines.
>
> The downside of both of these approaches over the one in this patch is
> that the change that the percpu cacheline we need is not in the cache
> is quite a bit higher since it's potentially a different cacheline for
> each mm. It would probably still be a win because avoiding the flush
> is really quite valuable.
>
> What do you think? The added code would be tiny.

That might be worth a try.

Now one other optimization which should be trivial to add is to keep the 4
asid context entries in cpu_tlbstate and cache the last asid in thread
info. If that's still valid then use it otherwise unconditionally get a new
one. That avoids the whole loop machinery and thread info is cache hot in
the context switch anyway. Delta patch on top of your version below.

> (P.S. Why doesn't random_p32() try arch_random_int()?)

Could you please ask questions which do not require crystalballs for
answering?

Thanks,

tglx

8<-------------------
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -159,8 +159,16 @@ static inline void destroy_context(struc
extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk);

-extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk);
+extern void __switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next, u32 *last_asid);
+
+static inline void switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ __switch_mm_irqs_off(prev, next, &tsk->thread_info.asid);
+}
+
#define switch_mm_irqs_off switch_mm_irqs_off

#define activate_mm(prev, next) \
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -54,6 +54,7 @@ struct task_struct;

struct thread_info {
unsigned long flags; /* low level flags */
+ u32 asid;
};

#define INIT_THREAD_INFO(tsk) \
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,10 +83,13 @@ static inline u64 bump_mm_tlb_gen(struct
#endif

/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
+ * NR_DYNAMIC_ASIDS must be a power of 2. 4 makes tlb_state fit into two
+ * cache lines.
*/
-#define NR_DYNAMIC_ASIDS 6
+#define NR_DYNAMIC_ASIDS_BITS 2
+#define NR_DYNAMIC_ASIDS (1U << NR_DYNAMIC_ASIDS_BITS)
+#define DYNAMIC_ASIDS_MASK (NR_DYNAMIC_ASIDS - 1)
+#define ASID_NEEDS_FLUSH (1U << 16)

struct tlb_context {
u64 ctx_id;
@@ -102,7 +105,8 @@ struct tlb_state {
*/
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
- u16 next_asid;
+ u16 curr_asid;
+ u32 notask_asid;

/*
* Access to this CR4 shadow and to H/W CR4 is protected by
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,7 @@ void __init zone_sizes_init(void)

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
- .next_asid = 1,
+ .curr_asid = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,43 +30,32 @@

atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);

-static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
- u16 *new_asid, bool *need_flush)
+static u32 choose_new_asid(mm_context_t *nctx, u32 *last_asid, u64 next_tlb_gen)
{
- u16 asid;
+ struct tlb_context *tctx;
+ u32 asid;

- if (!static_cpu_has(X86_FEATURE_PCID)) {
- *new_asid = 0;
- *need_flush = true;
- return;
- }
+ if (!static_cpu_has(X86_FEATURE_PCID))
+ return ASID_NEEDS_FLUSH;

- for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
- if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
- next->context.ctx_id)
- continue;
-
- *new_asid = asid;
- *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
- next_tlb_gen);
- return;
+ asid = *last_asid;
+ tctx = this_cpu_ptr(cpu_tlbstate.ctxs + asid);
+ if (likely(tctx->ctx_id == nctx->ctx_id)) {
+ if (tctx->tlb_gen != next_tlb_gen)
+ asid |= ASID_NEEDS_FLUSH;
+ return asid;
}

- /*
- * We don't currently own an ASID slot on this CPU.
- * Allocate a slot.
- */
- *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
- if (*new_asid >= NR_DYNAMIC_ASIDS) {
- *new_asid = 0;
- this_cpu_write(cpu_tlbstate.next_asid, 1);
- }
- *need_flush = true;
+ asid = this_cpu_inc_return(cpu_tlbstate.curr_asid);
+ asid &= DYNAMIC_ASIDS_MASK;
+ *last_asid = asid;
+ return asid | ASID_NEEDS_FLUSH;
}

void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned long flags;

/*
* It's plausible that we're in lazy TLB mode while our mm is init_mm.
@@ -82,21 +71,27 @@ void leave_mm(int cpu)
/* Warn if we're not lazy. */
WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

- switch_mm(NULL, &init_mm, NULL);
+ local_irq_save(flags);
+ switch_mm_irqs_off(NULL, &init_mm, NULL);
+ local_irq_restore(flags);
}

void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
unsigned long flags;
+ u32 *last_asid;
+
+ last_asid = tsk ? &tsk->thread_info.asid :
+ this_cpu_ptr(&cpu_tlbstate.notask_asid);

local_irq_save(flags);
- switch_mm_irqs_off(prev, next, tsk);
+ __switch_mm_irqs_off(prev, next, last_asid);
local_irq_restore(flags);
}

-void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk)
+void __switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ u32 *last_asid)
{
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -157,8 +152,7 @@ void switch_mm_irqs_off(struct mm_struct
* are not reflected in tlb_gen.)
*/
} else {
- u16 new_asid;
- bool need_flush;
+ u32 new_asid;

if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
@@ -187,9 +181,11 @@ void switch_mm_irqs_off(struct mm_struct
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

- choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+ new_asid = choose_new_asid(&next->context, last_asid,
+ next_tlb_gen);

- if (need_flush) {
+ if (new_asid & ASID_NEEDS_FLUSH) {
+ new_asid &= DYNAMIC_ASIDS_MASK;
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id,
next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen,

2017-06-22 14:48:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov <[email protected]> wrote:
> On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
>> > I'm certainly still missing something here:
>> >
>> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
>> > do once
>> >
>> > bump_mm_tlb_gen(mm);
>> >
>> > and once
>> >
>> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
>> >
>> > and in both cases, the bumping is done on mm->context.tlb_gen.
>> >
>> > So why isn't that enough to do the flushing and we have to consult
>> > info.new_tlb_gen too?
>>
>> The issue is a possible race. Suppose we start at tlb_gen == 1 and
>> then two concurrent flushes happen. The first flush is a full flush
>> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen
>> to 3. If the second flush gets propagated to a given CPU first and it
>
> Maybe I'm still missing something, which is likely...
>
> but if the second flush gets propagated to the CPU first, the CPU will
> have local tlb_gen 1 and thus enforce a full flush anyway because we
> will go 1 -> 3 on that particular CPU. Or?
>

Yes, exactly. Which means I'm probably just misunderstanding your
original question. Can you re-ask it?

--Andy

2017-06-22 14:50:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB. This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
>
> Rewrite it entirely. When we enter lazy mode, we simply remove the
> cpu from mm_cpumask. This means that we need a way to figure out

s/cpu/CPU/

> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
>
> Note to reviewers: this patch, my itself, looks a bit odd. I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1. Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
>
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works. This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.

This whole text should be under the "---" line below if we don't want it
in the commit message.

>
> The UV tlbflush code is rather dated and should be changed.
>
> Cc: Andrew Banman <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Dimitri Sivanich <[email protected]>
> Cc: Juergen Gross <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/mmu_context.h | 6 +-
> arch/x86/include/asm/tlbflush.h | 4 -
> arch/x86/mm/init.c | 1 -
> arch/x86/mm/tlb.c | 227 +++++++++++++++++++------------------
> arch/x86/xen/mmu_pv.c | 3 +-
> 5 files changed, 119 insertions(+), 122 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index e5295d485899..69a4f1ee86ac 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
>
> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> {
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> + int cpu = smp_processor_id();
> +
> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));

It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
does BTR straightaway.

> extern atomic64_t last_mm_ctx_id;
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4f6c30d6ec39..87b13e51e867 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -95,7 +95,6 @@ struct tlb_state {
> * mode even if we've already switched back to swapper_pg_dir.
> */
> struct mm_struct *loaded_mm;
> - int state;
>
> /*
> * Access to this CR4 shadow and to H/W CR4 is protected by
> @@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
> void native_flush_tlb_others(const struct cpumask *cpumask,
> const struct flush_tlb_info *info);
>
> -#define TLBSTATE_OK 1
> -#define TLBSTATE_LAZY 2
> -
> static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> struct mm_struct *mm)
> {
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 88ee942cb47d..7d6fa4676af9 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
> .loaded_mm = &init_mm,
> - .state = 0,
> .cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
> };
> EXPORT_SYMBOL_GPL(cpu_tlbstate);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 9f5ef7a5e74a..fea2b07ac7d8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -45,8 +45,8 @@ void leave_mm(int cpu)
> if (loaded_mm == &init_mm)
> return;
>
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - BUG();
> + /* Warn if we're not lazy. */
> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

We don't BUG() anymore?

>
> switch_mm(NULL, &init_mm, NULL);
> }
> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> {
> unsigned cpu = smp_processor_id();
> struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> + u64 next_tlb_gen;

Please sort function local variables declaration in a reverse christmas
tree order:

<type> longest_variable_name;
<type> shorter_var_name;
<type> even_shorter;
<type> i;

>
> /*
> - * NB: The scheduler will call us with prev == next when
> - * switching from lazy TLB mode to normal mode if active_mm
> - * isn't changing. When this happens, there is no guarantee
> - * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
> + * NB: The scheduler will call us with prev == next when switching
> + * from lazy TLB mode to normal mode if active_mm isn't changing.
> + * When this happens, we don't assume that CR3 (and hence
> + * cpu_tlbstate.loaded_mm) matches next.
> *
> * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
> */
>
> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> + /* We don't want flush_tlb_func_* to run concurrently with us. */
> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> + WARN_ON_ONCE(!irqs_disabled());
> +
> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));

Why do we need that check? Can that ever happen?

> if (real_prev == next) {
> - /*
> - * There's nothing to do: we always keep the per-mm control
> - * regs in sync with cpu_tlbstate.loaded_mm. Just
> - * sanity-check mm_cpumask.
> - */
> - if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
> - cpumask_set_cpu(cpu, mm_cpumask(next));
> - return;
> - }
> + if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
> + /*
> + * There's nothing to do: we weren't lazy, and we
> + * aren't changing our mm. We don't need to flush
> + * anything, nor do we need to update CR3, CR4, or
> + * LDTR.
> + */

Nice comment.

> + return;
> + }
> +
> + /* Resume remote flushes and then read tlb_gen. */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> +
> + VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> + next->context.ctx_id);

I guess this check should be right under the if (real_prev == next), right?

> +
> + if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
> + next_tlb_gen) {

Yeah, let it stick out - that trailing '<' doesn't make it any nicer.

> + /*
> + * Ideally, we'd have a flush_tlb() variant that
> + * takes the known CR3 value as input. This would
> + * be faster on Xen PV and on hypothetical CPUs
> + * on which INVPCID is fast.
> + */
> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> + next_tlb_gen);
> + write_cr3(__pa(next->pgd));

<---- newline here.

> + /*
> + * This gets called via leave_mm() in the idle path
> + * where RCU functions differently. Tracing normally
> + * uses RCU, so we have to call the tracepoint
> + * specially here.
> + */
> + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
> + TLB_FLUSH_ALL);
> + }
>
> - if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> /*
> - * If our current stack is in vmalloc space and isn't
> - * mapped in the new pgd, we'll double-fault. Forcibly
> - * map it.
> + * We just exited lazy mode, which means that CR4 and/or LDTR
> + * may be stale. (Changes to the required CR4 and LDTR states
> + * are not reflected in tlb_gen.)

We need that comment because... ? I mean, we do update CR4/LDTR at the
end of the function.

> */
> - unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
> + } else {
> + if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> + /*
> + * If our current stack is in vmalloc space and isn't
> + * mapped in the new pgd, we'll double-fault. Forcibly
> + * map it.
> + */
> + unsigned int stack_pgd_index =

Shorten that var name and make it fit into 80ish cols so that the
linebreak is gone.

> + pgd_index(current_stack_pointer());
> +
> + pgd_t *pgd = next->pgd + stack_pgd_index;
> +
> + if (unlikely(pgd_none(*pgd)))
> + set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
> + }
>
> - pgd_t *pgd = next->pgd + stack_pgd_index;
> + /* Stop remote flushes for the previous mm */
> + if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
> + cpumask_clear_cpu(cpu, mm_cpumask(real_prev));

cpumask_test_and_clear_cpu()

>
> - if (unlikely(pgd_none(*pgd)))
> - set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
> - }
> + WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));

We warn if the next task is not lazy because...?

> - this_cpu_write(cpu_tlbstate.loaded_mm, next);
> - this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
> - this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> - atomic64_read(&next->context.tlb_gen));
> + /*
> + * Start remote flushes and then read tlb_gen.
> + */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>
> - WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
> - cpumask_set_cpu(cpu, mm_cpumask(next));
> + VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
> + next->context.ctx_id);

Also put it as the first statement after the else { ?

> - /*
> - * Re-load page tables.
> - *
> - * This logic has an ordering constraint:
> - *
> - * CPU 0: Write to a PTE for 'next'
> - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
> - * CPU 1: set bit 1 in next's mm_cpumask
> - * CPU 1: load from the PTE that CPU 0 writes (implicit)
> - *
> - * We need to prevent an outcome in which CPU 1 observes
> - * the new PTE value and CPU 0 observes bit 1 clear in
> - * mm_cpumask. (If that occurs, then the IPI will never
> - * be sent, and CPU 0's TLB will contain a stale entry.)
> - *
> - * The bad outcome can occur if either CPU's load is
> - * reordered before that CPU's store, so both CPUs must
> - * execute full barriers to prevent this from happening.
> - *
> - * Thus, switch_mm needs a full barrier between the
> - * store to mm_cpumask and any operation that could load
> - * from next->pgd. TLB fills are special and can happen
> - * due to instruction fetches or for no reason at all,
> - * and neither LOCK nor MFENCE orders them.
> - * Fortunately, load_cr3() is serializing and gives the
> - * ordering guarantee we need.
> - */
> - load_cr3(next->pgd);
> -
> - /*
> - * This gets called via leave_mm() in the idle path where RCU
> - * functions differently. Tracing normally uses RCU, so we have to
> - * call the tracepoint specially here.
> - */
> - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
> + next->context.ctx_id);
> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> + next_tlb_gen);

Yeah, just let all those three stick out.

Also, no:

if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
next_tlb_gen) {

check?

> + this_cpu_write(cpu_tlbstate.loaded_mm, next);
> + write_cr3(__pa(next->pgd));
>
> - /* Stop flush ipis for the previous mm */
> - WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
> - real_prev != &init_mm);
> - cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
> + /*
> + * This gets called via leave_mm() in the idle path where RCU
> + * functions differently. Tracing normally uses RCU, so we
> + * have to call the tracepoint specially here.
> + */
> + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
> + TLB_FLUSH_ALL);
> + }

That's repeated as in the if-branch above. Move it out I guess.

>
> - /* Load per-mm CR4 and LDTR state */
> load_mm_cr4(next);
> switch_ldt(real_prev, next);
> }
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-22 14:59:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov <[email protected]> wrote:
> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
> >> > I'm certainly still missing something here:
> >> >
> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> >> > do once
> >> >
> >> > bump_mm_tlb_gen(mm);
> >> >
> >> > and once
> >> >
> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
> >> >
> >> > and in both cases, the bumping is done on mm->context.tlb_gen.
> >> >
> >> > So why isn't that enough to do the flushing and we have to consult
> >> > info.new_tlb_gen too?
> >>
> >> The issue is a possible race. Suppose we start at tlb_gen == 1 and
> >> then two concurrent flushes happen. The first flush is a full flush
> >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen
> >> to 3. If the second flush gets propagated to a given CPU first and it
> >
> > Maybe I'm still missing something, which is likely...
> >
> > but if the second flush gets propagated to the CPU first, the CPU will
> > have local tlb_gen 1 and thus enforce a full flush anyway because we
> > will go 1 -> 3 on that particular CPU. Or?
> >
>
> Yes, exactly. Which means I'm probably just misunderstanding your
> original question. Can you re-ask it?

Ah, simple: we control the flushing with info.new_tlb_gen and
mm->context.tlb_gen. I.e., this check:


if (f->end != TLB_FLUSH_ALL &&
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {

why can't we write:

if (f->end != TLB_FLUSH_ALL &&
mm_tlb_gen == local_tlb_gen + 1)

?

If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then
partial.

If the second flush, as you say is a partial one and still gets
propagated first, the check will force a full flush anyway.

When the first flush propagates after the second, we'll ignore it
because local_tlb_gen has advanced adready due to the second flush.

As a matter of fact, we could simplify the logic: if local_tlb_gen is
only mm_tlb_gen - 1, then do the requested flush type.

If mm_tlb_gen has advanced more than 1 generation, just do a full flush
unconditionally. ... and I think we do something like that already but I
think the logic could be simplified, unless I'm missing something, that is.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-22 15:56:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Thu, Jun 22, 2017 at 7:59 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov <[email protected]> wrote:
>> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
>> >> > I'm certainly still missing something here:
>> >> >
>> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
>> >> > do once
>> >> >
>> >> > bump_mm_tlb_gen(mm);
>> >> >
>> >> > and once
>> >> >
>> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
>> >> >
>> >> > and in both cases, the bumping is done on mm->context.tlb_gen.
>> >> >
>> >> > So why isn't that enough to do the flushing and we have to consult
>> >> > info.new_tlb_gen too?
>> >>
>> >> The issue is a possible race. Suppose we start at tlb_gen == 1 and
>> >> then two concurrent flushes happen. The first flush is a full flush
>> >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen
>> >> to 3. If the second flush gets propagated to a given CPU first and it
>> >
>> > Maybe I'm still missing something, which is likely...
>> >
>> > but if the second flush gets propagated to the CPU first, the CPU will
>> > have local tlb_gen 1 and thus enforce a full flush anyway because we
>> > will go 1 -> 3 on that particular CPU. Or?
>> >
>>
>> Yes, exactly. Which means I'm probably just misunderstanding your
>> original question. Can you re-ask it?
>
> Ah, simple: we control the flushing with info.new_tlb_gen and
> mm->context.tlb_gen. I.e., this check:
>
>
> if (f->end != TLB_FLUSH_ALL &&
> f->new_tlb_gen == local_tlb_gen + 1 &&
> f->new_tlb_gen == mm_tlb_gen) {
>
> why can't we write:
>
> if (f->end != TLB_FLUSH_ALL &&
> mm_tlb_gen == local_tlb_gen + 1)
>
> ?

Ah, I thought you were asking about why I needed mm_tlb_gen ==
local_tlb_gen + 1. This is just an optimization, or at least I hope
it is. The idea is that, if we know that another flush is coming, it
seems likely that it would be faster to do a full flush and increase
local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
needing to flush again very soon.

>
> If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then
> partial.
>
> If the second flush, as you say is a partial one and still gets
> propagated first, the check will force a full flush anyway.
>
> When the first flush propagates after the second, we'll ignore it
> because local_tlb_gen has advanced adready due to the second flush.
>
> As a matter of fact, we could simplify the logic: if local_tlb_gen is
> only mm_tlb_gen - 1, then do the requested flush type.

Hmm. I'd be nervous that there are more subtle races if we do this.
For example, suppose that a partial flush increments tlb_gen from 1 to
2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU
is busy switching back and forth between mms, so the partial flush
sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
set in mm_cpumask. The flush IPI hits after a switch_mm_irqs_off()
call notices the change from 1 to 2. switch_mm_irqs_off() will do a
full flush and increment the local tlb_gen to 2, and the IPI handler
for the partial flush will see local_tlb_gen == mm_tlb_gen - 1
(because local_tlb_gen == 2 and mm_tlb_gen == 3) and do a partial
flush. The problem here is that it's not obvious to me that this
actually ends up flushing everything that's needed. Maybe all the
memory ordering gets this right, but I can imagine scenarios in which
switch_mm_irqs_off() does its flush early enough that the TLB picks up
an entry that was supposed to get zapped by the full flush.

IOW it *might* be valid, but I think it would need very careful review
and documentation.

--Andy

2017-06-22 16:10:08

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

Andy Lutomirski <[email protected]> wrote:

>
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
> .loaded_mm = &init_mm,
> + .next_asid = 1,

I think this is a remainder from previous version of the patches, no? It
does not seem necessary and may be confusing (ctx_id 0 is reserved, but not
asid 0).

Other than that, if you want, you can put for the entire series:

Reviewed-by: Nadav Amit <[email protected]>

2017-06-22 17:22:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote:
> > Ah, simple: we control the flushing with info.new_tlb_gen and
> > mm->context.tlb_gen. I.e., this check:
> >
> >
> > if (f->end != TLB_FLUSH_ALL &&
> > f->new_tlb_gen == local_tlb_gen + 1 &&
> > f->new_tlb_gen == mm_tlb_gen) {
> >
> > why can't we write:
> >
> > if (f->end != TLB_FLUSH_ALL &&
> > mm_tlb_gen == local_tlb_gen + 1)
> >
> > ?
>
> Ah, I thought you were asking about why I needed mm_tlb_gen ==
> local_tlb_gen + 1. This is just an optimization, or at least I hope
> it is. The idea is that, if we know that another flush is coming, it
> seems likely that it would be faster to do a full flush and increase
> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
> needing to flush again very soon.

Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1.

Btw, do you see how confusing this check is: you have new_tlb_gen from
a variable passed from the function call IPI, local_tlb_gen which is
the CPU's own and then there's also mm_tlb_gen which we've written into
new_tlb_gen from:

info.new_tlb_gen = bump_mm_tlb_gen(mm);

which incremented mm_tlb_gen too.

> Hmm. I'd be nervous that there are more subtle races if we do this.
> For example, suppose that a partial flush increments tlb_gen from 1 to
> 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU
> is busy switching back and forth between mms, so the partial flush
> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
> set in mm_cpumask.

Lemme see if I understand this correctly: you mean, the full flush will
exit early due to the

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {

test?

> The flush IPI hits after a switch_mm_irqs_off() call notices the
> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
> increment the local tlb_gen to 2, and the IPI handler for the partial
> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
> == 2 and mm_tlb_gen == 3) and do a partial flush.

Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.

That's why you have this thing in addition to the tlb_gen.

What we end up doing in this case, is promote the partial flush to a
full one and thus have a partial and a full flush which are close by
converted to two full flushes.

> The problem here is that it's not obvious to me that this actually
> ends up flushing everything that's needed. Maybe all the memory
> ordering gets this right, but I can imagine scenarios in which
> switch_mm_irqs_off() does its flush early enough that the TLB picks up
> an entry that was supposed to get zapped by the full flush.

See above.

And I don't think that having two full flushes back-to-back is going to
cost a lot as the second one won't flush a whole lot.

> IOW it *might* be valid, but I think it would need very careful review
> and documentation.

Always.

Btw, I get the sense this TLB flush avoiding scheme becomes pretty
complex for diminishing reasons.

[ Or maybe I'm not seeing them - I'm always open to corrections. ]

Especially if intermediary levels from the pagetable walker are cached
and reestablishing the TLB entries seldom means a full walk.

You should do a full fledged benchmark to see whether this whole
complexity is even worth it, methinks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-22 17:47:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
>> Rewrite it entirely. When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask. This means that we need a way to figure out
>
> s/cpu/CPU/

Done.

>
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd. I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1. Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works. This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>
> This whole text should be under the "---" line below if we don't want it
> in the commit message.

I figured that some future reader of this patch might actually want to
see this text, though.

>
>>
>> The UV tlbflush code is rather dated and should be changed.

And I'd definitely like the UV maintainers to notice this part, now or
in the future :) I don't want to personally touch the UV code with a
ten-foot pole, but it really should be updated by someone who has a
chance of getting it right and being able to test it.

>> +
>> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> does BTR straightaway.

Yeah, but I'm doing this for performance. I think that all the
various one-line helpers do a LOCKed op right away, and I think it's
faster to see if we can avoid the LOCKed op by trying an ordinary read
first. OTOH, maybe this is misguided -- if the cacheline lives
somewhere else and we do end up needing to update it, we'll end up
first sharing it and then making it exclusive, which increases the
amount of cache coherency traffic, so maybe I'm optimizing for the
wrong thing. What do you think?

>> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> - BUG();
>> + /* Warn if we're not lazy. */
>> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
>
> We don't BUG() anymore?

We could. But, when the whole patch series is applied, the only
caller left is a somewhat dubious Xen optimization, and if we blindly
continue executing, I think the worst that happens is that we OOPS
later or that we get segfaults when we shouldn't get segfaults.

>
>>
>> switch_mm(NULL, &init_mm, NULL);
>> }
>> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>> {
>> unsigned cpu = smp_processor_id();
>> struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>> + u64 next_tlb_gen;
>
> Please sort function local variables declaration in a reverse christmas
> tree order:
>
> <type> longest_variable_name;
> <type> shorter_var_name;
> <type> even_shorter;
> <type> i;
>
>>
>> /*
>> - * NB: The scheduler will call us with prev == next when
>> - * switching from lazy TLB mode to normal mode if active_mm
>> - * isn't changing. When this happens, there is no guarantee
>> - * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
>> + * NB: The scheduler will call us with prev == next when switching
>> + * from lazy TLB mode to normal mode if active_mm isn't changing.
>> + * When this happens, we don't assume that CR3 (and hence
>> + * cpu_tlbstate.loaded_mm) matches next.
>> *
>> * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
>> */
>>
>> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>> + /* We don't want flush_tlb_func_* to run concurrently with us. */
>> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>> + WARN_ON_ONCE(!irqs_disabled());
>> +
>> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
>
> Why do we need that check? Can that ever happen?

It did in one particular buggy incarnation. It would also trigger if,
say, suspend/resume corrupts CR3. Admittedly this is unlikely, but
I'd rather catch it. Once PCID is on, corruption seems a bit less
farfetched -- this assertion will catch anyone who accidentally does
write_cr3(read_cr3_pa()).

>
>> if (real_prev == next) {
>> - /*
>> - * There's nothing to do: we always keep the per-mm control
>> - * regs in sync with cpu_tlbstate.loaded_mm. Just
>> - * sanity-check mm_cpumask.
>> - */
>> - if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
>> - cpumask_set_cpu(cpu, mm_cpumask(next));
>> - return;
>> - }
>> + if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
>> + /*
>> + * There's nothing to do: we weren't lazy, and we
>> + * aren't changing our mm. We don't need to flush
>> + * anything, nor do we need to update CR3, CR4, or
>> + * LDTR.
>> + */
>
> Nice comment.
>
>> + return;
>> + }
>> +
>> + /* Resume remote flushes and then read tlb_gen. */
>> + cpumask_set_cpu(cpu, mm_cpumask(next));
>> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>> +
>> + VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>> + next->context.ctx_id);
>
> I guess this check should be right under the if (real_prev == next), right?

Moved.

>
>> +
>> + if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
>> + next_tlb_gen) {
>
> Yeah, let it stick out - that trailing '<' doesn't make it any nicer.
>
>> + /*
>> + * Ideally, we'd have a flush_tlb() variant that
>> + * takes the known CR3 value as input. This would
>> + * be faster on Xen PV and on hypothetical CPUs
>> + * on which INVPCID is fast.
>> + */
>> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
>> + next_tlb_gen);
>> + write_cr3(__pa(next->pgd));
>
> <---- newline here.
>
>> + /*
>> + * This gets called via leave_mm() in the idle path
>> + * where RCU functions differently. Tracing normally
>> + * uses RCU, so we have to call the tracepoint
>> + * specially here.
>> + */
>> + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
>> + TLB_FLUSH_ALL);
>> + }
>>
>> - if (IS_ENABLED(CONFIG_VMAP_STACK)) {
>> /*
>> - * If our current stack is in vmalloc space and isn't
>> - * mapped in the new pgd, we'll double-fault. Forcibly
>> - * map it.
>> + * We just exited lazy mode, which means that CR4 and/or LDTR
>> + * may be stale. (Changes to the required CR4 and LDTR states
>> + * are not reflected in tlb_gen.)
>
> We need that comment because... ? I mean, we do update CR4/LDTR at the
> end of the function.

I'm trying to explain to the potentially confused reader why we fall
through and update CR4 and LDTR even if we decided not to update the
TLB.

>
>> */
>> - unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
>> + } else {
>> + if (IS_ENABLED(CONFIG_VMAP_STACK)) {
>> + /*
>> + * If our current stack is in vmalloc space and isn't
>> + * mapped in the new pgd, we'll double-fault. Forcibly
>> + * map it.
>> + */
>> + unsigned int stack_pgd_index =
>
> Shorten that var name and make it fit into 80ish cols so that the
> linebreak is gone.

Done.

>
>> + pgd_index(current_stack_pointer());
>> +
>> + pgd_t *pgd = next->pgd + stack_pgd_index;
>> +
>> + if (unlikely(pgd_none(*pgd)))
>> + set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
>> + }
>>
>> - pgd_t *pgd = next->pgd + stack_pgd_index;
>> + /* Stop remote flushes for the previous mm */
>> + if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
>
> cpumask_test_and_clear_cpu()

Same as before. I can change this, but it'll have different
performance characteristics. This one here will optimize the case
where we go lazy and then switch away.

>
>>
>> - if (unlikely(pgd_none(*pgd)))
>> - set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
>> - }
>> + WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
>
> We warn if the next task is not lazy because...?

The next task had better not be in mm_cpumask(), but I agree that this
is minor. I'll change it to VM_WARN_ON_ONCE.

>
>> - this_cpu_write(cpu_tlbstate.loaded_mm, next);
>> - this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
>> - this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
>> - atomic64_read(&next->context.tlb_gen));
>> + /*
>> + * Start remote flushes and then read tlb_gen.
>> + */
>> + cpumask_set_cpu(cpu, mm_cpumask(next));
>> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>>
>> - WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
>> - cpumask_set_cpu(cpu, mm_cpumask(next));
>> + VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
>> + next->context.ctx_id);
>
> Also put it as the first statement after the else { ?

Done.

>
>> - /*
>> - * Re-load page tables.
>> - *
>> - * This logic has an ordering constraint:
>> - *
>> - * CPU 0: Write to a PTE for 'next'
>> - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
>> - * CPU 1: set bit 1 in next's mm_cpumask
>> - * CPU 1: load from the PTE that CPU 0 writes (implicit)
>> - *
>> - * We need to prevent an outcome in which CPU 1 observes
>> - * the new PTE value and CPU 0 observes bit 1 clear in
>> - * mm_cpumask. (If that occurs, then the IPI will never
>> - * be sent, and CPU 0's TLB will contain a stale entry.)
>> - *
>> - * The bad outcome can occur if either CPU's load is
>> - * reordered before that CPU's store, so both CPUs must
>> - * execute full barriers to prevent this from happening.
>> - *
>> - * Thus, switch_mm needs a full barrier between the
>> - * store to mm_cpumask and any operation that could load
>> - * from next->pgd. TLB fills are special and can happen
>> - * due to instruction fetches or for no reason at all,
>> - * and neither LOCK nor MFENCE orders them.
>> - * Fortunately, load_cr3() is serializing and gives the
>> - * ordering guarantee we need.
>> - */
>> - load_cr3(next->pgd);
>> -
>> - /*
>> - * This gets called via leave_mm() in the idle path where RCU
>> - * functions differently. Tracing normally uses RCU, so we have to
>> - * call the tracepoint specially here.
>> - */
>> - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
>> + next->context.ctx_id);
>> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
>> + next_tlb_gen);
>
> Yeah, just let all those three stick out.
>
> Also, no:
>
> if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
> next_tlb_gen) {
>
> check?

What would the check do? Without PCID, we don't have a choice as to
whether we flush.

>
>> + this_cpu_write(cpu_tlbstate.loaded_mm, next);
>> + write_cr3(__pa(next->pgd));
>>
>> - /* Stop flush ipis for the previous mm */
>> - WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
>> - real_prev != &init_mm);
>> - cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
>> + /*
>> + * This gets called via leave_mm() in the idle path where RCU
>> + * functions differently. Tracing normally uses RCU, so we
>> + * have to call the tracepoint specially here.
>> + */
>> + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
>> + TLB_FLUSH_ALL);
>> + }
>
> That's repeated as in the if-branch above. Move it out I guess.

I would have, but it changes later in the series and this reduces churn.

2017-06-22 18:09:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Thu, Jun 22, 2017 at 10:22 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote:
>> > Ah, simple: we control the flushing with info.new_tlb_gen and
>> > mm->context.tlb_gen. I.e., this check:
>> >
>> >
>> > if (f->end != TLB_FLUSH_ALL &&
>> > f->new_tlb_gen == local_tlb_gen + 1 &&
>> > f->new_tlb_gen == mm_tlb_gen) {
>> >
>> > why can't we write:
>> >
>> > if (f->end != TLB_FLUSH_ALL &&
>> > mm_tlb_gen == local_tlb_gen + 1)
>> >
>> > ?
>>
>> Ah, I thought you were asking about why I needed mm_tlb_gen ==
>> local_tlb_gen + 1. This is just an optimization, or at least I hope
>> it is. The idea is that, if we know that another flush is coming, it
>> seems likely that it would be faster to do a full flush and increase
>> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
>> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
>> needing to flush again very soon.
>
> Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1.
>
> Btw, do you see how confusing this check is: you have new_tlb_gen from
> a variable passed from the function call IPI, local_tlb_gen which is
> the CPU's own and then there's also mm_tlb_gen which we've written into
> new_tlb_gen from:
>
> info.new_tlb_gen = bump_mm_tlb_gen(mm);
>
> which incremented mm_tlb_gen too.

Yes, I agree it's confusing. There really are three numbers. Those
numbers are: the latest generation, the generation that this CPU has
caught up to, and the generation that the requester of the flush we're
currently handling has asked us to catch up to. I don't see a way to
reduce the complexity.

>
>> Hmm. I'd be nervous that there are more subtle races if we do this.
>> For example, suppose that a partial flush increments tlb_gen from 1 to
>> 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU
>> is busy switching back and forth between mms, so the partial flush
>> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
>> set in mm_cpumask.
>
> Lemme see if I understand this correctly: you mean, the full flush will
> exit early due to the
>
> if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>
> test?

Yes, or at least that's what I'm imagining.

>
>> The flush IPI hits after a switch_mm_irqs_off() call notices the
>> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
>> increment the local tlb_gen to 2, and the IPI handler for the partial
>> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
>> == 2 and mm_tlb_gen == 3) and do a partial flush.
>
> Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
>
> That's why you have this thing in addition to the tlb_gen.

Yes. The idea is that we only do remote partial flushes when it's
100% obvious that it's safe.

>
> What we end up doing in this case, is promote the partial flush to a
> full one and thus have a partial and a full flush which are close by
> converted to two full flushes.

It could be converted to two full flushes or to just one, I think,
depending on what order everything happens in.

>
>> The problem here is that it's not obvious to me that this actually
>> ends up flushing everything that's needed. Maybe all the memory
>> ordering gets this right, but I can imagine scenarios in which
>> switch_mm_irqs_off() does its flush early enough that the TLB picks up
>> an entry that was supposed to get zapped by the full flush.
>
> See above.
>
> And I don't think that having two full flushes back-to-back is going to
> cost a lot as the second one won't flush a whole lot.

>From limited benchmarking on new Intel chips, a full flush is very
expensive no matter what. I think this is silly because I suspect
that the PCID circuitry could internally simulate a full flush at very
little cost, but it seems that it doesn't. I haven't tried to
benchmark INVLPG.

>
>> IOW it *might* be valid, but I think it would need very careful review
>> and documentation.
>
> Always.
>
> Btw, I get the sense this TLB flush avoiding scheme becomes pretty
> complex for diminishing reasons.
>
> [ Or maybe I'm not seeing them - I'm always open to corrections. ]
>
> Especially if intermediary levels from the pagetable walker are cached
> and reestablishing the TLB entries seldom means a full walk.
>
> You should do a full fledged benchmark to see whether this whole
> complexity is even worth it, methinks.

I agree that, by itself, this patch is waaaaaay too complex to be
justifiable. The thing is that, after quite a few false starts, I
couldn't find a clean way to get PCID and improved laziness without
this thing as a prerequisite. Both of those features depend on having
a heuristic for when a flush can be avoided, and that heuristic must
*never* say that a flush can be skipped when it can't be skipped.
This patch gives a way to do that.

I tried a few other approaches:

- Keeping a cpumask of which CPUs are up to date. Someone at Intel
tried this once and I inherited that code, but I scrapped it all after
it had both performance and correctness issues. I tried the approach
again from scratch and paulmck poked all kinds of holes in it.

- Using a lock to make sure that only one flush can be in progress on
a given mm at a given time. The performance is just fine -- flushes
can't usefully happen in parallel anyway. The problem is that the
batched unmap code in the core mm (which is apparently a huge win on
some workloads) can introduce arbitrarily long delays between
initiating a flush and actually requesting that the IPIs be sent. I
could have worked around this with fancy data structures, but getting
them right so they wouldn't deadlock if called during reclaim and
preventing lock ordering issues would have been really nasty.

- Poking remote cpus' data structures even when they're lazy. This
wouldn't scale on systems with many cpus, since a given mm can easily
be lazy on every single other cpu all at once.

- Ditching remote partial flushes entirely. But those were recently
re-optimized by some Intel folks (Dave and others, IIRC) and came with
nice benchmarks showing that they were useful on some workloads.
(munmapping a small range, presumably.)

But this approach of using three separate tlb_gen values seems to
cover all the bases, and I don't think it's *that* bad.

--Andy

2017-06-22 18:10:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Thu, Jun 22, 2017 at 9:09 AM, Nadav Amit <[email protected]> wrote:
> Andy Lutomirski <[email protected]> wrote:
>
>>
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
>>
>> DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
>> .loaded_mm = &init_mm,
>> + .next_asid = 1,
>
> I think this is a remainder from previous version of the patches, no? It
> does not seem necessary and may be confusing (ctx_id 0 is reserved, but not
> asid 0).

Hmm. It's no longer needed for correctness, but init_mm still lands
in slot 0, and it seems friendly to avoid immediately stomping it.
Admittedly, this won't make any practical difference since it'll only
happen once per cpu.

>
> Other than that, if you want, you can put for the entire series:
>
> Reviewed-by: Nadav Amit <[email protected]>
>

Thanks!

2017-06-22 18:13:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 21 Jun 2017, Andy Lutomirski wrote:
>> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner <[email protected]> wrote:
>> > That requires a conditional branch
>> >
>> > if (asid >= NR_DYNAMIC_ASIDS) {
>> > asid = 0;
>> > ....
>> > }
>> >
>> > The question is whether 4 IDs would be sufficient which trades the branch
>> > for a mask operation. Or you go for 8 and spend another cache line.
>>
>> Interesting. I'm inclined to either leave it at 6 or reduce it to 4
>> for now and to optimize later.
>
> :)
>
>> > Hmm. So this loop needs to be taken unconditionally even if the task stays
>> > on the same CPU. And of course the number of dynamic IDs has to be short in
>> > order to makes this loop suck performance wise.
>> >
>> > Something like the completely disfunctional below might be worthwhile to
>> > explore. At least arch/x86/mm/ compiles :)
>> >
>> > It gets rid of the loop search and lifts the limit of dynamic ids by
>> > trading it with a percpu variable in mm_context_t.
>>
>> That would work, but it would take a lot more memory on large systems
>> with lots of processes, and I'd also be concerned that we might run
>> out of dynamic percpu space.
>
> Yeah, did not think about the dynamic percpu space.
>
>> How about a different idea: make the percpu data structure look like a
>> 4-way set associative cache. The ctxs array could be, say, 1024
>> entries long without using crazy amounts of memory. We'd divide it
>> into 256 buckets, so you'd index it like ctxs[4*bucket + slot]. For
>> each mm, we choose a random bucket (from 0 through 256), and then we'd
>> just loop over the four slots in the bucket in choose_asid(). This
>> would require very slightly more arithmetic (I'd guess only one or two
>> cycles, though) but, critically, wouldn't touch any more cachelines.
>>
>> The downside of both of these approaches over the one in this patch is
>> that the change that the percpu cacheline we need is not in the cache
>> is quite a bit higher since it's potentially a different cacheline for
>> each mm. It would probably still be a win because avoiding the flush
>> is really quite valuable.
>>
>> What do you think? The added code would be tiny.
>
> That might be worth a try.
>
> Now one other optimization which should be trivial to add is to keep the 4
> asid context entries in cpu_tlbstate and cache the last asid in thread
> info. If that's still valid then use it otherwise unconditionally get a new
> one. That avoids the whole loop machinery and thread info is cache hot in
> the context switch anyway. Delta patch on top of your version below.

I'm not sure I understand. If an mm has ASID 0 on CPU 0 and ASID 1 on
CPU 1 and a thread in that mm bounces back and forth between those
CPUs, won't your patch cause it to flush every time?

--Andy

2017-06-22 19:05:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
> I figured that some future reader of this patch might actually want to
> see this text, though.

Oh, don't get me wrong: with commit messages more is more, in the
general case. That's why I said "if".

> >> The UV tlbflush code is rather dated and should be changed.
>
> And I'd definitely like the UV maintainers to notice this part, now or
> in the future :) I don't want to personally touch the UV code with a
> ten-foot pole, but it really should be updated by someone who has a
> chance of getting it right and being able to test it.

Ah, could be because they moved recently and have hpe addresses now.
Lemme add them.

> >> +
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
> >
> > It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> > does BTR straightaway.
>
> Yeah, but I'm doing this for performance. I think that all the
> various one-line helpers do a LOCKed op right away, and I think it's
> faster to see if we can avoid the LOCKed op by trying an ordinary read
> first.

Right, the test part of the operation is unlocked so if that is the
likely case, it is a win.

> OTOH, maybe this is misguided -- if the cacheline lives somewhere else
> and we do end up needing to update it, we'll end up first sharing it
> and then making it exclusive, which increases the amount of cache
> coherency traffic, so maybe I'm optimizing for the wrong thing. What
> do you think?

Yeah, but we'll have to do that anyway for the locked operation. Ok,
let's leave it split like it is.

> It did in one particular buggy incarnation. It would also trigger if,
> say, suspend/resume corrupts CR3. Admittedly this is unlikely, but
> I'd rather catch it. Once PCID is on, corruption seems a bit less
> farfetched -- this assertion will catch anyone who accidentally does
> write_cr3(read_cr3_pa()).

Ok, but let's put a comment over it pls as it is not obvious when
something like that can happen.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-22 21:22:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Thu, 22 Jun 2017, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner <[email protected]> wrote:
> > Now one other optimization which should be trivial to add is to keep the 4
> > asid context entries in cpu_tlbstate and cache the last asid in thread
> > info. If that's still valid then use it otherwise unconditionally get a new
> > one. That avoids the whole loop machinery and thread info is cache hot in
> > the context switch anyway. Delta patch on top of your version below.
>
> I'm not sure I understand. If an mm has ASID 0 on CPU 0 and ASID 1 on
> CPU 1 and a thread in that mm bounces back and forth between those
> CPUs, won't your patch cause it to flush every time?

Yeah, I was too focussed on the non migratory case, where two tasks from
different processes play rapid ping pong. That's what I was looking at for
various reasons.

There the cached asid really helps by avoiding the loop completely, but
yes, the search needs to be done for the bouncing between CPUs case.

So maybe a combo of those might be interesting.

Thanks,

tglx

2017-06-23 03:09:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Thu, Jun 22, 2017 at 2:22 PM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 22 Jun 2017, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner <[email protected]> wrote:
>> > Now one other optimization which should be trivial to add is to keep the 4
>> > asid context entries in cpu_tlbstate and cache the last asid in thread
>> > info. If that's still valid then use it otherwise unconditionally get a new
>> > one. That avoids the whole loop machinery and thread info is cache hot in
>> > the context switch anyway. Delta patch on top of your version below.
>>
>> I'm not sure I understand. If an mm has ASID 0 on CPU 0 and ASID 1 on
>> CPU 1 and a thread in that mm bounces back and forth between those
>> CPUs, won't your patch cause it to flush every time?
>
> Yeah, I was too focussed on the non migratory case, where two tasks from
> different processes play rapid ping pong. That's what I was looking at for
> various reasons.
>
> There the cached asid really helps by avoiding the loop completely, but
> yes, the search needs to be done for the bouncing between CPUs case.
>
> So maybe a combo of those might be interesting.
>

I'm not too worried about optimizing away the loop. It's a loop over
four or six things that are all in cachelines that we need anyway. I
suspect that we'll never be able to see it in any microbenchmark, let
alone real application.

2017-06-23 07:29:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Thu, 22 Jun 2017, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 2:22 PM, Thomas Gleixner <[email protected]> wrote:
> > On Thu, 22 Jun 2017, Andy Lutomirski wrote:
> >> On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner <[email protected]> wrote:
> >> > Now one other optimization which should be trivial to add is to keep the 4
> >> > asid context entries in cpu_tlbstate and cache the last asid in thread
> >> > info. If that's still valid then use it otherwise unconditionally get a new
> >> > one. That avoids the whole loop machinery and thread info is cache hot in
> >> > the context switch anyway. Delta patch on top of your version below.
> >>
> >> I'm not sure I understand. If an mm has ASID 0 on CPU 0 and ASID 1 on
> >> CPU 1 and a thread in that mm bounces back and forth between those
> >> CPUs, won't your patch cause it to flush every time?
> >
> > Yeah, I was too focussed on the non migratory case, where two tasks from
> > different processes play rapid ping pong. That's what I was looking at for
> > various reasons.
> >
> > There the cached asid really helps by avoiding the loop completely, but
> > yes, the search needs to be done for the bouncing between CPUs case.
> >
> > So maybe a combo of those might be interesting.
> >
>
> I'm not too worried about optimizing away the loop. It's a loop over
> four or six things that are all in cachelines that we need anyway. I
> suspect that we'll never be able to see it in any microbenchmark, let
> alone real application.

Fair enough.

2017-06-23 08:42:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote:
> Yes, I agree it's confusing. There really are three numbers. Those
> numbers are: the latest generation, the generation that this CPU has
> caught up to, and the generation that the requester of the flush we're
> currently handling has asked us to catch up to. I don't see a way to
> reduce the complexity.

Yeah, can you pls put that clarification what what is, over it. It
explains it nicely what the check is supposed to do.

> >> The flush IPI hits after a switch_mm_irqs_off() call notices the
> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
> >> increment the local tlb_gen to 2, and the IPI handler for the partial
> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
> >> == 2 and mm_tlb_gen == 3) and do a partial flush.
> >
> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
> >
> > That's why you have this thing in addition to the tlb_gen.
>
> Yes. The idea is that we only do remote partial flushes when it's
> 100% obvious that it's safe.

So why wouldn't my simplified suggestion work then?

if (f->end != TLB_FLUSH_ALL &&
mm_tlb_gen == local_tlb_gen + 1)

1->2 is a partial flush - gets promoted to a full one
2->3 is a full flush - it will get executed as one due to the f->end setting to
TLB_FLUSH_ALL.

> It could be converted to two full flushes or to just one, I think,
> depending on what order everything happens in.

Right. One flush at the right time would be optimal.

> But this approach of using three separate tlb_gen values seems to
> cover all the bases, and I don't think it's *that* bad.

Sure.

As I said in IRC, let's document that complexity then so that when we
stumble over it in the future, we at least know why it was done this
way.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-23 09:08:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code

On Tue, Jun 20, 2017 at 10:22:13PM -0700, Andy Lutomirski wrote:
> Now that lazy TLB suppresses all flush IPIs (as opposed to all but
> the first), there's no need to leave_mm() when going idle.
>
> This means we can get rid of the rcuidle hack in
> switch_mm_irqs_off() and we can unexport leave_mm().
>
> This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
> since it has no callers any more.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/ia64/include/asm/acpi.h | 2 --
> arch/x86/include/asm/acpi.h | 2 --
> arch/x86/mm/tlb.c | 19 +++----------------
> drivers/acpi/processor_idle.c | 2 --
> drivers/idle/intel_idle.c | 9 ++++-----
> 5 files changed, 7 insertions(+), 27 deletions(-)

Reviewed-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-23 09:24:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels

On Tue, Jun 20, 2017 at 10:22:14PM -0700, Andy Lutomirski wrote:
> 32-bit kernels on new hardware will see PCID in CPUID, but PCID can
> only be used in 64-bit mode. Rather than making all PCID code
> conditional, just disable the feature on 32-bit builds.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/disabled-features.h | 4 +++-
> arch/x86/kernel/cpu/bugs.c | 8 ++++++++
> 2 files changed, 11 insertions(+), 1 deletion(-)

Reviewed-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-23 09:34:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID

On Tue, Jun 20, 2017 at 10:22:15PM -0700, Andy Lutomirski wrote:
> The parameter is only present on x86_64 systems to save a few bytes,
> as PCID is always disabled on x86_32.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 2 ++
> arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++
> 2 files changed, 20 insertions(+)

Reviewed-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-23 11:50:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

On Tue, Jun 20, 2017 at 10:22:16PM -0700, Andy Lutomirski wrote:
> We can use PCID if the CPU has PCID and PGE and we're not on Xen.
>
> By itself, this has no effect. The next patch will start using
> PCID.
>
> Cc: Juergen Gross <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 8 ++++++++
> arch/x86/kernel/cpu/common.c | 15 +++++++++++++++
> arch/x86/xen/enlighten_pv.c | 6 ++++++
> 3 files changed, 29 insertions(+)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 87b13e51e867..57b305e13c4c 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
> __flush_tlb_global();
> else
> __flush_tlb();
> +
> + /*
> + * Note: if we somehow had PCID but not PGE, then this wouldn't work --
> + * we'd end up flushing kernel translations for the current ASID but
> + * we might fail to flush kernel translations for other cached ASIDs.
> + *
> + * To avoid this issue, we force PCID off if PGE is off.
> + */
> }
>
> static inline void __flush_tlb_one(unsigned long addr)
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 904485e7b230..01caf66b270f 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> setup_smep(c);
> setup_smap(c);
>
> + /* Set up PCID */
> + if (cpu_has(c, X86_FEATURE_PCID)) {
> + if (cpu_has(c, X86_FEATURE_PGE)) {

What are we protecting ourselves here against? Funny virtualization guests?

Because PGE should be ubiquitous by now. Or have you heard something?

> + cr4_set_bits(X86_CR4_PCIDE);
> + } else {
> + /*
> + * flush_tlb_all(), as currently implemented, won't
> + * work if PCID is on but PGE is not. Since that
> + * combination doesn't exist on real hardware, there's
> + * no reason to try to fully support it.
> + */
> + clear_cpu_cap(c, X86_FEATURE_PCID);
> + }
> + }

This whole in setup_pcid() I guess, like the rest of the features.

> +
> /*
> * The vendor-specific functions might have changed features.
> * Now we do "generic changes."
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index f33eef4ebd12..a136aac543c3 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
> setup_clear_cpu_cap(X86_FEATURE_ACC);
> setup_clear_cpu_cap(X86_FEATURE_X2APIC);
>
> + /*
> + * Xen PV would need some work to support PCID: CR3 handling as well
> + * as xen_flush_tlb_others() would need updating.
> + */
> + setup_clear_cpu_cap(X86_FEATURE_PCID);
> +
> if (!xen_initial_domain())
> setup_clear_cpu_cap(X86_FEATURE_ACPI);
>
> --
> 2.9.4
>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-06-23 13:35:41

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking


> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 1d7a7213a310..f5df56fb8b5c 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
> /* Get the "official" set of cpus referring to our pagetable. */
> if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
> for_each_online_cpu(cpu) {
> - if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
> - && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
> + if (per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
> continue;
> smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
> }
>


I wonder then whether
cpumask_copy(mask, mm_cpumask(mm));
immediately below is needed.

-boris

2017-06-23 13:36:48

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems


> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index f33eef4ebd12..a136aac543c3 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
> setup_clear_cpu_cap(X86_FEATURE_ACC);
> setup_clear_cpu_cap(X86_FEATURE_X2APIC);
>
> + /*
> + * Xen PV would need some work to support PCID: CR3 handling as well
> + * as xen_flush_tlb_others() would need updating.
> + */
> + setup_clear_cpu_cap(X86_FEATURE_PCID);
> +
> if (!xen_initial_domain())
> setup_clear_cpu_cap(X86_FEATURE_ACPI);
>


Reviewed-by: Boris Ostrovsky <[email protected]>

2017-06-23 15:23:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Fri, Jun 23, 2017 at 6:34 AM, Boris Ostrovsky
<[email protected]> wrote:
>
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 1d7a7213a310..f5df56fb8b5c 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
>> /* Get the "official" set of cpus referring to our pagetable. */
>> if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
>> for_each_online_cpu(cpu) {
>> - if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
>> - && per_cpu(xen_current_cr3, cpu) !=
>> __pa(mm->pgd))
>> + if (per_cpu(xen_current_cr3, cpu) !=
>> __pa(mm->pgd))
>> continue;
>> smp_call_function_single(cpu,
>> drop_mm_ref_this_cpu, mm, 1);
>> }
>>
>
>
> I wonder then whether
> cpumask_copy(mask, mm_cpumask(mm));
> immediately below is needed.

Probably not. I'll change it to cpumask_clear(). Then the two cases
in that function match better.

>
> -boris

2017-06-23 15:28:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

On Fri, Jun 23, 2017 at 4:50 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Jun 20, 2017 at 10:22:16PM -0700, Andy Lutomirski wrote:
>> We can use PCID if the CPU has PCID and PGE and we're not on Xen.
>>
>> By itself, this has no effect. The next patch will start using
>> PCID.
>>
>> Cc: Juergen Gross <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/tlbflush.h | 8 ++++++++
>> arch/x86/kernel/cpu/common.c | 15 +++++++++++++++
>> arch/x86/xen/enlighten_pv.c | 6 ++++++
>> 3 files changed, 29 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 87b13e51e867..57b305e13c4c 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
>> __flush_tlb_global();
>> else
>> __flush_tlb();
>> +
>> + /*
>> + * Note: if we somehow had PCID but not PGE, then this wouldn't work --
>> + * we'd end up flushing kernel translations for the current ASID but
>> + * we might fail to flush kernel translations for other cached ASIDs.
>> + *
>> + * To avoid this issue, we force PCID off if PGE is off.
>> + */
>> }
>>
>> static inline void __flush_tlb_one(unsigned long addr)
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 904485e7b230..01caf66b270f 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>> setup_smep(c);
>> setup_smap(c);
>>
>> + /* Set up PCID */
>> + if (cpu_has(c, X86_FEATURE_PCID)) {
>> + if (cpu_has(c, X86_FEATURE_PGE)) {
>
> What are we protecting ourselves here against? Funny virtualization guests?
>
> Because PGE should be ubiquitous by now. Or have you heard something?

Yes, funny VM guests. I've been known to throw weird options at qemu
myself, and I prefer when the system works. In this particular case,
I think the failure mode would be stale kernel TLB entries, and that
would be really annoying.

>
>> + cr4_set_bits(X86_CR4_PCIDE);
>> + } else {
>> + /*
>> + * flush_tlb_all(), as currently implemented, won't
>> + * work if PCID is on but PGE is not. Since that
>> + * combination doesn't exist on real hardware, there's
>> + * no reason to try to fully support it.
>> + */
>> + clear_cpu_cap(c, X86_FEATURE_PCID);
>> + }
>> + }
>
> This whole in setup_pcid() I guess, like the rest of the features.

Done.

2017-06-23 15:47:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

On Fri, Jun 23, 2017 at 1:42 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote:
>> Yes, I agree it's confusing. There really are three numbers. Those
>> numbers are: the latest generation, the generation that this CPU has
>> caught up to, and the generation that the requester of the flush we're
>> currently handling has asked us to catch up to. I don't see a way to
>> reduce the complexity.
>
> Yeah, can you pls put that clarification what what is, over it. It
> explains it nicely what the check is supposed to do.

Done. I've tried to improve a bunch of the comments in this function.

>
>> >> The flush IPI hits after a switch_mm_irqs_off() call notices the
>> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
>> >> increment the local tlb_gen to 2, and the IPI handler for the partial
>> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
>> >> == 2 and mm_tlb_gen == 3) and do a partial flush.
>> >
>> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
>> >
>> > That's why you have this thing in addition to the tlb_gen.
>>
>> Yes. The idea is that we only do remote partial flushes when it's
>> 100% obvious that it's safe.
>
> So why wouldn't my simplified suggestion work then?
>
> if (f->end != TLB_FLUSH_ALL &&
> mm_tlb_gen == local_tlb_gen + 1)
>
> 1->2 is a partial flush - gets promoted to a full one
> 2->3 is a full flush - it will get executed as one due to the f->end setting to
> TLB_FLUSH_ALL.

This could still fail in some cases, I think. Suppose 1->2 is a
partial flush and 2->3 is a full flush. We could have this order of
events:

- CPU 1: Partial flush. Increase context.tlb_gen to 2 and send IPI.
- CPU 0: switch_mm(), observe mm_tlb_gen == 2, set local_tlb_gen to 2.
- CPU 2: Full flush. Increase context.tlb_gen to 3 and send IPI.
- CPU 0: Receive partial flush IPI. mm_tlb_gen == 2 and
local_tlb_gen == 3. Do __flush_tlb_single() and set local_tlb_gen to
3.

Our invariant is now broken: CPU 0's percpu tlb_gen is now ahead of
its actual TLB state.

- CPU 0: Receive full flush IPI and skip the flush. Oops.

I think my condition makes it clear that the invariants we need hold
no matter it.

>
>> It could be converted to two full flushes or to just one, I think,
>> depending on what order everything happens in.
>
> Right. One flush at the right time would be optimal.
>
>> But this approach of using three separate tlb_gen values seems to
>> cover all the bases, and I don't think it's *that* bad.
>
> Sure.
>
> As I said in IRC, let's document that complexity then so that when we
> stumble over it in the future, we at least know why it was done this
> way.

I've given it a try. Hopefully v4 is more clear.

2017-06-26 15:59:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

On Tue, Jun 20, 2017 at 10:22:17PM -0700, Andy Lutomirski wrote:
> PCID is a "process context ID" -- it's what other architectures call
> an address space ID. Every non-global TLB entry is tagged with a
> PCID, only TLB entries that match the currently selected PCID are
> used, and we can switch PGDs without flushing the TLB. x86's
> PCID is 12 bits.
>
> This is an unorthodox approach to using PCID. x86's PCID is far too
> short to uniquely identify a process, and we can't even really
> uniquely identify a running process because there are monster
> systems with over 4096 CPUs. To make matters worse, past attempts
> to use all 12 PCID bits have resulted in slowdowns instead of
> speedups.
>
> This patch uses PCID differently. We use a PCID to identify a
> recently-used mm on a per-cpu basis. An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
>
> This seems to save about 100ns on context switches between mms.

"... with my microbenchmark of ping-ponging." :)

>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/mmu_context.h | 3 ++
> arch/x86/include/asm/processor-flags.h | 2 +
> arch/x86/include/asm/tlbflush.h | 18 +++++++-
> arch/x86/mm/init.c | 1 +
> arch/x86/mm/tlb.c | 82 ++++++++++++++++++++++++++--------
> 5 files changed, 86 insertions(+), 20 deletions(-)

...

> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 57b305e13c4c..a9a5aa6f45f7 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
> #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
> #endif
>
> +/*
> + * 6 because 6 should be plenty and struct tlb_state will fit in
> + * two cache lines.
> + */
> +#define NR_DYNAMIC_ASIDS 6

TLB_NR_DYN_ASIDS

Properly prefixed, I guess.

The rest later, when you're done experimenting. :)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-07-27 19:54:01

by Andrew Banman

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov <[email protected]> wrote:
> > On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
> >> Rewrite it entirely. When we enter lazy mode, we simply remove the
> >> cpu from mm_cpumask. This means that we need a way to figure out
> >
> > s/cpu/CPU/
>
> Done.
>
> >
> >> whether we've missed a flush when we switch back out of lazy mode.
> >> I use the tlb_gen machinery to track whether a context is up to
> >> date.
> >>
> >> Note to reviewers: this patch, my itself, looks a bit odd. I'm
> >> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> >> just storing tlb_gen, and making it at array isn't necessary yet.
> >> I'm doing this because the next few patches add PCID support, and,
> >> with PCID, we need ctx_id, and the array will end up with a length
> >> greater than 1. Making it an array now means that there will be
> >> less churn and therefore less stress on your eyeballs.
> >>
> >> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> >> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> >> patch changes the way that mm_cpumask() works. This should be okay,
> >> since Xen *also* iterates all online CPUs to find all the CPUs it
> >> needs to twiddle.
> >
> > This whole text should be under the "---" line below if we don't want it
> > in the commit message.
>
> I figured that some future reader of this patch might actually want to
> see this text, though.
>
> >
> >>
> >> The UV tlbflush code is rather dated and should be changed.
>
> And I'd definitely like the UV maintainers to notice this part, now or
> in the future :) I don't want to personally touch the UV code with a
> ten-foot pole, but it really should be updated by someone who has a
> chance of getting it right and being able to test it.

Noticed! We're aware of these changes and we're planning on updating this
code in the future. Presently the BAU tlb shootdown feature is working well
on our recent hardware.

Thank you,

Andrew
HPE <[email protected]>

>
> >> +
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
> >
> > It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> > does BTR straightaway.
>
> Yeah, but I'm doing this for performance. I think that all the
> various one-line helpers do a LOCKed op right away, and I think it's
> faster to see if we can avoid the LOCKed op by trying an ordinary read
> first. OTOH, maybe this is misguided -- if the cacheline lives
> somewhere else and we do end up needing to update it, we'll end up
> first sharing it and then making it exclusive, which increases the
> amount of cache coherency traffic, so maybe I'm optimizing for the
> wrong thing. What do you think?
>
> >> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> >> - BUG();
> >> + /* Warn if we're not lazy. */
> >> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
> >
> > We don't BUG() anymore?
>
> We could. But, when the whole patch series is applied, the only
> caller left is a somewhat dubious Xen optimization, and if we blindly
> continue executing, I think the worst that happens is that we OOPS
> later or that we get segfaults when we shouldn't get segfaults.
>
> >
> >>
> >> switch_mm(NULL, &init_mm, NULL);
> >> }
> >> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> >> {
> >> unsigned cpu = smp_processor_id();
> >> struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> >> + u64 next_tlb_gen;
> >
> > Please sort function local variables declaration in a reverse christmas
> > tree order:
> >
> > <type> longest_variable_name;
> > <type> shorter_var_name;
> > <type> even_shorter;
> > <type> i;
> >
> >>
> >> /*
> >> - * NB: The scheduler will call us with prev == next when
> >> - * switching from lazy TLB mode to normal mode if active_mm
> >> - * isn't changing. When this happens, there is no guarantee
> >> - * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
> >> + * NB: The scheduler will call us with prev == next when switching
> >> + * from lazy TLB mode to normal mode if active_mm isn't changing.
> >> + * When this happens, we don't assume that CR3 (and hence
> >> + * cpu_tlbstate.loaded_mm) matches next.
> >> *
> >> * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
> >> */
> >>
> >> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> >> + /* We don't want flush_tlb_func_* to run concurrently with us. */
> >> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> >> + WARN_ON_ONCE(!irqs_disabled());
> >> +
> >> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
> >
> > Why do we need that check? Can that ever happen?
>
> It did in one particular buggy incarnation. It would also trigger if,
> say, suspend/resume corrupts CR3. Admittedly this is unlikely, but
> I'd rather catch it. Once PCID is on, corruption seems a bit less
> farfetched -- this assertion will catch anyone who accidentally does
> write_cr3(read_cr3_pa()).
>
> >
> >> if (real_prev == next) {
> >> - /*
> >> - * There's nothing to do: we always keep the per-mm control
> >> - * regs in sync with cpu_tlbstate.loaded_mm. Just
> >> - * sanity-check mm_cpumask.
> >> - */
> >> - if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
> >> - cpumask_set_cpu(cpu, mm_cpumask(next));
> >> - return;
> >> - }
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
> >> + /*
> >> + * There's nothing to do: we weren't lazy, and we
> >> + * aren't changing our mm. We don't need to flush
> >> + * anything, nor do we need to update CR3, CR4, or
> >> + * LDTR.
> >> + */
> >
> > Nice comment.
> >
> >> + return;
> >> + }
> >> +
> >> + /* Resume remote flushes and then read tlb_gen. */
> >> + cpumask_set_cpu(cpu, mm_cpumask(next));
> >> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> >> +
> >> + VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> >> + next->context.ctx_id);
> >
> > I guess this check should be right under the if (real_prev == next), right?
>
> Moved.
>
> >
> >> +
> >> + if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
> >> + next_tlb_gen) {
> >
> > Yeah, let it stick out - that trailing '<' doesn't make it any nicer.
> >
> >> + /*
> >> + * Ideally, we'd have a flush_tlb() variant that
> >> + * takes the known CR3 value as input. This would
> >> + * be faster on Xen PV and on hypothetical CPUs
> >> + * on which INVPCID is fast.
> >> + */
> >> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> >> + next_tlb_gen);
> >> + write_cr3(__pa(next->pgd));
> >
> > <---- newline here.
> >
> >> + /*
> >> + * This gets called via leave_mm() in the idle path
> >> + * where RCU functions differently. Tracing normally
> >> + * uses RCU, so we have to call the tracepoint
> >> + * specially here.
> >> + */
> >> + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
> >> + TLB_FLUSH_ALL);
> >> + }
> >>
> >> - if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> >> /*
> >> - * If our current stack is in vmalloc space and isn't
> >> - * mapped in the new pgd, we'll double-fault. Forcibly
> >> - * map it.
> >> + * We just exited lazy mode, which means that CR4 and/or LDTR
> >> + * may be stale. (Changes to the required CR4 and LDTR states
> >> + * are not reflected in tlb_gen.)
> >
> > We need that comment because... ? I mean, we do update CR4/LDTR at the
> > end of the function.
>
> I'm trying to explain to the potentially confused reader why we fall
> through and update CR4 and LDTR even if we decided not to update the
> TLB.
>
> >
> >> */
> >> - unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
> >> + } else {
> >> + if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> >> + /*
> >> + * If our current stack is in vmalloc space and isn't
> >> + * mapped in the new pgd, we'll double-fault. Forcibly
> >> + * map it.
> >> + */
> >> + unsigned int stack_pgd_index =
> >
> > Shorten that var name and make it fit into 80ish cols so that the
> > linebreak is gone.
>
> Done.
>
> >
> >> + pgd_index(current_stack_pointer());
> >> +
> >> + pgd_t *pgd = next->pgd + stack_pgd_index;
> >> +
> >> + if (unlikely(pgd_none(*pgd)))
> >> + set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
> >> + }
> >>
> >> - pgd_t *pgd = next->pgd + stack_pgd_index;
> >> + /* Stop remote flushes for the previous mm */
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
> >
> > cpumask_test_and_clear_cpu()
>
> Same as before. I can change this, but it'll have different
> performance characteristics. This one here will optimize the case
> where we go lazy and then switch away.
>
> >
> >>
> >> - if (unlikely(pgd_none(*pgd)))
> >> - set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
> >> - }
> >> + WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
> >
> > We warn if the next task is not lazy because...?
>
> The next task had better not be in mm_cpumask(), but I agree that this
> is minor. I'll change it to VM_WARN_ON_ONCE.
>
> >
> >> - this_cpu_write(cpu_tlbstate.loaded_mm, next);
> >> - this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
> >> - this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> >> - atomic64_read(&next->context.tlb_gen));
> >> + /*
> >> + * Start remote flushes and then read tlb_gen.
> >> + */
> >> + cpumask_set_cpu(cpu, mm_cpumask(next));
> >> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> >>
> >> - WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
> >> - cpumask_set_cpu(cpu, mm_cpumask(next));
> >> + VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
> >> + next->context.ctx_id);
> >
> > Also put it as the first statement after the else { ?
>
> Done.
>
> >
> >> - /*
> >> - * Re-load page tables.
> >> - *
> >> - * This logic has an ordering constraint:
> >> - *
> >> - * CPU 0: Write to a PTE for 'next'
> >> - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
> >> - * CPU 1: set bit 1 in next's mm_cpumask
> >> - * CPU 1: load from the PTE that CPU 0 writes (implicit)
> >> - *
> >> - * We need to prevent an outcome in which CPU 1 observes
> >> - * the new PTE value and CPU 0 observes bit 1 clear in
> >> - * mm_cpumask. (If that occurs, then the IPI will never
> >> - * be sent, and CPU 0's TLB will contain a stale entry.)
> >> - *
> >> - * The bad outcome can occur if either CPU's load is
> >> - * reordered before that CPU's store, so both CPUs must
> >> - * execute full barriers to prevent this from happening.
> >> - *
> >> - * Thus, switch_mm needs a full barrier between the
> >> - * store to mm_cpumask and any operation that could load
> >> - * from next->pgd. TLB fills are special and can happen
> >> - * due to instruction fetches or for no reason at all,
> >> - * and neither LOCK nor MFENCE orders them.
> >> - * Fortunately, load_cr3() is serializing and gives the
> >> - * ordering guarantee we need.
> >> - */
> >> - load_cr3(next->pgd);
> >> -
> >> - /*
> >> - * This gets called via leave_mm() in the idle path where RCU
> >> - * functions differently. Tracing normally uses RCU, so we have to
> >> - * call the tracepoint specially here.
> >> - */
> >> - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> >> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
> >> + next->context.ctx_id);
> >> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> >> + next_tlb_gen);
> >
> > Yeah, just let all those three stick out.
> >
> > Also, no:
> >
> > if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
> > next_tlb_gen) {
> >
> > check?
>
> What would the check do? Without PCID, we don't have a choice as to
> whether we flush.
>
> >
> >> + this_cpu_write(cpu_tlbstate.loaded_mm, next);
> >> + write_cr3(__pa(next->pgd));
> >>
> >> - /* Stop flush ipis for the previous mm */
> >> - WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
> >> - real_prev != &init_mm);
> >> - cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
> >> + /*
> >> + * This gets called via leave_mm() in the idle path where RCU
> >> + * functions differently. Tracing normally uses RCU, so we
> >> + * have to call the tracepoint specially here.
> >> + */
> >> + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
> >> + TLB_FLUSH_ALL);
> >> + }
> >
> > That's repeated as in the if-branch above. Move it out I guess.
>
> I would have, but it changes later in the series and this reduces churn.

2017-07-28 02:06:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

> On Jul 27, 2017, at 3:53 PM, Andrew Banman <[email protected]> wrote:
>
>> On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
>>> On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov <[email protected]> wrote:
>>>> On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
>>>> Rewrite it entirely. When we enter lazy mode, we simply remove the
>>>> cpu from mm_cpumask. This means that we need a way to figure out
>>>
>>> s/cpu/CPU/
>>
>> Done.
>>
>>>
>>>> whether we've missed a flush when we switch back out of lazy mode.
>>>> I use the tlb_gen machinery to track whether a context is up to
>>>> date.
>>>>
>>>> Note to reviewers: this patch, my itself, looks a bit odd. I'm
>>>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>>>> just storing tlb_gen, and making it at array isn't necessary yet.
>>>> I'm doing this because the next few patches add PCID support, and,
>>>> with PCID, we need ctx_id, and the array will end up with a length
>>>> greater than 1. Making it an array now means that there will be
>>>> less churn and therefore less stress on your eyeballs.
>>>>
>>>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>>>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>>>> patch changes the way that mm_cpumask() works. This should be okay,
>>>> since Xen *also* iterates all online CPUs to find all the CPUs it
>>>> needs to twiddle.
>>>
>>> This whole text should be under the "---" line below if we don't want it
>>> in the commit message.
>>
>> I figured that some future reader of this patch might actually want to
>> see this text, though.
>>
>>>
>>>>
>>>> The UV tlbflush code is rather dated and should be changed.
>>
>> And I'd definitely like the UV maintainers to notice this part, now or
>> in the future :) I don't want to personally touch the UV code with a
>> ten-foot pole, but it really should be updated by someone who has a
>> chance of getting it right and being able to test it.
>
> Noticed! We're aware of these changes and we're planning on updating this
> code in the future. Presently the BAU tlb shootdown feature is working well
> on our recent hardware.

:)

I would suggest reworking it to hook the SMP function call
infrastructure instead of the TLB shootdown code.