2018-07-16 19:04:35

by Rik van Riel

[permalink] [raw]
Subject: [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier

Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
kernels, using 1.9% of a 48 CPU system during a netperf run. Digging
into the profile, the atomic operations in cpumask_clear_cpu and
cpumask_set_cpu are responsible for about half of that CPU use.

However, the CPUs running netperf are simply switching back and
forth between netperf and the idle task, which does not require any
changes to the mm_cpumask if lazy TLB mode were used.

Additionally, the init_mm cpumask ends up being the most heavily
contended one in the system, for no reason at all.

Making really lazy TLB mode work again on modern kernels, by sending
a shootdown IPI only when page table pages are being unmapped, we get
back some performance.

v6 addresses the comment and Signed-off-by issues pointed out by Ingo.

On memcache workloads on 2 socket systems, this patch series seems
to reduce total system CPU use by 1-2%. On Song's netbench tests,
CPU use in the context switch time is about cut in half.

These patches also provide a little memory savings by shrinking
the size of mm_struct, especially on distro kernels compiled with
a gigantically large NR_CPUS.




2018-07-16 19:04:48

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm

Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
kernels,using 1.8% of a 48 CPU system during a netperf to localhost run.
Digging into the profile, we noticed that cpumask_clear_cpu and
cpumask_set_cpu together take about half of the CPU time taken by
switch_mm_irqs_off.

However, the CPUs running netperf end up switching back and forth
between netperf and the idle task, which does not require changes
to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
the most heavily contended one in the system.

Simply skipping changes to mm_cpumask(&init_mm) reduces overhead.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Reported-and-tested-by: Song Liu <[email protected]>
---
arch/x86/mm/tlb.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 493559cae2d5..f086195f644c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -310,15 +310,22 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
sync_current_stack_to_mm(next);
}

- /* Stop remote flushes for the previous mm */
- VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
- real_prev != &init_mm);
- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ /*
+ * Stop remote flushes for the previous mm.
+ * Skip kernel threads; we never send init_mm TLB flushing IPIs,
+ * but the bitmap manipulation can cause cache line contention.
+ */
+ if (real_prev != &init_mm) {
+ VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
+ mm_cpumask(real_prev)));
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ }

/*
* Start remote flushes and then read tlb_gen.
*/
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ if (next != &init_mm)
+ cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
--
2.14.4


2018-07-16 19:04:59

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 3/7] x86,mm: restructure switch_mm_irqs_off

Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.

No functional changes, since the if (real_prev == next) branch
always returns.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
---
arch/x86/mm/tlb.c | 60 +++++++++++++++++++++++++++----------------------------
1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9a893673c56b..4b73fe835c95 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -187,6 +187,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
+ bool need_flush;
+ u16 new_asid;

/*
* NB: The scheduler will call us with prev == next when switching
@@ -252,8 +254,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

return;
} else {
- u16 new_asid;
- bool need_flush;
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);

/*
@@ -297,41 +297,41 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+ }

- if (need_flush) {
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
-
- /*
- * NB: This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we need to use the _rcuidle variant.
- *
- * (There is no good reason for this. The idle code should
- * be rearranged to call this before rcu_idle_enter().)
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- } else {
- /* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
-
- /* See above wrt _rcuidle. */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
- }
+ if (need_flush) {
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+ load_new_mm_cr3(next->pgd, new_asid, true);

/*
- * Record last user mm's context id, so we can avoid
- * flushing branch buffer with IBPB if we switch back
- * to the same user.
+ * NB: This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we need to use the _rcuidle variant.
+ *
+ * (There is no good reason for this. The idle code should
+ * be rearranged to call this before rcu_idle_enter().)
*/
- if (next != &init_mm)
- this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ load_new_mm_cr3(next->pgd, new_asid, false);

- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+ /* See above wrt _rcuidle. */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}

+ /*
+ * Record last user mm's context id, so we can avoid
+ * flushing branch buffer with IBPB if we switch back
+ * to the same user.
+ */
+ if (next != &init_mm)
+ this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+
load_mm_cr4(next);
switch_ldt(real_prev, next);
}
--
2.14.4


2018-07-16 19:05:05

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Song Liu <[email protected]>
---
drivers/firmware/efi/efi.c | 1 +
include/linux/mm_types.h | 241 +++++++++++++++++++++++----------------------
kernel/fork.c | 15 +--
mm/init-mm.c | 11 +++
4 files changed, 145 insertions(+), 123 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 232f4915223b..7f0b19410a95 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
.mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(efi_mm.mmlist),
+ .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
};

static bool disable_runtime;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..efdc24dd9e97 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,183 @@ struct core_state {

struct kioctx_table;
struct mm_struct {
- struct vm_area_struct *mmap; /* list of VMAs */
- struct rb_root mm_rb;
- u32 vmacache_seqnum; /* per-thread vmacache */
+ struct {
+ struct vm_area_struct *mmap; /* list of VMAs */
+ struct rb_root mm_rb;
+ u32 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
- unsigned long (*get_unmapped_area) (struct file *filp,
+ unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
#endif
- unsigned long mmap_base; /* base of mmap area */
- unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
+ unsigned long mmap_base; /* base of mmap area */
+ unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
- /* Base adresses for compatible mmap() */
- unsigned long mmap_compat_base;
- unsigned long mmap_compat_legacy_base;
+ /* Base adresses for compatible mmap() */
+ unsigned long mmap_compat_base;
+ unsigned long mmap_compat_legacy_base;
#endif
- unsigned long task_size; /* size of task vm space */
- unsigned long highest_vm_end; /* highest vma end address */
- pgd_t * pgd;
-
- /**
- * @mm_users: The number of users including userspace.
- *
- * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
- * to 0 (i.e. when the task exits and there are no other temporary
- * reference holders), we also release a reference on @mm_count
- * (which may then free the &struct mm_struct if @mm_count also
- * drops to 0).
- */
- atomic_t mm_users;
-
- /**
- * @mm_count: The number of references to &struct mm_struct
- * (@mm_users count as 1).
- *
- * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
- * &struct mm_struct is freed.
- */
- atomic_t mm_count;
+ unsigned long task_size; /* size of task vm space */
+ unsigned long highest_vm_end; /* highest vma end address */
+ pgd_t * pgd;
+
+ /**
+ * @mm_users: The number of users including userspace.
+ *
+ * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+ * drops to 0 (i.e. when the task exits and there are no other
+ * temporary reference holders), we also release a reference on
+ * @mm_count (which may then free the &struct mm_struct if
+ * @mm_count also drops to 0).
+ */
+ atomic_t mm_users;
+
+ /**
+ * @mm_count: The number of references to &struct mm_struct
+ * (@mm_users count as 1).
+ *
+ * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+ * &struct mm_struct is freed.
+ */
+ atomic_t mm_count;

#ifdef CONFIG_MMU
- atomic_long_t pgtables_bytes; /* PTE page table pages */
+ atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
- int map_count; /* number of VMAs */
+ int map_count; /* number of VMAs */

- spinlock_t page_table_lock; /* Protects page tables and some counters */
- struct rw_semaphore mmap_sem;
+ spinlock_t page_table_lock; /* Protects page tables and some
+ * counters
+ */
+ struct rw_semaphore mmap_sem;

- struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
- * together off init_mm.mmlist, and are protected
- * by mmlist_lock
- */
+ struct list_head mmlist; /* List of maybe swapped mm's. These
+ * are globally strung together off
+ * init_mm.mmlist, and are protected
+ * by mmlist_lock
+ */


- unsigned long hiwater_rss; /* High-watermark of RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */

- unsigned long total_vm; /* Total pages mapped */
- unsigned long locked_vm; /* Pages that have PG_mlocked set */
- unsigned long pinned_vm; /* Refcount permanently increased */
- unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
- unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
- unsigned long stack_vm; /* VM_STACK */
- unsigned long def_flags;
+ unsigned long total_vm; /* Total pages mapped */
+ unsigned long locked_vm; /* Pages that have PG_mlocked set */
+ unsigned long pinned_vm; /* Refcount permanently increased */
+ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+ unsigned long stack_vm; /* VM_STACK */
+ unsigned long def_flags;

- spinlock_t arg_lock; /* protect the below fields */
- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
+ spinlock_t arg_lock; /* protect the below fields */
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;

- unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+ unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

- /*
- * Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- struct mm_rss_stat rss_stat;
-
- struct linux_binfmt *binfmt;
+ /*
+ * Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ struct mm_rss_stat rss_stat;

- cpumask_var_t cpu_vm_mask_var;
+ struct linux_binfmt *binfmt;

- /* Architecture-specific MM context */
- mm_context_t context;
+ /* Architecture-specific MM context */
+ mm_context_t context;

- unsigned long flags; /* Must use atomic bitops to access the bits */
+ unsigned long flags; /* Must use atomic bitops to access */

- struct core_state *core_state; /* coredumping support */
+ struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_MEMBARRIER
- atomic_t membarrier_state;
+ atomic_t membarrier_state;
#endif
#ifdef CONFIG_AIO
- spinlock_t ioctx_lock;
- struct kioctx_table __rcu *ioctx_table;
+ spinlock_t ioctx_lock;
+ struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ /*
+ * "owner" points to a task that is regarded as the canonical
+ * user/owner of this mm. All of the following must be true in
+ * order for it to be changed:
+ *
+ * current == mm->owner
+ * current->mm != mm
+ * new_owner->mm == mm
+ * new_owner->alloc_lock is held
+ */
+ struct task_struct __rcu *owner;
#endif
- struct user_namespace *user_ns;
+ struct user_namespace *user_ns;

- /* store ref to file /proc/<pid>/exe symlink points to */
- struct file __rcu *exe_file;
+ /* store ref to file /proc/<pid>/exe symlink points to */
+ struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
- struct mmu_notifier_mm *mmu_notifier_mm;
+ struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
- pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
- struct cpumask cpumask_allocation;
+ pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_NUMA_BALANCING
- /*
- * numa_next_scan is the next time that the PTEs will be marked
- * pte_numa. NUMA hinting faults will gather statistics and migrate
- * pages to new nodes if necessary.
- */
- unsigned long numa_next_scan;
+ /*
+ * numa_next_scan is the next time that the PTEs will be marked
+ * pte_numa. NUMA hinting faults will gather statistics and
+ * migrate pages to new nodes if necessary.
+ */
+ unsigned long numa_next_scan;

- /* Restart point for scanning and setting pte_numa */
- unsigned long numa_scan_offset;
+ /* Restart point for scanning and setting pte_numa */
+ unsigned long numa_scan_offset;

- /* numa_scan_seq prevents two threads setting pte_numa */
- int numa_scan_seq;
+ /* numa_scan_seq prevents two threads setting pte_numa */
+ int numa_scan_seq;
#endif
- /*
- * An operation with batched TLB flushing is going on. Anything that
- * can move process memory needs to flush the TLB when moving a
- * PROT_NONE or PROT_NUMA mapped page.
- */
- atomic_t tlb_flush_pending;
+ /*
+ * An operation with batched TLB flushing is going on. Anything
+ * that can move process memory needs to flush the TLB when
+ * moving a PROT_NONE or PROT_NUMA mapped page.
+ */
+ atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
- /* See flush_tlb_batched_pending() */
- bool tlb_flush_batched;
+ /* See flush_tlb_batched_pending() */
+ bool tlb_flush_batched;
#endif
- struct uprobes_state uprobes_state;
+ struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
- atomic_long_t hugetlb_usage;
+ atomic_long_t hugetlb_usage;
#endif
- struct work_struct async_put_work;
+ struct work_struct async_put_work;

#if IS_ENABLED(CONFIG_HMM)
- /* HMM needs to track a few things per mm */
- struct hmm *hmm;
+ /* HMM needs to track a few things per mm */
+ struct hmm *hmm;
#endif
-} __randomize_layout;
+ } __randomize_layout;
+
+ /*
+ * The mm_cpumask needs to be at the end of mm_struct, because it
+ * is dynamically sized based on nr_cpu_ids.
+ */
+ unsigned long cpu_bitmap[];
+};

extern struct mm_struct init_mm;

+/* Pointer magic because the dynamic array size confuses some compilers. */
static inline void mm_init_cpumask(struct mm_struct *mm)
{
-#ifdef CONFIG_CPUMASK_OFFSTACK
- mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
- cpumask_clear(mm->cpu_vm_mask_var);
+ unsigned long cpu_bitmap = (unsigned long)mm;
+
+ cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+ cpumask_clear((struct cpumask *)cpu_bitmap);
}

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
{
- return mm->cpu_vm_mask_var;
+ return (struct cpumask *)&mm->cpu_bitmap;
}

struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)

void __init proc_caches_init(void)
{
+ unsigned int mm_size;
+
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
sizeof(struct fs_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
NULL);
+
/*
- * FIXME! The "sizeof(struct mm_struct)" currently includes the
- * whole struct cpumask for the OFFSTACK case. We could change
- * this to *only* allocate as much of it as required by the
- * maximum number of CPU's we can ever have. The cpumask_allocation
- * is at the end of the structure, exactly for that reason.
+ * The mm_cpumask is located at the end of mm_struct, and is
+ * dynamically sized based on the maximum CPU number this system
+ * can have, taking hotplug into account (nr_cpu_ids).
*/
+ mm_size = sizeof(struct mm_struct) + cpumask_size();
+
mm_cachep = kmem_cache_create_usercopy("mm_struct",
- sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+ mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
offsetof(struct mm_struct, saved_auxv),
sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
#define INIT_MM_CONTEXT(name)
#endif

+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
+ .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
INIT_MM_CONTEXT(init_mm)
};
--
2.14.4


2018-07-16 19:05:12

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.

The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.

Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
are not removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.

This patch reduces total CPU use in the system by about 1-2% for a
memcache workload on two socket systems, and by about 1% for a heavily
multi-process netperf between two systems.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Song Liu <[email protected]>
---
arch/x86/mm/tlb.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4b73fe835c95..26542cc17043 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -7,6 +7,7 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/gfp.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
next->context.ctx_id);

/*
- * We don't currently support having a real mm loaded without
- * our cpu set in mm_cpumask(). We have all the bookkeeping
- * in place to figure out whether we would need to flush
- * if our cpu were cleared in mm_cpumask(), but we don't
- * currently use it.
+ * Even in lazy TLB mode, the CPU should stay set in the
+ * mm_cpumask. The TLB shootdown code can figure out from
+ * from cpu_tlbstate.is_lazy whether or not to send an IPI.
*/
if (WARN_ON_ONCE(real_prev != &init_mm &&
!cpumask_test_cpu(cpu, mm_cpumask(next))))
cpumask_set_cpu(cpu, mm_cpumask(next));

- return;
+ /*
+ * If the CPU is not in lazy TLB mode, we are just switching
+ * from one thread in a process to another thread in the same
+ * process. No TLB flush required.
+ */
+ if (!was_lazy)
+ return;
+
+ /*
+ * Read the tlb_gen to check whether a flush is needed.
+ * If the TLB is up to date, just use it.
+ * The barrier synchronizes with the tlb_gen increment in
+ * the TLB shootdown code.
+ */
+ smp_mb();
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+ if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+ next_tlb_gen)
+ return;
+
+ /*
+ * TLB contents went out of date while we were in lazy
+ * mode. Fall through to the TLB switching code below.
+ */
+ new_asid = prev_asid;
+ need_flush = true;
} else {
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);

@@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* paging-structure cache to avoid speculatively reading
* garbage into our TLB. Since switching to init_mm is barely
* slower than a minimal flush, just switch to init_mm.
+ *
+ * This should be rare, with native_flush_tlb_others skipping
+ * IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
return;
@@ -560,6 +588,9 @@ static void flush_tlb_func_remote(void *info)
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
+ cpumask_var_t lazymask;
+ unsigned int cpu;
+
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (info->end == TLB_FLUSH_ALL)
trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -583,8 +614,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
* that UV should be updated so that smp_call_function_many(),
* etc, are optimal on UV.
*/
- unsigned int cpu;
-
cpu = smp_processor_id();
cpumask = uv_flush_tlb_others(cpumask, info);
if (cpumask)
@@ -592,8 +621,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(void *)info, 1);
return;
}
- smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+ /*
+ * A temporary cpumask is used in order to skip sending IPIs
+ * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm).
+ * If the allocation fails, simply IPI every CPU in mm_cpumask.
+ */
+ if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) {
+ smp_call_function_many(cpumask, flush_tlb_func_remote,
+ (void *)info, 1);
+ return;
+ }
+
+ cpumask_copy(lazymask, cpumask);
+
+ for_each_cpu(cpu, lazymask) {
+ if (per_cpu(cpu_tlbstate.is_lazy, cpu))
+ cpumask_clear_cpu(cpu, lazymask);
+ }
+
+ smp_call_function_many(lazymask, flush_tlb_func_remote,
(void *)info, 1);
+
+ free_cpumask_var(lazymask);
}

/*
--
2.14.4


2018-07-16 19:05:37

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs

CPUs in !is_lazy have either received TLB flush IPIs earlier on during
the munmap (when the user memory was unmapped), or have context switched
and reloaded during that stage of the munmap.

Page table free TLB flushes only need to be sent to CPUs in lazy TLB
mode, which TLB contents might not yet be up to date yet.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Song Liu <[email protected]>
---
arch/x86/mm/tlb.c | 43 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 26542cc17043..e4156e37aa71 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -712,15 +712,50 @@ void tlb_flush_remove_tables_local(void *arg)
}
}

+static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
+ struct cpumask *lazy_cpus)
+{
+ int cpu;
+
+ for_each_cpu(cpu, mm_cpumask(mm)) {
+ if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
+ cpumask_set_cpu(cpu, lazy_cpus);
+ }
+}
+
void tlb_flush_remove_tables(struct mm_struct *mm)
{
int cpu = get_cpu();
+ cpumask_var_t lazy_cpus;
+
+ if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
+ put_cpu();
+ return;
+ }
+
+ if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) {
+ /*
+ * If the cpumask allocation fails, do a brute force flush
+ * on all the CPUs that have this mm loaded.
+ */
+ smp_call_function_many(mm_cpumask(mm),
+ tlb_flush_remove_tables_local, (void *)mm, 1);
+ put_cpu();
+ return;
+ }
+
/*
- * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+ * CPUs with !is_lazy either received a TLB flush IPI while the user
+ * pages in this address range were unmapped, or have context switched
+ * and reloaded %CR3 since then.
+ *
+ * Shootdown IPIs at page table freeing time only need to be sent to
+ * CPUs that may have out of date TLB contents.
*/
- if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
- smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
-
+ mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus);
+ smp_call_function_many(lazy_cpus,
+ tlb_flush_remove_tables_local, (void *)mm, 1);
+ free_cpumask_var(lazy_cpus);
put_cpu();
}

--
2.14.4


2018-07-16 19:05:57

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time

Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Song Liu <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 5 +++++
arch/x86/mm/tlb.c | 27 +++++++++++++++++++++++++++
include/asm-generic/tlb.h | 10 ++++++++++
mm/memory.c | 22 ++++++++++++++--------
4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6690cd3fc8b1..3aa3204b5dc0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
native_flush_tlb_others(mask, info)
#endif

+extern void tlb_flush_remove_tables(struct mm_struct *mm);
+extern void tlb_flush_remove_tables_local(void *arg);
+
+#define HAVE_TLB_FLUSH_REMOVE_TABLES
+
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6eb1f34c3c85..9a893673c56b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -646,6 +646,33 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
put_cpu();
}

+void tlb_flush_remove_tables_local(void *arg)
+{
+ struct mm_struct *mm = arg;
+
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+ this_cpu_read(cpu_tlbstate.is_lazy)) {
+ /*
+ * We're in lazy mode. We need to at least flush our
+ * paging-structure cache to avoid speculatively reading
+ * garbage into our TLB. Since switching to init_mm is barely
+ * slower than a minimal flush, just switch to init_mm.
+ */
+ switch_mm_irqs_off(NULL, &init_mm, NULL);
+ }
+}
+
+void tlb_flush_remove_tables(struct mm_struct *mm)
+{
+ int cpu = get_cpu();
+ /*
+ * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+ */
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+ smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+
+ put_cpu();
+}

static void do_flush_tlb_all(void *info)
{
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faddde44de8c..f0b462ddfa15 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -295,4 +295,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,

#define tlb_migrate_finish(mm) do {} while (0)

+/*
+ * Used to flush the TLB when page tables are removed, when lazy
+ * TLB mode may cause a CPU to retain intermediate translations
+ * pointing to about-to-be-freed page table memory.
+ */
+#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
+#define tlb_flush_remove_tables(mm) do {} while (0)
+#define tlb_flush_remove_tables_local(mm) do {} while (0)
+#endif
+
#endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/memory.c b/mm/memory.c
index 9d472e00fc2d..b4117272dc7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -326,16 +326,20 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_

#ifdef CONFIG_HAVE_RCU_TABLE_FREE

-/*
- * See the comment near struct mmu_table_batch.
- */
-
static void tlb_remove_table_smp_sync(void *arg)
{
- /* Simply deliver the interrupt */
+ struct mm_struct __maybe_unused *mm = arg;
+ /*
+ * On most architectures this does nothing. Simply delivering the
+ * interrupt is enough to prevent races with software page table
+ * walking like that done in get_user_pages_fast.
+ *
+ * See the comment near struct mmu_table_batch.
+ */
+ tlb_flush_remove_tables_local(mm);
}

-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
{
/*
* This isn't an RCU grace period and hence the page-tables cannot be
@@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table)
* It is however sufficient for software page-table walkers that rely on
* IRQ disabling. See the comment near struct mmu_table_batch.
*/
- smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+ smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
__tlb_remove_table(table);
}

@@ -365,6 +369,8 @@ void tlb_table_flush(struct mmu_gather *tlb)
{
struct mmu_table_batch **batch = &tlb->batch;

+ tlb_flush_remove_tables(tlb->mm);
+
if (*batch) {
call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
@@ -387,7 +393,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
if (*batch == NULL) {
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
if (*batch == NULL) {
- tlb_remove_table_one(table);
+ tlb_remove_table_one(table, tlb);
return;
}
(*batch)->nr = 0;
--
2.14.4


2018-07-16 19:06:29

by Rik van Riel

[permalink] [raw]
Subject: [PATCH 6/7] x86,mm: always use lazy TLB mode

Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Song Liu <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 16 ----------------
arch/x86/mm/tlb.c | 15 +--------------
2 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3aa3204b5dc0..511bf5fae8b8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
#endif

-static inline bool tlb_defer_switch_to_init_mm(void)
-{
- /*
- * If we have PCID, then switching to init_mm is reasonably
- * fast. If we don't have PCID, then switching to init_mm is
- * quite slow, so we try to defer it in the hopes that we can
- * avoid it entirely. The latter approach runs the risk of
- * receiving otherwise unnecessary IPIs.
- *
- * This choice is just a heuristic. The tlb code can handle this
- * function returning true or false regardless of whether we have
- * PCID.
- */
- return !static_cpu_has(X86_FEATURE_PCID);
-}
-
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e4156e37aa71..493559cae2d5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -379,20 +379,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
return;

- if (tlb_defer_switch_to_init_mm()) {
- /*
- * There's a significant optimization that may be possible
- * here. We have accurate enough TLB flush tracking that we
- * don't need to maintain coherence of TLB per se when we're
- * lazy. We do, however, need to maintain coherence of
- * paging-structure caches. We could, in principle, leave our
- * old mm loaded and only switch to init_mm when
- * tlb_remove_page() happens.
- */
- this_cpu_write(cpu_tlbstate.is_lazy, true);
- } else {
- switch_mm(NULL, &init_mm, NULL);
- }
+ this_cpu_write(cpu_tlbstate.is_lazy, true);
}

/*
--
2.14.4


Subject: [tip:x86/mm] mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids

Commit-ID: c1a2f7f0c06454387c2cd7b93ff1491c715a8c69
Gitweb: https://git.kernel.org/tip/c1a2f7f0c06454387c2cd7b93ff1491c715a8c69
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:31 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:30 +0200

mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the CPU bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.

Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/firmware/efi/efi.c | 1 +
include/linux/mm_types.h | 241 +++++++++++++++++++++++----------------------
kernel/fork.c | 15 +--
mm/init-mm.c | 11 +++
4 files changed, 145 insertions(+), 123 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 232f4915223b..7f0b19410a95 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
.mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(efi_mm.mmlist),
+ .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
};

static bool disable_runtime;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..efdc24dd9e97 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,183 @@ struct core_state {

struct kioctx_table;
struct mm_struct {
- struct vm_area_struct *mmap; /* list of VMAs */
- struct rb_root mm_rb;
- u32 vmacache_seqnum; /* per-thread vmacache */
+ struct {
+ struct vm_area_struct *mmap; /* list of VMAs */
+ struct rb_root mm_rb;
+ u32 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
- unsigned long (*get_unmapped_area) (struct file *filp,
+ unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
#endif
- unsigned long mmap_base; /* base of mmap area */
- unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
+ unsigned long mmap_base; /* base of mmap area */
+ unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
- /* Base adresses for compatible mmap() */
- unsigned long mmap_compat_base;
- unsigned long mmap_compat_legacy_base;
+ /* Base adresses for compatible mmap() */
+ unsigned long mmap_compat_base;
+ unsigned long mmap_compat_legacy_base;
#endif
- unsigned long task_size; /* size of task vm space */
- unsigned long highest_vm_end; /* highest vma end address */
- pgd_t * pgd;
-
- /**
- * @mm_users: The number of users including userspace.
- *
- * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
- * to 0 (i.e. when the task exits and there are no other temporary
- * reference holders), we also release a reference on @mm_count
- * (which may then free the &struct mm_struct if @mm_count also
- * drops to 0).
- */
- atomic_t mm_users;
-
- /**
- * @mm_count: The number of references to &struct mm_struct
- * (@mm_users count as 1).
- *
- * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
- * &struct mm_struct is freed.
- */
- atomic_t mm_count;
+ unsigned long task_size; /* size of task vm space */
+ unsigned long highest_vm_end; /* highest vma end address */
+ pgd_t * pgd;
+
+ /**
+ * @mm_users: The number of users including userspace.
+ *
+ * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+ * drops to 0 (i.e. when the task exits and there are no other
+ * temporary reference holders), we also release a reference on
+ * @mm_count (which may then free the &struct mm_struct if
+ * @mm_count also drops to 0).
+ */
+ atomic_t mm_users;
+
+ /**
+ * @mm_count: The number of references to &struct mm_struct
+ * (@mm_users count as 1).
+ *
+ * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+ * &struct mm_struct is freed.
+ */
+ atomic_t mm_count;

#ifdef CONFIG_MMU
- atomic_long_t pgtables_bytes; /* PTE page table pages */
+ atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
- int map_count; /* number of VMAs */
+ int map_count; /* number of VMAs */

- spinlock_t page_table_lock; /* Protects page tables and some counters */
- struct rw_semaphore mmap_sem;
+ spinlock_t page_table_lock; /* Protects page tables and some
+ * counters
+ */
+ struct rw_semaphore mmap_sem;

- struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
- * together off init_mm.mmlist, and are protected
- * by mmlist_lock
- */
+ struct list_head mmlist; /* List of maybe swapped mm's. These
+ * are globally strung together off
+ * init_mm.mmlist, and are protected
+ * by mmlist_lock
+ */


- unsigned long hiwater_rss; /* High-watermark of RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */

- unsigned long total_vm; /* Total pages mapped */
- unsigned long locked_vm; /* Pages that have PG_mlocked set */
- unsigned long pinned_vm; /* Refcount permanently increased */
- unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
- unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
- unsigned long stack_vm; /* VM_STACK */
- unsigned long def_flags;
+ unsigned long total_vm; /* Total pages mapped */
+ unsigned long locked_vm; /* Pages that have PG_mlocked set */
+ unsigned long pinned_vm; /* Refcount permanently increased */
+ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+ unsigned long stack_vm; /* VM_STACK */
+ unsigned long def_flags;

- spinlock_t arg_lock; /* protect the below fields */
- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
+ spinlock_t arg_lock; /* protect the below fields */
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;

- unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+ unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

- /*
- * Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- struct mm_rss_stat rss_stat;
-
- struct linux_binfmt *binfmt;
+ /*
+ * Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ struct mm_rss_stat rss_stat;

- cpumask_var_t cpu_vm_mask_var;
+ struct linux_binfmt *binfmt;

- /* Architecture-specific MM context */
- mm_context_t context;
+ /* Architecture-specific MM context */
+ mm_context_t context;

- unsigned long flags; /* Must use atomic bitops to access the bits */
+ unsigned long flags; /* Must use atomic bitops to access */

- struct core_state *core_state; /* coredumping support */
+ struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_MEMBARRIER
- atomic_t membarrier_state;
+ atomic_t membarrier_state;
#endif
#ifdef CONFIG_AIO
- spinlock_t ioctx_lock;
- struct kioctx_table __rcu *ioctx_table;
+ spinlock_t ioctx_lock;
+ struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ /*
+ * "owner" points to a task that is regarded as the canonical
+ * user/owner of this mm. All of the following must be true in
+ * order for it to be changed:
+ *
+ * current == mm->owner
+ * current->mm != mm
+ * new_owner->mm == mm
+ * new_owner->alloc_lock is held
+ */
+ struct task_struct __rcu *owner;
#endif
- struct user_namespace *user_ns;
+ struct user_namespace *user_ns;

- /* store ref to file /proc/<pid>/exe symlink points to */
- struct file __rcu *exe_file;
+ /* store ref to file /proc/<pid>/exe symlink points to */
+ struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
- struct mmu_notifier_mm *mmu_notifier_mm;
+ struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
- pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
- struct cpumask cpumask_allocation;
+ pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_NUMA_BALANCING
- /*
- * numa_next_scan is the next time that the PTEs will be marked
- * pte_numa. NUMA hinting faults will gather statistics and migrate
- * pages to new nodes if necessary.
- */
- unsigned long numa_next_scan;
+ /*
+ * numa_next_scan is the next time that the PTEs will be marked
+ * pte_numa. NUMA hinting faults will gather statistics and
+ * migrate pages to new nodes if necessary.
+ */
+ unsigned long numa_next_scan;

- /* Restart point for scanning and setting pte_numa */
- unsigned long numa_scan_offset;
+ /* Restart point for scanning and setting pte_numa */
+ unsigned long numa_scan_offset;

- /* numa_scan_seq prevents two threads setting pte_numa */
- int numa_scan_seq;
+ /* numa_scan_seq prevents two threads setting pte_numa */
+ int numa_scan_seq;
#endif
- /*
- * An operation with batched TLB flushing is going on. Anything that
- * can move process memory needs to flush the TLB when moving a
- * PROT_NONE or PROT_NUMA mapped page.
- */
- atomic_t tlb_flush_pending;
+ /*
+ * An operation with batched TLB flushing is going on. Anything
+ * that can move process memory needs to flush the TLB when
+ * moving a PROT_NONE or PROT_NUMA mapped page.
+ */
+ atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
- /* See flush_tlb_batched_pending() */
- bool tlb_flush_batched;
+ /* See flush_tlb_batched_pending() */
+ bool tlb_flush_batched;
#endif
- struct uprobes_state uprobes_state;
+ struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
- atomic_long_t hugetlb_usage;
+ atomic_long_t hugetlb_usage;
#endif
- struct work_struct async_put_work;
+ struct work_struct async_put_work;

#if IS_ENABLED(CONFIG_HMM)
- /* HMM needs to track a few things per mm */
- struct hmm *hmm;
+ /* HMM needs to track a few things per mm */
+ struct hmm *hmm;
#endif
-} __randomize_layout;
+ } __randomize_layout;
+
+ /*
+ * The mm_cpumask needs to be at the end of mm_struct, because it
+ * is dynamically sized based on nr_cpu_ids.
+ */
+ unsigned long cpu_bitmap[];
+};

extern struct mm_struct init_mm;

+/* Pointer magic because the dynamic array size confuses some compilers. */
static inline void mm_init_cpumask(struct mm_struct *mm)
{
-#ifdef CONFIG_CPUMASK_OFFSTACK
- mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
- cpumask_clear(mm->cpu_vm_mask_var);
+ unsigned long cpu_bitmap = (unsigned long)mm;
+
+ cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+ cpumask_clear((struct cpumask *)cpu_bitmap);
}

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
{
- return mm->cpu_vm_mask_var;
+ return (struct cpumask *)&mm->cpu_bitmap;
}

struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)

void __init proc_caches_init(void)
{
+ unsigned int mm_size;
+
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
sizeof(struct fs_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
NULL);
+
/*
- * FIXME! The "sizeof(struct mm_struct)" currently includes the
- * whole struct cpumask for the OFFSTACK case. We could change
- * this to *only* allocate as much of it as required by the
- * maximum number of CPU's we can ever have. The cpumask_allocation
- * is at the end of the structure, exactly for that reason.
+ * The mm_cpumask is located at the end of mm_struct, and is
+ * dynamically sized based on the maximum CPU number this system
+ * can have, taking hotplug into account (nr_cpu_ids).
*/
+ mm_size = sizeof(struct mm_struct) + cpumask_size();
+
mm_cachep = kmem_cache_create_usercopy("mm_struct",
- sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+ mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
offsetof(struct mm_struct, saved_auxv),
sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
#define INIT_MM_CONTEXT(name)
#endif

+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
+ .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
INIT_MM_CONTEXT(init_mm)
};

Subject: [tip:x86/mm] x86/mm/tlb: Leave lazy TLB mode at page table free time

Commit-ID: 2ff6ddf19c0ec40633bd14d8fe28a289816bd98d
Gitweb: https://git.kernel.org/tip/2ff6ddf19c0ec40633bd14d8fe28a289816bd98d
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:32 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:31 +0200

x86/mm/tlb: Leave lazy TLB mode at page table free time

Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 5 +++++
arch/x86/mm/tlb.c | 27 +++++++++++++++++++++++++++
include/asm-generic/tlb.h | 10 ++++++++++
mm/memory.c | 22 ++++++++++++++--------
4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6690cd3fc8b1..3aa3204b5dc0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
native_flush_tlb_others(mask, info)
#endif

+extern void tlb_flush_remove_tables(struct mm_struct *mm);
+extern void tlb_flush_remove_tables_local(void *arg);
+
+#define HAVE_TLB_FLUSH_REMOVE_TABLES
+
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6eb1f34c3c85..9a893673c56b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -646,6 +646,33 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
put_cpu();
}

+void tlb_flush_remove_tables_local(void *arg)
+{
+ struct mm_struct *mm = arg;
+
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+ this_cpu_read(cpu_tlbstate.is_lazy)) {
+ /*
+ * We're in lazy mode. We need to at least flush our
+ * paging-structure cache to avoid speculatively reading
+ * garbage into our TLB. Since switching to init_mm is barely
+ * slower than a minimal flush, just switch to init_mm.
+ */
+ switch_mm_irqs_off(NULL, &init_mm, NULL);
+ }
+}
+
+void tlb_flush_remove_tables(struct mm_struct *mm)
+{
+ int cpu = get_cpu();
+ /*
+ * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+ */
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+ smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+
+ put_cpu();
+}

static void do_flush_tlb_all(void *info)
{
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 3063125197ad..e811ef7b8350 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -303,4 +303,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,

#define tlb_migrate_finish(mm) do {} while (0)

+/*
+ * Used to flush the TLB when page tables are removed, when lazy
+ * TLB mode may cause a CPU to retain intermediate translations
+ * pointing to about-to-be-freed page table memory.
+ */
+#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
+#define tlb_flush_remove_tables(mm) do {} while (0)
+#define tlb_flush_remove_tables_local(mm) do {} while (0)
+#endif
+
#endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..18355e0b971a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -326,16 +326,20 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_

#ifdef CONFIG_HAVE_RCU_TABLE_FREE

-/*
- * See the comment near struct mmu_table_batch.
- */
-
static void tlb_remove_table_smp_sync(void *arg)
{
- /* Simply deliver the interrupt */
+ struct mm_struct __maybe_unused *mm = arg;
+ /*
+ * On most architectures this does nothing. Simply delivering the
+ * interrupt is enough to prevent races with software page table
+ * walking like that done in get_user_pages_fast.
+ *
+ * See the comment near struct mmu_table_batch.
+ */
+ tlb_flush_remove_tables_local(mm);
}

-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
{
/*
* This isn't an RCU grace period and hence the page-tables cannot be
@@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table)
* It is however sufficient for software page-table walkers that rely on
* IRQ disabling. See the comment near struct mmu_table_batch.
*/
- smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+ smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
__tlb_remove_table(table);
}

@@ -365,6 +369,8 @@ void tlb_table_flush(struct mmu_gather *tlb)
{
struct mmu_table_batch **batch = &tlb->batch;

+ tlb_flush_remove_tables(tlb->mm);
+
if (*batch) {
call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
@@ -387,7 +393,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
if (*batch == NULL) {
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
if (*batch == NULL) {
- tlb_remove_table_one(table);
+ tlb_remove_table_one(table, tlb);
return;
}
(*batch)->nr = 0;

Subject: [tip:x86/mm] x86/mm/tlb: Restructure switch_mm_irqs_off()

Commit-ID: 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b
Gitweb: https://git.kernel.org/tip/61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:33 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:32 +0200

x86/mm/tlb: Restructure switch_mm_irqs_off()

Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.

No functional changes, since the if (real_prev == next) branch
always returns.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 60 +++++++++++++++++++++++++++----------------------------
1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9a893673c56b..4b73fe835c95 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -187,6 +187,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
+ bool need_flush;
+ u16 new_asid;

/*
* NB: The scheduler will call us with prev == next when switching
@@ -252,8 +254,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

return;
} else {
- u16 new_asid;
- bool need_flush;
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);

/*
@@ -297,41 +297,41 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+ }

- if (need_flush) {
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
-
- /*
- * NB: This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we need to use the _rcuidle variant.
- *
- * (There is no good reason for this. The idle code should
- * be rearranged to call this before rcu_idle_enter().)
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- } else {
- /* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
-
- /* See above wrt _rcuidle. */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
- }
+ if (need_flush) {
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+ load_new_mm_cr3(next->pgd, new_asid, true);

/*
- * Record last user mm's context id, so we can avoid
- * flushing branch buffer with IBPB if we switch back
- * to the same user.
+ * NB: This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we need to use the _rcuidle variant.
+ *
+ * (There is no good reason for this. The idle code should
+ * be rearranged to call this before rcu_idle_enter().)
*/
- if (next != &init_mm)
- this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ load_new_mm_cr3(next->pgd, new_asid, false);

- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+ /* See above wrt _rcuidle. */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}

+ /*
+ * Record last user mm's context id, so we can avoid
+ * flushing branch buffer with IBPB if we switch back
+ * to the same user.
+ */
+ if (next != &init_mm)
+ this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+
load_mm_cr4(next);
switch_ldt(real_prev, next);
}

Subject: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier

Commit-ID: ac0315896970d8589291e9d8a1569fc65967b7f1
Gitweb: https://git.kernel.org/tip/ac0315896970d8589291e9d8a1569fc65967b7f1
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:34 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:33 +0200

x86/mm/tlb: Make lazy TLB mode lazier

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.

The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.

Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
are not removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.

This patch reduces total CPU use in the system by about 1-2% for a
memcache workload on two socket systems, and by about 1% for a heavily
multi-process netperf between two systems.

Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4b73fe835c95..26542cc17043 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -7,6 +7,7 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/gfp.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
next->context.ctx_id);

/*
- * We don't currently support having a real mm loaded without
- * our cpu set in mm_cpumask(). We have all the bookkeeping
- * in place to figure out whether we would need to flush
- * if our cpu were cleared in mm_cpumask(), but we don't
- * currently use it.
+ * Even in lazy TLB mode, the CPU should stay set in the
+ * mm_cpumask. The TLB shootdown code can figure out from
+ * from cpu_tlbstate.is_lazy whether or not to send an IPI.
*/
if (WARN_ON_ONCE(real_prev != &init_mm &&
!cpumask_test_cpu(cpu, mm_cpumask(next))))
cpumask_set_cpu(cpu, mm_cpumask(next));

- return;
+ /*
+ * If the CPU is not in lazy TLB mode, we are just switching
+ * from one thread in a process to another thread in the same
+ * process. No TLB flush required.
+ */
+ if (!was_lazy)
+ return;
+
+ /*
+ * Read the tlb_gen to check whether a flush is needed.
+ * If the TLB is up to date, just use it.
+ * The barrier synchronizes with the tlb_gen increment in
+ * the TLB shootdown code.
+ */
+ smp_mb();
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+ if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+ next_tlb_gen)
+ return;
+
+ /*
+ * TLB contents went out of date while we were in lazy
+ * mode. Fall through to the TLB switching code below.
+ */
+ new_asid = prev_asid;
+ need_flush = true;
} else {
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);

@@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* paging-structure cache to avoid speculatively reading
* garbage into our TLB. Since switching to init_mm is barely
* slower than a minimal flush, just switch to init_mm.
+ *
+ * This should be rare, with native_flush_tlb_others skipping
+ * IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
return;
@@ -560,6 +588,9 @@ static void flush_tlb_func_remote(void *info)
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
+ cpumask_var_t lazymask;
+ unsigned int cpu;
+
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (info->end == TLB_FLUSH_ALL)
trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -583,8 +614,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
* that UV should be updated so that smp_call_function_many(),
* etc, are optimal on UV.
*/
- unsigned int cpu;
-
cpu = smp_processor_id();
cpumask = uv_flush_tlb_others(cpumask, info);
if (cpumask)
@@ -592,8 +621,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
(void *)info, 1);
return;
}
- smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+ /*
+ * A temporary cpumask is used in order to skip sending IPIs
+ * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm).
+ * If the allocation fails, simply IPI every CPU in mm_cpumask.
+ */
+ if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) {
+ smp_call_function_many(cpumask, flush_tlb_func_remote,
+ (void *)info, 1);
+ return;
+ }
+
+ cpumask_copy(lazymask, cpumask);
+
+ for_each_cpu(cpu, lazymask) {
+ if (per_cpu(cpu_tlbstate.is_lazy, cpu))
+ cpumask_clear_cpu(cpu, lazymask);
+ }
+
+ smp_call_function_many(lazymask, flush_tlb_func_remote,
(void *)info, 1);
+
+ free_cpumask_var(lazymask);
}

/*

Subject: [tip:x86/mm] x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs

Commit-ID: 64482aafe55fc7e84d0741c356f8176ee7bde357
Gitweb: https://git.kernel.org/tip/64482aafe55fc7e84d0741c356f8176ee7bde357
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:35 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:33 +0200

x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs

CPUs in !is_lazy have either received TLB flush IPIs earlier on during
the munmap (when the user memory was unmapped), or have context switched
and reloaded during that stage of the munmap.

Page table free TLB flushes only need to be sent to CPUs in lazy TLB
mode, which TLB contents might not yet be up to date yet.

Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 43 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 26542cc17043..e4156e37aa71 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -712,15 +712,50 @@ void tlb_flush_remove_tables_local(void *arg)
}
}

+static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
+ struct cpumask *lazy_cpus)
+{
+ int cpu;
+
+ for_each_cpu(cpu, mm_cpumask(mm)) {
+ if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
+ cpumask_set_cpu(cpu, lazy_cpus);
+ }
+}
+
void tlb_flush_remove_tables(struct mm_struct *mm)
{
int cpu = get_cpu();
+ cpumask_var_t lazy_cpus;
+
+ if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
+ put_cpu();
+ return;
+ }
+
+ if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) {
+ /*
+ * If the cpumask allocation fails, do a brute force flush
+ * on all the CPUs that have this mm loaded.
+ */
+ smp_call_function_many(mm_cpumask(mm),
+ tlb_flush_remove_tables_local, (void *)mm, 1);
+ put_cpu();
+ return;
+ }
+
/*
- * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+ * CPUs with !is_lazy either received a TLB flush IPI while the user
+ * pages in this address range were unmapped, or have context switched
+ * and reloaded %CR3 since then.
+ *
+ * Shootdown IPIs at page table freeing time only need to be sent to
+ * CPUs that may have out of date TLB contents.
*/
- if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
- smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
-
+ mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus);
+ smp_call_function_many(lazy_cpus,
+ tlb_flush_remove_tables_local, (void *)mm, 1);
+ free_cpumask_var(lazy_cpus);
put_cpu();
}


Subject: [tip:x86/mm] x86/mm/tlb: Always use lazy TLB mode

Commit-ID: 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e
Gitweb: https://git.kernel.org/tip/95b0e6357d3e4e05349668940d7ff8f3b7e7e11e
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:36 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:34 +0200

x86/mm/tlb: Always use lazy TLB mode

Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 16 ----------------
arch/x86/mm/tlb.c | 15 +--------------
2 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3aa3204b5dc0..511bf5fae8b8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
#endif

-static inline bool tlb_defer_switch_to_init_mm(void)
-{
- /*
- * If we have PCID, then switching to init_mm is reasonably
- * fast. If we don't have PCID, then switching to init_mm is
- * quite slow, so we try to defer it in the hopes that we can
- * avoid it entirely. The latter approach runs the risk of
- * receiving otherwise unnecessary IPIs.
- *
- * This choice is just a heuristic. The tlb code can handle this
- * function returning true or false regardless of whether we have
- * PCID.
- */
- return !static_cpu_has(X86_FEATURE_PCID);
-}
-
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e4156e37aa71..493559cae2d5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -379,20 +379,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
return;

- if (tlb_defer_switch_to_init_mm()) {
- /*
- * There's a significant optimization that may be possible
- * here. We have accurate enough TLB flush tracking that we
- * don't need to maintain coherence of TLB per se when we're
- * lazy. We do, however, need to maintain coherence of
- * paging-structure caches. We could, in principle, leave our
- * old mm loaded and only switch to init_mm when
- * tlb_remove_page() happens.
- */
- this_cpu_write(cpu_tlbstate.is_lazy, true);
- } else {
- switch_mm(NULL, &init_mm, NULL);
- }
+ this_cpu_write(cpu_tlbstate.is_lazy, true);
}

/*

Subject: [tip:x86/mm] x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()

Commit-ID: e9d8c61557687b7126101e9550bdf243223f0d8f
Gitweb: https://git.kernel.org/tip/e9d8c61557687b7126101e9550bdf243223f0d8f
Author: Rik van Riel <[email protected]>
AuthorDate: Mon, 16 Jul 2018 15:03:37 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 17 Jul 2018 09:35:34 +0200

x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()

Song Liu noticed switch_mm_irqs_off() taking a lot of CPU time in recent
kernels,using 1.8% of a 48 CPU system during a netperf to localhost run.
Digging into the profile, we noticed that cpumask_clear_cpu and
cpumask_set_cpu together take about half of the CPU time taken by
switch_mm_irqs_off().

However, the CPUs running netperf end up switching back and forth
between netperf and the idle task, which does not require changes
to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
the most heavily contended one in the system.

Simply skipping changes to mm_cpumask(&init_mm) reduces overhead.

Reported-and-tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 493559cae2d5..f086195f644c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -310,15 +310,22 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
sync_current_stack_to_mm(next);
}

- /* Stop remote flushes for the previous mm */
- VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
- real_prev != &init_mm);
- cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ /*
+ * Stop remote flushes for the previous mm.
+ * Skip kernel threads; we never send init_mm TLB flushing IPIs,
+ * but the bitmap manipulation can cause cache line contention.
+ */
+ if (real_prev != &init_mm) {
+ VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
+ mm_cpumask(real_prev)));
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+ }

/*
* Start remote flushes and then read tlb_gen.
*/
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ if (next != &init_mm)
+ cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

2018-07-17 11:34:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier

On Tue, Jul 17, 2018 at 02:35:08AM -0700, tip-bot for Rik van Riel wrote:
> @@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> next->context.ctx_id);
>
> /*
> + * Even in lazy TLB mode, the CPU should stay set in the
> + * mm_cpumask. The TLB shootdown code can figure out from
> + * from cpu_tlbstate.is_lazy whether or not to send an IPI.
> */
> if (WARN_ON_ONCE(real_prev != &init_mm &&
> !cpumask_test_cpu(cpu, mm_cpumask(next))))
> cpumask_set_cpu(cpu, mm_cpumask(next));
>
> + /*
> + * If the CPU is not in lazy TLB mode, we are just switching
> + * from one thread in a process to another thread in the same
> + * process. No TLB flush required.
> + */
> + if (!was_lazy)
> + return;
> +
> + /*
> + * Read the tlb_gen to check whether a flush is needed.
> + * If the TLB is up to date, just use it.
> + * The barrier synchronizes with the tlb_gen increment in
> + * the TLB shootdown code.
> + */
> + smp_mb();

What exactly is this smp_mb() ordering? The above comment is
insufficient. Is it the cpumask_set_cpu() vs the atomic64_read() ?

If so, should this not be smp_mb__after_atomic() (iow a NO-OP on x86)

If it is not so, please fix the comment to explain things.

> + next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> + next_tlb_gen)
> + return;
> +
> + /*
> + * TLB contents went out of date while we were in lazy
> + * mode. Fall through to the TLB switching code below.
> + */
> + new_asid = prev_asid;
> + need_flush = true;

2018-07-17 11:40:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs

On Tue, Jul 17, 2018 at 02:35:41AM -0700, tip-bot for Rik van Riel wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 26542cc17043..e4156e37aa71 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -712,15 +712,50 @@ void tlb_flush_remove_tables_local(void *arg)
> }
> }
>
> +static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
> + struct cpumask *lazy_cpus)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, mm_cpumask(mm)) {
> + if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
> + cpumask_set_cpu(cpu, lazy_cpus);

That really wants to be __cpumask_set_cpu(). Using LOCK prefix
instructions to set local bits is kinda pointless and expensive.

> + }
> +}
> +
> void tlb_flush_remove_tables(struct mm_struct *mm)
> {
> int cpu = get_cpu();
> + cpumask_var_t lazy_cpus;
> +
> + if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
> + put_cpu();
> + return;
> + }
> +
> + if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) {
> + /*
> + * If the cpumask allocation fails, do a brute force flush
> + * on all the CPUs that have this mm loaded.
> + */
> + smp_call_function_many(mm_cpumask(mm),
> + tlb_flush_remove_tables_local, (void *)mm, 1);
> + put_cpu();
> + return;
> + }
> +
> /*
> - * XXX: this really only needs to be called for CPUs in lazy TLB mode.
> + * CPUs with !is_lazy either received a TLB flush IPI while the user
> + * pages in this address range were unmapped, or have context switched
> + * and reloaded %CR3 since then.
> + *
> + * Shootdown IPIs at page table freeing time only need to be sent to
> + * CPUs that may have out of date TLB contents.
> */
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> - smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
> -
> + mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus);
> + smp_call_function_many(lazy_cpus,
> + tlb_flush_remove_tables_local, (void *)mm, 1);
> + free_cpumask_var(lazy_cpus);
> put_cpu();
> }

Also, was there a reason to re-implement on_each_cpu_cond() ? (which btw
also wants that __cpumask_set_bit fix).

2018-07-17 11:48:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Leave lazy TLB mode at page table free time

On Tue, Jul 17, 2018 at 02:34:07AM -0700, tip-bot for Rik van Riel wrote:
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 3063125197ad..e811ef7b8350 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -303,4 +303,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
>
> #define tlb_migrate_finish(mm) do {} while (0)
>
> +/*
> + * Used to flush the TLB when page tables are removed, when lazy
> + * TLB mode may cause a CPU to retain intermediate translations
> + * pointing to about-to-be-freed page table memory.
> + */
> +#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
> +#define tlb_flush_remove_tables(mm) do {} while (0)
> +#define tlb_flush_remove_tables_local(mm) do {} while (0)
> +#endif

Is there a reason these are not inline functions, which gets us type
checking and the like?

2018-07-17 20:06:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Mon, Jul 16, 2018 at 12:03 PM, Rik van Riel <[email protected]> wrote:
> Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
> when all it really needs to do is reload %CR3 at the next context switch,
> assuming no page table pages got freed.
>
> Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
> which checks whether .tlb_gen changed, and the TLB invalidation code, which
> increments .tlb_gen whenever page table entries get invalidated.
>
> The atomic increment in inc_mm_tlb_gen is its own barrier; the context
> switch code adds an explicit barrier between reading tlbstate.is_lazy and
> next->context.tlb_gen.
>
> Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
> are not removed from the mm_cpumask(mm), since that would prevent the TLB
> flush IPIs at page table free time from being sent to all the CPUs
> that need them.
>
> This patch reduces total CPU use in the system by about 1-2% for a
> memcache workload on two socket systems, and by about 1% for a heavily
> multi-process netperf between two systems.
>

I'm not 100% certain I'm replying to the right email, and I haven't
gotten the tip-bot notification at all, but:

I think you've introduced a minor-ish performance regression due to
changing the old (admittedly terribly documented) control flow a bit.
Before, if real_prev == next, we would skip:

load_mm_cr4(next);
switch_ldt(real_prev, next);

Now we don't any more. I think you should reinstate that
optimization. It's probably as simple as wrapping them in an if
(real_priv != next) with a comment like /* Remote changes that would
require a cr4 or ldt reload will unconditionally send an IPI even to
lazy CPUs. So, if we aren't changing our mm, we don't need to refresh
cr4 or the ldt */

Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
fix that part or should I?

--Andy

2018-07-17 21:30:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <[email protected]> wrote:
> Can I skip both the cr4 and let switches when the TLB contents
> are no longer valid and got reloaded?
>
> If the TLB contents are still valid, either because we never went
> into lazy TLB mode, or because no invalidates happened while
> we were lazy, we immediately return.
>
> The cr4 and ldt reloads only happen if the TLB was invalidated
> while we were in lazy TLB mode.

Yes, since the only events that would change the LDT or the required
CR4 value will unconditionally broadcast to every CPU in mm_cpumask
regardless of whether they're lazy. The interesting case is that you
go lazy, you miss an invalidation IPI because you were lazy, then you
go unlazy, notice the tlb_gen change, and flush. If this happens, you
know that you only missed a page table update and not an LDT update or
a CR4 update, because the latter would have sent the IPI even though
you were lazy. So you should skip the CR4 and LDT updates.

I suppose a different approach would be to fix the issue below and to
try to track when the LDT actually needs reloading. But that latter
part seems a bit complicated for minimal gain.

(Do you believe me? If not, please argue back!)

>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
>> fix that part or should I?
>
> I would be happy to send in a patch for this, and one for
> the above optimization you pointed out.
>

Yes please!

2018-07-17 22:06:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier



> On Jul 17, 2018, at 5:29 PM, Andy Lutomirski <[email protected]> wrote:
>
> On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <[email protected]> wrote:
>> Can I skip both the cr4 and let switches when the TLB contents
>> are no longer valid and got reloaded?
>>
>> If the TLB contents are still valid, either because we never went
>> into lazy TLB mode, or because no invalidates happened while
>> we were lazy, we immediately return.
>>
>> The cr4 and ldt reloads only happen if the TLB was invalidated
>> while we were in lazy TLB mode.
>
> Yes, since the only events that would change the LDT or the required
> CR4 value will unconditionally broadcast to every CPU in mm_cpumask
> regardless of whether they're lazy. The interesting case is that you
> go lazy, you miss an invalidation IPI because you were lazy, then you
> go unlazy, notice the tlb_gen change, and flush. If this happens, you
> know that you only missed a page table update and not an LDT update or
> a CR4 update, because the latter would have sent the IPI even though
> you were lazy. So you should skip the CR4 and LDT updates.
>
> I suppose a different approach would be to fix the issue below and to
> try to track when the LDT actually needs reloading. But that latter
> part seems a bit complicated for minimal gain.
>
> (Do you believe me? If not, please argue back!)
>
I believe you :)

>>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
>>> fix that part or should I?
>>
>> I would be happy to send in a patch for this, and one for
>> the above optimization you pointed out.
>>
>
> Yes please!
>
There is a third optimization left to do. Currently every time
we switch into lazy tlb mode, we take a refcount on the mm,
even when switching from one kernel thread to another, or
when repeatedly switching between the same mm and kernel
threads.

We could keep that refcount (on a per cpu basis) from the time
we first switch to that mm in lazy tlb mode, to when we switch
the CPU to a different mm.

That would allow us to not bounce the cache line with the
mm_struct reference count on every lazy TLB context switch.

Does that seem like a reasonable optimization?

Am I overlooking anything?

I'll try to get all three optimizations working, and will run them
through some testing here before posting upstream.


2018-07-17 22:28:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier



> On Jul 17, 2018, at 12:05 PM, Rik van Riel <[email protected]> wrote:
>
>
>
>> On Jul 17, 2018, at 5:29 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <[email protected]> wrote:
>>> Can I skip both the cr4 and let switches when the TLB contents
>>> are no longer valid and got reloaded?
>>>
>>> If the TLB contents are still valid, either because we never went
>>> into lazy TLB mode, or because no invalidates happened while
>>> we were lazy, we immediately return.
>>>
>>> The cr4 and ldt reloads only happen if the TLB was invalidated
>>> while we were in lazy TLB mode.
>>
>> Yes, since the only events that would change the LDT or the required
>> CR4 value will unconditionally broadcast to every CPU in mm_cpumask
>> regardless of whether they're lazy. The interesting case is that you
>> go lazy, you miss an invalidation IPI because you were lazy, then you
>> go unlazy, notice the tlb_gen change, and flush. If this happens, you
>> know that you only missed a page table update and not an LDT update or
>> a CR4 update, because the latter would have sent the IPI even though
>> you were lazy. So you should skip the CR4 and LDT updates.
>>
>> I suppose a different approach would be to fix the issue below and to
>> try to track when the LDT actually needs reloading. But that latter
>> part seems a bit complicated for minimal gain.
>>
>> (Do you believe me? If not, please argue back!)
>>
> I believe you :)
>
>>>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
>>>> fix that part or should I?
>>>
>>> I would be happy to send in a patch for this, and one for
>>> the above optimization you pointed out.
>>>
>>
>> Yes please!
>>
> There is a third optimization left to do. Currently every time
> we switch into lazy tlb mode, we take a refcount on the mm,
> even when switching from one kernel thread to another, or
> when repeatedly switching between the same mm and kernel
> threads.
>
> We could keep that refcount (on a per cpu basis) from the time
> we first switch to that mm in lazy tlb mode, to when we switch
> the CPU to a different mm.
>
> That would allow us to not bounce the cache line with the
> mm_struct reference count on every lazy TLB context switch.
>
> Does that seem like a reasonable optimization?

Are you referring to the core sched code that deals with mm_count and active_mm? If so, last time I looked at it, I convinced myself that it was totally useless, at least on x86. I think the my reasoning was that, when mm_users went to zero, we already waited for RCU before tearing down page tables.

Things may have changed, but I strongly suspect that it should be possibly for at least x86 to opt out of mm_count and maybe even active_mm entirely. If nothing else, you’re shooting the mm out of CR3 on all CPUs whenever the pagetables get freed, and more or less the same logic should be sufficient so that, whenever mm_users hits zero, we can synchronously or via RCU callback kill the mm entirely.

Want to take a look at that?

>
> Am I overlooking anything?
>
> I'll try to get all three optimizations working, and will run them
> through some testing here before posting upstream.
>

2018-07-18 15:35:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier



> On Jul 17, 2018, at 7:33 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jul 17, 2018 at 02:35:08AM -0700, tip-bot for Rik van Riel wrote:
>> @@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>> next->context.ctx_id);
>>
>> /*
>> + * Even in lazy TLB mode, the CPU should stay set in the
>> + * mm_cpumask. The TLB shootdown code can figure out from
>> + * from cpu_tlbstate.is_lazy whether or not to send an IPI.
>> */
>> if (WARN_ON_ONCE(real_prev != &init_mm &&
>> !cpumask_test_cpu(cpu, mm_cpumask(next))))
>> cpumask_set_cpu(cpu, mm_cpumask(next));
>>
>> + /*
>> + * If the CPU is not in lazy TLB mode, we are just switching
>> + * from one thread in a process to another thread in the same
>> + * process. No TLB flush required.
>> + */
>> + if (!was_lazy)
>> + return;
>> +
>> + /*
>> + * Read the tlb_gen to check whether a flush is needed.
>> + * If the TLB is up to date, just use it.
>> + * The barrier synchronizes with the tlb_gen increment in
>> + * the TLB shootdown code.
>> + */
>> + smp_mb();
>
> What exactly is this smp_mb() ordering? The above comment is
> insufficient. Is it the cpumask_set_cpu() vs the atomic64_read() ?
>
> If so, should this not be smp_mb__after_atomic() (iow a NO-OP on x86)
>
> If it is not so, please fix the comment to explain things.
>
>
The tlb flush code first increments mm->context.tlb_gen, and then sends
shootdown IPIs to CPUs that have this mm loaded and are not in lazy
TLB mode.

At context switch time, we have to ensure that we check the tlb_gen after
we load the old is_lazy state.

Maybe something like this?

/*
* Read the tlb_gen to check whether a flush is needed.
* If the TLB is up to date, just use it.
* The TLB shootdown code first increments tlb_gen, and then
* sends IPIs to CPUs that have this CPU loaded and are not
* in lazy TLB mode. The barrier ensures we handle
* cpu_tlbstate.is_lazy before tlb_gen, keeping this code
* synchronized with the TLB flush code.
*/



2018-07-18 16:01:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier

On Wed, Jul 18, 2018 at 11:33:02AM -0400, Rik van Riel wrote:
> The tlb flush code first increments mm->context.tlb_gen, and then sends

> shootdown IPIs to CPUs that have this mm loaded and are not in lazy
> TLB mode.
>
> At context switch time, we have to ensure that we check the tlb_gen after
> we load the old is_lazy state.
>
> Maybe something like this?
>
> /*
> * Read the tlb_gen to check whether a flush is needed.
> * If the TLB is up to date, just use it.
> * The TLB shootdown code first increments tlb_gen, and then
> * sends IPIs to CPUs that have this CPU loaded and are not
> * in lazy TLB mode. The barrier ensures we handle
> * cpu_tlbstate.is_lazy before tlb_gen, keeping this code
> * synchronized with the TLB flush code.
> */

Let me try and draw a diagram; that always works better for me that
text. So the relevant ordering is something like:

CPU0 - switch_mm() CPU1 - flush_tlb_mm_range()

[W] cpu_tlbstate.is_lazy = false; [RmW] next->tlb_gen++
smp_mb() MB (implied)
[R] tlb_gen = next->tlb_gen native_flush_tlb_others()
[R] cpu_tlbstate.is_lazy


Such that CPU1 either observes !lazy and flushes and/or CPU0 observes
the generation increment and forces a flush itself.

Either way, CPU0 gets flushed.


Also, I don't suppose you've looked at the paravirt instances of
flush_tlb_other() ? They don't elide the flushes because of lazy.

2018-07-18 16:20:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs

On Wed, Jul 18, 2018 at 11:50:33AM -0400, Rik van Riel wrote:
> > On Jul 18, 2018, at 11:42 AM, Rik van Riel <[email protected]> wrote:
> >> On Jul 17, 2018, at 7:39 AM, Peter Zijlstra <[email protected] <mailto:[email protected]>> wrote:

> >> Also, was there a reason to re-implement on_each_cpu_cond() ? (which btw
> >> also wants that __cpumask_set_bit fix).
> >
> > I did not use on_each_cpu_cond() because I had no idea it
> > existed. A quick grep suggests very few users of that function :)

Yeah, only reason I know it existed was because I helped write it or
something like that :-)

> > I'll make sure things are done the right way.
> >
> OK, looking at it some more, I think open coding may be faster in
> case of the TLB shootdown code, because that way we only iterate
> over the CPUs in the mm_cpumask, instead of iterating over every
> single online CPU in the system, and calling the helper function for
> every CPU, like on_each_cpu_cond() does.
>
> However, the difference in overhead might be small enough that
> we might not notice. Preferences?

Yeah, so the difference is the case where the mask allocation fails; in
that case we're under severe memory pressure and performance sucks
anyway, right?

In which case using on_each_cpu_cond() seems the simpler option.

2018-07-18 18:25:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier

On Wed, Jul 18, 2018 at 01:22:19PM -0400, Rik van Riel wrote:
> > On Jul 18, 2018, at 12:00 PM, Peter Zijlstra <[email protected]> wrote:

> > Also, I don't suppose you've looked at the paravirt instances of
> > flush_tlb_other() ? They don't elide the flushes because of lazy.
>
> Let me look at those now :)

<snip xen>

> kvm_flush_tlb_other takes out preempted VCPUs from the flush mask,
> before calling native_flush_tlb_others, so it should get the optimization
> automatically.

Ah, ok. I wasn't entirely sure the new lazy was purely for the idle
case. But yes, the KVM paravirt thing should get the idle case right.

> Hyperv does ... sad magic.

Yeah,.. it got us to switch to 'rcu' freeing of page-tables and enabled
the kvm thing though, so that all is good :-)

2018-07-18 18:53:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier



> On Jul 18, 2018, at 2:23 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Jul 18, 2018 at 01:22:19PM -0400, Rik van Riel wrote:
>>> On Jul 18, 2018, at 12:00 PM, Peter Zijlstra <[email protected]> wrote:
>
>>> Also, I don't suppose you've looked at the paravirt instances of
>>> flush_tlb_other() ? They don't elide the flushes because of lazy.
>>
>> Let me look at those now :)
>
> <snip xen>
>
>> kvm_flush_tlb_other takes out preempted VCPUs from the flush mask,
>> before calling native_flush_tlb_others, so it should get the optimization
>> automatically.
>
> Ah, ok. I wasn't entirely sure the new lazy was purely for the idle
> case. But yes, the KVM paravirt thing should get the idle case right.
>
Not just idle, but also running in kernel threads like ksoftirqd,
kworker, kswapd, etc. However, kvm_flush_tlb_other calls
native_flush_tlb_other, so it should get that optimization automatically
from my patch series.


2018-07-18 21:00:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier



> On Jul 17, 2018, at 4:04 PM, Andy Lutomirski <[email protected]> wrote:
>
>
> I think you've introduced a minor-ish performance regression due to
> changing the old (admittedly terribly documented) control flow a bit.
> Before, if real_prev == next, we would skip:
>
> load_mm_cr4(next);
> switch_ldt(real_prev, next);
>
> Now we don't any more. I think you should reinstate that
> optimization. It's probably as simple as wrapping them in an if
> (real_priv != next) with a comment like /* Remote changes that would
> require a cr4 or ldt reload will unconditionally send an IPI even to
> lazy CPUs. So, if we aren't changing our mm, we don't need to refresh
> cr4 or the ldt */

Looks like switch_ldt already skips reloading the LDT when prev equals
next, or when they simply have the same LDT values:

if (unlikely((unsigned long)prev->context.ldt |
(unsigned long)next->context.ldt))
load_mm_ldt(next);

It appears that the cr4 bits have a similar optimization:

static inline void cr4_set_bits(unsigned long mask)
{
unsigned long cr4, flags;

local_irq_save(flags);
cr4 = this_cpu_read(cpu_tlbstate.cr4);
if ((cr4 | mask) != cr4)
__cr4_set(cr4 | mask);
local_irq_restore(flags);
}

>
> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
> fix that part or should I?
>
Looks like there might not be anything to do here, after all.

On to the lazy TLB mm_struct refcounting stuff :)


2018-07-18 23:14:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier



> On Jul 18, 2018, at 10:58 AM, Rik van Riel <[email protected]> wrote:
>
>
>
>> On Jul 17, 2018, at 4:04 PM, Andy Lutomirski <[email protected]> wrote:
>>
>>
>> I think you've introduced a minor-ish performance regression due to
>> changing the old (admittedly terribly documented) control flow a bit.
>> Before, if real_prev == next, we would skip:
>>
>> load_mm_cr4(next);
>> switch_ldt(real_prev, next);
>>
>> Now we don't any more. I think you should reinstate that
>> optimization. It's probably as simple as wrapping them in an if
>> (real_priv != next) with a comment like /* Remote changes that would
>> require a cr4 or ldt reload will unconditionally send an IPI even to
>> lazy CPUs. So, if we aren't changing our mm, we don't need to refresh
>> cr4 or the ldt */
>
> Looks like switch_ldt already skips reloading the LDT when prev equals
> next, or when they simply have the same LDT values:
>
> if (unlikely((unsigned long)prev->context.ldt |
> (unsigned long)next->context.ldt))
> load_mm_ldt(next);
>

Read that again? It will reload if there’s an LDT, even if it’s the same one.

> It appears that the cr4 bits have a similar optimization:
>
> static inline void cr4_set_bits(unsigned long mask)
> {
> unsigned long cr4, flags;
>
> local_irq_save(flags);
> cr4 = this_cpu_read(cpu_tlbstate.cr4);
> if ((cr4 | mask) != cr4)
> __cr4_set(cr4 | mask);
> local_irq_restore(flags);
> }
>>
>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
>> fix that part or should I?
>>
> Looks like there might not be anything to do here, after all.

But if init_mm and the thread that just went idle have different selected cr4 values, we’ll still write it. With your lazy TLB work, it’s less of a big deal, but still.

I’m happy to fix this myself, though.

>
> On to the lazy TLB mm_struct refcounting stuff :)
>

Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about mm_count. My suggestion is to get rid of mm_count instead of trying to optimize it.

2018-07-19 09:16:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier

On Wed, Jul 18, 2018 at 02:51:45PM -0400, Rik van Riel wrote:

> > Ah, ok. I wasn't entirely sure the new lazy was purely for the idle
> > case. But yes, the KVM paravirt thing should get the idle case right.
> >
> Not just idle, but also running in kernel threads like ksoftirqd,
> kworker, kswapd, etc.

Right, that's what I thought.

> However, kvm_flush_tlb_other calls
> native_flush_tlb_other, so it should get that optimization automatically
> from my patch series.

Aah, seems I forgot that part, then yes indeed, it all works.

2018-07-19 16:47:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

[I added PeterZ and Vitaly -- can you see any way in which this would
break something obscure? I don't.]

On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <[email protected]> wrote:
> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> next?

Yes, AFAICS.

>
> On to the lazy TLB mm_struct refcounting stuff :)
>
>>
>> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about
>> mm_count. My suggestion is to get rid of mm_count instead of trying to
>> optimize it.
>
>
> Do you have any suggestions on how? :)
>
> The TLB shootdown sent at __exit_mm time does not get rid of the
> kernelthread->active_mm
> pointer pointing at the mm that is exiting.
>

Ah, but that's conceptually very easy to fix. Add a #define like
ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that
#define is set. After some grepping, there are very few users. The
only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
are involved in the rather complicated dance of refcounting active_mm.
If that field goes away, it doesn't need to be refcounted. Instead, I
think the refcounting can get replaced with something like:

/*
* Release any arch-internal references to mm. Only called when
mm_users is zero
* and all tasks using mm have either been switch_mm()'d away or have had
* enter_lazy_tlb() called.
*/
extern void arch_shoot_down_dead_mm(struct mm_struct *mm);

which the kernel calls in __mmput() after tearing down all the page
tables. The body can be something like:

if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
/* send an IPI. Maybe just call tlb_flush_remove_tables() */
}

(You'll also have to fix up the highly questionable users in
arch/x86/platform/efi/efi_64.c, but that's easy.)

Does all that make sense? Basically, as I understand it, the
expensive atomic ops you're seeing are all pointless because they're
enabling an optimization that hasn't actually worked for a long time,
if ever.

2018-07-19 17:06:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <[email protected]> wrote:
> [I added PeterZ and Vitaly -- can you see any way in which this would
> break something obscure? I don't.]
>
> On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <[email protected]> wrote:
>> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
>> next?
>
> Yes, AFAICS.
>
>>
>> On to the lazy TLB mm_struct refcounting stuff :)
>>
>>>
>>> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about
>>> mm_count. My suggestion is to get rid of mm_count instead of trying to
>>> optimize it.
>>
>>
>> Do you have any suggestions on how? :)
>>
>> The TLB shootdown sent at __exit_mm time does not get rid of the
>> kernelthread->active_mm
>> pointer pointing at the mm that is exiting.
>>
>
> Ah, but that's conceptually very easy to fix. Add a #define like
> ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that
> #define is set. After some grepping, there are very few users. The
> only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> are involved in the rather complicated dance of refcounting active_mm.
> If that field goes away, it doesn't need to be refcounted. Instead, I
> think the refcounting can get replaced with something like:
>
> /*
> * Release any arch-internal references to mm. Only called when
> mm_users is zero
> * and all tasks using mm have either been switch_mm()'d away or have had
> * enter_lazy_tlb() called.
> */
> extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
>
> which the kernel calls in __mmput() after tearing down all the page
> tables. The body can be something like:
>
> if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> /* send an IPI. Maybe just call tlb_flush_remove_tables() */
> }
>
> (You'll also have to fix up the highly questionable users in
> arch/x86/platform/efi/efi_64.c, but that's easy.)
>
> Does all that make sense? Basically, as I understand it, the
> expensive atomic ops you're seeing are all pointless because they're
> enabling an optimization that hasn't actually worked for a long time,
> if ever.

Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from
arch_exit_mmap(), I think. It's a heavier weight version of more or
less the same thing that arch_shoot_down_dead_mm() would be, except
that it happens before exit_mmap(). But maybe Xen actually has the
right idea. In other words, rather doing the big pagetable free in
exit_mmap() while there may still be other CPUs pointing at the page
tables, the other order might make more sense. So maybe, if
ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
for getting rid of all secret arch references to the mm.

Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.

I added some more arch maintainers. The idea here is that, on x86 at
least, task->active_mm and all its refcounting is pure overhead. When
a process exits, __mmput() gets called, but the core kernel has a
longstanding "optimization" in which other tasks (kernel threads and
idle tasks) may have ->active_mm pointing at this mm. This is nasty,
complicated, and hurts performance on large systems, since it requires
extra atomic operations whenever a CPU switches between real users
threads and idle/kernel threads.

It's also almost completely worthless on x86 at least, since __mmput()
frees pagetables, and that operation *already* forces a remote TLB
flush, so we might as well zap all the active_mm references at the
same time.

But arm64 has real HW remote flushes. Does arm64 actually benefit
from the active_mm optimization? What happens on arm64 when a process
exits? How about s390? I suspect that x390 has rather larger systems
than arm64, where the cost of the reference counting can be much
higher.

(Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
interact with your previous patch set?)

2018-07-19 17:20:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Thu, Jul 19, 2018 at 10:15 AM, Rik van Riel <[email protected]> wrote:
>
>
> Given that CPUs in lazy TLB mode stay part of the mm_cpumask,
> that WARN_ON seems misplaced. You are right though, that the
> mm_cpumask alone should provide enough information for us to
> avoid a need for both tsk->active_mm and the refcounting.
>

If you do this extra shootdown after freeing pagetables, it would be
odd if mm_cpumask() wasn't empty. But you're right, the warn is
probably silly. And if you move it into arch_exit_mmap(), the warn is
definitely wrong.

>
> Does all that make sense? Basically, as I understand it, the
> expensive atomic ops you're seeing are all pointless because they're
> enabling an optimization that hasn't actually worked for a long time,
> if ever.
>
>
> Our benchmark results suggest that lazy TLB mode works, and makes
> a measurable performance difference. Getting rid of the atomic ops
> should make it a little better, though :)
>

I'm not saying lazy mode is useless. I'm saying that active_mm isn't
needed for x86's lazy mode :)

2018-07-20 05:00:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Thu, 2018-07-19 at 10:04 -0700, Andy Lutomirski wrote:
> On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <[email protected]> wrote:
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure? I don't.]

Added Nick and Aneesh. We do have HW remote flushes on powerpc.

> > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <[email protected]> wrote:
> > > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> > > next?
> >
> > Yes, AFAICS.
> >
> > >
> > > On to the lazy TLB mm_struct refcounting stuff :)
> > >
> > > >
> > > > Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about
> > > > mm_count. My suggestion is to get rid of mm_count instead of trying to
> > > > optimize it.
> > >
> > >
> > > Do you have any suggestions on how? :)
> > >
> > > The TLB shootdown sent at __exit_mm time does not get rid of the
> > > kernelthread->active_mm
> > > pointer pointing at the mm that is exiting.
> > >
> >
> > Ah, but that's conceptually very easy to fix. Add a #define like
> > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that
> > #define is set. After some grepping, there are very few users. The
> > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> > are involved in the rather complicated dance of refcounting active_mm.
> > If that field goes away, it doesn't need to be refcounted. Instead, I
> > think the refcounting can get replaced with something like:
> >
> > /*
> > * Release any arch-internal references to mm. Only called when
> > mm_users is zero
> > * and all tasks using mm have either been switch_mm()'d away or have had
> > * enter_lazy_tlb() called.
> > */
> > extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
> >
> > which the kernel calls in __mmput() after tearing down all the page
> > tables. The body can be something like:
> >
> > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> > /* send an IPI. Maybe just call tlb_flush_remove_tables() */
> > }
> >
> > (You'll also have to fix up the highly questionable users in
> > arch/x86/platform/efi/efi_64.c, but that's easy.)
> >
> > Does all that make sense? Basically, as I understand it, the
> > expensive atomic ops you're seeing are all pointless because they're
> > enabling an optimization that hasn't actually worked for a long time,
> > if ever.
>
> Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from
> arch_exit_mmap(), I think. It's a heavier weight version of more or
> less the same thing that arch_shoot_down_dead_mm() would be, except
> that it happens before exit_mmap(). But maybe Xen actually has the
> right idea. In other words, rather doing the big pagetable free in
> exit_mmap() while there may still be other CPUs pointing at the page
> tables, the other order might make more sense. So maybe, if
> ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
> for getting rid of all secret arch references to the mm.
>
> Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.
>
> I added some more arch maintainers. The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead. When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm. This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
>
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.
>
> But arm64 has real HW remote flushes. Does arm64 actually benefit
> from the active_mm optimization? What happens on arm64 when a process
> exits? How about s390? I suspect that x390 has rather larger systems
> than arm64, where the cost of the reference counting can be much
> higher.
>
> (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
> interact with your previous patch set?)

2018-07-20 08:03:46

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

Andy Lutomirski <[email protected]> writes:

> [I added PeterZ and Vitaly -- can you see any way in which this would
> break something obscure? I don't.]

Thanks for CCing me,

I don't see how this can break things either. At first glance, however,
I'm afraid we can add performance penalty to virtualized guests which
don't use native_flush_tlb_others() (Hyper-V, KVM): we will be reloading
CR3 without a need as we don't look at lazy mode in PV tlb flush
functions.

We can either check to switch_mm_irqs_off() that
native_flush_tlb_others() is in use or teach PV tlb flush functions to
look at lazy mode too.

(Sorry if I'm missing something important here or if this was already
discussed. I just became aware of this work)

[...]

--
Vitaly

2018-07-20 08:32:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote:
> I added some more arch maintainers. The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead. When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm. This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
>
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.

So I disagree that active_mm is complicated (the code is less than ideal
but that is actually fixable). And aside from the process exit case, it
does avoid CR3 writes when switching between user and kernel threads
(which can be far more often than exit if you have longer running
tasks).

Now agreed, recent x86 work has made that less important.

And I of course also agree that not doing those refcount atomics is
better.

2018-07-20 09:34:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Thu, Jul 19, 2018 at 09:45:47AM -0700, Andy Lutomirski wrote:
> After some grepping, there are very few users. The
> only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> are involved in the rather complicated dance of refcounting active_mm.

Something like so should work I suppose.

It keeps the field, but ensure that for all tasks tsk->active_mm ==
tsk->mm. The implication is that switch_mm(.prev == NULL) should work
(it does for x86, we don't care about @prev).


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d84eafcf5e29..7bfd850c6bf7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2780,12 +2780,8 @@ static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
- struct mm_struct *mm, *oldmm;
-
prepare_task_switch(rq, prev, next);

- mm = next->mm;
- oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
@@ -2800,16 +2796,32 @@ context_switch(struct rq *rq, struct task_struct *prev,
* membarrier after storing to rq->curr, before returning to
* user-space.
*/
- if (!mm) {
- next->active_mm = oldmm;
- mmgrab(oldmm);
- enter_lazy_tlb(oldmm, next);
- } else
- switch_mm_irqs_off(oldmm, mm, next);
-
- if (!prev->mm) {
- prev->active_mm = NULL;
- rq->prev_mm = oldmm;
+
+ /*
+ * kernel -> kernel lazy + transfer active
+ * user -> kernel lazy + mmgrab() active
+ *
+ * kernel -> user switch + mmdrop() active
+ * user -> user switch
+ */
+ if (!next->mm) { // to kernel
+ enter_lazy_tlb(prev->active_mm, next);
+
+#ifdef ARCH_NO_ACTIVE_MM
+ next->active_mm = prev->active_mm;
+ if (prev->mm) // from user
+ mmgrab(prev->active_mm);
+#endif
+ } else { // to user
+ switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
+#ifdef ARCH_NO_ACTIVE_MM
+ if (!prev->mm) { // from kernel
+ /* will mmdrop() in finish_task_switch(). */
+ rq->prev_mm = prev->active_mm;
+ prev->active_mm = NULL;
+ }
+#endif
}

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..770bb615d115 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -12,6 +12,10 @@

#include <asm/mmu_context.h>

+#ifndef finish_arch_post_lock_switch
+# define finish_arch_post_lock_switch() do { } while (0)
+#endif
+
/*
* use_mm
* Makes the calling kernel thread take on the specified
@@ -33,12 +37,14 @@ void use_mm(struct mm_struct *mm)
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
-#ifdef finish_arch_post_lock_switch
finish_arch_post_lock_switch();
-#endif

+#ifdef ARCH_NO_ACTIVE_MM
+ WARN_ON_ONCE(active_mm != NULL);
+#else
if (active_mm != mm)
mmdrop(active_mm);
+#endif
}
EXPORT_SYMBOL_GPL(use_mm);

@@ -57,8 +63,22 @@ void unuse_mm(struct mm_struct *mm)
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
+
+ WARN_ON_ONCE(tsk->active_mm != mm);
+
+#ifdef ARCH_NO_ACTIVE_MM
+ switch_mm(tsk->active_mm, &init_mm, tsk);
+ mmdrop(tsk->active_mm);
+ tsk->active_mm = NULL;
+#else
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
+#endif
+
task_unlock(tsk);
+
+#ifdef ARCH_NO_ACTIVE_MM
+ finish_arch_post_lock_switch();
+#endif
}
EXPORT_SYMBOL_GPL(unuse_mm);

2018-07-20 09:51:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Fri, Jul 20, 2018 at 10:02:10AM +0200, Vitaly Kuznetsov wrote:
> Andy Lutomirski <[email protected]> writes:
>
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure? I don't.]
>
> Thanks for CCing me,
>
> I don't see how this can break things either. At first glance, however,
> I'm afraid we can add performance penalty to virtualized guests which
> don't use native_flush_tlb_others() (Hyper-V, KVM): we will be reloading
> CR3 without a need as we don't look at lazy mode in PV tlb flush
> functions.
>
> We can either check to switch_mm_irqs_off() that
> native_flush_tlb_others() is in use or teach PV tlb flush functions to
> look at lazy mode too.

As Rik noted elsewhere in the thread, kvm_flush_tlb_others() ends up
calling native_tlb_flush_others() for all running vcpu threads.

The Hyper-V thing is magical, we can't really do anything about it
there. Let them worry about it.

2018-07-20 10:19:29

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

Peter Zijlstra <[email protected]> writes:

> On Fri, Jul 20, 2018 at 10:02:10AM +0200, Vitaly Kuznetsov wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>> > [I added PeterZ and Vitaly -- can you see any way in which this would
>> > break something obscure? I don't.]
>>
>> Thanks for CCing me,
>>
>> I don't see how this can break things either. At first glance, however,
>> I'm afraid we can add performance penalty to virtualized guests which
>> don't use native_flush_tlb_others() (Hyper-V, KVM): we will be reloading
>> CR3 without a need as we don't look at lazy mode in PV tlb flush
>> functions.
>>
>> We can either check to switch_mm_irqs_off() that
>> native_flush_tlb_others() is in use or teach PV tlb flush functions to
>> look at lazy mode too.
>
> As Rik noted elsewhere in the thread, kvm_flush_tlb_others() ends up
> calling native_tlb_flush_others() for all running vcpu threads.

Ah, right!

>
> The Hyper-V thing is magical, we can't really do anything about it
> there. Let them worry about it.

Well, we kinda know how this magic works: we just ask the hypervisor to
flush TLB for us (if the particular vCPU is running) :-) Anyway, nothing
stops us from duplicating the logic regarding lazy mode from
native_flush_tlb_others() to hyperv_flush_tlb_others(): if TLB state is
lazy omit TLB flush.

--
Vitaly

2018-07-20 11:05:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Fri, Jul 20, 2018 at 11:32:39AM +0200, Peter Zijlstra wrote:

> + if (!next->mm) { // to kernel
> + enter_lazy_tlb(prev->active_mm, next);
> +
> +#ifdef ARCH_NO_ACTIVE_MM
> + next->active_mm = prev->active_mm;
> + if (prev->mm) // from user
> + mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
> +#endif
> + } else { // to user
> + switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> +#ifdef ARCH_NO_ACTIVE_MM
> + if (!prev->mm) { // from kernel
> + /* will mmdrop() in finish_task_switch(). */
> + rq->prev_mm = prev->active_mm;
> + prev->active_mm = NULL;
> + }
> +#endif
> }

2018-07-23 12:28:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Fri, 2018-07-20 at 10:30 +0200, Peter Zijlstra wrote:
> On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote:
> > I added some more arch maintainers. The idea here is that, on x86
> > at
> > least, task->active_mm and all its refcounting is pure
> > overhead. When
> > a process exits, __mmput() gets called, but the core kernel has a
> > longstanding "optimization" in which other tasks (kernel threads
> > and
> > idle tasks) may have ->active_mm pointing at this mm. This is
> > nasty,
> > complicated, and hurts performance on large systems, since it
> > requires
> > extra atomic operations whenever a CPU switches between real users
> > threads and idle/kernel threads.
> >
> > It's also almost completely worthless on x86 at least, since
> > __mmput()
> > frees pagetables, and that operation *already* forces a remote TLB
> > flush, so we might as well zap all the active_mm references at the
> > same time.
>
> So I disagree that active_mm is complicated (the code is less than
> ideal
> but that is actually fixable). And aside from the process exit case,
> it
> does avoid CR3 writes when switching between user and kernel threads
> (which can be far more often than exit if you have longer running
> tasks).
>
> Now agreed, recent x86 work has made that less important.
>
> And I of course also agree that not doing those refcount atomics is
> better.

It might be cleaner to keep the ->active_mm pointer
in place for now (at least in the first patch), even
on architectures where we end up dropping the refcounting.

That way the code is more similar everywhere, and
we just get rid of the expensive instructions.

Let me try coding this up...

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-07-24 16:35:10

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

Hi Andy,

Sorry, I missed the arm64 question at the end of this...

On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote:
> On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <[email protected]> wrote:
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure? I don't.]
> >
> > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <[email protected]> wrote:
> >> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> >> next?
> >
> > Yes, AFAICS.
> >
> >>
> >> On to the lazy TLB mm_struct refcounting stuff :)
> >>
> >>>
> >>> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about
> >>> mm_count. My suggestion is to get rid of mm_count instead of trying to
> >>> optimize it.
> >>
> >>
> >> Do you have any suggestions on how? :)
> >>
> >> The TLB shootdown sent at __exit_mm time does not get rid of the
> >> kernelthread->active_mm
> >> pointer pointing at the mm that is exiting.
> >>
> >
> > Ah, but that's conceptually very easy to fix. Add a #define like
> > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that
> > #define is set. After some grepping, there are very few users. The
> > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> > are involved in the rather complicated dance of refcounting active_mm.
> > If that field goes away, it doesn't need to be refcounted. Instead, I
> > think the refcounting can get replaced with something like:
> >
> > /*
> > * Release any arch-internal references to mm. Only called when
> > mm_users is zero
> > * and all tasks using mm have either been switch_mm()'d away or have had
> > * enter_lazy_tlb() called.
> > */
> > extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
> >
> > which the kernel calls in __mmput() after tearing down all the page
> > tables. The body can be something like:
> >
> > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> > /* send an IPI. Maybe just call tlb_flush_remove_tables() */
> > }
> >
> > (You'll also have to fix up the highly questionable users in
> > arch/x86/platform/efi/efi_64.c, but that's easy.)
> >
> > Does all that make sense? Basically, as I understand it, the
> > expensive atomic ops you're seeing are all pointless because they're
> > enabling an optimization that hasn't actually worked for a long time,
> > if ever.
>
> Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from
> arch_exit_mmap(), I think. It's a heavier weight version of more or
> less the same thing that arch_shoot_down_dead_mm() would be, except
> that it happens before exit_mmap(). But maybe Xen actually has the
> right idea. In other words, rather doing the big pagetable free in
> exit_mmap() while there may still be other CPUs pointing at the page
> tables, the other order might make more sense. So maybe, if
> ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
> for getting rid of all secret arch references to the mm.
>
> Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.
>
> I added some more arch maintainers. The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead. When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm. This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
>
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.
>
> But arm64 has real HW remote flushes. Does arm64 actually benefit
> from the active_mm optimization? What happens on arm64 when a process
> exits? How about s390? I suspect that x390 has rather larger systems
> than arm64, where the cost of the reference counting can be much
> higher.

IIRC, the TLB invalidation on task exit has the fullmm field set in the
mmu_gather structure, so we don't actually do any TLB invalidation at all.
Instead, we just don't re-allocate the ASID and invalidate the whole TLB
when we run out of ASIDs (they're 16-bit on most Armv8 CPUs).

Does that answer your question?

Will

2018-07-25 01:34:24

by Anders Roxell

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86/mm/tlb: Leave lazy TLB mode at page table free time

On 2018-07-17 13:46, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 02:34:07AM -0700, tip-bot for Rik van Riel wrote:
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index 3063125197ad..e811ef7b8350 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -303,4 +303,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
> >
> > #define tlb_migrate_finish(mm) do {} while (0)
> >
> > +/*
> > + * Used to flush the TLB when page tables are removed, when lazy
> > + * TLB mode may cause a CPU to retain intermediate translations
> > + * pointing to about-to-be-freed page table memory.
> > + */
> > +#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
> > +#define tlb_flush_remove_tables(mm) do {} while (0)
> > +#define tlb_flush_remove_tables_local(mm) do {} while (0)
> > +#endif
>
> Is there a reason these are not inline functions, which gets us type
> checking and the like?

More to that, when building (linux-next tag: next-20180724) on arm 32
there is this build error when CONFIG_MMU is enabled, asm-generic/tlb.h
isn't included, see arch/arm/include/asm/tlb.h.

CC mm/memory.o
mm/memory.c: In function ‘tlb_remove_table_smp_sync’:
mm/memory.c:339:2: error: implicit declaration of function ‘tlb_flush_remove_tables_local’; did you mean ‘tlb_remove_table’? [-Werror=implicit-function-declaration]
tlb_flush_remove_tables_local(mm);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tlb_remove_table
mm/memory.c: In function ‘tlb_table_flush’:
mm/memory.c:372:2: error: implicit declaration of function ‘tlb_flush_remove_tables’; did you mean ‘tlb_remove_table’? [-Werror=implicit-function-declaration]
tlb_flush_remove_tables(tlb->mm);
^~~~~~~~~~~~~~~~~~~~~~~
tlb_remove_table


Alternative could be setting in every architecture's asm/tlb.h but that
doesn't make much sense. Other way would be to include prototype in
arch/arm/include/asm/tlb.h, not sure if that would be the best. Is
there any other place this could be defined for fixing issue in arm 32
bit ?


Cheers,
Anders

2018-08-04 22:29:59

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids

On Mon, Jul 16, 2018 at 03:03:31PM -0400, Rik van Riel wrote:
> The mm_struct always contains a cpumask bitmap, regardless of
> CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> simplify things, and simply have one bitmask at the end of the
> mm_struct for the mm_cpumask.
>
> This does necessitate moving everything else in mm_struct into
> an anonymous sub-structure, which can be randomized when struct
> randomization is enabled.
>
> The second step is to determine the correct size for the
> mm_struct slab object from the size of the mm_struct
> (excluding the cpu bitmap) and the size the cpumask.
>
> For init_mm we can simply allocate the maximum size this
> kernel is compiled for, since we only have one init_mm
> in the system, anyway.
>
> Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
> getting confused by the dynamically sized array.
>
> Signed-off-by: Rik van Riel <[email protected]>
> Signed-off-by: Mike Galbraith <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
> Tested-by: Song Liu <[email protected]>

Hi,

this patch causes unicore32 build failures.

In file included from include/linux/mm.h:17,
from arch/unicore32/kernel/asm-offsets.c:17:
include/linux/mm_types.h:497: error:
flexible array member in otherwise empty struct

Build reference: next-20180803
gcc version: unicore32-linux-gcc (UC4_1.0.5_20100917) 4.4.2

I understand this is an old compiler, but it is the only available
version as far as I know.

Guenter

2018-08-16 05:49:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time

On Mon, Jul 16, 2018 at 12:03 PM, Rik van Riel <[email protected]> wrote:
> Andy discovered that speculative memory accesses while in lazy
> TLB mode can crash a system, when a CPU tries to dereference a
> speculative access using memory contents that used to be valid
> page table memory, but have since been reused for something else
> and point into la-la land.

Hi Rik-

I was looking through this, and I see:

> -static void tlb_remove_table_one(void *table)
> +static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
> {
> /*
> * This isn't an RCU grace period and hence the page-tables cannot be
> @@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table)
> * It is however sufficient for software page-table walkers that rely on
> * IRQ disabling. See the comment near struct mmu_table_batch.
> */
> - smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> + smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
> __tlb_remove_table(table);
> }

But tlb_remove_table() doesn't always call tlb_remove_table_one(). Do
the other paths through tlb_remove_table() do the right thing?

--Andy

2018-08-16 09:23:14

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time

On Wed, 2018-08-15 at 18:54 -0700, Andy Lutomirski wrote:
> On Mon, Jul 16, 2018 at 12:03 PM, Rik van Riel <[email protected]>
> wrote:

> Hi Rik-
>
> I was looking through this, and I see:
>
> > -static void tlb_remove_table_one(void *table)
> > +static void tlb_remove_table_one(void *table, struct mmu_gather
> > *tlb)
> > {
> > /*
> > * This isn't an RCU grace period and hence the page-tables
> > cannot be
> > @@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table)
> > * It is however sufficient for software page-table walkers
> > that rely on
> > * IRQ disabling. See the comment near struct
> > mmu_table_batch.
> > */
> > - smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> > + smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
> > __tlb_remove_table(table);
> > }
>
> But tlb_remove_table() doesn't always call
> tlb_remove_table_one(). Do
> the other paths through tlb_remove_table() do the right thing?

Urghhh, reading through the code it seems that
I got one case wrong.

The obvious cases, where tlb->mm->mm_users >= 2,
are both correct. The functions tlb_remove_table_one()
and tlb_table_flush() both get rid of lazy TLB users
of the MM.

However, there is one case where we don't but should:
/*
* When there's less then two users of this mm there cannot be
a
* concurrent page-table walk.
*/
if (atomic_read(&tlb->mm->mm_users) < 2) {
__tlb_remove_table(table);
return;
}

I guess the obvious fix is to call tlb_flush_remove_tables()
in the branch above. The first time it might shoot down a CPU
in lazy TLB mode, while the second time it is called that CPU
will have already switched to init_mm and no IPI is sent.

Alternatively, we could always do the batching, even when
mm_users is 1, and examine the bitmap with the other CPUs
less frequently in the exit path.

I can send in a patch for this tomorrow, or early next week,
depending on what other stuff comes up...

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part
Subject: [tip:x86/mm] x86/mm/tlb: Always use lazy TLB mode

Commit-ID: 5462bc3a9a3c38328bbbd276d51164c7cf21d6a8
Gitweb: https://git.kernel.org/tip/5462bc3a9a3c38328bbbd276d51164c7cf21d6a8
Author: Rik van Riel <[email protected]>
AuthorDate: Tue, 25 Sep 2018 23:58:38 -0400
Committer: Peter Zijlstra <[email protected]>
CommitDate: Tue, 9 Oct 2018 16:51:11 +0200

x86/mm/tlb: Always use lazy TLB mode

On most workloads, the number of context switches far exceeds the
number of TLB flushes sent. Optimizing the context switches, by always
using lazy TLB mode, speeds up those workloads.

This patch results in about a 1% reduction in CPU use on a two socket
Broadwell system running a memcache like workload.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
(cherry picked from commit 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e)
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/tlbflush.h | 16 ----------------
arch/x86/mm/tlb.c | 15 +--------------
2 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 671f65309ce7..d6c0cd9e9591 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
#endif

-static inline bool tlb_defer_switch_to_init_mm(void)
-{
- /*
- * If we have PCID, then switching to init_mm is reasonably
- * fast. If we don't have PCID, then switching to init_mm is
- * quite slow, so we try to defer it in the hopes that we can
- * avoid it entirely. The latter approach runs the risk of
- * receiving otherwise unnecessary IPIs.
- *
- * This choice is just a heuristic. The tlb code can handle this
- * function returning true or false regardless of whether we have
- * PCID.
- */
- return !static_cpu_has(X86_FEATURE_PCID);
-}
-
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6aa195796dec..54a5870190a6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -368,20 +368,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
return;

- if (tlb_defer_switch_to_init_mm()) {
- /*
- * There's a significant optimization that may be possible
- * here. We have accurate enough TLB flush tracking that we
- * don't need to maintain coherence of TLB per se when we're
- * lazy. We do, however, need to maintain coherence of
- * paging-structure caches. We could, in principle, leave our
- * old mm loaded and only switch to init_mm when
- * tlb_remove_page() happens.
- */
- this_cpu_write(cpu_tlbstate.is_lazy, true);
- } else {
- switch_mm(NULL, &init_mm, NULL);
- }
+ this_cpu_write(cpu_tlbstate.is_lazy, true);
}

/*

Subject: [tip:x86/mm] x86/mm/tlb: Restructure switch_mm_irqs_off()

Commit-ID: 12c4d978fd170ccdd7260ec11f93b11e46904228
Gitweb: https://git.kernel.org/tip/12c4d978fd170ccdd7260ec11f93b11e46904228
Author: Rik van Riel <[email protected]>
AuthorDate: Tue, 25 Sep 2018 23:58:39 -0400
Committer: Peter Zijlstra <[email protected]>
CommitDate: Tue, 9 Oct 2018 16:51:11 +0200

x86/mm/tlb: Restructure switch_mm_irqs_off()

Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.

No functional changes, since the if (real_prev == next) branch
always returns.

(cherry picked from commit 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b)
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/mm/tlb.c | 66 +++++++++++++++++++++++++++----------------------------
1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 54a5870190a6..9fb30d27854b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -187,6 +187,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
+ bool need_flush;
+ u16 new_asid;

/*
* NB: The scheduler will call us with prev == next when switching
@@ -252,8 +254,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

return;
} else {
- u16 new_asid;
- bool need_flush;
u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);

/*
@@ -308,44 +308,44 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/* Let nmi_uaccess_okay() know that we're changing CR3. */
this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
barrier();
+ }

- if (need_flush) {
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
- this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
-
- /*
- * NB: This gets called via leave_mm() in the idle path
- * where RCU functions differently. Tracing normally
- * uses RCU, so we need to use the _rcuidle variant.
- *
- * (There is no good reason for this. The idle code should
- * be rearranged to call this before rcu_idle_enter().)
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- } else {
- /* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
-
- /* See above wrt _rcuidle. */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
- }
+ if (need_flush) {
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+ this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+ load_new_mm_cr3(next->pgd, new_asid, true);

/*
- * Record last user mm's context id, so we can avoid
- * flushing branch buffer with IBPB if we switch back
- * to the same user.
+ * NB: This gets called via leave_mm() in the idle path
+ * where RCU functions differently. Tracing normally
+ * uses RCU, so we need to use the _rcuidle variant.
+ *
+ * (There is no good reason for this. The idle code should
+ * be rearranged to call this before rcu_idle_enter().)
*/
- if (next != &init_mm)
- this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
-
- /* Make sure we write CR3 before loaded_mm. */
- barrier();
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ } else {
+ /* The new ASID is already up to date. */
+ load_new_mm_cr3(next->pgd, new_asid, false);

- this_cpu_write(cpu_tlbstate.loaded_mm, next);
- this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+ /* See above wrt _rcuidle. */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}

+ /*
+ * Record last user mm's context id, so we can avoid
+ * flushing branch buffer with IBPB if we switch back
+ * to the same user.
+ */
+ if (next != &init_mm)
+ this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
+ /* Make sure we write CR3 before loaded_mm. */
+ barrier();
+
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
+ this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+
load_mm_cr4(next);
switch_ldt(real_prev, next);
}