Thanks to the reviewers and Andy Lutomirski for the suggestion of
using ctx_id which got rid of the problem of mm pointer recycling.
Here's an update of this patch based on Andy's suggestion.
We could switch to a kernel idle thread and then back to the original
process such as:
process A -> idle -> process A
In such scenario, we do not have to do IBPB here even though the process is
non-dumpable, as we are switching back to the same process after
an hiatus.
We track the last mm user context id before we switch to init_mm by calling
leave_mm when tlb_defer_switch_to_init_mm returns false (pcid available).
The cost is to have an extra u64 mm context id to track the last mm we were using before
switching to the init_mm used by idle. Avoiding the extra IBPB
is probably worth the extra memory for this common scenario.
For those cases where tlb_defer_switch_to_init_mm returns true (non pcid),
lazy tlb will defer switch to init_mm, so we will not be changing
the mm for the process A -> idle -> process A switch. So
IBPB will be skipped for this case.
v2:
1. Save last user context id instead of last user mm to avoid the problem of recycled mm
Signed-off-by: Tim Chen <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 2 ++
arch/x86/mm/tlb.c | 23 ++++++++++++++++-------
2 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3effd3c..4405c4b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -174,6 +174,8 @@ struct tlb_state {
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
u16 next_asid;
+ /* last user mm's ctx id */
+ u64 last_ctx_id;
/*
* We can be in one of several states:
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 33f5f97..2179b90 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -220,6 +220,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
} else {
u16 new_asid;
bool need_flush;
+ u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
/*
* Avoid user/user BTB poisoning by flushing the branch predictor
@@ -230,14 +231,13 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* switching into processes that disable dumping.
*
* This will not flush branches when switching into kernel
- * threads, but it would flush them when switching to the
- * idle thread and back.
- *
- * It might be useful to have a one-off cache here
- * to also not flush the idle case, but we would need some
- * kind of stable sequence number to remember the previous mm.
+ * threads. It will also not flush if we switch to idle
+ * thread and back to the same process. It will flush if we
+ * switch to a different non-dumpable process.
*/
- if (tsk && tsk->mm && get_dumpable(tsk->mm) != SUID_DUMP_USER)
+ if (tsk && tsk->mm &&
+ tsk->mm->context.ctx_id != last_ctx_id &&
+ get_dumpable(tsk->mm) != SUID_DUMP_USER)
indirect_branch_prediction_barrier();
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -288,6 +288,14 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
+ /*
+ * Record last user mm's context id, so we can avoid
+ * flushing branch buffer with IBPB if we switch back
+ * to the same user.
+ */
+ if (next != &init_mm)
+ this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
}
@@ -365,6 +373,7 @@ void initialize_tlbstate_and_flush(void)
write_cr3(build_cr3(mm->pgd, 0));
/* Reinitialize tlbstate. */
+ this_cpu_write(cpu_tlbstate.last_ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
--
2.9.4
* Tim Chen <[email protected]> wrote:
> Thanks to the reviewers and Andy Lutomirski for the suggestion of
> using ctx_id which got rid of the problem of mm pointer recycling.
> Here's an update of this patch based on Andy's suggestion.
>
> We could switch to a kernel idle thread and then back to the original
> process such as:
> process A -> idle -> process A
>
> In such scenario, we do not have to do IBPB here even though the process is
> non-dumpable, as we are switching back to the same process after
> an hiatus.
>
> We track the last mm user context id before we switch to init_mm by calling
> leave_mm when tlb_defer_switch_to_init_mm returns false (pcid available).
>
> The cost is to have an extra u64 mm context id to track the last mm we were using before
> switching to the init_mm used by idle. Avoiding the extra IBPB
> is probably worth the extra memory for this common scenario.
>
> For those cases where tlb_defer_switch_to_init_mm returns true (non pcid),
> lazy tlb will defer switch to init_mm, so we will not be changing
> the mm for the process A -> idle -> process A switch. So
> IBPB will be skipped for this case.
>
> v2:
> 1. Save last user context id instead of last user mm to avoid the problem of recycled mm
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 2 ++
> arch/x86/mm/tlb.c | 23 ++++++++++++++++-------
> 2 files changed, 18 insertions(+), 7 deletions(-)
What tree is this patch against? It doesn't apply to linus's latest, nor to
tip:master.
Thanks,
Ingo
On Sun, 2018-01-28 at 10:56 +0100, Ingo Molnar wrote:
>
> What tree is this patch against? It doesn't apply to linus's latest, nor to
> tip:master.
It's in my tree at
http://git.infradead.org/users/dwmw2/linux-retpoline.git/shortlog/refs/heads/ibpb
which is gradually being finalized and flushed via tip/x86/pti.
The IBPB on context switch parts are currently the first three patches
there, which roughly suggests they might be the next to get sent out
for real, if we have reached a consensus. The three patches there
probably want collapsing into one, but I've left them as-is for now
while we're discussing it.
The other thing that's next on the list is exposing the MSRs to guests.
The IBPB one is fairly simple, and Karim is working on exposing IBRS to
guests too, using Paolo's per-vCPU MSR bitmap to do the same trick
we've done in Xen, to expose it only after the guest first touches it
(to avoid the cost of swapping it when it's always 0←→0 in the common
case). I think Ashok was talking about doing the same thing? We'll see
who gets there first :)