From: Lai Jiangshan <[email protected]>
This path gets used called from:
1. #NMI return.
2. paranoid_exit (i.e. #MCE, #VC, #DB and #DF return)
Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
RESTORE_CR3"), we never modify CR3 in any of these exceptions, except
for switching from user to kernel pagetables under PTI. That means that
most of the time when returning from an exception that interrupted the
kernel no CR3 restore is necessary. Writing CR3 is expensive on some
machines, so this commit avoids redundant writes.
I said "most of the time" because we might have interrupted the kernel
entry before the user->kernel CR3 switch or the exit after the
kernel->user switch. In the former case skipping the restore might
actually be be fine, but definitely not the latter. So we do still need
to check the saved CR3 and restore it if it's a user CR3.
To reflect the new behaviour RESTORE_CR3 is given a longer name, and a
comment that was describing its behaviour at the call site is removed.
We can also simplify the code around the SET_NOFLUSH_BIT invocation
as we no longer need to branch to it from above.
Signed-off-by: Lai Jiangshan <[email protected]>
[Rewrote commit message; responded to review comments]
Signed-off-by: Brendan Jackman <[email protected]>
---
Notes:
V1: https://lore.kernel.org/lkml/[email protected]/
v1->v2: Rewrote some comments, added a proper commit message, cleaned up
the code per tglx's suggestion.
I've kept Lai as the Author. If you prefer for the blame to
record the last person that touched it then that's also fine
though, I can credit Lai as Co-developed-by.
arch/x86/entry/calling.h | 25 +++++++++----------------
arch/x86/entry/entry_64.S | 7 +++----
2 files changed, 12 insertions(+), 20 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f6907627172b..84b1e32c27a1 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -233,17 +233,18 @@ For 32-bit we have the following conventions - kernel is built with
.Ldone_\@:
.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
+.macro RESTORE_CR3_IF_USER scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
- ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
/*
- * KERNEL pages can always resume with NOFLUSH as we do
- * explicit flushes.
+ * If CR3 contained the kernel page tables at the paranoid exception
+ * entry, then there is nothing to restore as CR3 is not modified while
+ * handling the exception.
*/
bt $PTI_USER_PGTABLE_BIT, \save_reg
- jnc .Lnoflush_\@
+ jnc .Lend_\@
+
+ ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
/*
* Check if there's a pending flush for the user ASID we're
@@ -251,20 +252,12 @@ For 32-bit we have the following conventions - kernel is built with
*/
movq \save_reg, \scratch_reg
andq $(0x7FF), \scratch_reg
- bt \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jnc .Lnoflush_\@
-
btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jmp .Lwrcr3_\@
+ jc .Lwrcr3_\@
-.Lnoflush_\@:
SET_NOFLUSH_BIT \save_reg
.Lwrcr3_\@:
- /*
- * The CR3 write could be avoided when not changing its value,
- * but would require a CR3 read *and* a scratch register.
- */
movq \save_reg, %cr3
.Lend_\@:
.endm
@@ -279,7 +272,7 @@ For 32-bit we have the following conventions - kernel is built with
.endm
.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
+.macro RESTORE_CR3_IF_USER scratch_reg:req save_reg:req
.endm
#endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 43606de22511..ff73767b5d1f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1019,14 +1019,14 @@ SYM_CODE_START_LOCAL(paranoid_exit)
IBRS_EXIT save_reg=%r15
/*
- * The order of operations is important. RESTORE_CR3 requires
+ * The order of operations is important. RESTORE_CR3_IF_USER requires
* kernel GSBASE.
*
* NB to anyone to try to optimize this code: this code does
* not execute at all for exceptions from user mode. Those
* exceptions go through error_return instead.
*/
- RESTORE_CR3 scratch_reg=%rax save_reg=%r14
+ RESTORE_CR3_IF_USER scratch_reg=%rax save_reg=%r14
/* Handle the three GSBASE cases */
ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
@@ -1457,8 +1457,7 @@ end_repeat_nmi:
/* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
IBRS_EXIT save_reg=%r15
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
+ RESTORE_CR3_IF_USER scratch_reg=%r15 save_reg=%r14
/*
* The above invocation of paranoid_entry stored the GSBASE
--
2.42.0.459.ge4e396fd5e-goog
Hi Thomas, others, any thoughts on this one?
On Fri, Oct 27 2023 at 14:22, Brendan Jackman wrote:
> Hi Thomas, others, any thoughts on this one?
Oops. This clearly fell through the cracks and now it's a tad late for
6.7. Let me stare at it though.
Thanks,
tglx
On Wed, Sep 20, 2023 at 03:04:43PM +0000, Brendan Jackman wrote:
> From: Lai Jiangshan <[email protected]>
>
> This path gets used called from:
>
> 1. #NMI return.
> 2. paranoid_exit (i.e. #MCE, #VC, #DB and #DF return)
>
> Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
> RESTORE_CR3"), we never modify CR3 in any of these exceptions, except
> for switching from user to kernel pagetables under PTI. That means that
> most of the time when returning from an exception that interrupted the
> kernel no CR3 restore is necessary. Writing CR3 is expensive on some
> machines, so this commit avoids redundant writes.
>
> I said "most of the time" because we might have interrupted the kernel
> entry before the user->kernel CR3 switch or the exit after the
> kernel->user switch. In the former case skipping the restore might
> actually be be fine, but definitely not the latter. So we do still need
> to check the saved CR3 and restore it if it's a user CR3.
>
> To reflect the new behaviour RESTORE_CR3 is given a longer name, and a
> comment that was describing its behaviour at the call site is removed.
> We can also simplify the code around the SET_NOFLUSH_BIT invocation
> as we no longer need to branch to it from above.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> [Rewrote commit message; responded to review comments]
> Signed-off-by: Brendan Jackman <[email protected]>
> ---
Seems sensible, although I do wonder what made you care enough to
optimize the PTI paranoid path... :-)
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On 9/20/23 08:04, Brendan Jackman wrote:
> From: Lai Jiangshan <[email protected]>
>
> This path gets used called from:
>
> 1. #NMI return.
> 2. paranoid_exit (i.e. #MCE, #VC, #DB and #DF return)
>
> Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
> RESTORE_CR3"), we never modify CR3 in any of these exceptions, except
> for switching from user to kernel pagetables under PTI. That means that
> most of the time when returning from an exception that interrupted the
> kernel no CR3 restore is necessary. Writing CR3 is expensive on some
> machines, so this commit avoids redundant writes.
Please avoid "we's" in changelogs.
One thing that I think is key to this patch, but which is muddled up in
that changelog:
RESTORE_CR3 is *ONLY* used when returning to the kernel.
It must handle user CR3 values because the kernel can run in the
entry/exit code with user CR3 values. That means that restoring
user CR3 values is important functionally but is actually rare
in practice.
That's _not_ obvious from just glancing at RESTORE_CR3 call sites or
reading your changelog. It also makes it obvious why this patch is
worth it: it optimizes the overwhelmingly common case.
> I said "most of the time" because we might have interrupted the kernel
> entry before the user->kernel CR3 switch or the exit after the
> kernel->user switch. In the former case skipping the restore might
> actually be be fine, but definitely not the latter. So we do still need
> to check the saved CR3 and restore it if it's a user CR3.
>
> To reflect the new behaviour RESTORE_CR3 is given a longer name, and a
> comment that was describing its behaviour at the call site is removed.
> We can also simplify the code around the SET_NOFLUSH_BIT invocation
> as we no longer need to branch to it from above.
Also, I don't feel _that_ strongly about the naming, but I'd kinda
rather it be:
PARANOID_RESTORE_CR3
That would at least label it explicitly for these paranoid exit cases.
Sure, this _can_ get used in other contexts theoretically, but it's
really only suitable for the two paranoid exit paths.
It also avoids confusion about whether the "USER" refers to the context
or the CR3 value.
I would probably also give this CR3 function a declared purpose:
/* Restore CR3 from a kernel context. May restore a user CR3 value. */