From: Lai Jiangshan <[email protected]>
This path gets used called from:
1. #NMI return.
2. paranoid_exit (i.e. #MCE, #VC, #DB and #DF return)
Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
RESTORE_CR3"), the kernel never modifies CR3 in any of these exceptions,
except for switching from user to kernel pagetables under PTI. That
means that most of the time when returning from an exception that
interrupted the kernel no CR3 restore is necessary. Writing CR3 is
expensive on some machines, so this commit avoids redundant writes.
I said "most of the time" because the interrupt might have come during
kernel entry before the user->kernel CR3 switch or the during exit after
the kernel->user switch. In the former case skipping the restore might
actually be be fine, but definitely not the latter. So we do still need
to check the saved CR3 and restore it if it's a user CR3.
Note this code is ONLY used for returning _to kernel code_. So the only
times where the CR3 write is necessary are in those rather special cases
mentioned above where we are in kernel _code_ but a userspace CR3.
While changing this logic the macro is given a new name to clarify its
usage, and a comment that was describing its behaviour at the call site
is removed. We can also simplify the code around the SET_NOFLUSH_BIT
invocation as we no longer need to branch to it from above.
Signed-off-by: Lai Jiangshan <[email protected]>
[Rewrote commit message; responded to review comments]
Signed-off-by: Brendan Jackman <[email protected]>
Change-Id: I6e56978c4753fb943a7897ff101f519514fa0827
---
Notes:
v1: https://lore.kernel.org/lkml/[email protected]/
v1->v2: Rewrote some comments, added a proper commit message, cleaned up
the code per tglx's suggestion.
I've kept Lai as the Author. If you prefer for the blame to
record the last person that touched it then that's also fine
though, I can credit Lai as Co-developed-by.
v2: https://lore.kernel.org/lkml/[email protected]/
v2->v3: Clarified the commit message per Dave's suggestion and renamed the
macro. I did not carry PeterZ's ack since I have made some changes.
original v3 (no responses):
https://lore.kernel.org/lkml/[email protected]/
Thanks for the reviews :)
arch/x86/entry/calling.h | 26 ++++++++++----------------
arch/x86/entry/entry_64.S | 7 +++----
2 files changed, 13 insertions(+), 20 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f6907627172b..25cbfba1fe46 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -233,17 +233,19 @@ For 32-bit we have the following conventions - kernel is built with
.Ldone_\@:
.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
+/* Restore CR3 from a kernel context. May restore a user CR3 value. */
+.macro PARANOID_RESTORE_CR3 scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
- ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
/*
- * KERNEL pages can always resume with NOFLUSH as we do
- * explicit flushes.
+ * If CR3 contained the kernel page tables at the paranoid exception
+ * entry, then there is nothing to restore as CR3 is not modified while
+ * handling the exception.
*/
bt $PTI_USER_PGTABLE_BIT, \save_reg
- jnc .Lnoflush_\@
+ jnc .Lend_\@
+
+ ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
/*
* Check if there's a pending flush for the user ASID we're
@@ -251,20 +253,12 @@ For 32-bit we have the following conventions - kernel is built with
*/
movq \save_reg, \scratch_reg
andq $(0x7FF), \scratch_reg
- bt \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jnc .Lnoflush_\@
-
btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jmp .Lwrcr3_\@
+ jc .Lwrcr3_\@
-.Lnoflush_\@:
SET_NOFLUSH_BIT \save_reg
.Lwrcr3_\@:
- /*
- * The CR3 write could be avoided when not changing its value,
- * but would require a CR3 read *and* a scratch register.
- */
movq \save_reg, %cr3
.Lend_\@:
.endm
@@ -279,7 +273,7 @@ For 32-bit we have the following conventions - kernel is built with
.endm
.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
+.macro PARANOID_RESTORE_CR3 scratch_reg:req save_reg:req
.endm
#endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index de6469dffe3a..d65182500bfe 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -957,14 +957,14 @@ SYM_CODE_START_LOCAL(paranoid_exit)
IBRS_EXIT save_reg=%r15
/*
- * The order of operations is important. RESTORE_CR3 requires
+ * The order of operations is important. PARANOID_RESTORE_CR3 requires
* kernel GSBASE.
*
* NB to anyone to try to optimize this code: this code does
* not execute at all for exceptions from user mode. Those
* exceptions go through error_return instead.
*/
- RESTORE_CR3 scratch_reg=%rax save_reg=%r14
+ PARANOID_RESTORE_CR3 scratch_reg=%rax save_reg=%r14
/* Handle the three GSBASE cases */
ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
@@ -1393,8 +1393,7 @@ end_repeat_nmi:
/* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
IBRS_EXIT save_reg=%r15
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
+ PARANOID_RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
/*
* The above invocation of paranoid_entry stored the GSBASE
--
2.42.0.869.gea05f2083d-goog
On Mon, Jan 8, 2024 at 3:39 AM Brendan Jackman <[email protected]> wrote:
>
> From: Lai Jiangshan <[email protected]>
>
> This path gets used called from:
>
> 1. #NMI return.
> 2. paranoid_exit (i.e. #MCE, #VC, #DB and #DF return)
>
> Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
> RESTORE_CR3"), the kernel never modifies CR3 in any of these exceptions,
> except for switching from user to kernel pagetables under PTI. That
> means that most of the time when returning from an exception that
> interrupted the kernel no CR3 restore is necessary. Writing CR3 is
> expensive on some machines, so this commit avoids redundant writes.
>
> I said "most of the time" because the interrupt might have come during
> kernel entry before the user->kernel CR3 switch or the during exit after
> the kernel->user switch. In the former case skipping the restore might
> actually be be fine, but definitely not the latter. So we do still need
> to check the saved CR3 and restore it if it's a user CR3.
>
> Note this code is ONLY used for returning _to kernel code_. So the only
> times where the CR3 write is necessary are in those rather special cases
> mentioned above where we are in kernel _code_ but a userspace CR3.
>
> While changing this logic the macro is given a new name to clarify its
> usage, and a comment that was describing its behaviour at the call site
> is removed. We can also simplify the code around the SET_NOFLUSH_BIT
> invocation as we no longer need to branch to it from above.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> [Rewrote commit message; responded to review comments]
> Signed-off-by: Brendan Jackman <[email protected]>
> Change-Id: I6e56978c4753fb943a7897ff101f519514fa0827
The Change-Id line here needs to be deleted. Otherwise, it seems like
this patch keeps falling through the cracks :)
Is there anything that needs to be done here?
> ---
>
> Notes:
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> v1->v2: Rewrote some comments, added a proper commit message, cleaned up
> the code per tglx's suggestion.
>
> I've kept Lai as the Author. If you prefer for the blame to
> record the last person that touched it then that's also fine
> though, I can credit Lai as Co-developed-by.
>
> v2: https://lore.kernel.org/lkml/[email protected]/
>
> v2->v3: Clarified the commit message per Dave's suggestion and renamed the
> macro. I did not carry PeterZ's ack since I have made some changes.
>
> original v3 (no responses):
> https://lore.kernel.org/lkml/[email protected]/
The following commit has been merged into the x86/entry branch of tip:
Commit-ID: bb998361999e79bc87dae1ebe0f5bf317f632585
Gitweb: https://git.kernel.org/tip/bb998361999e79bc87dae1ebe0f5bf317f632585
Author: Lai Jiangshan <[email protected]>
AuthorDate: Mon, 08 Jan 2024 11:39:50
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 24 Jan 2024 13:57:59 +01:00
x86/entry: Avoid redundant CR3 write on paranoid returns
The CR3 restore happens in:
1. #NMI return.
2. paranoid_exit() (i.e. #MCE, #VC, #DB and #DF return)
Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
RESTORE_CR3"), the kernel never modifies CR3 in any of these exceptions,
except for switching from user to kernel pagetables under PTI. That
means that most of the time when returning from an exception that
interrupted the kernel no CR3 restore is necessary. Writing CR3 is
expensive on some machines.
Most of the time because the interrupt might have come during kernel entry
before the user to kernel CR3 switch or the during exit after the kernel to
user switch. In the former case skipping the restore would be correct, but
definitely not for the latter.
So check the saved CR3 value and restore it only, if it is a user CR3.
Give the macro a new name to clarify its usage, and remove a comment that
was describing the original behaviour along with the not longer needed jump
label.
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Brendan Jackman <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[Rewrote commit message; responded to review comments]
Change-Id: I6e56978c4753fb943a7897ff101f519514fa0827
---
arch/x86/entry/calling.h | 26 ++++++++++----------------
arch/x86/entry/entry_64.S | 7 +++----
2 files changed, 13 insertions(+), 20 deletions(-)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9f1d947..92dca4a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -239,17 +239,19 @@ For 32-bit we have the following conventions - kernel is built with
.Ldone_\@:
.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
+/* Restore CR3 from a kernel context. May restore a user CR3 value. */
+.macro PARANOID_RESTORE_CR3 scratch_reg:req save_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
- ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
/*
- * KERNEL pages can always resume with NOFLUSH as we do
- * explicit flushes.
+ * If CR3 contained the kernel page tables at the paranoid exception
+ * entry, then there is nothing to restore as CR3 is not modified while
+ * handling the exception.
*/
bt $PTI_USER_PGTABLE_BIT, \save_reg
- jnc .Lnoflush_\@
+ jnc .Lend_\@
+
+ ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
/*
* Check if there's a pending flush for the user ASID we're
@@ -257,20 +259,12 @@ For 32-bit we have the following conventions - kernel is built with
*/
movq \save_reg, \scratch_reg
andq $(0x7FF), \scratch_reg
- bt \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jnc .Lnoflush_\@
-
btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
- jmp .Lwrcr3_\@
+ jc .Lwrcr3_\@
-.Lnoflush_\@:
SET_NOFLUSH_BIT \save_reg
.Lwrcr3_\@:
- /*
- * The CR3 write could be avoided when not changing its value,
- * but would require a CR3 read *and* a scratch register.
- */
movq \save_reg, %cr3
.Lend_\@:
.endm
@@ -285,7 +279,7 @@ For 32-bit we have the following conventions - kernel is built with
.endm
.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
+.macro PARANOID_RESTORE_CR3 scratch_reg:req save_reg:req
.endm
#endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c40f89a..aedd169 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -968,14 +968,14 @@ SYM_CODE_START_LOCAL(paranoid_exit)
IBRS_EXIT save_reg=%r15
/*
- * The order of operations is important. RESTORE_CR3 requires
+ * The order of operations is important. PARANOID_RESTORE_CR3 requires
* kernel GSBASE.
*
* NB to anyone to try to optimize this code: this code does
* not execute at all for exceptions from user mode. Those
* exceptions go through error_return instead.
*/
- RESTORE_CR3 scratch_reg=%rax save_reg=%r14
+ PARANOID_RESTORE_CR3 scratch_reg=%rax save_reg=%r14
/* Handle the three GSBASE cases */
ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
@@ -1404,8 +1404,7 @@ end_repeat_nmi:
/* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
IBRS_EXIT save_reg=%r15
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
+ PARANOID_RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
/*
* The above invocation of paranoid_entry stored the GSBASE
[Apologies if you see this as a duplicate, accidentally sent the
original in HTML, please disregard the other one]
Hi Thomas,
I have just noticed that the commit has disappeared from
tip/x86/entry. Is that deliberate?
Thanks,
Brendan
On Wed, 24 Jan 2024 at 19:36, tip-bot2 for Lai Jiangshan
<[email protected]> wrote:
>
> The following commit has been merged into the x86/entry branch of tip:
>
> Commit-ID: bb998361999e79bc87dae1ebe0f5bf317f632585
> Gitweb: https://git.kernel.org/tip/bb998361999e79bc87dae1ebe0f5bf317f632585
> Author: Lai Jiangshan <[email protected]>
> AuthorDate: Mon, 08 Jan 2024 11:39:50
> Committer: Thomas Gleixner <[email protected]>
> CommitterDate: Wed, 24 Jan 2024 13:57:59 +01:00
>
> x86/entry: Avoid redundant CR3 write on paranoid returns
>
> The CR3 restore happens in:
>
> 1. #NMI return.
> 2. paranoid_exit() (i.e. #MCE, #VC, #DB and #DF return)
>
> Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
> RESTORE_CR3"), the kernel never modifies CR3 in any of these exceptions,
> except for switching from user to kernel pagetables under PTI. That
> means that most of the time when returning from an exception that
> interrupted the kernel no CR3 restore is necessary. Writing CR3 is
> expensive on some machines.
>
> Most of the time because the interrupt might have come during kernel entry
> before the user to kernel CR3 switch or the during exit after the kernel to
> user switch. In the former case skipping the restore would be correct, but
> definitely not for the latter.
>
> So check the saved CR3 value and restore it only, if it is a user CR3.
>
> Give the macro a new name to clarify its usage, and remove a comment that
> was describing the original behaviour along with the not longer needed jump
> label.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Signed-off-by: Brendan Jackman <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
>
> [Rewrote commit message; responded to review comments]
> Change-Id: I6e56978c4753fb943a7897ff101f519514fa0827
> ---
> arch/x86/entry/calling.h | 26 ++++++++++----------------
> arch/x86/entry/entry_64.S | 7 +++----
> 2 files changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 9f1d947..92dca4a 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -239,17 +239,19 @@ For 32-bit we have the following conventions - kernel is built with
> .Ldone_\@:
> .endm
>
> -.macro RESTORE_CR3 scratch_reg:req save_reg:req
> +/* Restore CR3 from a kernel context. May restore a user CR3 value. */
> +.macro PARANOID_RESTORE_CR3 scratch_reg:req save_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>
> - ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> -
> /*
> - * KERNEL pages can always resume with NOFLUSH as we do
> - * explicit flushes.
> + * If CR3 contained the kernel page tables at the paranoid exception
> + * entry, then there is nothing to restore as CR3 is not modified while
> + * handling the exception.
> */
> bt $PTI_USER_PGTABLE_BIT, \save_reg
> - jnc .Lnoflush_\@
> + jnc .Lend_\@
> +
> + ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
>
> /*
> * Check if there's a pending flush for the user ASID we're
> @@ -257,20 +259,12 @@ For 32-bit we have the following conventions - kernel is built with
> */
> movq \save_reg, \scratch_reg
> andq $(0x7FF), \scratch_reg
> - bt \scratch_reg, THIS_CPU_user_pcid_flush_mask
> - jnc .Lnoflush_\@
> -
> btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
> - jmp .Lwrcr3_\@
> + jc .Lwrcr3_\@
>
> -.Lnoflush_\@:
> SET_NOFLUSH_BIT \save_reg
>
> .Lwrcr3_\@:
> - /*
> - * The CR3 write could be avoided when not changing its value,
> - * but would require a CR3 read *and* a scratch register.
> - */
> movq \save_reg, %cr3
> .Lend_\@:
> .endm
> @@ -285,7 +279,7 @@ For 32-bit we have the following conventions - kernel is built with
> .endm
> .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> .endm
> -.macro RESTORE_CR3 scratch_reg:req save_reg:req
> +.macro PARANOID_RESTORE_CR3 scratch_reg:req save_reg:req
> .endm
>
> #endif
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index c40f89a..aedd169 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -968,14 +968,14 @@ SYM_CODE_START_LOCAL(paranoid_exit)
> IBRS_EXIT save_reg=%r15
>
> /*
> - * The order of operations is important. RESTORE_CR3 requires
> + * The order of operations is important. PARANOID_RESTORE_CR3 requires
> * kernel GSBASE.
> *
> * NB to anyone to try to optimize this code: this code does
> * not execute at all for exceptions from user mode. Those
> * exceptions go through error_return instead.
> */
> - RESTORE_CR3 scratch_reg=%rax save_reg=%r14
> + PARANOID_RESTORE_CR3 scratch_reg=%rax save_reg=%r14
>
> /* Handle the three GSBASE cases */
> ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
> @@ -1404,8 +1404,7 @@ end_repeat_nmi:
> /* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
> IBRS_EXIT save_reg=%r15
>
> - /* Always restore stashed CR3 value (see paranoid_entry) */
> - RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> + PARANOID_RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>
> /*
> * The above invocation of paranoid_entry stored the GSBASE
Ah yep, it's there. Checked my history, I was looking in the wrong
branch. Thanks for the correction.
On Mon, 19 Feb 2024 at 15:42, Borislav Petkov <[email protected]> wrote:
>
> On Mon, Feb 19, 2024 at 11:49:46AM +0100, Brendan Jackman wrote:
> > [Apologies if you see this as a duplicate, accidentally sent the
> > original in HTML, please disregard the other one]
> >
> > Hi Thomas,
> >
> > I have just noticed that the commit has disappeared from
> > tip/x86/entry. Is that deliberate?
>
> $ git fetch tip
> $ git log -1 --oneline tip/x86/entry
> bb998361999e (refs/remotes/tip/x86/entry) x86/entry: Avoid redundant CR3 write on paranoid returns
>
> Looks there to me. :)
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Feb 19, 2024 at 11:49:46AM +0100, Brendan Jackman wrote:
> [Apologies if you see this as a duplicate, accidentally sent the
> original in HTML, please disregard the other one]
>
> Hi Thomas,
>
> I have just noticed that the commit has disappeared from
> tip/x86/entry. Is that deliberate?
$ git fetch tip
$ git log -1 --oneline tip/x86/entry
bb998361999e (refs/remotes/tip/x86/entry) x86/entry: Avoid redundant CR3 write on paranoid returns
Looks there to me. :)
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette