From: Lai Jiangshan <[email protected]>
error_entry() calls sync_regs() to settle/copy the pt_regs and switches
the stack directly after sync_regs(). But error_entry() itself is also
a function call, the switching has to handle the return address of it
together, which causes the work complicated and tangly.
Switching to the stack after error_entry() makes the code simpler and
intuitive.
The behavior/logic is unchanged:
1) (opt) feed fixup_bad_iret() with the pt_regs pushed by ASM code
2) (opt) fixup_bad_iret() moves the partial pt_regs up
3) feed sync_regs() with the pt_regs pushed by ASM code or returned
by fixup_bad_iret()
4) sync_regs() copies the whole pt_regs to kernel stack if needed
5) after error_entry() and switching %rsp, it is in kernel stack with
the pt_regs
Changes only in calling:
Old code switches to copied pt_regs immediately twice in
error_entry() while new code switches to the copied pt_regs only
once after error_entry() returns.
It is correct since sync_regs() doesn't need to be called close
to the pt_regs it handles.
Old code stashes the return-address of error_entry() in a scratch
register and new code doesn't stash it.
It relies on the fact that fixup_bad_iret() and sync_regs() don't
corrupt the return-address of error_entry() on the stack. But even
the old code also relies on the fact that fixup_bad_iret() and
sync_regs() don't corrupt the return-address of themselves.
They are the same reliances and are assured.
After this change, error_entry() will not do fancy things with the stack
except when in the prolog which will be fixed in the next patch ("move
PUSH_AND_CLEAR_REGS out of error_entry"). This patch and the next patch
can't be swapped because the next patch relies on this patch's stopping
fiddling with the return-address of error_entry(), otherwise the objtool
would complain.
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 37505331b7f1..7768cdd0c7ed 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -326,6 +326,8 @@ SYM_CODE_END(ret_from_fork)
.macro idtentry_body cfunc has_error_code:req
call error_entry
+ movq %rax, %rsp /* switch stack settled by sync_regs() */
+ ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
@@ -1014,14 +1016,10 @@ SYM_CODE_START_LOCAL(error_entry)
/* We have user CR3. Change to kernel CR3. */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+ leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
.Lerror_entry_from_usermode_after_swapgs:
/* Put us onto the real thread stack. */
- popq %r12 /* save return addr in %12 */
- movq %rsp, %rdi /* arg0 = pt_regs pointer */
call sync_regs
- movq %rax, %rsp /* switch stack */
- ENCODE_FRAME_POINTER
- pushq %r12
RET
/*
@@ -1053,6 +1051,7 @@ SYM_CODE_START_LOCAL(error_entry)
*/
.Lerror_entry_done_lfence:
FENCE_SWAPGS_KERNEL_ENTRY
+ leaq 8(%rsp), %rax /* return pt_regs pointer */
RET
.Lbstep_iret:
@@ -1073,12 +1072,9 @@ SYM_CODE_START_LOCAL(error_entry)
* Pretend that the exception came from user mode: set up pt_regs
* as if we faulted immediately after IRET.
*/
- popq %r12 /* save return addr in %12 */
- movq %rsp, %rdi /* arg0 = pt_regs pointer */
+ leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
call fixup_bad_iret
- mov %rax, %rsp
- ENCODE_FRAME_POINTER
- pushq %r12
+ mov %rax, %rdi
jmp .Lerror_entry_from_usermode_after_swapgs
SYM_CODE_END(error_entry)
--
2.19.1.6.gb485710b
On Tue, Mar 15, 2022 at 03:39:45PM +0800, Lai Jiangshan wrote:
> .Lerror_entry_from_usermode_after_swapgs:
> /* Put us onto the real thread stack. */
This comment is no longer correct.
--
Josh
On Tue, Mar 15, 2022 at 03:39:45PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> error_entry() calls sync_regs() to settle/copy the pt_regs and switches
> the stack directly after sync_regs(). But error_entry() itself is also
> a function call, the switching has to handle the return address of it
> together, which causes the work complicated and tangly.
>
> Switching to the stack after error_entry() makes the code simpler and
> intuitive.
>
> The behavior/logic is unchanged:
> 1) (opt) feed fixup_bad_iret() with the pt_regs pushed by ASM code
> 2) (opt) fixup_bad_iret() moves the partial pt_regs up
> 3) feed sync_regs() with the pt_regs pushed by ASM code or returned
> by fixup_bad_iret()
> 4) sync_regs() copies the whole pt_regs to kernel stack if needed
> 5) after error_entry() and switching %rsp, it is in kernel stack with
> the pt_regs
>
> Changes only in calling:
> Old code switches to copied pt_regs immediately twice in
> error_entry() while new code switches to the copied pt_regs only
> once after error_entry() returns.
> It is correct since sync_regs() doesn't need to be called close
> to the pt_regs it handles.
>
> Old code stashes the return-address of error_entry() in a scratch
> register and new code doesn't stash it.
> It relies on the fact that fixup_bad_iret() and sync_regs() don't
> corrupt the return-address of error_entry() on the stack. But even
> the old code also relies on the fact that fixup_bad_iret() and
> sync_regs() don't corrupt the return-address of themselves.
> They are the same reliances and are assured.
>
> After this change, error_entry() will not do fancy things with the stack
> except when in the prolog which will be fixed in the next patch ("move
> PUSH_AND_CLEAR_REGS out of error_entry"). This patch and the next patch
> can't be swapped because the next patch relies on this patch's stopping
> fiddling with the return-address of error_entry(), otherwise the objtool
> would complain.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 37505331b7f1..7768cdd0c7ed 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -326,6 +326,8 @@ SYM_CODE_END(ret_from_fork)
> .macro idtentry_body cfunc has_error_code:req
>
> call error_entry
> + movq %rax, %rsp /* switch stack settled by sync_regs() */
> + ENCODE_FRAME_POINTER
> UNWIND_HINT_REGS
>
> movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
> @@ -1014,14 +1016,10 @@ SYM_CODE_START_LOCAL(error_entry)
> /* We have user CR3. Change to kernel CR3. */
> SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>
> + leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
> .Lerror_entry_from_usermode_after_swapgs:
> /* Put us onto the real thread stack. */
> - popq %r12 /* save return addr in %12 */
> - movq %rsp, %rdi /* arg0 = pt_regs pointer */
> call sync_regs
> - movq %rax, %rsp /* switch stack */
> - ENCODE_FRAME_POINTER
> - pushq %r12
> RET
>
> /*
> @@ -1053,6 +1051,7 @@ SYM_CODE_START_LOCAL(error_entry)
> */
> .Lerror_entry_done_lfence:
> FENCE_SWAPGS_KERNEL_ENTRY
> + leaq 8(%rsp), %rax /* return pt_regs pointer */
> RET
>
> .Lbstep_iret:
> @@ -1073,12 +1072,9 @@ SYM_CODE_START_LOCAL(error_entry)
> * Pretend that the exception came from user mode: set up pt_regs
> * as if we faulted immediately after IRET.
> */
> - popq %r12 /* save return addr in %12 */
> - movq %rsp, %rdi /* arg0 = pt_regs pointer */
> + leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
> call fixup_bad_iret
> - mov %rax, %rsp
> - ENCODE_FRAME_POINTER
> - pushq %r12
> + mov %rax, %rdi
> jmp .Lerror_entry_from_usermode_after_swapgs
> SYM_CODE_END(error_entry)
So the new Changelog doesn't seem to help me much. But looking at both
fixup_bad_iret() and sync_regs(), they both have:
__this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1
as hard-coded destination. Now, fixup_bad_iret() sets up a complete
ptregs there and then returns a pointer to this stack.
sync_regs otoh, does a straight up pt_regs sized copy from arg0 to this
new stack.
Therefore it appears to me that doing sync_regs() after fixup_bad_iret()
is a complete NO-OP and only confuses things further.
Would not something like the below clarify things?
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1004,6 +1004,7 @@ SYM_CODE_START_LOCAL(error_entry)
.Lerror_entry_from_usermode_after_swapgs:
/* Put us onto the real thread stack. */
call sync_regs
+.Lerror_entry_from_usermode_after_sync_regs:
RET
/*
@@ -1058,8 +1059,12 @@ SYM_CODE_START_LOCAL(error_entry)
*/
leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
call fixup_bad_iret
- mov %rax, %rdi
- jmp .Lerror_entry_from_usermode_after_swapgs
+ /*
+ * fixup_bad_iret() will have setup pt_regs on the thread stack, and
+ * returns a pointer to that stack exactly like sync_regs() would've
+ * done. As such, calling sync_regs again makes no sense.
+ */
+ jmp .Lerror_entry_from_usermode_after_sync_regs
SYM_CODE_END(error_entry)
SYM_CODE_START_LOCAL(error_return)
On Wed, Mar 16, 2022 at 10:37 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Tue, Mar 15, 2022 at 03:39:45PM +0800, Lai Jiangshan wrote:
> > .Lerror_entry_from_usermode_after_swapgs:
> > /* Put us onto the real thread stack. */
>
> This comment is no longer correct.
Thanks for the review.
I thought "us" is the pt_regs. But it can be everything including
the "running environment" and "running stack".
I will change "us" to "the pt_regs"
>
> --
> Josh
>
On Wed, Mar 16, 2022 at 11:05 PM Peter Zijlstra <[email protected]> wrote:
>
> > SYM_CODE_END(error_entry)
>
> So the new Changelog doesn't seem to help me much. But looking at both
> fixup_bad_iret() and sync_regs(), they both have:
>
> __this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1
For a long time in old days (before KPTI), tss.sp0 does be the kernel
thread stack, but now is not the kernel thread stack, rather it is the
trampoline stack (or entry stack) in the cpu entry area.
And bad IRET can happen when doing IRET to return to userspace and it
is also the trampoline stack. So fixup_bad_iret() is really just moving
the lower partial pt_regs up to concat with user IRET frame head
in the entry stack.
So fixup_bad_iret() will NOT have setup pt_regs on the thread stack.
And sync_regs() is needed after fixup_bad_iret()
The patch
https://lore.kernel.org/lkml/[email protected]/
tried changing fixup_bad_iret() to copy the pt_regs directly to
the kernel stack.
And in the V1 of the patchset that converting ASM to code also
tried it:
https://lore.kernel.org/lkml/[email protected]/
And in the current patchset, it focuses on ASM code only. I don't think
we need to change it. It would be much simpler to change the behavior
of fixup_bad_iret() when error_entry() is converted to C.
>
> as hard-coded destination. Now, fixup_bad_iret() sets up a complete
> ptregs there and then returns a pointer to this stack.
>
> sync_regs otoh, does a straight up pt_regs sized copy from arg0 to this
> new stack.
>
> Therefore it appears to me that doing sync_regs() after fixup_bad_iret()
> is a complete NO-OP and only confuses things further.
>
> Would not something like the below clarify things?
>
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1004,6 +1004,7 @@ SYM_CODE_START_LOCAL(error_entry)
> .Lerror_entry_from_usermode_after_swapgs:
> /* Put us onto the real thread stack. */
> call sync_regs
> +.Lerror_entry_from_usermode_after_sync_regs:
> RET
>
> /*
> @@ -1058,8 +1059,12 @@ SYM_CODE_START_LOCAL(error_entry)
> */
> leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
> call fixup_bad_iret
> - mov %rax, %rdi
> - jmp .Lerror_entry_from_usermode_after_swapgs
> + /*
> + * fixup_bad_iret() will have setup pt_regs on the thread stack, and
> + * returns a pointer to that stack exactly like sync_regs() would've
> + * done. As such, calling sync_regs again makes no sense.
> + */
> + jmp .Lerror_entry_from_usermode_after_sync_regs
> SYM_CODE_END(error_entry)
>
> SYM_CODE_START_LOCAL(error_return)
On Thu, Mar 17, 2022 at 12:43:02AM +0800, Lai Jiangshan wrote:
> On Wed, Mar 16, 2022 at 11:05 PM Peter Zijlstra <[email protected]> wrote:
> >
> > > SYM_CODE_END(error_entry)
> >
> > So the new Changelog doesn't seem to help me much. But looking at both
> > fixup_bad_iret() and sync_regs(), they both have:
> >
> > __this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1
>
Yes, I can't read :/ Functions don't both use that.
Your patch is correct, not switching the stack in between runs sync_regs
from another stack, but the end result should be the same.