Hi-
In entry_64.S, we have:
ENTRY(entry_SYSCALL_64)
/*
* Interrupts are off on entry.
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.
*/
SWAPGS_UNSAFE_STACK
/*
* A hypervisor implementation might want to use a label
* after the swapgs, so that it can do the swapgs
* for the guest and jump here on syscall.
*/
GLOBAL(entry_SYSCALL_64_after_swapgs)
movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
It would be really, really nice if Xen entered the SYSCALL path
*after* the mov to %rsp.
Similarly, we have:
movq RSP(%rsp), %rsp
/* big comment */
USERGS_SYSRET64
It would be really nice if we could just mov to %rsp, swapgs, and
sysret, without worrying that the sysret is actually a jump on Xen.
I suspect that making Xen stop using these code paths would actually
be a simplification. On SYSCALL entry, Xen lands in
xen_syscall_target (AFAICT) and clearly has rsp pointing somewhere
valid. Xen obligingly shoves the user RSP into the hardware RSP
register and jumps into the entry code.
Is that stuff running on the normal kernel stack? If so, can we just
enter later on:
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
<-- Xen enters here
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */
pushq %rsi /* pt_regs->si */
pushq %rdx /* pt_regs->dx */
For SYSRET, I think the way to go is to force Xen to always use the
syscall slow path. Instead, Xen could hook into
syscall_return_via_sysret or even right before the opportunistic
sysret stuff. Then we could remove the USERGS_SYSRET hooks entirely.
Would this work?
--Andy
On 23/07/2015 17:49, Andy Lutomirski wrote:
> Hi-
Hi. Apologies for the delay. I have been out of the office for a few days.
>
> In entry_64.S, we have:
>
> ENTRY(entry_SYSCALL_64)
> /*
> * Interrupts are off on entry.
> * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
> * it is too small to ever cause noticeable irq latency.
> */
> SWAPGS_UNSAFE_STACK
> /*
> * A hypervisor implementation might want to use a label
> * after the swapgs, so that it can do the swapgs
> * for the guest and jump here on syscall.
> */
> GLOBAL(entry_SYSCALL_64_after_swapgs)
>
> movq %rsp, PER_CPU_VAR(rsp_scratch)
> movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> It would be really, really nice if Xen entered the SYSCALL path
> *after* the mov to %rsp.
>
> Similarly, we have:
>
> movq RSP(%rsp), %rsp
> /* big comment */
> USERGS_SYSRET64
>
> It would be really nice if we could just mov to %rsp, swapgs, and
> sysret, without worrying that the sysret is actually a jump on Xen.
>
> I suspect that making Xen stop using these code paths would actually
> be a simplification. On SYSCALL entry, Xen lands in
> xen_syscall_target (AFAICT) and clearly has rsp pointing somewhere
> valid. Xen obligingly shoves the user RSP into the hardware RSP
> register and jumps into the entry code.
>
> Is that stuff running on the normal kernel stack?
Yes. The Xen ABI takes what is essentially tss->esp0 and uses that stack
for all "switch to kernel" actions, including syscall and sysenter.
> If so, can we just
> enter later on:
>
> pushq %r11 /* pt_regs->flags */
> pushq $__USER_CS /* pt_regs->cs */
> pushq %rcx /* pt_regs->ip */
>
> <-- Xen enters here
>
> pushq %rax /* pt_regs->orig_ax */
> pushq %rdi /* pt_regs->di */
> pushq %rsi /* pt_regs->si */
> pushq %rdx /* pt_regs->dx */
This looks plausible, and indeed preferable to the current doublestep
with undo_xen_syscall.
One slight complication is that xen_enable_syscall() will want to
special case register_callback() to not set CALLBACKF_mask_events, as
the entry point is now after re-enabling interrupts.
>
> For SYSRET, I think the way to go is to force Xen to always use the
> syscall slow path. Instead, Xen could hook into
> syscall_return_via_sysret or even right before the opportunistic
> sysret stuff. Then we could remove the USERGS_SYSRET hooks entirely.
>
> Would this work?
None of the opportunistic sysret stuff makes sense under Xen. The path
will inevitably end up in xen_iret making a hypercall. Short circuiting
all of this seems like a good idea, especially if it allows for the
removal of the UERGS_SYSRET.
~Andrew
On Sun, Jul 26, 2015 at 12:34 PM, Andrew Cooper
<[email protected]> wrote:
> On 23/07/2015 17:49, Andy Lutomirski wrote:
>> Hi-
>
> Hi. Apologies for the delay. I have been out of the office for a few days.
>
>>
>> In entry_64.S, we have:
>>
>> ENTRY(entry_SYSCALL_64)
>> /*
>> * Interrupts are off on entry.
>> * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
>> * it is too small to ever cause noticeable irq latency.
>> */
>> SWAPGS_UNSAFE_STACK
>> /*
>> * A hypervisor implementation might want to use a label
>> * after the swapgs, so that it can do the swapgs
>> * for the guest and jump here on syscall.
>> */
>> GLOBAL(entry_SYSCALL_64_after_swapgs)
>>
>> movq %rsp, PER_CPU_VAR(rsp_scratch)
>> movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>>
>> It would be really, really nice if Xen entered the SYSCALL path
>> *after* the mov to %rsp.
>>
>> Similarly, we have:
>>
>> movq RSP(%rsp), %rsp
>> /* big comment */
>> USERGS_SYSRET64
>>
>> It would be really nice if we could just mov to %rsp, swapgs, and
>> sysret, without worrying that the sysret is actually a jump on Xen.
>>
>> I suspect that making Xen stop using these code paths would actually
>> be a simplification. On SYSCALL entry, Xen lands in
>> xen_syscall_target (AFAICT) and clearly has rsp pointing somewhere
>> valid. Xen obligingly shoves the user RSP into the hardware RSP
>> register and jumps into the entry code.
>>
>> Is that stuff running on the normal kernel stack?
>
> Yes. The Xen ABI takes what is essentially tss->esp0 and uses that stack
> for all "switch to kernel" actions, including syscall and sysenter.
>
>> If so, can we just
>> enter later on:
>>
>> pushq %r11 /* pt_regs->flags */
>> pushq $__USER_CS /* pt_regs->cs */
>> pushq %rcx /* pt_regs->ip */
>>
>> <-- Xen enters here
>>
>> pushq %rax /* pt_regs->orig_ax */
>> pushq %rdi /* pt_regs->di */
>> pushq %rsi /* pt_regs->si */
>> pushq %rdx /* pt_regs->dx */
>
> This looks plausible, and indeed preferable to the current doublestep
> with undo_xen_syscall.
>
> One slight complication is that xen_enable_syscall() will want to
> special case register_callback() to not set CALLBACKF_mask_events, as
> the entry point is now after re-enabling interrupts.
I wouldn't do that. Let's just move the ENABLE_INTERRUPTS a few
instructions later even on native -- I want to do that anyway.
>
>>
>> For SYSRET, I think the way to go is to force Xen to always use the
>> syscall slow path. Instead, Xen could hook into
>> syscall_return_via_sysret or even right before the opportunistic
>> sysret stuff. Then we could remove the USERGS_SYSRET hooks entirely.
>>
>> Would this work?
>
> None of the opportunistic sysret stuff makes sense under Xen. The path
> will inevitably end up in xen_iret making a hypercall. Short circuiting
> all of this seems like a good idea, especially if it allows for the
> removal of the UERGS_SYSRET.
Doesn't Xen decide what to do based on VGCF_IN_SYSCALL? Maybe Xen
should have its own opportunistic VGCF_IN_SYSCALL logic.
Hmm, maybe some of this would be easier to think about if, rather than
having a paravirt op, we could have:
ALTERNATIVE "", "jmp xen_pop_things_and_iret", X86_FEATURE_XEN
Or just IF_XEN("jmp ...");
As a practical matter, x86_64 has native and Xen -- I don't think
there's any other paravirt platform that needs the asm hooks.
--Andy
On 26/07/2015 23:08, Andy Lutomirski wrote:
>
>>> If so, can we just
>>> enter later on:
>>>
>>> pushq %r11 /* pt_regs->flags */
>>> pushq $__USER_CS /* pt_regs->cs */
>>> pushq %rcx /* pt_regs->ip */
>>>
>>> <-- Xen enters here
>>>
>>> pushq %rax /* pt_regs->orig_ax */
>>> pushq %rdi /* pt_regs->di */
>>> pushq %rsi /* pt_regs->si */
>>> pushq %rdx /* pt_regs->dx */
>> This looks plausible, and indeed preferable to the current doublestep
>> with undo_xen_syscall.
>>
>> One slight complication is that xen_enable_syscall() will want to
>> special case register_callback() to not set CALLBACKF_mask_events, as
>> the entry point is now after re-enabling interrupts.
> I wouldn't do that. Let's just move the ENABLE_INTERRUPTS a few
> instructions later even on native -- I want to do that anyway.
That would also work.
>
>>> For SYSRET, I think the way to go is to force Xen to always use the
>>> syscall slow path. Instead, Xen could hook into
>>> syscall_return_via_sysret or even right before the opportunistic
>>> sysret stuff. Then we could remove the USERGS_SYSRET hooks entirely.
>>>
>>> Would this work?
>> None of the opportunistic sysret stuff makes sense under Xen. The path
>> will inevitably end up in xen_iret making a hypercall. Short circuiting
>> all of this seems like a good idea, especially if it allows for the
>> removal of the UERGS_SYSRET.
> Doesn't Xen decide what to do based on VGCF_IN_SYSCALL? Maybe Xen
> should have its own opportunistic VGCF_IN_SYSCALL logic.
VGCF_in_syscall affects whether the extra r11/rcx get restored or not,
as the hypercall itself is implemented using syscall. As the extra
r11/rcx (and rax for that matter) are unconditionally saved in the
hypercall stub, I can't see anything Linux could usefully do,
opportunistically speaking.
>
> Hmm, maybe some of this would be easier to think about if, rather than
> having a paravirt op, we could have:
>
> ALTERNATIVE "", "jmp xen_pop_things_and_iret", X86_FEATURE_XEN
>
> Or just IF_XEN("jmp ...");
>
> As a practical matter, x86_64 has native and Xen -- I don't think
> there's any other paravirt platform that needs the asm hooks.
It would certainly seem so. A careful use of IF_XEN() or two would make
the code far clearer to read, and to drop the hooks.
~Andrew
On Sun, Jul 26, 2015 at 4:05 PM, Andrew Cooper
<[email protected]> wrote:
> On 26/07/2015 23:08, Andy Lutomirski wrote:
>>
>>>> If so, can we just
>>>> enter later on:
>>>>
>>>> pushq %r11 /* pt_regs->flags */
>>>> pushq $__USER_CS /* pt_regs->cs */
>>>> pushq %rcx /* pt_regs->ip */
>>>>
>>>> <-- Xen enters here
>>>>
>>>> pushq %rax /* pt_regs->orig_ax */
>>>> pushq %rdi /* pt_regs->di */
>>>> pushq %rsi /* pt_regs->si */
>>>> pushq %rdx /* pt_regs->dx */
>>> This looks plausible, and indeed preferable to the current doublestep
>>> with undo_xen_syscall.
>>>
>>> One slight complication is that xen_enable_syscall() will want to
>>> special case register_callback() to not set CALLBACKF_mask_events, as
>>> the entry point is now after re-enabling interrupts.
>> I wouldn't do that. Let's just move the ENABLE_INTERRUPTS a few
>> instructions later even on native -- I want to do that anyway.
>
> That would also work.
>
>>
>>>> For SYSRET, I think the way to go is to force Xen to always use the
>>>> syscall slow path. Instead, Xen could hook into
>>>> syscall_return_via_sysret or even right before the opportunistic
>>>> sysret stuff. Then we could remove the USERGS_SYSRET hooks entirely.
>>>>
>>>> Would this work?
>>> None of the opportunistic sysret stuff makes sense under Xen. The path
>>> will inevitably end up in xen_iret making a hypercall. Short circuiting
>>> all of this seems like a good idea, especially if it allows for the
>>> removal of the UERGS_SYSRET.
>> Doesn't Xen decide what to do based on VGCF_IN_SYSCALL? Maybe Xen
>> should have its own opportunistic VGCF_IN_SYSCALL logic.
>
> VGCF_in_syscall affects whether the extra r11/rcx get restored or not,
> as the hypercall itself is implemented using syscall. As the extra
> r11/rcx (and rax for that matter) are unconditionally saved in the
> hypercall stub, I can't see anything Linux could usefully do,
> opportunistically speaking.
Xen does:
/* %rbx: struct vcpu, interrupts disabled */
restore_all_guest:
ASSERT_INTERRUPTS_DISABLED
RESTORE_ALL
testw $TRAP_syscall,4(%rsp)
jz iret_exit_to_guest
/* Don't use SYSRET path if the return address is not canonical. */
movq 8(%rsp),%rcx
sarq $47,%rcx
incl %ecx
cmpl $1,%ecx
ja .Lforce_iret
cmpw $FLAT_USER_CS32,16(%rsp)# CS
movq 8(%rsp),%rcx # RIP
movq 24(%rsp),%r11 # RFLAGS
movq 32(%rsp),%rsp # RSP
je 1f
sysretq
1: sysretl
That's essentially the same thing as opportunistic sysret. If Linux
stops setting VGCF_in_syscall, though, I think we'll bypass that code,
which will hurt performance. Whether this should be fixed in the
hypervisor or in the guest kernel hooks, I don't know, but it would be
easy to have a very simple xen_opportunistic_sysret path that checks
rcx==rip and r11==rflags and, if so, sets VGCF_in_syscall.
>
>>
>> Hmm, maybe some of this would be easier to think about if, rather than
>> having a paravirt op, we could have:
>>
>> ALTERNATIVE "", "jmp xen_pop_things_and_iret", X86_FEATURE_XEN
>>
>> Or just IF_XEN("jmp ...");
>>
>> As a practical matter, x86_64 has native and Xen -- I don't think
>> there's any other paravirt platform that needs the asm hooks.
>
> It would certainly seem so. A careful use of IF_XEN() or two would make
> the code far clearer to read, and to drop the hooks.
>
Want to add an IF_XEN macro?
I'm about to send patches for the SYSCALL bit.
--Andy
> ~Andrew
>
--
Andy Lutomirski
AMA Capital Management, LLC
On 27/07/15 00:27, Andy Lutomirski wrote:
>
>>>>> For SYSRET, I think the way to go is to force Xen to always use the
>>>>> syscall slow path. Instead, Xen could hook into
>>>>> syscall_return_via_sysret or even right before the opportunistic
>>>>> sysret stuff. Then we could remove the USERGS_SYSRET hooks entirely.
>>>>>
>>>>> Would this work?
>>>> None of the opportunistic sysret stuff makes sense under Xen. The path
>>>> will inevitably end up in xen_iret making a hypercall. Short circuiting
>>>> all of this seems like a good idea, especially if it allows for the
>>>> removal of the UERGS_SYSRET.
>>> Doesn't Xen decide what to do based on VGCF_IN_SYSCALL? Maybe Xen
>>> should have its own opportunistic VGCF_IN_SYSCALL logic.
>> VGCF_in_syscall affects whether the extra r11/rcx get restored or not,
>> as the hypercall itself is implemented using syscall. As the extra
>> r11/rcx (and rax for that matter) are unconditionally saved in the
>> hypercall stub, I can't see anything Linux could usefully do,
>> opportunistically speaking.
> Xen does:
>
> /* %rbx: struct vcpu, interrupts disabled */
> restore_all_guest:
> ASSERT_INTERRUPTS_DISABLED
> RESTORE_ALL
> testw $TRAP_syscall,4(%rsp)
> jz iret_exit_to_guest
>
> /* Don't use SYSRET path if the return address is not canonical. */
> movq 8(%rsp),%rcx
> sarq $47,%rcx
> incl %ecx
> cmpl $1,%ecx
> ja .Lforce_iret
>
> cmpw $FLAT_USER_CS32,16(%rsp)# CS
> movq 8(%rsp),%rcx # RIP
> movq 24(%rsp),%r11 # RFLAGS
> movq 32(%rsp),%rsp # RSP
> je 1f
> sysretq
> 1: sysretl
>
> That's essentially the same thing as opportunistic sysret. If Linux
> stops setting VGCF_in_syscall, though, I think we'll bypass that code,
> which will hurt performance. Whether this should be fixed in the
> hypervisor or in the guest kernel hooks, I don't know, but it would be
> easy to have a very simple xen_opportunistic_sysret path that checks
> rcx==rip and r11==rflags and, if so, sets VGCF_in_syscall.
I see your point. I didn't intend to suggest that Linux should stop
setting VGCF_in_syscall, as it is the only entity which knows whether it
is safe to clobber rcx/r11 in user context.
Having said this, Xen could certainly do its own opportunistic sysret
calculations as well. There are a number of issues in the Xen sysret
code which I plan to fix in due course, and I will see about making this
adjustment.
>
>>> Hmm, maybe some of this would be easier to think about if, rather than
>>> having a paravirt op, we could have:
>>>
>>> ALTERNATIVE "", "jmp xen_pop_things_and_iret", X86_FEATURE_XEN
>>>
>>> Or just IF_XEN("jmp ...");
>>>
>>> As a practical matter, x86_64 has native and Xen -- I don't think
>>> there's any other paravirt platform that needs the asm hooks.
>> It would certainly seem so. A careful use of IF_XEN() or two would make
>> the code far clearer to read, and to drop the hooks.
>>
> Want to add an IF_XEN macro?
I currently have a blocker bug against the impending Xen 4.6 release
which is higher on my todo list, but I will look into this as soon as I can.
~Andrew