An Armv8.8 FEAT_MOPS main or epilogue instruction will take an exception
if executed on a CPU with a different MOPS implementation option (A or
B) than the CPU where the preceding prologue instruction ran. In this
case the OS exception handler is expected to reset the registers and
restart execution from the prologue instruction.
A KVM guest may use the instructions at EL1 at times when the guest is
not able to handle the exception, expecting that the instructions will
only run on one CPU (e.g. when running UEFI boot services in the guest).
As KVM may reschedule the guest between different types of CPUs at any
time (on an asymmetric system), it needs to also handle the resulting
exception itself in case the guest is not able to. A similar situation
will also occur in the future when live migrating a guest from one type
of CPU to another.
Add handling for the MOPS exception to KVM. The handling can be shared
with the EL0 exception handler, as the logic and register layouts are
the same. The exception can be handled right after exiting a guest,
which avoids the cost of returning to the host exit handler.
Similarly to the EL0 exception handler, in case the main or epilogue
instruction is being single stepped, it makes sense to finish the step
before executing the prologue instruction, so advance the single step
state machine.
Signed-off-by: Kristina Martsenko <[email protected]>
---
arch/arm64/include/asm/traps.h | 54 ++++++++++++++++++++++++-
arch/arm64/kernel/traps.c | 48 +---------------------
arch/arm64/kvm/hyp/include/hyp/switch.h | 17 ++++++++
arch/arm64/kvm/hyp/nvhe/switch.c | 2 +
arch/arm64/kvm/hyp/vhe/switch.c | 1 +
5 files changed, 73 insertions(+), 49 deletions(-)
diff --git a/arch/arm64/include/asm/traps.h b/arch/arm64/include/asm/traps.h
index d66dfb3a72dd..eefe766d6161 100644
--- a/arch/arm64/include/asm/traps.h
+++ b/arch/arm64/include/asm/traps.h
@@ -9,10 +9,9 @@
#include <linux/list.h>
#include <asm/esr.h>
+#include <asm/ptrace.h>
#include <asm/sections.h>
-struct pt_regs;
-
#ifdef CONFIG_ARMV8_DEPRECATED
bool try_emulate_armv8_deprecated(struct pt_regs *regs, u32 insn);
#else
@@ -101,4 +100,55 @@ static inline unsigned long arm64_ras_serror_get_severity(unsigned long esr)
bool arm64_is_fatal_ras_serror(struct pt_regs *regs, unsigned long esr);
void __noreturn arm64_serror_panic(struct pt_regs *regs, unsigned long esr);
+
+static inline void arm64_mops_reset_regs(struct user_pt_regs *regs, unsigned long esr)
+{
+ bool wrong_option = esr & ESR_ELx_MOPS_ISS_WRONG_OPTION;
+ bool option_a = esr & ESR_ELx_MOPS_ISS_OPTION_A;
+ int dstreg = ESR_ELx_MOPS_ISS_DESTREG(esr);
+ int srcreg = ESR_ELx_MOPS_ISS_SRCREG(esr);
+ int sizereg = ESR_ELx_MOPS_ISS_SIZEREG(esr);
+ unsigned long dst, src, size;
+
+ dst = regs->regs[dstreg];
+ src = regs->regs[srcreg];
+ size = regs->regs[sizereg];
+
+ /*
+ * Put the registers back in the original format suitable for a
+ * prologue instruction, using the generic return routine from the
+ * Arm ARM (DDI 0487I.a) rules CNTMJ and MWFQH.
+ */
+ if (esr & ESR_ELx_MOPS_ISS_MEM_INST) {
+ /* SET* instruction */
+ if (option_a ^ wrong_option) {
+ /* Format is from Option A; forward set */
+ regs->regs[dstreg] = dst + size;
+ regs->regs[sizereg] = -size;
+ }
+ } else {
+ /* CPY* instruction */
+ if (!(option_a ^ wrong_option)) {
+ /* Format is from Option B */
+ if (regs->pstate & PSR_N_BIT) {
+ /* Backward copy */
+ regs->regs[dstreg] = dst - size;
+ regs->regs[srcreg] = src - size;
+ }
+ } else {
+ /* Format is from Option A */
+ if (size & BIT(63)) {
+ /* Forward copy */
+ regs->regs[dstreg] = dst + size;
+ regs->regs[srcreg] = src + size;
+ regs->regs[sizereg] = -size;
+ }
+ }
+ }
+
+ if (esr & ESR_ELx_MOPS_ISS_FROM_EPILOGUE)
+ regs->pc -= 8;
+ else
+ regs->pc -= 4;
+}
#endif
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 8b70759cdbb9..ede65a20e7dc 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -516,53 +516,7 @@ void do_el1_fpac(struct pt_regs *regs, unsigned long esr)
void do_el0_mops(struct pt_regs *regs, unsigned long esr)
{
- bool wrong_option = esr & ESR_ELx_MOPS_ISS_WRONG_OPTION;
- bool option_a = esr & ESR_ELx_MOPS_ISS_OPTION_A;
- int dstreg = ESR_ELx_MOPS_ISS_DESTREG(esr);
- int srcreg = ESR_ELx_MOPS_ISS_SRCREG(esr);
- int sizereg = ESR_ELx_MOPS_ISS_SIZEREG(esr);
- unsigned long dst, src, size;
-
- dst = pt_regs_read_reg(regs, dstreg);
- src = pt_regs_read_reg(regs, srcreg);
- size = pt_regs_read_reg(regs, sizereg);
-
- /*
- * Put the registers back in the original format suitable for a
- * prologue instruction, using the generic return routine from the
- * Arm ARM (DDI 0487I.a) rules CNTMJ and MWFQH.
- */
- if (esr & ESR_ELx_MOPS_ISS_MEM_INST) {
- /* SET* instruction */
- if (option_a ^ wrong_option) {
- /* Format is from Option A; forward set */
- pt_regs_write_reg(regs, dstreg, dst + size);
- pt_regs_write_reg(regs, sizereg, -size);
- }
- } else {
- /* CPY* instruction */
- if (!(option_a ^ wrong_option)) {
- /* Format is from Option B */
- if (regs->pstate & PSR_N_BIT) {
- /* Backward copy */
- pt_regs_write_reg(regs, dstreg, dst - size);
- pt_regs_write_reg(regs, srcreg, src - size);
- }
- } else {
- /* Format is from Option A */
- if (size & BIT(63)) {
- /* Forward copy */
- pt_regs_write_reg(regs, dstreg, dst + size);
- pt_regs_write_reg(regs, srcreg, src + size);
- pt_regs_write_reg(regs, sizereg, -size);
- }
- }
- }
-
- if (esr & ESR_ELx_MOPS_ISS_FROM_EPILOGUE)
- regs->pc -= 8;
- else
- regs->pc -= 4;
+ arm64_mops_reset_regs(®s->user_regs, esr);
/*
* If single stepping then finish the step before executing the
diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index 9cfe6bd1dbe4..f99d8af0b9af 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -30,6 +30,7 @@
#include <asm/fpsimd.h>
#include <asm/debug-monitors.h>
#include <asm/processor.h>
+#include <asm/traps.h>
struct kvm_exception_table_entry {
int insn, fixup;
@@ -265,6 +266,22 @@ static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
return __get_fault_info(vcpu->arch.fault.esr_el2, &vcpu->arch.fault);
}
+static bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
+{
+ *vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR);
+ arm64_mops_reset_regs(vcpu_gp_regs(vcpu), vcpu->arch.fault.esr_el2);
+ write_sysreg_el2(*vcpu_pc(vcpu), SYS_ELR);
+
+ /*
+ * Finish potential single step before executing the prologue
+ * instruction.
+ */
+ *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
+ write_sysreg_el2(*vcpu_cpsr(vcpu), SYS_SPSR);
+
+ return true;
+}
+
static inline void __hyp_sve_restore_guest(struct kvm_vcpu *vcpu)
{
sve_cond_update_zcr_vq(vcpu_sve_max_vq(vcpu) - 1, SYS_ZCR_EL2);
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index c353a06ee7e6..c50f8459e4fc 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -192,6 +192,7 @@ static const exit_handler_fn hyp_exit_handlers[] = {
[ESR_ELx_EC_DABT_LOW] = kvm_hyp_handle_dabt_low,
[ESR_ELx_EC_WATCHPT_LOW] = kvm_hyp_handle_watchpt_low,
[ESR_ELx_EC_PAC] = kvm_hyp_handle_ptrauth,
+ [ESR_ELx_EC_MOPS] = kvm_hyp_handle_mops,
};
static const exit_handler_fn pvm_exit_handlers[] = {
@@ -203,6 +204,7 @@ static const exit_handler_fn pvm_exit_handlers[] = {
[ESR_ELx_EC_DABT_LOW] = kvm_hyp_handle_dabt_low,
[ESR_ELx_EC_WATCHPT_LOW] = kvm_hyp_handle_watchpt_low,
[ESR_ELx_EC_PAC] = kvm_hyp_handle_ptrauth,
+ [ESR_ELx_EC_MOPS] = kvm_hyp_handle_mops,
};
static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 6537f58b1a8c..796202f2e08f 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -126,6 +126,7 @@ static const exit_handler_fn hyp_exit_handlers[] = {
[ESR_ELx_EC_DABT_LOW] = kvm_hyp_handle_dabt_low,
[ESR_ELx_EC_WATCHPT_LOW] = kvm_hyp_handle_watchpt_low,
[ESR_ELx_EC_PAC] = kvm_hyp_handle_ptrauth,
+ [ESR_ELx_EC_MOPS] = kvm_hyp_handle_mops,
};
static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu)
--
2.25.1
Hi Kristina,
On Fri, 22 Sep 2023 12:25:07 +0100,
Kristina Martsenko <[email protected]> wrote:
>
> An Armv8.8 FEAT_MOPS main or epilogue instruction will take an exception
> if executed on a CPU with a different MOPS implementation option (A or
> B) than the CPU where the preceding prologue instruction ran. In this
> case the OS exception handler is expected to reset the registers and
> restart execution from the prologue instruction.
>
> A KVM guest may use the instructions at EL1 at times when the guest is
> not able to handle the exception, expecting that the instructions will
> only run on one CPU (e.g. when running UEFI boot services in the guest).
> As KVM may reschedule the guest between different types of CPUs at any
> time (on an asymmetric system), it needs to also handle the resulting
> exception itself in case the guest is not able to. A similar situation
> will also occur in the future when live migrating a guest from one type
> of CPU to another.
>
> Add handling for the MOPS exception to KVM. The handling can be shared
> with the EL0 exception handler, as the logic and register layouts are
> the same. The exception can be handled right after exiting a guest,
> which avoids the cost of returning to the host exit handler.
>
> Similarly to the EL0 exception handler, in case the main or epilogue
> instruction is being single stepped, it makes sense to finish the step
> before executing the prologue instruction, so advance the single step
> state machine.
What is the rationale for advancing the state machine? Shouldn't we
instead return to the guest and immediately get the SS exception,
which in turn gets reported to userspace? Is it because we rollback
the PC to a previous instruction?
In the latter case, won't userspace see multiple SS exceptions for the
middle instruction if trapping from the epilogue? This would be a bit
surprising, to say the least.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On 24/09/2023 15:48, Marc Zyngier wrote:
> Hi Kristina,
Hi Marc,
> On Fri, 22 Sep 2023 12:25:07 +0100,
> Kristina Martsenko <[email protected]> wrote:
>>
>> An Armv8.8 FEAT_MOPS main or epilogue instruction will take an exception
>> if executed on a CPU with a different MOPS implementation option (A or
>> B) than the CPU where the preceding prologue instruction ran. In this
>> case the OS exception handler is expected to reset the registers and
>> restart execution from the prologue instruction.
>>
>> A KVM guest may use the instructions at EL1 at times when the guest is
>> not able to handle the exception, expecting that the instructions will
>> only run on one CPU (e.g. when running UEFI boot services in the guest).
>> As KVM may reschedule the guest between different types of CPUs at any
>> time (on an asymmetric system), it needs to also handle the resulting
>> exception itself in case the guest is not able to. A similar situation
>> will also occur in the future when live migrating a guest from one type
>> of CPU to another.
>>
>> Add handling for the MOPS exception to KVM. The handling can be shared
>> with the EL0 exception handler, as the logic and register layouts are
>> the same. The exception can be handled right after exiting a guest,
>> which avoids the cost of returning to the host exit handler.
>>
>> Similarly to the EL0 exception handler, in case the main or epilogue
>> instruction is being single stepped, it makes sense to finish the step
>> before executing the prologue instruction, so advance the single step
>> state machine.
>
> What is the rationale for advancing the state machine? Shouldn't we
> instead return to the guest and immediately get the SS exception,
> which in turn gets reported to userspace? Is it because we rollback
> the PC to a previous instruction?
Yes, because we rollback the PC to the prologue instruction. We advance the
state machine so that the SS exception is taken immediately upon returning to
the guest at the prologue instruction. If we didn't advance it then we would
return to the guest, execute the prologue instruction, and then take the SS
exception on the middle instruction. Which would be surprising as userspace
would see the middle and epilogue instructions executed multiple times but not
the prologue.
> In the latter case, won't userspace see multiple SS exceptions for the
> middle instruction if trapping from the epilogue? This would be a bit
> surprising, to say the least.
Not sure I follow. Do you mean multiple in a row or multiple in total? Not in a
row (we step the prologue instruction in between), but yes in total (every time
we start executing the middle instruction). And this happens when trapping from
the middle instruction too, not just the epilogue. Do you see a better way of
handling it?
Here is an example of what GDB sees when single stepping a guest while the
guest executes these instructions ("mops ex" are debug prints in kvm; I've
added prologue/main/epilogue comments):
Breakpoint 2, 0xffff80008051b6a4 in ?? ()
0xffff80008051b6a8 in ?? () # prologue
0xffff80008051b6ac in ?? () # main
[ 33.615305] mops ex: memcpy: B->A: fwd: main
0xffff80008051b6a8 in ?? () # prologue
0xffff80008051b6ac in ?? () # main
0xffff80008051b6b0 in ?? () # epilogue
[ 34.141251] mops ex: memcpy: A->B: fwd: epi
0xffff80008051b6a8 in ?? () # prologue
0xffff80008051b6ac in ?? () # main
0xffff80008051b6b0 in ?? () # epilogue
[ 34.209822] mops ex: memcpy: B->A: fwd: epi
0xffff80008051b6a8 in ?? () # prologue
0xffff80008051b6ac in ?? () # main
0xffff80008051b6b0 in ?? () # epilogue
0xffff80008051b6b4 in ?? ()
[...]
Thanks,
Kristina
On Mon, Sep 25, 2023 at 04:16:06PM +0100, Kristina Martsenko wrote:
[...]
> > What is the rationale for advancing the state machine? Shouldn't we
> > instead return to the guest and immediately get the SS exception,
> > which in turn gets reported to userspace? Is it because we rollback
> > the PC to a previous instruction?
>
> Yes, because we rollback the PC to the prologue instruction. We advance the
> state machine so that the SS exception is taken immediately upon returning to
> the guest at the prologue instruction. If we didn't advance it then we would
> return to the guest, execute the prologue instruction, and then take the SS
> exception on the middle instruction. Which would be surprising as userspace
> would see the middle and epilogue instructions executed multiple times but not
> the prologue.
I agree with Kristina that taking the SS exception on the prologue is
likely the best course of action. Especially since it matches the
behavior of single-stepping an EL0 MOPS sequence with an intervening CPU
migration.
This behavior might throw an EL1 that single-steps itself for a loop,
but I think it is impossible for a hypervisor to hide the consequences
of vCPU migration with MOPS in the first place.
Marc, I'm guessing you were most concerned about the former case where
the VMM was debugging the guest. Is there something you're concerned
about I missed?
--
Thanks,
Oliver
On Wed, 27 Sep 2023 09:28:20 +0100,
Oliver Upton <[email protected]> wrote:
>
> On Mon, Sep 25, 2023 at 04:16:06PM +0100, Kristina Martsenko wrote:
>
> [...]
>
> > > What is the rationale for advancing the state machine? Shouldn't we
> > > instead return to the guest and immediately get the SS exception,
> > > which in turn gets reported to userspace? Is it because we rollback
> > > the PC to a previous instruction?
> >
> > Yes, because we rollback the PC to the prologue instruction. We advance the
> > state machine so that the SS exception is taken immediately upon returning to
> > the guest at the prologue instruction. If we didn't advance it then we would
> > return to the guest, execute the prologue instruction, and then take the SS
> > exception on the middle instruction. Which would be surprising as userspace
> > would see the middle and epilogue instructions executed multiple times but not
> > the prologue.
>
> I agree with Kristina that taking the SS exception on the prologue is
> likely the best course of action. Especially since it matches the
> behavior of single-stepping an EL0 MOPS sequence with an intervening CPU
> migration.
>
> This behavior might throw an EL1 that single-steps itself for a loop,
> but I think it is impossible for a hypervisor to hide the consequences
> of vCPU migration with MOPS in the first place.
>
> Marc, I'm guessing you were most concerned about the former case where
> the VMM was debugging the guest. Is there something you're concerned
> about I missed?
My concern is not only the VMM, but any userspace that perform
single-stepping. Imagine the debugger tracks PC by itself, and simply
increments it by 4 on a non-branch, non-fault instruction.
Move the vcpu or the userspace around, rewind PC, and now the debugger
is out of whack with what is executing. While I agree that there is
not much a hypervisor can do about that, I'm a bit worried that we are
going to break existing SW with this.
Now the obvious solution is "don't do that"...
M.
--
Without deviation from the norm, progress is not possible.
On 29/09/2023 10:23, Marc Zyngier wrote:
> On Wed, 27 Sep 2023 09:28:20 +0100,
> Oliver Upton <[email protected]> wrote:
>>
>> On Mon, Sep 25, 2023 at 04:16:06PM +0100, Kristina Martsenko wrote:
>>
>> [...]
>>
>>>> What is the rationale for advancing the state machine? Shouldn't we
>>>> instead return to the guest and immediately get the SS exception,
>>>> which in turn gets reported to userspace? Is it because we rollback
>>>> the PC to a previous instruction?
>>>
>>> Yes, because we rollback the PC to the prologue instruction. We advance the
>>> state machine so that the SS exception is taken immediately upon returning to
>>> the guest at the prologue instruction. If we didn't advance it then we would
>>> return to the guest, execute the prologue instruction, and then take the SS
>>> exception on the middle instruction. Which would be surprising as userspace
>>> would see the middle and epilogue instructions executed multiple times but not
>>> the prologue.
>>
>> I agree with Kristina that taking the SS exception on the prologue is
>> likely the best course of action. Especially since it matches the
>> behavior of single-stepping an EL0 MOPS sequence with an intervening CPU
>> migration.
>>
>> This behavior might throw an EL1 that single-steps itself for a loop,
>> but I think it is impossible for a hypervisor to hide the consequences
>> of vCPU migration with MOPS in the first place.
>>
>> Marc, I'm guessing you were most concerned about the former case where
>> the VMM was debugging the guest. Is there something you're concerned
>> about I missed?
>
> My concern is not only the VMM, but any userspace that perform
> single-stepping. Imagine the debugger tracks PC by itself, and simply
> increments it by 4 on a non-branch, non-fault instruction.
>
> Move the vcpu or the userspace around, rewind PC, and now the debugger
> is out of whack with what is executing. While I agree that there is
> not much a hypervisor can do about that, I'm a bit worried that we are
> going to break existing SW with this.
>
> Now the obvious solution is "don't do that"...
If the debugger can handle the PC changing on branching or faulting
instructions, then why can't it handle it on MOPS instructions? Wouldn't
such a debugger need to be updated any time the architecture adds new
branching or faulting instructions? What's different here?
Confused,
Kristina
On Mon, 02 Oct 2023 15:06:33 +0100,
Kristina Martsenko <[email protected]> wrote:
>
> On 29/09/2023 10:23, Marc Zyngier wrote:
> > On Wed, 27 Sep 2023 09:28:20 +0100,
> > Oliver Upton <[email protected]> wrote:
> >>
> >> On Mon, Sep 25, 2023 at 04:16:06PM +0100, Kristina Martsenko wrote:
> >>
> >> [...]
> >>
> >>>> What is the rationale for advancing the state machine? Shouldn't we
> >>>> instead return to the guest and immediately get the SS exception,
> >>>> which in turn gets reported to userspace? Is it because we rollback
> >>>> the PC to a previous instruction?
> >>>
> >>> Yes, because we rollback the PC to the prologue instruction. We advance the
> >>> state machine so that the SS exception is taken immediately upon returning to
> >>> the guest at the prologue instruction. If we didn't advance it then we would
> >>> return to the guest, execute the prologue instruction, and then take the SS
> >>> exception on the middle instruction. Which would be surprising as userspace
> >>> would see the middle and epilogue instructions executed multiple times but not
> >>> the prologue.
> >>
> >> I agree with Kristina that taking the SS exception on the prologue is
> >> likely the best course of action. Especially since it matches the
> >> behavior of single-stepping an EL0 MOPS sequence with an intervening CPU
> >> migration.
> >>
> >> This behavior might throw an EL1 that single-steps itself for a loop,
> >> but I think it is impossible for a hypervisor to hide the consequences
> >> of vCPU migration with MOPS in the first place.
> >>
> >> Marc, I'm guessing you were most concerned about the former case where
> >> the VMM was debugging the guest. Is there something you're concerned
> >> about I missed?
> >
> > My concern is not only the VMM, but any userspace that perform
> > single-stepping. Imagine the debugger tracks PC by itself, and simply
> > increments it by 4 on a non-branch, non-fault instruction.
> >
> > Move the vcpu or the userspace around, rewind PC, and now the debugger
> > is out of whack with what is executing. While I agree that there is
> > not much a hypervisor can do about that, I'm a bit worried that we are
> > going to break existing SW with this.
> >
> > Now the obvious solution is "don't do that"...
>
> If the debugger can handle the PC changing on branching or faulting
> instructions, then why can't it handle it on MOPS instructions? Wouldn't
> such a debugger need to be updated any time the architecture adds new
> branching or faulting instructions? What's different here?
What is different is that we *go back* in the instruction stream,
which is a first. I'm not saying that the debugger I describe above
would be a very clever piece of SW, quite the opposite. But the way
the architecture works results in some interesting side-effects, and
I'm willing to bet that some SW will break (rr?).
But again, asymmetric systems are such a bad idea that I can't say I
care.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, Oct 02, 2023 at 03:55:33PM +0100, Marc Zyngier wrote:
> On Mon, 02 Oct 2023 15:06:33 +0100,
> Kristina Martsenko <[email protected]> wrote:
> > On 29/09/2023 10:23, Marc Zyngier wrote:
> > > On Wed, 27 Sep 2023 09:28:20 +0100,
> > > Oliver Upton <[email protected]> wrote:
> > >> On Mon, Sep 25, 2023 at 04:16:06PM +0100, Kristina Martsenko wrote:
> > >>>> What is the rationale for advancing the state machine? Shouldn't we
> > >>>> instead return to the guest and immediately get the SS exception,
> > >>>> which in turn gets reported to userspace? Is it because we rollback
> > >>>> the PC to a previous instruction?
> > >>>
> > >>> Yes, because we rollback the PC to the prologue instruction. We advance the
> > >>> state machine so that the SS exception is taken immediately upon returning to
> > >>> the guest at the prologue instruction. If we didn't advance it then we would
> > >>> return to the guest, execute the prologue instruction, and then take the SS
> > >>> exception on the middle instruction. Which would be surprising as userspace
> > >>> would see the middle and epilogue instructions executed multiple times but not
> > >>> the prologue.
> > >>
> > >> I agree with Kristina that taking the SS exception on the prologue is
> > >> likely the best course of action. Especially since it matches the
> > >> behavior of single-stepping an EL0 MOPS sequence with an intervening CPU
> > >> migration.
> > >>
> > >> This behavior might throw an EL1 that single-steps itself for a loop,
> > >> but I think it is impossible for a hypervisor to hide the consequences
> > >> of vCPU migration with MOPS in the first place.
> > >>
> > >> Marc, I'm guessing you were most concerned about the former case where
> > >> the VMM was debugging the guest. Is there something you're concerned
> > >> about I missed?
> > >
> > > My concern is not only the VMM, but any userspace that perform
> > > single-stepping. Imagine the debugger tracks PC by itself, and simply
> > > increments it by 4 on a non-branch, non-fault instruction.
> > >
> > > Move the vcpu or the userspace around, rewind PC, and now the debugger
> > > is out of whack with what is executing. While I agree that there is
> > > not much a hypervisor can do about that, I'm a bit worried that we are
> > > going to break existing SW with this.
> > >
> > > Now the obvious solution is "don't do that"...
> >
> > If the debugger can handle the PC changing on branching or faulting
> > instructions, then why can't it handle it on MOPS instructions? Wouldn't
> > such a debugger need to be updated any time the architecture adds new
> > branching or faulting instructions? What's different here?
>
> What is different is that we *go back* in the instruction stream,
> which is a first. I'm not saying that the debugger I describe above
> would be a very clever piece of SW, quite the opposite. But the way
> the architecture works results in some interesting side-effects, and
> I'm willing to bet that some SW will break (rr?).
The way the architecture works, either with or without Kristina's
single-step change, a debugger would get confused. At least for EL0, I
find the proposed (well, upstreamed) approach more predictable - it
always restarts from the prologue in case of migration between CPUs with
different MOPS implementation (which is not just theoretical AFAIK).
It's more like these three instructions are a bigger CISC one ;) (though
the CPU can step through its parts).
A more transparent approach would have been to fully emulate the
instructions in the kernel and advance the PC as expected but I don't
think that's even possible. An implementation may decide to leave some
bytes to be copied by the epilogue but we can't know that in software,
it's a microarchitecture thing.
There is the case of EL1 debugging itself (kgdb) and it triggers a MOPS
exception to EL2. It would look weird for the guest but I guess the only
other option is to disable MCE2 and let EL1 handle the mismatch MOPS
option itself (assuming it knows how to; it should be fine for Linux). I
think I still prefer Kristina's proposal for KVM as more generic, with
the downside of breaking less usual cases like the kernel
single-stepping itself.
--
Catalin
On Tue, 03 Oct 2023 15:29:42 +0100,
Catalin Marinas <[email protected]> wrote:
>
> The way the architecture works, either with or without Kristina's
> single-step change, a debugger would get confused. At least for EL0, I
> find the proposed (well, upstreamed) approach more predictable - it
> always restarts from the prologue in case of migration between CPUs with
> different MOPS implementation (which is not just theoretical AFAIK).
> It's more like these three instructions are a bigger CISC one ;) (though
> the CPU can step through its parts).
>
> A more transparent approach would have been to fully emulate the
> instructions in the kernel and advance the PC as expected but I don't
> think that's even possible. An implementation may decide to leave some
> bytes to be copied by the epilogue but we can't know that in software,
> it's a microarchitecture thing.
>
> There is the case of EL1 debugging itself (kgdb) and it triggers a MOPS
> exception to EL2. It would look weird for the guest but I guess the only
> other option is to disable MCE2 and let EL1 handle the mismatch MOPS
> option itself (assuming it knows how to; it should be fine for Linux). I
> think I still prefer Kristina's proposal for KVM as more generic, with
> the downside of breaking less usual cases like the kernel
> single-stepping itself.
I don't disagree at all.
My issue isn't with Kristina's patches, which are absolutely fine. It
has more to do with the shape of the FEAT_MOPS extension itself, which
exposes uarch details to SW instead of abstracting them.
But I've now ranted about it for close to two weeks, and it is time
for me to move on... ;-)
M.
--
Without deviation from the norm, progress is not possible.