After receiving an INIT signal (either via the local APIC, or through
KVM_SET_MP_STATE), the bootstrap processor should reset immediately
and start execution at 0xfffffff0. Also, SIPIs have no effect on the
bootstrap processor. However, KVM currently does not differentiate
between the BSP and APs.
Implement this so that userspace can correctly implement CPU soft resets
even when the in-kernel APIC is in use. Another small change is needed,
because INITs sent to the bootstrap processor do not go through a halt
state; it is incorrect to go through kvm_vcpu_block. I think this also
fixes a race before between sending the INIT and SIPI interrupts; if the
two were close enough, the receiving VCPU could have received the SIPI
before entering kvm_vcpu_block. It would them stay in kvm_vcpu_block
until the next kvm_vcpu_kick. In practice this was not a problem,
because the Intel SDM suggests to send two SIPIs with some time passing
between them; the second SIPI would unblock the VCPU.
The tests in vcpu_needs_reset are organized so that the hypervisor
will go through the same number of compare-and-jump sequences as
before in the common case.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/lapic.c | 3 ++-
arch/x86/kvm/x86.c | 23 +++++++++++++++++++----
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9392f52..0c515ac 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -710,7 +710,8 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
case APIC_DM_STARTUP:
apic_debug("SIPI to vcpu %d vector 0x%02x\n",
vcpu->vcpu_id, vector);
- if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
+ if (!kvm_vcpu_is_bsp(apic->vcpu) &&
+ vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
result = 1;
vcpu->arch.sipi_vector = vector;
vcpu->arch.mp_state = KVM_MP_STATE_SIPI_RECEIVED;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c243b81..603e6ff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5784,15 +5784,29 @@ out:
return r;
}
+static inline int vcpu_needs_reset(struct kvm_vcpu *vcpu)
+{
+ /* Shortcut the test in the common case. */
+ if (likely(vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE))
+ return 0;
+
+ if (kvm_vcpu_is_bsp(vcpu))
+ return vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED;
+ else
+ return vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED;
+}
static int __vcpu_run(struct kvm_vcpu *vcpu)
{
int r;
struct kvm *kvm = vcpu->kvm;
- if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED)) {
- pr_debug("vcpu %d received sipi with vector # %x\n",
- vcpu->vcpu_id, vcpu->arch.sipi_vector);
+ if (unlikely(vcpu_needs_reset(vcpu))) {
+ if (kvm_vcpu_is_bsp(vcpu))
+ pr_debug("vcpu %d received init\n", vcpu->vcpu_id);
+ else
+ pr_debug("vcpu %d received sipi with vector # %x\n",
+ vcpu->vcpu_id, vcpu->arch.sipi_vector);
kvm_lapic_reset(vcpu);
r = kvm_vcpu_reset(vcpu);
if (r)
@@ -5812,6 +5826,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
r = vcpu_enter_guest(vcpu);
+ else if (unlikely(vcpu_needs_reset(vcpu)))
+ r = -EINTR;
else {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_block(vcpu);
@@ -5825,7 +5841,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
case KVM_MP_STATE_RUNNABLE:
vcpu->arch.apf.halted = false;
break;
- case KVM_MP_STATE_SIPI_RECEIVED:
default:
r = -EINTR;
break;
--
1.8.1.4
On Sat, Mar 09, 2013 at 07:48:33AM +0100, Paolo Bonzini wrote:
> After receiving an INIT signal (either via the local APIC, or through
> KVM_SET_MP_STATE), the bootstrap processor should reset immediately
> and start execution at 0xfffffff0. Also, SIPIs have no effect on the
> bootstrap processor. However, KVM currently does not differentiate
> between the BSP and APs.
>
Userspace is capable of resetting vcpu by itself, so adding code to
handle KVM_SET_MP_STATE(INIT) looks unnecessary to me. I think the
simple patch below (not tested) should handle INIT for in-kernel irq chip
and userspace does not need special handling for cpu reset. It already
knows how to reset cpu on system_reset, so reseting only cpus should not
be different.
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 02b51dd..6cb3c21 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -731,7 +731,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
case APIC_DM_INIT:
if (!trig_mode || level) {
result = 1;
- vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+ vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
+ KVM_MP_STATE_SIPI_RECEIVED :
+ KVM_MP_STATE_INIT_RECEIVED;
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_vcpu_kick(vcpu);
} else {
--
Gleb.
Il 10/03/2013 12:46, Gleb Natapov ha scritto:
> On Sat, Mar 09, 2013 at 07:48:33AM +0100, Paolo Bonzini wrote:
>> After receiving an INIT signal (either via the local APIC, or through
>> KVM_SET_MP_STATE), the bootstrap processor should reset immediately
>> and start execution at 0xfffffff0. Also, SIPIs have no effect on the
>> bootstrap processor. However, KVM currently does not differentiate
>> between the BSP and APs.
>>
> Userspace is capable of resetting vcpu by itself, so adding code to
> handle KVM_SET_MP_STATE(INIT) looks unnecessary to me. I think the
> simple patch below (not tested) should handle INIT for in-kernel irq chip
> and userspace does not need special handling for cpu reset. It already
> knows how to reset cpu on system_reset, so reseting only cpus should not
> be different.
At least you'll need the last two hunks, moving the check for
KVM_MP_STATE_SIPI_RECEIVED before kvm_vcpu_block, but yes---something
like this could work.
However, it would effectively redefine the meaning of
KVM_MP_STATE_INIT_RECEIVED and KVM_MP_STATE_SIPI_RECEIVED, respectively
to KVM_MP_STATE_WAIT_FOR_SIPI and KVM_MP_STATE_RESETTING. I wasn't sure
if this is considered an API change (personally, I would treat it as one).
Paolo
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 02b51dd..6cb3c21 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -731,7 +731,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
> case APIC_DM_INIT:
> if (!trig_mode || level) {
> result = 1;
> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
> + KVM_MP_STATE_SIPI_RECEIVED :
> + KVM_MP_STATE_INIT_RECEIVED;
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> kvm_vcpu_kick(vcpu);
> } else {
> --
> Gleb.
>
On Sun, Mar 10, 2013 at 03:53:54PM +0100, Paolo Bonzini wrote:
> Il 10/03/2013 12:46, Gleb Natapov ha scritto:
> > On Sat, Mar 09, 2013 at 07:48:33AM +0100, Paolo Bonzini wrote:
> >> After receiving an INIT signal (either via the local APIC, or through
> >> KVM_SET_MP_STATE), the bootstrap processor should reset immediately
> >> and start execution at 0xfffffff0. Also, SIPIs have no effect on the
> >> bootstrap processor. However, KVM currently does not differentiate
> >> between the BSP and APs.
> >>
> > Userspace is capable of resetting vcpu by itself, so adding code to
> > handle KVM_SET_MP_STATE(INIT) looks unnecessary to me. I think the
> > simple patch below (not tested) should handle INIT for in-kernel irq chip
> > and userspace does not need special handling for cpu reset. It already
> > knows how to reset cpu on system_reset, so reseting only cpus should not
> > be different.
>
> At least you'll need the last two hunks, moving the check for
> KVM_MP_STATE_SIPI_RECEIVED before kvm_vcpu_block, but yes---something
> like this could work.
>
I do not see why. kvm_vcpu_block() exits if vpu is in
KVM_MP_STATE_SIPI_RECEIVED state.
> However, it would effectively redefine the meaning of
> KVM_MP_STATE_INIT_RECEIVED and KVM_MP_STATE_SIPI_RECEIVED, respectively
> to KVM_MP_STATE_WAIT_FOR_SIPI and KVM_MP_STATE_RESETTING. I wasn't sure
> if this is considered an API change (personally, I would treat it as one).
>
If it is kernel module internal it definitely is not API change.
INIT/SIPI handling is a bit ad-hoc right now anyway as Jan noticed. For
instance INIT does not really resets VCPU. Only after SIPI it is really
reset, so KVM_MP_STATE_SIPI_RECEIVED is really KVM_MP_STATE_RESET_ME_RIGHT_NOW
state.
--
Gleb.
Il 10/03/2013 16:35, Gleb Natapov ha scritto:
>> > However, it would effectively redefine the meaning of
>> > KVM_MP_STATE_INIT_RECEIVED and KVM_MP_STATE_SIPI_RECEIVED, respectively
>> > to KVM_MP_STATE_WAIT_FOR_SIPI and KVM_MP_STATE_RESETTING. I wasn't sure
>> > if this is considered an API change (personally, I would treat it as one).
>> >
> If it is kernel module internal it definitely is not API change.
> INIT/SIPI handling is a bit ad-hoc right now anyway as Jan noticed. For
> instance INIT does not really resets VCPU. Only after SIPI it is really
> reset, so KVM_MP_STATE_SIPI_RECEIVED is really KVM_MP_STATE_RESET_ME_RIGHT_NOW
> state.
Yeah, and the current definition is ambiguous (without hypervisor
patches, there's no way to use it as the names would suggest), so
perhaps the right thing to do is to rename the states (old names kept
for backwards compatibility only) and work from there.
Paolo
On Sun, Mar 10, 2013 at 06:19:07PM +0100, Paolo Bonzini wrote:
> Il 10/03/2013 16:35, Gleb Natapov ha scritto:
> >> > However, it would effectively redefine the meaning of
> >> > KVM_MP_STATE_INIT_RECEIVED and KVM_MP_STATE_SIPI_RECEIVED, respectively
> >> > to KVM_MP_STATE_WAIT_FOR_SIPI and KVM_MP_STATE_RESETTING. I wasn't sure
> >> > if this is considered an API change (personally, I would treat it as one).
> >> >
> > If it is kernel module internal it definitely is not API change.
> > INIT/SIPI handling is a bit ad-hoc right now anyway as Jan noticed. For
> > instance INIT does not really resets VCPU. Only after SIPI it is really
> > reset, so KVM_MP_STATE_SIPI_RECEIVED is really KVM_MP_STATE_RESET_ME_RIGHT_NOW
> > state.
>
> Yeah, and the current definition is ambiguous (without hypervisor
> patches, there's no way to use it as the names would suggest), so
> perhaps the right thing to do is to rename the states (old names kept
> for backwards compatibility only) and work from there.
>
I do not see how renaming clarify things. From userspace point of view
KVM_MP_STATE_SIPI_RECEIVED does not exists. If AP is hard reset
userspase makes it UNINIT, if soft reset it makes it INIT_RECEIVED, if
BSP it makes it running no matter what type of reset.
--
Gleb.
Il 10/03/2013 19:10, Gleb Natapov ha scritto:
> On Sun, Mar 10, 2013 at 06:19:07PM +0100, Paolo Bonzini wrote:
>> Il 10/03/2013 16:35, Gleb Natapov ha scritto:
>>>>> However, it would effectively redefine the meaning of
>>>>> KVM_MP_STATE_INIT_RECEIVED and KVM_MP_STATE_SIPI_RECEIVED, respectively
>>>>> to KVM_MP_STATE_WAIT_FOR_SIPI and KVM_MP_STATE_RESETTING. I wasn't sure
>>>>> if this is considered an API change (personally, I would treat it as one).
>>>>>
>>> If it is kernel module internal it definitely is not API change.
>>> INIT/SIPI handling is a bit ad-hoc right now anyway as Jan noticed. For
>>> instance INIT does not really resets VCPU. Only after SIPI it is really
>>> reset, so KVM_MP_STATE_SIPI_RECEIVED is really KVM_MP_STATE_RESET_ME_RIGHT_NOW
>>> state.
>>
>> Yeah, and the current definition is ambiguous (without hypervisor
>> patches, there's no way to use it as the names would suggest), so
>> perhaps the right thing to do is to rename the states (old names kept
>> for backwards compatibility only) and work from there.
>>
> I do not see how renaming clarify things. From userspace point of view
> KVM_MP_STATE_SIPI_RECEIVED does not exists.
Not really true---we do exit with that state and EINTR when we get a
SIPI. Perhaps that can be changed.
> If AP is hard reset
> userspase makes it UNINIT, if soft reset it makes it INIT_RECEIVED, if
> BSP it makes it running no matter what type of reset.
The current name just suggests .
And when getting an INIT in the in-kernel LAPIC, this:
- vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+ vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
+ KVM_MP_STATE_SIPI_RECEIVED :
+ KVM_MP_STATE_INIT_RECEIVED;
makes much less sense than this:
- vcpu->arch.mp_state = KVM_MP_STATE_WAIT_FOR_SIPI;
+ vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
+ KVM_MP_STATE_RESET_NOW :
+ KVM_MP_STATE_WAIT_FOR_SIPI;
However, there's also Jan's plans for nVMX. Peeking at his queue (see
http://git.kiszka.org/?p=linux-kvm.git;a=commitdiff;h=037fb24ec) I think
it's better to always reflect INITs to the hypervisor like I did in these
patches.
Paolo
On Mon, Mar 11, 2013 at 11:14:39AM +0100, Paolo Bonzini wrote:
> Il 10/03/2013 19:10, Gleb Natapov ha scritto:
> > On Sun, Mar 10, 2013 at 06:19:07PM +0100, Paolo Bonzini wrote:
> >> Il 10/03/2013 16:35, Gleb Natapov ha scritto:
> >>>>> However, it would effectively redefine the meaning of
> >>>>> KVM_MP_STATE_INIT_RECEIVED and KVM_MP_STATE_SIPI_RECEIVED, respectively
> >>>>> to KVM_MP_STATE_WAIT_FOR_SIPI and KVM_MP_STATE_RESETTING. I wasn't sure
> >>>>> if this is considered an API change (personally, I would treat it as one).
> >>>>>
> >>> If it is kernel module internal it definitely is not API change.
> >>> INIT/SIPI handling is a bit ad-hoc right now anyway as Jan noticed. For
> >>> instance INIT does not really resets VCPU. Only after SIPI it is really
> >>> reset, so KVM_MP_STATE_SIPI_RECEIVED is really KVM_MP_STATE_RESET_ME_RIGHT_NOW
> >>> state.
> >>
> >> Yeah, and the current definition is ambiguous (without hypervisor
> >> patches, there's no way to use it as the names would suggest), so
> >> perhaps the right thing to do is to rename the states (old names kept
> >> for backwards compatibility only) and work from there.
> >>
> > I do not see how renaming clarify things. From userspace point of view
> > KVM_MP_STATE_SIPI_RECEIVED does not exists.
>
> Not really true---we do exit with that state and EINTR when we get a
> SIPI. Perhaps that can be changed.
>
That's implementation detail. We can jump to the beginning of the
function instead. Nowhere we document that entering
KVM_MP_STATE_SIPI_RECEIVED state cause KVM_RUN return with EINTR.
> > If AP is hard reset
> > userspase makes it UNINIT, if soft reset it makes it INIT_RECEIVED, if
> > BSP it makes it running no matter what type of reset.
>
> The current name just suggests .
> And when getting an INIT in the in-kernel LAPIC, this:
>
> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
> + KVM_MP_STATE_SIPI_RECEIVED :
> + KVM_MP_STATE_INIT_RECEIVED;
>
> makes much less sense than this:
>
> - vcpu->arch.mp_state = KVM_MP_STATE_WAIT_FOR_SIPI;
> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
> + KVM_MP_STATE_RESET_NOW :
> + KVM_MP_STATE_WAIT_FOR_SIPI;
>
Both of them are equally incorrect. INIT should cause reset, and only if
vmx is off. An userspace reset is also completely broken in that regard.
Renaming things gives us nothing, only bring unneeded churn. If the
names were internal I wouldn't mind, but they are APIs.
> However, there's also Jan's plans for nVMX. Peeking at his queue (see
> http://git.kiszka.org/?p=linux-kvm.git;a=commitdiff;h=037fb24ec) I think
> it's better to always reflect INITs to the hypervisor like I did in these
> patches.
>
The commit was before we decided that we should not abuse mp_state for
signaling.
--
Gleb.
Il 11/03/2013 11:28, Gleb Natapov ha scritto:
>> Not really true---we do exit with that state and EINTR when we get a
>> SIPI. Perhaps that can be changed.
>
> That's implementation detail. We can jump to the beginning of the
> function instead. Nowhere we document that entering
> KVM_MP_STATE_SIPI_RECEIVED state cause KVM_RUN return with EINTR.
Yes, that would be nice.
>>> If AP is hard reset
>>> userspase makes it UNINIT, if soft reset it makes it INIT_RECEIVED, if
>>> BSP it makes it running no matter what type of reset.
>>
>> The current name just suggests .
>> And when getting an INIT in the in-kernel LAPIC, this:
>>
>> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
>> + KVM_MP_STATE_SIPI_RECEIVED :
>> + KVM_MP_STATE_INIT_RECEIVED;
>>
>> makes much less sense than this:
>>
>> - vcpu->arch.mp_state = KVM_MP_STATE_WAIT_FOR_SIPI;
>> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
>> + KVM_MP_STATE_RESET_NOW :
>> + KVM_MP_STATE_WAIT_FOR_SIPI;
>>
> Both of them are equally incorrect. INIT should cause reset, and only if
> vmx is off. An userspace reset is also completely broken in that regard.
> Renaming things gives us nothing, only bring unneeded churn. If the
> names were internal I wouldn't mind, but they are APIs.
>
>> However, there's also Jan's plans for nVMX. Peeking at his queue (see
>> http://git.kiszka.org/?p=linux-kvm.git;a=commitdiff;h=037fb24ec) I think
>> it's better to always reflect INITs to the hypervisor like I did in these
>> patches.
>>
> The commit was before we decided that we should not abuse mp_state for
> signaling.
Agreed, but we still have the problem of how to signal from userspace.
For that do you have any other suggestion than mp_state? And if we keep
mp_state to signal from userspace, giving INIT_RECEIVED the
"wait-for-SIPI" semantics would be wrong.
Paolo
On Mon, Mar 11, 2013 at 12:25:57PM +0100, Paolo Bonzini wrote:
> Il 11/03/2013 11:28, Gleb Natapov ha scritto:
> >> Not really true---we do exit with that state and EINTR when we get a
> >> SIPI. Perhaps that can be changed.
> >
> > That's implementation detail. We can jump to the beginning of the
> > function instead. Nowhere we document that entering
> > KVM_MP_STATE_SIPI_RECEIVED state cause KVM_RUN return with EINTR.
>
> Yes, that would be nice.
>
That's not performance critical path, so I guess no one bothered.
> >>> If AP is hard reset
> >>> userspase makes it UNINIT, if soft reset it makes it INIT_RECEIVED, if
> >>> BSP it makes it running no matter what type of reset.
> >>
> >> The current name just suggests .
> >> And when getting an INIT in the in-kernel LAPIC, this:
> >>
> >> - vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> >> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
> >> + KVM_MP_STATE_SIPI_RECEIVED :
> >> + KVM_MP_STATE_INIT_RECEIVED;
> >>
> >> makes much less sense than this:
> >>
> >> - vcpu->arch.mp_state = KVM_MP_STATE_WAIT_FOR_SIPI;
> >> + vcpu->arch.mp_state = kvm_vcpu_is_bsp(vcpu) ?
> >> + KVM_MP_STATE_RESET_NOW :
> >> + KVM_MP_STATE_WAIT_FOR_SIPI;
> >>
> > Both of them are equally incorrect. INIT should cause reset, and only if
> > vmx is off. An userspace reset is also completely broken in that regard.
> > Renaming things gives us nothing, only bring unneeded churn. If the
> > names were internal I wouldn't mind, but they are APIs.
> >
> >> However, there's also Jan's plans for nVMX. Peeking at his queue (see
> >> http://git.kiszka.org/?p=linux-kvm.git;a=commitdiff;h=037fb24ec) I think
> >> it's better to always reflect INITs to the hypervisor like I did in these
> >> patches.
> >>
> > The commit was before we decided that we should not abuse mp_state for
> > signaling.
>
> Agreed, but we still have the problem of how to signal from userspace.
> For that do you have any other suggestion than mp_state? And if we keep
> mp_state to signal from userspace, giving INIT_RECEIVED the
> "wait-for-SIPI" semantics would be wrong.
>
I don't see how can we use mp_state for signaling from userspace either.
Currently soft reset always reset vcpus, so it is OK for userspace to
generate reset vcpu state and put it into kernel, mp_state is just one
of the updated states, but when INIT will be just another signal that
may or may not reset cpu or have other side effects like #vmexit this
will not longer work. We will have to have another interface for
injecting INIT from userspace and userspace soft-reset will use it
instead of doing reset by itself.
--
Gleb.
Il 11/03/2013 12:51, Gleb Natapov ha scritto:
>> >
>> > Agreed, but we still have the problem of how to signal from userspace.
>> > For that do you have any other suggestion than mp_state? And if we keep
>> > mp_state to signal from userspace, giving INIT_RECEIVED the
>> > "wait-for-SIPI" semantics would be wrong.
>> >
> I don't see how can we use mp_state for signaling from userspace either.
> Currently soft reset always reset vcpus, so it is OK for userspace to
> generate reset vcpu state and put it into kernel, mp_state is just one
> of the updated states, but when INIT will be just another signal that
> may or may not reset cpu or have other side effects like #vmexit this
> will not longer work. We will have to have another interface for
> injecting INIT from userspace and userspace soft-reset will use it
> instead of doing reset by itself.
Setting the mp_state to INIT_RECEIVED is that interface, and it already
works, for APs at least. This patch extends it to work for the BSP as well.
In the corresponding userspace patch, I don't need to touch the CPU
state at all. I can just signal the kernel. If I touch the CPU, I'll
break the nested case, no matter how it is implemented. So far, the
userspace did not have to worry about nested, and that's something that
should be kept that way.
If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
will have to convert them to the right bits in the requests field or in
the APIC state. But I'm starting to see less benefit from moving away
from mp_state.
Paolo
On Mon, Mar 11, 2013 at 02:31:46PM +0100, Paolo Bonzini wrote:
> Il 11/03/2013 12:51, Gleb Natapov ha scritto:
> >> >
> >> > Agreed, but we still have the problem of how to signal from userspace.
> >> > For that do you have any other suggestion than mp_state? And if we keep
> >> > mp_state to signal from userspace, giving INIT_RECEIVED the
> >> > "wait-for-SIPI" semantics would be wrong.
> >> >
> > I don't see how can we use mp_state for signaling from userspace either.
> > Currently soft reset always reset vcpus, so it is OK for userspace to
> > generate reset vcpu state and put it into kernel, mp_state is just one
> > of the updated states, but when INIT will be just another signal that
> > may or may not reset cpu or have other side effects like #vmexit this
> > will not longer work. We will have to have another interface for
> > injecting INIT from userspace and userspace soft-reset will use it
> > instead of doing reset by itself.
>
> Setting the mp_state to INIT_RECEIVED is that interface, and it already
> works, for APs at least. This patch extends it to work for the BSP as well.
>
It does not for AP either. If AP has vmx on mp_sate should not be set to
INIT_RECEIVED. mp_sate is a state as you can see from its name and we
already had a discussion on the generic device API about importance of
separating sending commands from setting state. There is a difference
between setting mp_sate during migration and signaling INIT#.
> In the corresponding userspace patch, I don't need to touch the CPU
> state at all. I can just signal the kernel. If I touch the CPU, I'll
> break the nested case, no matter how it is implemented. So far, the
> userspace did not have to worry about nested, and that's something that
> should be kept that way.
We are discussing two different things here. I'll try to separate them.
1. BSP is broken WRT #INIT
2. nested is broken WRT #INIT
You are fixing 1 with your patches, for that I proposed much easier
solution (at last from kernel point of view): if BSP reset it in
userspace and make it runnable. Nested virt is still broken, but this is
not what you are fixing.
For 2 much more involved fix is needed. Jan fixes it and it will require
signaling INIT# from userspace by other means than mp_sate because
signaling INIT# does not automatically means that mp_sate changes to
INIT_RECEIVED.
>
> If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
> in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
> will have to convert them to the right bits in the requests field or in
> the APIC state. But I'm starting to see less benefit from moving away
> from mp_state.
>
We are not moving away from mp_state, we are moving away from using
mp_state for signaling because with nested virt INIT does not always
change mp_state, not only that it can change mp_state long after signal
is received after vmx off is done.
--
Gleb.
On 2013-03-11 14:54, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 02:31:46PM +0100, Paolo Bonzini wrote:
>> Il 11/03/2013 12:51, Gleb Natapov ha scritto:
>>>>>
>>>>> Agreed, but we still have the problem of how to signal from userspace.
>>>>> For that do you have any other suggestion than mp_state? And if we keep
>>>>> mp_state to signal from userspace, giving INIT_RECEIVED the
>>>>> "wait-for-SIPI" semantics would be wrong.
>>>>>
>>> I don't see how can we use mp_state for signaling from userspace either.
>>> Currently soft reset always reset vcpus, so it is OK for userspace to
>>> generate reset vcpu state and put it into kernel, mp_state is just one
>>> of the updated states, but when INIT will be just another signal that
>>> may or may not reset cpu or have other side effects like #vmexit this
>>> will not longer work. We will have to have another interface for
>>> injecting INIT from userspace and userspace soft-reset will use it
>>> instead of doing reset by itself.
>>
>> Setting the mp_state to INIT_RECEIVED is that interface, and it already
>> works, for APs at least. This patch extends it to work for the BSP as well.
>>
> It does not for AP either. If AP has vmx on mp_sate should not be set to
> INIT_RECEIVED. mp_sate is a state as you can see from its name and we
> already had a discussion on the generic device API about importance of
> separating sending commands from setting state. There is a difference
> between setting mp_sate during migration and signaling INIT#.
>
>> In the corresponding userspace patch, I don't need to touch the CPU
>> state at all. I can just signal the kernel. If I touch the CPU, I'll
>> break the nested case, no matter how it is implemented. So far, the
>> userspace did not have to worry about nested, and that's something that
>> should be kept that way.
> We are discussing two different things here. I'll try to separate them.
> 1. BSP is broken WRT #INIT
> 2. nested is broken WRT #INIT
>
> You are fixing 1 with your patches, for that I proposed much easier
> solution (at last from kernel point of view): if BSP reset it in
> userspace and make it runnable. Nested virt is still broken, but this is
> not what you are fixing.
>
> For 2 much more involved fix is needed. Jan fixes it and it will require
> signaling INIT# from userspace by other means than mp_sate because
> signaling INIT# does not automatically means that mp_sate changes to
> INIT_RECEIVED.
>
>>
>> If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
>> in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
>> will have to convert them to the right bits in the requests field or in
>> the APIC state. But I'm starting to see less benefit from moving away
>> from mp_state.
>>
> We are not moving away from mp_state, we are moving away from using
> mp_state for signaling because with nested virt INIT does not always
> change mp_state, not only that it can change mp_state long after signal
> is received after vmx off is done.
Right.
BTW, for that to happen, we will also need to influence the INIT level.
Unless I misread the spec, INIT is blocked while in root mode, and if
you deassert INIT before leaving root (vmxoff, vmenter), nothing
actually happens. So what matters is the INIT signal level at the exit
of root mode.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> > We are not moving away from mp_state, we are moving away from using
> > mp_state for signaling because with nested virt INIT does not always
> > change mp_state, not only that it can change mp_state long after signal
> > is received after vmx off is done.
>
> Right.
>
> BTW, for that to happen, we will also need to influence the INIT level.
> Unless I misread the spec, INIT is blocked while in root mode, and if
> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> actually happens. So what matters is the INIT signal level at the exit
> of root mode.
>
You are talking about INIT# signal received via CPU pin, right? I think
INIT send by IPI cannot go away.
--
Gleb.
On 2013-03-11 15:05, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>> We are not moving away from mp_state, we are moving away from using
>>> mp_state for signaling because with nested virt INIT does not always
>>> change mp_state, not only that it can change mp_state long after signal
>>> is received after vmx off is done.
>>
>> Right.
>>
>> BTW, for that to happen, we will also need to influence the INIT level.
>> Unless I misread the spec, INIT is blocked while in root mode, and if
>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>> actually happens. So what matters is the INIT signal level at the exit
>> of root mode.
>>
> You are talking about INIT# signal received via CPU pin, right? I think
> INIT send by IPI cannot go away.
Why shouldn't it? Besides edge, there is also level-triggered INIT.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 03:06:18PM +0100, Jan Kiszka wrote:
> On 2013-03-11 15:05, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>> We are not moving away from mp_state, we are moving away from using
> >>> mp_state for signaling because with nested virt INIT does not always
> >>> change mp_state, not only that it can change mp_state long after signal
> >>> is received after vmx off is done.
> >>
> >> Right.
> >>
> >> BTW, for that to happen, we will also need to influence the INIT level.
> >> Unless I misread the spec, INIT is blocked while in root mode, and if
> >> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >> actually happens. So what matters is the INIT signal level at the exit
> >> of root mode.
> >>
> > You are talking about INIT# signal received via CPU pin, right? I think
> > INIT send by IPI cannot go away.
>
> Why shouldn't it? Besides edge, there is also level-triggered INIT.
>
OK, so level-triggered INIT can be de-asserted what about edge triggered
one? :)
--
Gleb.
On 2013-03-11 15:09, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 03:06:18PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 15:05, Gleb Natapov wrote:
>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>> We are not moving away from mp_state, we are moving away from using
>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>> is received after vmx off is done.
>>>>
>>>> Right.
>>>>
>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>> actually happens. So what matters is the INIT signal level at the exit
>>>> of root mode.
>>>>
>>> You are talking about INIT# signal received via CPU pin, right? I think
>>> INIT send by IPI cannot go away.
>>
>> Why shouldn't it? Besides edge, there is also level-triggered INIT.
>>
> OK, so level-triggered INIT can be de-asserted what about edge triggered
> one? :)
It should be lost while in root mode.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 03:10:45PM +0100, Jan Kiszka wrote:
> On 2013-03-11 15:09, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 03:06:18PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 15:05, Gleb Natapov wrote:
> >>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>>>> We are not moving away from mp_state, we are moving away from using
> >>>>> mp_state for signaling because with nested virt INIT does not always
> >>>>> change mp_state, not only that it can change mp_state long after signal
> >>>>> is received after vmx off is done.
> >>>>
> >>>> Right.
> >>>>
> >>>> BTW, for that to happen, we will also need to influence the INIT level.
> >>>> Unless I misread the spec, INIT is blocked while in root mode, and if
> >>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >>>> actually happens. So what matters is the INIT signal level at the exit
> >>>> of root mode.
> >>>>
> >>> You are talking about INIT# signal received via CPU pin, right? I think
> >>> INIT send by IPI cannot go away.
> >>
> >> Why shouldn't it? Besides edge, there is also level-triggered INIT.
> >>
> > OK, so level-triggered INIT can be de-asserted what about edge triggered
> > one? :)
>
> It should be lost while in root mode.
>
Ah, that's great. Removes some potential complications.
--
Gleb.
On 2013-03-11 15:12, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 03:10:45PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 15:09, Gleb Natapov wrote:
>>> On Mon, Mar 11, 2013 at 03:06:18PM +0100, Jan Kiszka wrote:
>>>> On 2013-03-11 15:05, Gleb Natapov wrote:
>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>>>> We are not moving away from mp_state, we are moving away from using
>>>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>>>> is received after vmx off is done.
>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>>>> actually happens. So what matters is the INIT signal level at the exit
>>>>>> of root mode.
>>>>>>
>>>>> You are talking about INIT# signal received via CPU pin, right? I think
>>>>> INIT send by IPI cannot go away.
>>>>
>>>> Why shouldn't it? Besides edge, there is also level-triggered INIT.
>>>>
>>> OK, so level-triggered INIT can be de-asserted what about edge triggered
>>> one? :)
>>
>> It should be lost while in root mode.
>>
> Ah, that's great. Removes some potential complications.
Again, that's my interpretation. I didn't check this against real HW
yet. It just makes most sense, given what I read.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
Il 11/03/2013 15:05, Gleb Natapov ha scritto:
> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>> We are not moving away from mp_state, we are moving away from using
>>> mp_state for signaling because with nested virt INIT does not always
>>> change mp_state, not only that it can change mp_state long after signal
>>> is received after vmx off is done.
>>
>> Right.
>>
>> BTW, for that to happen, we will also need to influence the INIT level.
>> Unless I misread the spec, INIT is blocked while in root mode, and if
>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>> actually happens. So what matters is the INIT signal level at the exit
>> of root mode.
>>
> You are talking about INIT# signal received via CPU pin, right? I think
> INIT send by IPI cannot go away.
Neither can go away. For INIT sent by IPI, 10.4.7 says:
Only the Pentium and P6 family processors support the INIT-deassert IPI.
An INIT-disassert IPI has no affect on the state of the APIC, other than
to reload the arbitration ID register with the value in the APIC ID
register.
18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
deassert) are always treated as edge triggered interrupts".
For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
clocks" when a soft reset is requested. So we can guess that INIT# is
also edge-triggered.
Paolo
Il 11/03/2013 14:54, Gleb Natapov ha scritto:
>> Setting the mp_state to INIT_RECEIVED is that interface, and it already
>> works, for APs at least. This patch extends it to work for the BSP as well.
>
> It does not for AP either. If AP has vmx on mp_sate should not be set to
> INIT_RECEIVED. mp_sate is a state as you can see from its name and we
> already had a discussion on the generic device API about importance of
> separating sending commands from setting state. There is a difference
> between setting mp_state during migration and signaling INIT#.
What does migration have to do with this?
>> In the corresponding userspace patch, I don't need to touch the CPU
>> state at all. I can just signal the kernel. If I touch the CPU, I'll
>> break the nested case, no matter how it is implemented. So far, the
>> userspace did not have to worry about nested, and that's something that
>> should be kept that way.
> We are discussing two different things here. I'll try to separate them.
> 1. BSP is broken WRT #INIT
> 2. nested is broken WRT #INIT
>
> You are fixing 1 with your patches, for that I proposed much easier
> solution (at last from kernel point of view): if BSP reset it in
> userspace and make it runnable. Nested virt is still broken, but this is
> not what you are fixing.
It's not what I'm fixing, but I don't want to make the fix for nested
virt unnecessarily more complicated. Nested virt needs to know about
INIT and SIPI; redefining the meaning of INIT_RECEIVED and SIPI_RECEIVED
makes it more complicated to reflect these events to L1.
> For 2 much more involved fix is needed. Jan fixes it and it will require
> signaling INIT# from userspace by other means than mp_sate because
> signaling INIT# does not automatically means that mp_sate changes to
> INIT_RECEIVED.
In your interpretation of INIT_RECEIVED, no. In mine, yes...
>> If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
>> in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
>> will have to convert them to the right bits in the requests field or in
>> the APIC state. But I'm starting to see less benefit from moving away
>> from mp_state.
>>
> We are not moving away from mp_state, we are moving away from using
> mp_state for signaling
That's what I meant; sorry for the unclear abbreviation.
> because with nested virt INIT does not always
> change mp_state
Why not? It does change mp_state, it changes how you react to the
change. Which is why it's good to have the reset done in kernel space,
not in user space.
Paolo
> , not only that it can change mp_state long after signal
> is received after vmx off is done.
>
> --
> Gleb.
>
On 2013-03-11 15:23, Paolo Bonzini wrote:
> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>> We are not moving away from mp_state, we are moving away from using
>>>> mp_state for signaling because with nested virt INIT does not always
>>>> change mp_state, not only that it can change mp_state long after signal
>>>> is received after vmx off is done.
>>>
>>> Right.
>>>
>>> BTW, for that to happen, we will also need to influence the INIT level.
>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>> actually happens. So what matters is the INIT signal level at the exit
>>> of root mode.
>>>
>> You are talking about INIT# signal received via CPU pin, right? I think
>> INIT send by IPI cannot go away.
>
> Neither can go away. For INIT sent by IPI, 10.4.7 says:
>
> Only the Pentium and P6 family processors support the INIT-deassert IPI.
> An INIT-disassert IPI has no affect on the state of the APIC, other than
> to reload the arbitration ID register with the value in the APIC ID
> register.
>
> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
> deassert) are always treated as edge triggered interrupts".
>
>
> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
> clocks" when a soft reset is requested. So we can guess that INIT# is
> also edge-triggered.
Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
form of a reset or a vmexit.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 03:28:03PM +0100, Paolo Bonzini wrote:
> Il 11/03/2013 14:54, Gleb Natapov ha scritto:
> >> Setting the mp_state to INIT_RECEIVED is that interface, and it already
> >> works, for APs at least. This patch extends it to work for the BSP as well.
> >
> > It does not for AP either. If AP has vmx on mp_sate should not be set to
> > INIT_RECEIVED. mp_sate is a state as you can see from its name and we
> > already had a discussion on the generic device API about importance of
> > separating sending commands from setting state. There is a difference
> > between setting mp_state during migration and signaling INIT#.
>
> What does migration have to do with this?
>
get|set_mpstate is used by migration. Actually this is primary reason
for this interface existence.
> >> In the corresponding userspace patch, I don't need to touch the CPU
> >> state at all. I can just signal the kernel. If I touch the CPU, I'll
> >> break the nested case, no matter how it is implemented. So far, the
> >> userspace did not have to worry about nested, and that's something that
> >> should be kept that way.
> > We are discussing two different things here. I'll try to separate them.
> > 1. BSP is broken WRT #INIT
> > 2. nested is broken WRT #INIT
> >
> > You are fixing 1 with your patches, for that I proposed much easier
> > solution (at last from kernel point of view): if BSP reset it in
> > userspace and make it runnable. Nested virt is still broken, but this is
> > not what you are fixing.
>
> It's not what I'm fixing, but I don't want to make the fix for nested
What are you fixing then?
> virt unnecessarily more complicated. Nested virt needs to know about
> INIT and SIPI; redefining the meaning of INIT_RECEIVED and SIPI_RECEIVED
> makes it more complicated to reflect these events to L1.
>
> > For 2 much more involved fix is needed. Jan fixes it and it will require
> > signaling INIT# from userspace by other means than mp_sate because
> > signaling INIT# does not automatically means that mp_sate changes to
> > INIT_RECEIVED.
>
> In your interpretation of INIT_RECEIVED, no. In mine, yes...
>
Your code shows different. With your patch setting mp_state to
INIT_RECEIVED makes vcpu non tunable. This is incorrect if INIT_RECEIVED
is "INIT# is triggered" interface.
> >> If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
> >> in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
> >> will have to convert them to the right bits in the requests field or in
> >> the APIC state. But I'm starting to see less benefit from moving away
> >> from mp_state.
> >>
> > We are not moving away from mp_state, we are moving away from using
> > mp_state for signaling
>
> That's what I meant; sorry for the unclear abbreviation.
Then we disagree.
>
> > because with nested virt INIT does not always
> > change mp_state
>
> Why not?
Because mp_state is the current state the vcpu is in. It can be
uninitialized, runnable, halted or wait for sipi. SDM says that
if nested virt is enabled vcpu does not enter wait for sipi state
on INIT#.
> It does change mp_state, it changes how you react to the
> change.
How does it? CPU was runnable before it is still runnable after.
> Which is why it's good to have the reset done in kernel space,
> not in user space.
>
Without nested virt it does not really matter and if it is does not
really matter you do not add code to the kernel just because it is good.
With nested virt INIT# processing needs to go to the kernel. In some
cases INIT will cause reset, but you do not "do reset in kernel space",
you do "INIT# handling in kernel space".
--
Gleb.
On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
> On 2013-03-11 15:23, Paolo Bonzini wrote:
> > Il 11/03/2013 15:05, Gleb Natapov ha scritto:
> >> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>>> We are not moving away from mp_state, we are moving away from using
> >>>> mp_state for signaling because with nested virt INIT does not always
> >>>> change mp_state, not only that it can change mp_state long after signal
> >>>> is received after vmx off is done.
> >>>
> >>> Right.
> >>>
> >>> BTW, for that to happen, we will also need to influence the INIT level.
> >>> Unless I misread the spec, INIT is blocked while in root mode, and if
> >>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >>> actually happens. So what matters is the INIT signal level at the exit
> >>> of root mode.
> >>>
> >> You are talking about INIT# signal received via CPU pin, right? I think
> >> INIT send by IPI cannot go away.
> >
> > Neither can go away. For INIT sent by IPI, 10.4.7 says:
> >
> > Only the Pentium and P6 family processors support the INIT-deassert IPI.
> > An INIT-disassert IPI has no affect on the state of the APIC, other than
> > to reload the arbitration ID register with the value in the APIC ID
> > register.
> >
> > 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
> > deassert) are always treated as edge triggered interrupts".
> >
> >
> > For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
> > clocks" when a soft reset is requested. So we can guess that INIT# is
> > also edge-triggered.
>
> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
> form of a reset or a vmexit.
>
vmexit clears it?
--
Gleb.
On 2013-03-11 18:23, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 15:23, Paolo Bonzini wrote:
>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>>> We are not moving away from mp_state, we are moving away from using
>>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>>> is received after vmx off is done.
>>>>>
>>>>> Right.
>>>>>
>>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>>> actually happens. So what matters is the INIT signal level at the exit
>>>>> of root mode.
>>>>>
>>>> You are talking about INIT# signal received via CPU pin, right? I think
>>>> INIT send by IPI cannot go away.
>>>
>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
>>>
>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
>>> to reload the arbitration ID register with the value in the APIC ID
>>> register.
>>>
>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
>>> deassert) are always treated as edge triggered interrupts".
>>>
>>>
>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
>>> clocks" when a soft reset is requested. So we can guess that INIT# is
>>> also edge-triggered.
>>
>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
>> form of a reset or a vmexit.
>>
> vmexit clears it?
It has to. Otherwise, it would hit the host on vmxoff.
The spec says: "The INIT signal is blocked whenever a logical processor
is in VMX root operation. It is not blocked in VMX non-root operation.
Instead, INITs cause VM exits [...]."
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On 2013-03-11 18:34, Jan Kiszka wrote:
> On 2013-03-11 18:23, Gleb Natapov wrote:
>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>>>> We are not moving away from mp_state, we are moving away from using
>>>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>>>> is received after vmx off is done.
>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>>>> actually happens. So what matters is the INIT signal level at the exit
>>>>>> of root mode.
>>>>>>
>>>>> You are talking about INIT# signal received via CPU pin, right? I think
>>>>> INIT send by IPI cannot go away.
>>>>
>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
>>>>
>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
>>>> to reload the arbitration ID register with the value in the APIC ID
>>>> register.
>>>>
>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
>>>> deassert) are always treated as edge triggered interrupts".
>>>>
>>>>
>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
>>>> also edge-triggered.
>>>
>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
>>> form of a reset or a vmexit.
>>>
>> vmexit clears it?
>
> It has to. Otherwise, it would hit the host on vmxoff.
>
> The spec says: "The INIT signal is blocked whenever a logical processor
> is in VMX root operation. It is not blocked in VMX non-root operation.
> Instead, INITs cause VM exits [...]."
BTW, we have a similar behaviour with SVM, just that GIF and the INIT
interception flags controls the processing: When in guest mode and
interception is enabled, INIT causes vmexit. When in host mode and GIF
is 0, INIT is blocked and held pending until reaching guest mode again
or switching on GIF explicitly.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
Il 11/03/2013 18:20, Gleb Natapov ha scritto:
> On Mon, Mar 11, 2013 at 03:28:03PM +0100, Paolo Bonzini wrote:
>> Il 11/03/2013 14:54, Gleb Natapov ha scritto:
>>>> Setting the mp_state to INIT_RECEIVED is that interface, and it already
>>>> works, for APs at least. This patch extends it to work for the BSP as well.
>>>
>>> It does not for AP either. If AP has vmx on mp_sate should not be set to
>>> INIT_RECEIVED. mp_sate is a state as you can see from its name and we
>>> already had a discussion on the generic device API about importance of
>>> separating sending commands from setting state. There is a difference
>>> between setting mp_state during migration and signaling INIT#.
>>
>> What does migration have to do with this?
>
> get|set_mpstate is used by migration. Actually this is primary reason
> for this interface existence.
Does it have to be the only one?
>>>> In the corresponding userspace patch, I don't need to touch the CPU
>>>> state at all. I can just signal the kernel. If I touch the CPU, I'll
>>>> break the nested case, no matter how it is implemented. So far, the
>>>> userspace did not have to worry about nested, and that's something that
>>>> should be kept that way.
>>> We are discussing two different things here. I'll try to separate them.
>>> 1. BSP is broken WRT #INIT
>>> 2. nested is broken WRT #INIT
>>>
>>> You are fixing 1 with your patches, for that I proposed much easier
>>> solution (at last from kernel point of view): if BSP reset it in
>>> userspace and make it runnable. Nested virt is still broken, but this is
>>> not what you are fixing.
>>
>> It's not what I'm fixing, but I don't want to make the fix for nested
>
> What are you fixing then?
Nested virt is not what I am fixing, but I'm trying to keep an eye on
that (and the other INIT race) while doing these patches.
>> virt unnecessarily more complicated. Nested virt needs to know about
>> INIT and SIPI; redefining the meaning of INIT_RECEIVED and SIPI_RECEIVED
>> makes it more complicated to reflect these events to L1.
>>
>>> For 2 much more involved fix is needed. Jan fixes it and it will require
>>> signaling INIT# from userspace by other means than mp_sate because
>>> signaling INIT# does not automatically means that mp_sate changes to
>>> INIT_RECEIVED.
>>
>> In your interpretation of INIT_RECEIVED, no. In mine, yes...
>
> Your code shows different. With your patch setting mp_state to
> INIT_RECEIVED makes vcpu non tunable. This is incorrect if INIT_RECEIVED
> is "INIT# is triggered" interface.
What do you mean by "non tunable"? In non-nested mode, the VCPU will
reset immediately, as soon as it is re-entered. In nested mode, the
VCPU will eat the INIT_RECEIVED and turn it into a vmexit.
At least according to AMD's docs, the VMM has to reassert INIT if it
wants the processor to actually process it [15.20.8 INIT support].
Intel's does not say it explicitly, but it doesn't say the opposite
either. It seems to be the only that makes sense.
>>>> If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
>>>> in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
>>>> will have to convert them to the right bits in the requests field or in
>>>> the APIC state. But I'm starting to see less benefit from moving away
>>>> from mp_state.
>>>>
>>> We are not moving away from mp_state, we are moving away from using
>>> mp_state for signaling
>>
>> That's what I meant; sorry for the unclear abbreviation.
>
> Then we disagree.
We do. Let's see _where_ exactly we disagree.
>>> because with nested virt INIT does not always
>>> change mp_state
>>
>> Why not?
>
> Because mp_state is the current state the vcpu is in. It can be
> uninitialized, runnable, halted or wait for sipi. SDM says that
> if nested virt is enabled vcpu does not enter wait for sipi state
> on INIT#.
Yes, but it still has to do something (a vmexit) and go back to RUNNING.
So it needs signaling from userspace to the kernel.
>> Which is why it's good to have the reset done in kernel space,
>> not in user space.
>
> Without nested virt it does not really matter and if it is does not
> really matter you do not add code to the kernel just because it is good.
> With nested virt INIT# processing needs to go to the kernel. In some
> cases INIT will cause reset, but you do not "do reset in kernel space",
> you do "INIT# handling in kernel space".
We agree on this. What I add is: let's define the API so that it is
nested-friendly. This means having a signaling mechanism for userspace.
I think you do not want mp_state to be this signaling mechanism. Why
not? Can an existing ioctl be the alternative or do we need to invent a
new one?
Paolo
On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
> On 2013-03-11 18:23, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 15:23, Paolo Bonzini wrote:
> >>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
> >>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>>>>> We are not moving away from mp_state, we are moving away from using
> >>>>>> mp_state for signaling because with nested virt INIT does not always
> >>>>>> change mp_state, not only that it can change mp_state long after signal
> >>>>>> is received after vmx off is done.
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>> BTW, for that to happen, we will also need to influence the INIT level.
> >>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
> >>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >>>>> actually happens. So what matters is the INIT signal level at the exit
> >>>>> of root mode.
> >>>>>
> >>>> You are talking about INIT# signal received via CPU pin, right? I think
> >>>> INIT send by IPI cannot go away.
> >>>
> >>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
> >>>
> >>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
> >>> An INIT-disassert IPI has no affect on the state of the APIC, other than
> >>> to reload the arbitration ID register with the value in the APIC ID
> >>> register.
> >>>
> >>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
> >>> deassert) are always treated as edge triggered interrupts".
> >>>
> >>>
> >>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
> >>> clocks" when a soft reset is requested. So we can guess that INIT# is
> >>> also edge-triggered.
> >>
> >> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
> >> form of a reset or a vmexit.
> >>
> > vmexit clears it?
>
> It has to. Otherwise, it would hit the host on vmxoff.
>
Why do you thing this is not happening?
Look at [1] page 10 "VMX and INIT blocking". Do you think they were
lucky to hit CPU while it was in a root mode?
[1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
--
Gleb.
On Mon, Mar 11, 2013 at 06:39:44PM +0100, Paolo Bonzini wrote:
> Il 11/03/2013 18:20, Gleb Natapov ha scritto:
> > On Mon, Mar 11, 2013 at 03:28:03PM +0100, Paolo Bonzini wrote:
> >> Il 11/03/2013 14:54, Gleb Natapov ha scritto:
> >>>> Setting the mp_state to INIT_RECEIVED is that interface, and it already
> >>>> works, for APs at least. This patch extends it to work for the BSP as well.
> >>>
> >>> It does not for AP either. If AP has vmx on mp_sate should not be set to
> >>> INIT_RECEIVED. mp_sate is a state as you can see from its name and we
> >>> already had a discussion on the generic device API about importance of
> >>> separating sending commands from setting state. There is a difference
> >>> between setting mp_state during migration and signaling INIT#.
> >>
> >> What does migration have to do with this?
> >
> > get|set_mpstate is used by migration. Actually this is primary reason
> > for this interface existence.
>
> Does it have to be the only one?
Nope, but it does prevent it from been used as even injection mechanism.
>
> >>>> In the corresponding userspace patch, I don't need to touch the CPU
> >>>> state at all. I can just signal the kernel. If I touch the CPU, I'll
> >>>> break the nested case, no matter how it is implemented. So far, the
> >>>> userspace did not have to worry about nested, and that's something that
> >>>> should be kept that way.
> >>> We are discussing two different things here. I'll try to separate them.
> >>> 1. BSP is broken WRT #INIT
> >>> 2. nested is broken WRT #INIT
> >>>
> >>> You are fixing 1 with your patches, for that I proposed much easier
> >>> solution (at last from kernel point of view): if BSP reset it in
> >>> userspace and make it runnable. Nested virt is still broken, but this is
> >>> not what you are fixing.
> >>
> >> It's not what I'm fixing, but I don't want to make the fix for nested
> >
> > What are you fixing then?
>
> Nested virt is not what I am fixing, but I'm trying to keep an eye on
> that (and the other INIT race) while doing these patches.
>
So you do fixing 1 then?
> >> virt unnecessarily more complicated. Nested virt needs to know about
> >> INIT and SIPI; redefining the meaning of INIT_RECEIVED and SIPI_RECEIVED
> >> makes it more complicated to reflect these events to L1.
> >>
> >>> For 2 much more involved fix is needed. Jan fixes it and it will require
> >>> signaling INIT# from userspace by other means than mp_sate because
> >>> signaling INIT# does not automatically means that mp_sate changes to
> >>> INIT_RECEIVED.
> >>
> >> In your interpretation of INIT_RECEIVED, no. In mine, yes...
> >
> > Your code shows different. With your patch setting mp_state to
> > INIT_RECEIVED makes vcpu non tunable. This is incorrect if INIT_RECEIVED
> > is "INIT# is triggered" interface.
>
> What do you mean by "non tunable"? In non-nested mode, the VCPU will
Sorry, it should have been "not runnable".
> reset immediately, as soon as it is re-entered. In nested mode, the
> VCPU will eat the INIT_RECEIVED and turn it into a vmexit.
>
It should not eat it. It should be pending until vmxoff. And processing
INIT_RECEIVED separately far away from other events that causes vmexit
is cumbersome. INIT# is just an event that can be masked, why treat it
differently from IRQs?
> At least according to AMD's docs, the VMM has to reassert INIT if it
> wants the processor to actually process it [15.20.8 INIT support].
> Intel's does not say it explicitly, but it doesn't say the opposite
> either. It seems to be the only that makes sense.
Well [1] claims differently. Page 10 "VMX and INIT blocking".
[1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
>
> >>>> If we move away from the INIT_RECEIVED and SIPI_RECEIVED states for
> >>>> in-kernel APIC -> VCPU communication, then the KVM_SET_MP_STATE ioctl
> >>>> will have to convert them to the right bits in the requests field or in
> >>>> the APIC state. But I'm starting to see less benefit from moving away
> >>>> from mp_state.
> >>>>
> >>> We are not moving away from mp_state, we are moving away from using
> >>> mp_state for signaling
> >>
> >> That's what I meant; sorry for the unclear abbreviation.
> >
> > Then we disagree.
>
> We do. Let's see _where_ exactly we disagree.
My guess would be this: for me mp_state is current vcpu state, for you
it is a way of passing events _and_ current vcpu state when you do not
pass events there. Correct me if I am wrong.
>
> >>> because with nested virt INIT does not always
> >>> change mp_state
> >>
> >> Why not?
> >
> > Because mp_state is the current state the vcpu is in. It can be
> > uninitialized, runnable, halted or wait for sipi. SDM says that
> > if nested virt is enabled vcpu does not enter wait for sipi state
> > on INIT#.
>
> Yes, but it still has to do something (a vmexit) and go back to RUNNING.
Or go back to HALT and when you overwrote current running state how
do you know which one is that?. And how is it different from IRQs
or NMI. We do not pass them in mp_state, not because we can't, just
introduce EVENT_PENDING state, because it does not make sense.
> So it needs signaling from userspace to the kernel.
>
And do I say it does not need it? I am saying mp_state is not it.
> >> Which is why it's good to have the reset done in kernel space,
> >> not in user space.
> >
> > Without nested virt it does not really matter and if it is does not
> > really matter you do not add code to the kernel just because it is good.
> > With nested virt INIT# processing needs to go to the kernel. In some
> > cases INIT will cause reset, but you do not "do reset in kernel space",
> > you do "INIT# handling in kernel space".
>
> We agree on this. What I add is: let's define the API so that it is
> nested-friendly. This means having a signaling mechanism for userspace.
> I think you do not want mp_state to be this signaling mechanism. Why
> not? Can an existing ioctl be the alternative or do we need to invent a
> new one?
>
I explained why not above and in my previous emails. For the second question
it is implementation detail. I hope we can extend existing interface,
but if we cannot we add another one.
--
Gleb.
On 2013-03-11 18:41, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 18:23, Gleb Natapov wrote:
>>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
>>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
>>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
>>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>>>>> We are not moving away from mp_state, we are moving away from using
>>>>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>>>>> is received after vmx off is done.
>>>>>>>
>>>>>>> Right.
>>>>>>>
>>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>>>>> actually happens. So what matters is the INIT signal level at the exit
>>>>>>> of root mode.
>>>>>>>
>>>>>> You are talking about INIT# signal received via CPU pin, right? I think
>>>>>> INIT send by IPI cannot go away.
>>>>>
>>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
>>>>>
>>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
>>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
>>>>> to reload the arbitration ID register with the value in the APIC ID
>>>>> register.
>>>>>
>>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
>>>>> deassert) are always treated as edge triggered interrupts".
>>>>>
>>>>>
>>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
>>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
>>>>> also edge-triggered.
>>>>
>>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
>>>> form of a reset or a vmexit.
>>>>
>>> vmexit clears it?
>>
>> It has to. Otherwise, it would hit the host on vmxoff.
>>
> Why do you thing this is not happening?
>
> Look at [1] page 10 "VMX and INIT blocking". Do you think they were
> lucky to hit CPU while it was in a root mode?
>
> [1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
Interesting. And confusing. If a VMM cannot "consume" INIT events by
reentering the guest nor postpone those events up to that point if they
arrived in root mode, the whole vmexit-on-INIT thing is practically
useless. I wonder what use case Intel had in mind while designing this.
That article claims they tested it, though via MSI injection, so they
are probably right, at least for the used CPU type.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 07:05:48PM +0100, Jan Kiszka wrote:
> On 2013-03-11 18:41, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 18:23, Gleb Natapov wrote:
> >>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
> >>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
> >>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
> >>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>>>>>>> We are not moving away from mp_state, we are moving away from using
> >>>>>>>> mp_state for signaling because with nested virt INIT does not always
> >>>>>>>> change mp_state, not only that it can change mp_state long after signal
> >>>>>>>> is received after vmx off is done.
> >>>>>>>
> >>>>>>> Right.
> >>>>>>>
> >>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
> >>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
> >>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >>>>>>> actually happens. So what matters is the INIT signal level at the exit
> >>>>>>> of root mode.
> >>>>>>>
> >>>>>> You are talking about INIT# signal received via CPU pin, right? I think
> >>>>>> INIT send by IPI cannot go away.
> >>>>>
> >>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
> >>>>>
> >>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
> >>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
> >>>>> to reload the arbitration ID register with the value in the APIC ID
> >>>>> register.
> >>>>>
> >>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
> >>>>> deassert) are always treated as edge triggered interrupts".
> >>>>>
> >>>>>
> >>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
> >>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
> >>>>> also edge-triggered.
> >>>>
> >>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
> >>>> form of a reset or a vmexit.
> >>>>
> >>> vmexit clears it?
> >>
> >> It has to. Otherwise, it would hit the host on vmxoff.
> >>
> > Why do you thing this is not happening?
> >
> > Look at [1] page 10 "VMX and INIT blocking". Do you think they were
> > lucky to hit CPU while it was in a root mode?
> >
> > [1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
>
> Interesting. And confusing. If a VMM cannot "consume" INIT events by
> reentering the guest nor postpone those events up to that point if they
> arrived in root mode, the whole vmexit-on-INIT thing is practically
> useless. I wonder what use case Intel had in mind while designing this.
>
I actually find it very useful. On INIT vmexit hypervisor may call
vmxoff and do proper reset. I find it less useful on AMD where you need
to send self INIT IPI, but then how you can send self SIPI?
> That article claims they tested it, though via MSI injection, so they
> are probably right, at least for the used CPU type.
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux
--
Gleb.
On 2013-03-11 19:13, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 07:05:48PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 18:41, Gleb Natapov wrote:
>>> On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
>>>> On 2013-03-11 18:23, Gleb Natapov wrote:
>>>>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
>>>>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
>>>>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
>>>>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>>>>>>> We are not moving away from mp_state, we are moving away from using
>>>>>>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>>>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>>>>>>> is received after vmx off is done.
>>>>>>>>>
>>>>>>>>> Right.
>>>>>>>>>
>>>>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>>>>>>> actually happens. So what matters is the INIT signal level at the exit
>>>>>>>>> of root mode.
>>>>>>>>>
>>>>>>>> You are talking about INIT# signal received via CPU pin, right? I think
>>>>>>>> INIT send by IPI cannot go away.
>>>>>>>
>>>>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
>>>>>>>
>>>>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
>>>>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
>>>>>>> to reload the arbitration ID register with the value in the APIC ID
>>>>>>> register.
>>>>>>>
>>>>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
>>>>>>> deassert) are always treated as edge triggered interrupts".
>>>>>>>
>>>>>>>
>>>>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
>>>>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
>>>>>>> also edge-triggered.
>>>>>>
>>>>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
>>>>>> form of a reset or a vmexit.
>>>>>>
>>>>> vmexit clears it?
>>>>
>>>> It has to. Otherwise, it would hit the host on vmxoff.
>>>>
>>> Why do you thing this is not happening?
>>>
>>> Look at [1] page 10 "VMX and INIT blocking". Do you think they were
>>> lucky to hit CPU while it was in a root mode?
>>>
>>> [1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
>>
>> Interesting. And confusing. If a VMM cannot "consume" INIT events by
>> reentering the guest nor postpone those events up to that point if they
>> arrived in root mode, the whole vmexit-on-INIT thing is practically
>> useless. I wonder what use case Intel had in mind while designing this.
>>
> I actually find it very useful. On INIT vmexit hypervisor may call
> vmxoff and do proper reset. I find it less useful on AMD where you need
> to send self INIT IPI, but then how you can send self SIPI?
Where's the difference? On Intel, SIPI is also not deliverable until
after vmxoff. So that signal has to come from the INIT sender, just like
on AMD.
However, AMD allows you to NOT do a reset after leaving virtualization
mode. On Intel, INIT is obviously irreversible, thus of limited use.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 07:27:44PM +0100, Jan Kiszka wrote:
> On 2013-03-11 19:13, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 07:05:48PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 18:41, Gleb Natapov wrote:
> >>> On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
> >>>> On 2013-03-11 18:23, Gleb Natapov wrote:
> >>>>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
> >>>>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
> >>>>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
> >>>>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>>>>>>>>> We are not moving away from mp_state, we are moving away from using
> >>>>>>>>>> mp_state for signaling because with nested virt INIT does not always
> >>>>>>>>>> change mp_state, not only that it can change mp_state long after signal
> >>>>>>>>>> is received after vmx off is done.
> >>>>>>>>>
> >>>>>>>>> Right.
> >>>>>>>>>
> >>>>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
> >>>>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
> >>>>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >>>>>>>>> actually happens. So what matters is the INIT signal level at the exit
> >>>>>>>>> of root mode.
> >>>>>>>>>
> >>>>>>>> You are talking about INIT# signal received via CPU pin, right? I think
> >>>>>>>> INIT send by IPI cannot go away.
> >>>>>>>
> >>>>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
> >>>>>>>
> >>>>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
> >>>>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
> >>>>>>> to reload the arbitration ID register with the value in the APIC ID
> >>>>>>> register.
> >>>>>>>
> >>>>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
> >>>>>>> deassert) are always treated as edge triggered interrupts".
> >>>>>>>
> >>>>>>>
> >>>>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
> >>>>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
> >>>>>>> also edge-triggered.
> >>>>>>
> >>>>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
> >>>>>> form of a reset or a vmexit.
> >>>>>>
> >>>>> vmexit clears it?
> >>>>
> >>>> It has to. Otherwise, it would hit the host on vmxoff.
> >>>>
> >>> Why do you thing this is not happening?
> >>>
> >>> Look at [1] page 10 "VMX and INIT blocking". Do you think they were
> >>> lucky to hit CPU while it was in a root mode?
> >>>
> >>> [1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
> >>
> >> Interesting. And confusing. If a VMM cannot "consume" INIT events by
> >> reentering the guest nor postpone those events up to that point if they
> >> arrived in root mode, the whole vmexit-on-INIT thing is practically
> >> useless. I wonder what use case Intel had in mind while designing this.
> >>
> > I actually find it very useful. On INIT vmexit hypervisor may call
> > vmxoff and do proper reset. I find it less useful on AMD where you need
> > to send self INIT IPI, but then how you can send self SIPI?
>
> Where's the difference? On Intel, SIPI is also not deliverable until
> after vmxoff. So that signal has to come from the INIT sender, just like
> on AMD.
>
On Intel:
CPU 1 CPU 2 in a guest mode
send INIT
send SIPI
INIT vmexit
vmxoff
reset and start from SIPI vector
On AMD;
CPU 1 CPU 2 in a guest mode
send INIT
send SIPI
INIT vmexit
self INIT IPI
???
> However, AMD allows you to NOT do a reset after leaving virtualization
> mode. On Intel, INIT is obviously irreversible, thus of limited use.
>
Why would hypervisor send INIT to one of its CPUs if it does not want to
reset it? :)
--
Gleb.
On 2013-03-11 19:39, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 07:27:44PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 19:13, Gleb Natapov wrote:
>>> On Mon, Mar 11, 2013 at 07:05:48PM +0100, Jan Kiszka wrote:
>>>> On 2013-03-11 18:41, Gleb Natapov wrote:
>>>>> On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
>>>>>> On 2013-03-11 18:23, Gleb Natapov wrote:
>>>>>>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
>>>>>>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
>>>>>>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
>>>>>>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
>>>>>>>>>>>> We are not moving away from mp_state, we are moving away from using
>>>>>>>>>>>> mp_state for signaling because with nested virt INIT does not always
>>>>>>>>>>>> change mp_state, not only that it can change mp_state long after signal
>>>>>>>>>>>> is received after vmx off is done.
>>>>>>>>>>>
>>>>>>>>>>> Right.
>>>>>>>>>>>
>>>>>>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
>>>>>>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
>>>>>>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
>>>>>>>>>>> actually happens. So what matters is the INIT signal level at the exit
>>>>>>>>>>> of root mode.
>>>>>>>>>>>
>>>>>>>>>> You are talking about INIT# signal received via CPU pin, right? I think
>>>>>>>>>> INIT send by IPI cannot go away.
>>>>>>>>>
>>>>>>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
>>>>>>>>>
>>>>>>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
>>>>>>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
>>>>>>>>> to reload the arbitration ID register with the value in the APIC ID
>>>>>>>>> register.
>>>>>>>>>
>>>>>>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
>>>>>>>>> deassert) are always treated as edge triggered interrupts".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
>>>>>>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
>>>>>>>>> also edge-triggered.
>>>>>>>>
>>>>>>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
>>>>>>>> form of a reset or a vmexit.
>>>>>>>>
>>>>>>> vmexit clears it?
>>>>>>
>>>>>> It has to. Otherwise, it would hit the host on vmxoff.
>>>>>>
>>>>> Why do you thing this is not happening?
>>>>>
>>>>> Look at [1] page 10 "VMX and INIT blocking". Do you think they were
>>>>> lucky to hit CPU while it was in a root mode?
>>>>>
>>>>> [1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
>>>>
>>>> Interesting. And confusing. If a VMM cannot "consume" INIT events by
>>>> reentering the guest nor postpone those events up to that point if they
>>>> arrived in root mode, the whole vmexit-on-INIT thing is practically
>>>> useless. I wonder what use case Intel had in mind while designing this.
>>>>
>>> I actually find it very useful. On INIT vmexit hypervisor may call
>>> vmxoff and do proper reset. I find it less useful on AMD where you need
>>> to send self INIT IPI, but then how you can send self SIPI?
>>
>> Where's the difference? On Intel, SIPI is also not deliverable until
>> after vmxoff. So that signal has to come from the INIT sender, just like
>> on AMD.
>>
> On Intel:
> CPU 1 CPU 2 in a guest mode
> send INIT
> send SIPI
> INIT vmexit
> vmxoff
> reset and start from SIPI vector
Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
state (but runnable and in vmxon) while receiving it?
>
>
> On AMD;
> CPU 1 CPU 2 in a guest mode
> send INIT
> send SIPI
> INIT vmexit
> self INIT IPI
> ???
>
>
>> However, AMD allows you to NOT do a reset after leaving virtualization
>> mode. On Intel, INIT is obviously irreversible, thus of limited use.
>>
> Why would hypervisor send INIT to one of its CPUs if it does not want to
> reset it? :)
First of all, to trigger a vmexit. What will happen with this event
should be the hand of the hypervisor - ideally.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 07:47:03PM +0100, Jan Kiszka wrote:
> On 2013-03-11 19:39, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 07:27:44PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 19:13, Gleb Natapov wrote:
> >>> On Mon, Mar 11, 2013 at 07:05:48PM +0100, Jan Kiszka wrote:
> >>>> On 2013-03-11 18:41, Gleb Natapov wrote:
> >>>>> On Mon, Mar 11, 2013 at 06:34:03PM +0100, Jan Kiszka wrote:
> >>>>>> On 2013-03-11 18:23, Gleb Natapov wrote:
> >>>>>>> On Mon, Mar 11, 2013 at 04:36:33PM +0100, Jan Kiszka wrote:
> >>>>>>>> On 2013-03-11 15:23, Paolo Bonzini wrote:
> >>>>>>>>> Il 11/03/2013 15:05, Gleb Natapov ha scritto:
> >>>>>>>>>> On Mon, Mar 11, 2013 at 03:01:40PM +0100, Jan Kiszka wrote:
> >>>>>>>>>>>> We are not moving away from mp_state, we are moving away from using
> >>>>>>>>>>>> mp_state for signaling because with nested virt INIT does not always
> >>>>>>>>>>>> change mp_state, not only that it can change mp_state long after signal
> >>>>>>>>>>>> is received after vmx off is done.
> >>>>>>>>>>>
> >>>>>>>>>>> Right.
> >>>>>>>>>>>
> >>>>>>>>>>> BTW, for that to happen, we will also need to influence the INIT level.
> >>>>>>>>>>> Unless I misread the spec, INIT is blocked while in root mode, and if
> >>>>>>>>>>> you deassert INIT before leaving root (vmxoff, vmenter), nothing
> >>>>>>>>>>> actually happens. So what matters is the INIT signal level at the exit
> >>>>>>>>>>> of root mode.
> >>>>>>>>>>>
> >>>>>>>>>> You are talking about INIT# signal received via CPU pin, right? I think
> >>>>>>>>>> INIT send by IPI cannot go away.
> >>>>>>>>>
> >>>>>>>>> Neither can go away. For INIT sent by IPI, 10.4.7 says:
> >>>>>>>>>
> >>>>>>>>> Only the Pentium and P6 family processors support the INIT-deassert IPI.
> >>>>>>>>> An INIT-disassert IPI has no affect on the state of the APIC, other than
> >>>>>>>>> to reload the arbitration ID register with the value in the APIC ID
> >>>>>>>>> register.
> >>>>>>>>>
> >>>>>>>>> 18.27.1 also says that "In the local APIC, NMI and INIT (except for INIT
> >>>>>>>>> deassert) are always treated as edge triggered interrupts".
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> For INIT#, the ICH9 chipset says that "INIT# is driven low for 16 PCI
> >>>>>>>>> clocks" when a soft reset is requested. So we can guess that INIT# is
> >>>>>>>>> also edge-triggered.
> >>>>>>>>
> >>>>>>>> Ah, ok. So, virtually, INIT stays asserted until it can be delivered in
> >>>>>>>> form of a reset or a vmexit.
> >>>>>>>>
> >>>>>>> vmexit clears it?
> >>>>>>
> >>>>>> It has to. Otherwise, it would hit the host on vmxoff.
> >>>>>>
> >>>>> Why do you thing this is not happening?
> >>>>>
> >>>>> Look at [1] page 10 "VMX and INIT blocking". Do you think they were
> >>>>> lucky to hit CPU while it was in a root mode?
> >>>>>
> >>>>> [1] http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
> >>>>
> >>>> Interesting. And confusing. If a VMM cannot "consume" INIT events by
> >>>> reentering the guest nor postpone those events up to that point if they
> >>>> arrived in root mode, the whole vmexit-on-INIT thing is practically
> >>>> useless. I wonder what use case Intel had in mind while designing this.
> >>>>
> >>> I actually find it very useful. On INIT vmexit hypervisor may call
> >>> vmxoff and do proper reset. I find it less useful on AMD where you need
> >>> to send self INIT IPI, but then how you can send self SIPI?
> >>
> >> Where's the difference? On Intel, SIPI is also not deliverable until
> >> after vmxoff. So that signal has to come from the INIT sender, just like
> >> on AMD.
> >>
> > On Intel:
> > CPU 1 CPU 2 in a guest mode
> > send INIT
> > send SIPI
> > INIT vmexit
> > vmxoff
> > reset and start from SIPI vector
>
> Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
> state (but runnable and in vmxon) while receiving it?
>
That what they seams to be saying:
However, an INIT and SIPI interrupts sent to a CPU during time when
it is in a VMX mode are remembered and delivered, perhaps hours later,
when the CPU exits the VMX mode
Otherwise their exploit will not work.
--
Gleb.
On 2013-03-11 19:51, Gleb Natapov wrote:
>>> On Intel:
>>> CPU 1 CPU 2 in a guest mode
>>> send INIT
>>> send SIPI
>>> INIT vmexit
>>> vmxoff
>>> reset and start from SIPI vector
>>
>> Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
>> state (but runnable and in vmxon) while receiving it?
>>
> That what they seams to be saying:
> However, an INIT and SIPI interrupts sent to a CPU during time when
> it is in a VMX mode are remembered and delivered, perhaps hours later,
> when the CPU exits the VMX mode
>
> Otherwise their exploit will not work.
Very weird, specifically as SIPI is not just a binary event but carries
payload. Will another SIPI event overwrite the previously "saved"
vector? We are deep into an underspecified area...
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Mon, Mar 11, 2013 at 08:01:30PM +0100, Jan Kiszka wrote:
> On 2013-03-11 19:51, Gleb Natapov wrote:
> >>> On Intel:
> >>> CPU 1 CPU 2 in a guest mode
> >>> send INIT
> >>> send SIPI
> >>> INIT vmexit
> >>> vmxoff
> >>> reset and start from SIPI vector
> >>
> >> Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
> >> state (but runnable and in vmxon) while receiving it?
> >>
> > That what they seams to be saying:
> > However, an INIT and SIPI interrupts sent to a CPU during time when
> > it is in a VMX mode are remembered and delivered, perhaps hours later,
> > when the CPU exits the VMX mode
> >
> > Otherwise their exploit will not work.
>
> Very weird, specifically as SIPI is not just a binary event but carries
> payload. Will another SIPI event overwrite the previously "saved"
> vector? We are deep into an underspecified area...
My guess is that VMX INIT blocking is done by the same mechanism as
INIT blocking during SMM. Obviously after exit from SMM pending
INIT/SIPI have to be processed.
--
Gleb.
On 2013-03-11 20:30, Gleb Natapov wrote:
> On Mon, Mar 11, 2013 at 08:01:30PM +0100, Jan Kiszka wrote:
>> On 2013-03-11 19:51, Gleb Natapov wrote:
>>>>> On Intel:
>>>>> CPU 1 CPU 2 in a guest mode
>>>>> send INIT
>>>>> send SIPI
>>>>> INIT vmexit
>>>>> vmxoff
>>>>> reset and start from SIPI vector
>>>>
>>>> Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
>>>> state (but runnable and in vmxon) while receiving it?
>>>>
>>> That what they seams to be saying:
>>> However, an INIT and SIPI interrupts sent to a CPU during time when
>>> it is in a VMX mode are remembered and delivered, perhaps hours later,
>>> when the CPU exits the VMX mode
>>>
>>> Otherwise their exploit will not work.
>>
>> Very weird, specifically as SIPI is not just a binary event but carries
>> payload. Will another SIPI event overwrite the previously "saved"
>> vector? We are deep into an underspecified area...
> My guess is that VMX INIT blocking is done by the same mechanism as
> INIT blocking during SMM. Obviously after exit from SMM pending
> INIT/SIPI have to be processed.
I think this should be further examined via a test case that can run on
real HW. Is kvm-unit-test ready for this? Then we "just" need to
implement what you were already asking for: minimalistic nVMX tests...
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
On Tue, Mar 12, 2013 at 10:25:35AM +0100, Jan Kiszka wrote:
> On 2013-03-11 20:30, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 08:01:30PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 19:51, Gleb Natapov wrote:
> >>>>> On Intel:
> >>>>> CPU 1 CPU 2 in a guest mode
> >>>>> send INIT
> >>>>> send SIPI
> >>>>> INIT vmexit
> >>>>> vmxoff
> >>>>> reset and start from SIPI vector
> >>>>
> >>>> Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
> >>>> state (but runnable and in vmxon) while receiving it?
> >>>>
> >>> That what they seams to be saying:
> >>> However, an INIT and SIPI interrupts sent to a CPU during time when
> >>> it is in a VMX mode are remembered and delivered, perhaps hours later,
> >>> when the CPU exits the VMX mode
> >>>
> >>> Otherwise their exploit will not work.
> >>
> >> Very weird, specifically as SIPI is not just a binary event but carries
> >> payload. Will another SIPI event overwrite the previously "saved"
> >> vector? We are deep into an underspecified area...
> > My guess is that VMX INIT blocking is done by the same mechanism as
> > INIT blocking during SMM. Obviously after exit from SMM pending
> > INIT/SIPI have to be processed.
>
> I think this should be further examined via a test case that can run on
> real HW. Is kvm-unit-test ready for this? Then we "just" need to
> implement what you were already asking for: minimalistic nVMX tests...
>
I do not think kvm-unit-test will run on bare metal. I once implemented
framework for interrupt injection testing that ran on bare metal too. It
was very handy to compare KVM behaviour with real HW, but it was since
folded into kvm-unit-test.
--
Gleb.