Previously (see commit 72c139bacf) it was found that Hyper-V frequency
MSRs are required to make Hyper-V on KVM pass through TSC page as stable
clocksource to its guests. However, to make things work this is not
enough. Hyper-V tries to prove that TSC frequency underneath it won't
change (e.g. when it's migrated), there are two mechanisms:
1) Invariant TSC (CPUID.80000007H:EDX[8]). If Hyper-V sees this bit it will
treat TSC as stable. We, however, don't want to pass it as it makes
migration hard (e.g. Qemu adds a migration blocker when 'invtsc' flag is
passed. Genuine Hyper-V running in L0 doesn't pass it either.
2) Hyper-V Reenlightenment (CPUID.40000003H:EAX[13]).
This patch series add rudimentary support for Hyper-V reenlightenment
notifications to KVM ('producer' part; we already implemented
reenlightenment 'consumer' for KVM-on Hyper-V some time ago) and fixes
bugs I found during testing. Fully fledged reenlightenment implementation
will be added later when we learn to migrate nested workloads in KVM making
testing possible.
Qemu patches are also required, I'll post them separately.
Vitaly Kuznetsov (3):
x86/kvm/hyper-v: add reenlightenment MSRs support
x86/kvm/hyper-v: remove stale entries from vec_bitmap/auto_eoi_bitmap
on vector change
x86/kvm/hyper-v: inject #GP only when invalid SINTx vector is unmasked
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/include/uapi/asm/hyperv.h | 3 ++
arch/x86/kvm/hyperv.c | 58 +++++++++++++++++++++++++++++++-------
arch/x86/kvm/x86.c | 12 +++++++-
4 files changed, 66 insertions(+), 11 deletions(-)
--
2.14.3
Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following
trace:
kvm_entry: vcpu 0
kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0
kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0
kvm_msr: msr_write 40000090 = 0x10000 (#GP)
kvm_inj_exception: #GP (0x0)
KVM acts according to the following statement from TLFS:
"
11.8.4 SINTx Registers
...
Valid values for vector are 16-255 inclusive. Specifying an invalid
vector number results in #GP.
"
However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000
to SINTx. I checked with Microsoft and they confirmed that if either the
Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they
ignore the value of Vector. Make KVM act accordingly.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/uapi/asm/hyperv.h | 1 +
arch/x86/kvm/hyperv.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index 62c778a303a1..a492dc357bd7 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -326,6 +326,7 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
#define HV_SYNIC_SIEFP_ENABLE (1ULL << 0)
#define HV_SYNIC_SINT_MASKED (1ULL << 16)
#define HV_SYNIC_SINT_AUTO_EOI (1ULL << 17)
+#define HV_SYNIC_SINT_POLLING (1ULL << 18)
#define HV_SYNIC_SINT_VECTOR_MASK (0xFF)
#define HV_SYNIC_STIMER_COUNT (4)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 6d14f808145d..d3d866c32976 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -95,9 +95,14 @@ static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
u64 data, bool host)
{
int vector, old_vector;
+ bool masked, polling;
vector = data & HV_SYNIC_SINT_VECTOR_MASK;
- if (vector < HV_SYNIC_FIRST_VALID_VECTOR && !host)
+ masked = data & HV_SYNIC_SINT_MASKED;
+ polling = data & HV_SYNIC_SINT_POLLING;
+
+ if (vector < HV_SYNIC_FIRST_VALID_VECTOR &&
+ !host && !masked && !polling)
return 1;
/*
* Guest may configure multiple SINTs to use the same vector, so
--
2.14.3
Nested Hyper-V/Windows guest running on top of KVM will use TSC page
clocksource in two cases:
- L0 exposes invariant TSC (CPUID.80000007H:EDX[8]).
- L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]).
Exposing invariant TSC effectively blocks migration to hosts with different
TSC frequencies, providing reenlightenment support will be needed when we
start migrating nested workloads.
Implement rudimentary support for reenlightenment MSRs. For now, these are
just read/write MSRs with no effect.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/hyperv.c | 21 +++++++++++++++++++++
arch/x86/kvm/x86.c | 12 +++++++++++-
3 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 516798431328..eff8aa880333 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -752,6 +752,10 @@ struct kvm_hv {
u64 hv_crash_ctl;
HV_REFERENCE_TSC_PAGE tsc_ref;
+
+ u64 hv_reenlightenment_control;
+ u64 hv_tsc_emulation_control;
+ u64 hv_tsc_emulation_status;
};
enum kvm_irqchip_mode {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index dc97f2544b6f..05f414525538 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -736,6 +736,9 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
case HV_X64_MSR_CRASH_CTL:
case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
case HV_X64_MSR_RESET:
+ case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_STATUS:
r = true;
break;
}
@@ -981,6 +984,15 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
kvm_make_request(KVM_REQ_HV_RESET, vcpu);
}
break;
+ case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
+ hv->hv_reenlightenment_control = data;
+ break;
+ case HV_X64_MSR_TSC_EMULATION_CONTROL:
+ hv->hv_tsc_emulation_control = data;
+ break;
+ case HV_X64_MSR_TSC_EMULATION_STATUS:
+ hv->hv_tsc_emulation_status = data;
+ break;
default:
vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n",
msr, data);
@@ -1105,6 +1117,15 @@ static int kvm_hv_get_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case HV_X64_MSR_RESET:
data = 0;
break;
+ case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
+ data = hv->hv_reenlightenment_control;
+ break;
+ case HV_X64_MSR_TSC_EMULATION_CONTROL:
+ data = hv->hv_tsc_emulation_control;
+ break;
+ case HV_X64_MSR_TSC_EMULATION_STATUS:
+ data = hv->hv_tsc_emulation_status;
+ break;
default:
vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr);
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c53298dfbf50..d97ab134b2d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1026,7 +1026,11 @@ static u32 emulated_msrs[] = {
HV_X64_MSR_VP_RUNTIME,
HV_X64_MSR_SCONTROL,
HV_X64_MSR_STIMER0_CONFIG,
- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+ HV_X64_MSR_APIC_ASSIST_PAGE,
+ HV_X64_MSR_REENLIGHTENMENT_CONTROL, HV_X64_MSR_TSC_EMULATION_CONTROL,
+ HV_X64_MSR_TSC_EMULATION_STATUS,
+
+ MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
MSR_IA32_TSC_ADJUST,
@@ -2325,6 +2329,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
case HV_X64_MSR_CRASH_CTL:
case HV_X64_MSR_STIMER0_CONFIG ... HV_X64_MSR_STIMER3_COUNT:
+ case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_STATUS:
return kvm_hv_set_msr_common(vcpu, msr, data,
msr_info->host_initiated);
case MSR_IA32_BBL_CR_CTL3:
@@ -2551,6 +2558,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
case HV_X64_MSR_CRASH_CTL:
case HV_X64_MSR_STIMER0_CONFIG ... HV_X64_MSR_STIMER3_COUNT:
+ case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_CONTROL:
+ case HV_X64_MSR_TSC_EMULATION_STATUS:
return kvm_hv_get_msr_common(vcpu,
msr_info->index, &msr_info->data);
break;
--
2.14.3
When a new vector is written to SINx we update vec_bitmap/auto_eoi_bitmap
but we forget to remove old vector from these masks (in case it is not
present in some other SINTx).
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/uapi/asm/hyperv.h | 2 ++
arch/x86/kvm/hyperv.c | 32 ++++++++++++++++++++++----------
2 files changed, 24 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index 197c2e6c7376..62c778a303a1 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -318,6 +318,8 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
#define HV_SYNIC_SINT_COUNT (16)
/* Define the expected SynIC version. */
#define HV_SYNIC_VERSION_1 (0x1)
+/* Valid SynIC vectors are 16-255. */
+#define HV_SYNIC_FIRST_VALID_VECTOR (16)
#define HV_SYNIC_CONTROL_ENABLE (1ULL << 0)
#define HV_SYNIC_SIMP_ENABLE (1ULL << 0)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 05f414525538..6d14f808145d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -74,13 +74,30 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
return false;
}
+static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
+ int vector)
+{
+ if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
+ return;
+
+ if (synic_has_vector_connected(synic, vector))
+ __set_bit(vector, synic->vec_bitmap);
+ else
+ __clear_bit(vector, synic->vec_bitmap);
+
+ if (synic_has_vector_auto_eoi(synic, vector))
+ __set_bit(vector, synic->auto_eoi_bitmap);
+ else
+ __clear_bit(vector, synic->auto_eoi_bitmap);
+}
+
static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
u64 data, bool host)
{
- int vector;
+ int vector, old_vector;
vector = data & HV_SYNIC_SINT_VECTOR_MASK;
- if (vector < 16 && !host)
+ if (vector < HV_SYNIC_FIRST_VALID_VECTOR && !host)
return 1;
/*
* Guest may configure multiple SINTs to use the same vector, so
@@ -88,18 +105,13 @@ static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
* bitmap of vectors with auto-eoi behavior. The bitmaps are
* updated here, and atomically queried on fast paths.
*/
+ old_vector = synic_read_sint(synic, sint) & HV_SYNIC_SINT_VECTOR_MASK;
atomic64_set(&synic->sint[sint], data);
- if (synic_has_vector_connected(synic, vector))
- __set_bit(vector, synic->vec_bitmap);
- else
- __clear_bit(vector, synic->vec_bitmap);
+ synic_update_vector(synic, old_vector);
- if (synic_has_vector_auto_eoi(synic, vector))
- __set_bit(vector, synic->auto_eoi_bitmap);
- else
- __clear_bit(vector, synic->auto_eoi_bitmap);
+ synic_update_vector(synic, vector);
/* Load SynIC vectors into EOI exit bitmap */
kvm_make_request(KVM_REQ_SCAN_IOAPIC, synic_to_vcpu(synic));
--
2.14.3
On Wed, Feb 28, 2018 at 02:44:00PM +0100, Vitaly Kuznetsov wrote:
> When a new vector is written to SINx we update vec_bitmap/auto_eoi_bitmap
> but we forget to remove old vector from these masks (in case it is not
> present in some other SINTx).
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/include/uapi/asm/hyperv.h | 2 ++
> arch/x86/kvm/hyperv.c | 32 ++++++++++++++++++++++----------
> 2 files changed, 24 insertions(+), 10 deletions(-)
Reviewed-by: Roman Kagan <[email protected]>
On Wed, Feb 28, 2018 at 02:44:01PM +0100, Vitaly Kuznetsov wrote:
> Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following
> trace:
>
> kvm_entry: vcpu 0
> kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0
> kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0
> kvm_msr: msr_write 40000090 = 0x10000 (#GP)
> kvm_inj_exception: #GP (0x0)
I don't remember having seen this... Does this happen with the mainline
QEMU, which doesn't set the SintPollingModeAvailable (17) bit in cpuid
0x40000003:edx?
>
> KVM acts according to the following statement from TLFS:
>
> "
> 11.8.4 SINTx Registers
> ...
> Valid values for vector are 16-255 inclusive. Specifying an invalid
> vector number results in #GP.
> "
>
> However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000
> to SINTx. I checked with Microsoft and they confirmed that if either the
> Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they
> ignore the value of Vector. Make KVM act accordingly.
I wonder if that cpuid setting affects this behavior? Also curious what
exactly the guest is trying to achieve writing this bogus value?
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/include/uapi/asm/hyperv.h | 1 +
> arch/x86/kvm/hyperv.c | 7 ++++++-
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
> index 62c778a303a1..a492dc357bd7 100644
> --- a/arch/x86/include/uapi/asm/hyperv.h
> +++ b/arch/x86/include/uapi/asm/hyperv.h
> @@ -326,6 +326,7 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
> #define HV_SYNIC_SIEFP_ENABLE (1ULL << 0)
> #define HV_SYNIC_SINT_MASKED (1ULL << 16)
> #define HV_SYNIC_SINT_AUTO_EOI (1ULL << 17)
> +#define HV_SYNIC_SINT_POLLING (1ULL << 18)
> #define HV_SYNIC_SINT_VECTOR_MASK (0xFF)
>
> #define HV_SYNIC_STIMER_COUNT (4)
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 6d14f808145d..d3d866c32976 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -95,9 +95,14 @@ static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
> u64 data, bool host)
> {
> int vector, old_vector;
> + bool masked, polling;
>
> vector = data & HV_SYNIC_SINT_VECTOR_MASK;
> - if (vector < HV_SYNIC_FIRST_VALID_VECTOR && !host)
> + masked = data & HV_SYNIC_SINT_MASKED;
> + polling = data & HV_SYNIC_SINT_POLLING;
> +
> + if (vector < HV_SYNIC_FIRST_VALID_VECTOR &&
> + !host && !masked && !polling)
> return 1;
> /*
> * Guest may configure multiple SINTs to use the same vector, so
I'm not sure this is enough to implement the polling mode: per spec,
> Setting the polling bit will have the effect of unmasking an interrupt
> source, except that an actual interrupt is not generated.
However, if the guest sets a valid vector and the masked bit cleared,
we'll consider it a usual SINT and add to masks and inject interrupts,
etc, regardless of the polling bit.
I must admit I'm confused by the above quote from the spec: is the
polling bit supposed to come together with the masked bit? If so, then
we probably should validate it here (but your logs indicate otherwise).
In general I'm missing the utility of this mode: why should an interrupt
controller be involved in polling at all?
Roman.
Roman Kagan <[email protected]> writes:
> On Wed, Feb 28, 2018 at 02:44:01PM +0100, Vitaly Kuznetsov wrote:
>> Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following
>> trace:
>>
>> kvm_entry: vcpu 0
>> kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0
>> kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0
>> kvm_msr: msr_write 40000090 = 0x10000 (#GP)
>> kvm_inj_exception: #GP (0x0)
>
> I don't remember having seen this... Does this happen with the mainline
> QEMU, which doesn't set the SintPollingModeAvailable (17) bit in cpuid
> 0x40000003:edx?
Yes, you need to have Hyper-V role enabled, kvm-intel modules needs to
be loaded with 'nesting' support enabled.
>
>>
>> KVM acts according to the following statement from TLFS:
>>
>> "
>> 11.8.4 SINTx Registers
>> ...
>> Valid values for vector are 16-255 inclusive. Specifying an invalid
>> vector number results in #GP.
>> "
>>
>> However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000
>> to SINTx. I checked with Microsoft and they confirmed that if either the
>> Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they
>> ignore the value of Vector. Make KVM act accordingly.
>
> I wonder if that cpuid setting affects this behavior? Also curious what
> exactly the guest is trying to achieve writing this bogus value?
The value is actually the default value which is supposed to be there:
"At virtual processor creation time, the default value of all SINTx
(synthetic interrupt source) registers is 0x0000000000010000." so I
guess this is just an intialization procedure.
>
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/include/uapi/asm/hyperv.h | 1 +
>> arch/x86/kvm/hyperv.c | 7 ++++++-
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
>> index 62c778a303a1..a492dc357bd7 100644
>> --- a/arch/x86/include/uapi/asm/hyperv.h
>> +++ b/arch/x86/include/uapi/asm/hyperv.h
>> @@ -326,6 +326,7 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
>> #define HV_SYNIC_SIEFP_ENABLE (1ULL << 0)
>> #define HV_SYNIC_SINT_MASKED (1ULL << 16)
>> #define HV_SYNIC_SINT_AUTO_EOI (1ULL << 17)
>> +#define HV_SYNIC_SINT_POLLING (1ULL << 18)
>> #define HV_SYNIC_SINT_VECTOR_MASK (0xFF)
>>
>> #define HV_SYNIC_STIMER_COUNT (4)
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index 6d14f808145d..d3d866c32976 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -95,9 +95,14 @@ static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
>> u64 data, bool host)
>> {
>> int vector, old_vector;
>> + bool masked, polling;
>>
>> vector = data & HV_SYNIC_SINT_VECTOR_MASK;
>> - if (vector < HV_SYNIC_FIRST_VALID_VECTOR && !host)
>> + masked = data & HV_SYNIC_SINT_MASKED;
>> + polling = data & HV_SYNIC_SINT_POLLING;
>> +
>> + if (vector < HV_SYNIC_FIRST_VALID_VECTOR &&
>> + !host && !masked && !polling)
>> return 1;
>> /*
>> * Guest may configure multiple SINTs to use the same vector, so
>
> I'm not sure this is enough to implement the polling mode: per spec,
>
Oh, no, I wasn't trying to -- and by the way we don't currently announce
SintPollingModeAvailable so guests are not supposed to do that. This is
rather a future proof to 'not forget'.
>> Setting the polling bit will have the effect of unmasking an interrupt
>> source, except that an actual interrupt is not generated.
>
> However, if the guest sets a valid vector and the masked bit cleared,
> we'll consider it a usual SINT and add to masks and inject interrupts,
> etc, regardless of the polling bit.
>
> I must admit I'm confused by the above quote from the spec: is the
> polling bit supposed to come together with the masked bit? If so, then
> we probably should validate it here (but your logs indicate otherwise).
> In general I'm missing the utility of this mode: why should an interrupt
> controller be involved in polling at all?
"Setting the polling bit will have the effect of unmasking an interrupt
source, except that an actual interrupt is not generated."
So, as I understand it, setting polling bit makes Vector value
irrelevant - the interrupt is not generated so I *assume* we may see
writes with zero Vector and polling bit set. But again, we're not
implementing polling mode for now, I can just drop it from the patch if
you think it is confusing.
--
Vitaly
On Wed, Feb 28, 2018 at 04:35:59PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan <[email protected]> writes:
>
> > On Wed, Feb 28, 2018 at 02:44:01PM +0100, Vitaly Kuznetsov wrote:
> >> Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following
> >> trace:
> >>
> >> kvm_entry: vcpu 0
> >> kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0
> >> kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0
> >> kvm_msr: msr_write 40000090 = 0x10000 (#GP)
> >> kvm_inj_exception: #GP (0x0)
> >
> > I don't remember having seen this... Does this happen with the mainline
> > QEMU, which doesn't set the SintPollingModeAvailable (17) bit in cpuid
> > 0x40000003:edx?
>
> Yes, you need to have Hyper-V role enabled, kvm-intel modules needs to
> be loaded with 'nesting' support enabled.
Thanks, reproduced now.
> >>
> >> KVM acts according to the following statement from TLFS:
> >>
> >> "
> >> 11.8.4 SINTx Registers
> >> ...
> >> Valid values for vector are 16-255 inclusive. Specifying an invalid
> >> vector number results in #GP.
> >> "
> >>
> >> However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000
> >> to SINTx. I checked with Microsoft and they confirmed that if either the
> >> Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they
> >> ignore the value of Vector. Make KVM act accordingly.
> >
> > I wonder if that cpuid setting affects this behavior? Also curious what
> > exactly the guest is trying to achieve writing this bogus value?
>
> The value is actually the default value which is supposed to be there:
>
> "At virtual processor creation time, the default value of all SINTx
> (synthetic interrupt source) registers is 0x0000000000010000." so I
> guess this is just an intialization procedure.
Oh, sorry, I trapped on that polling thing throughout the patch so I
missed that the value you observe has actually only the masked bit set,
which is indeed the initial value (no idea why the guest *writes* it
though, it's the hypervisor's responsibility to put it there on vCPU
reset).
> >>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> arch/x86/include/uapi/asm/hyperv.h | 1 +
> >> arch/x86/kvm/hyperv.c | 7 ++++++-
> >> 2 files changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
> >> index 62c778a303a1..a492dc357bd7 100644
> >> --- a/arch/x86/include/uapi/asm/hyperv.h
> >> +++ b/arch/x86/include/uapi/asm/hyperv.h
> >> @@ -326,6 +326,7 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
> >> #define HV_SYNIC_SIEFP_ENABLE (1ULL << 0)
> >> #define HV_SYNIC_SINT_MASKED (1ULL << 16)
> >> #define HV_SYNIC_SINT_AUTO_EOI (1ULL << 17)
> >> +#define HV_SYNIC_SINT_POLLING (1ULL << 18)
> >> #define HV_SYNIC_SINT_VECTOR_MASK (0xFF)
> >>
> >> #define HV_SYNIC_STIMER_COUNT (4)
> >> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> >> index 6d14f808145d..d3d866c32976 100644
> >> --- a/arch/x86/kvm/hyperv.c
> >> +++ b/arch/x86/kvm/hyperv.c
> >> @@ -95,9 +95,14 @@ static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
> >> u64 data, bool host)
> >> {
> >> int vector, old_vector;
> >> + bool masked, polling;
> >>
> >> vector = data & HV_SYNIC_SINT_VECTOR_MASK;
> >> - if (vector < HV_SYNIC_FIRST_VALID_VECTOR && !host)
> >> + masked = data & HV_SYNIC_SINT_MASKED;
> >> + polling = data & HV_SYNIC_SINT_POLLING;
> >> +
> >> + if (vector < HV_SYNIC_FIRST_VALID_VECTOR &&
> >> + !host && !masked && !polling)
> >> return 1;
> >> /*
> >> * Guest may configure multiple SINTs to use the same vector, so
> >
> > I'm not sure this is enough to implement the polling mode: per spec,
> >
>
> Oh, no, I wasn't trying to -- and by the way we don't currently announce
> SintPollingModeAvailable so guests are not supposed to do that. This is
> rather a future proof to 'not forget'.
>
> >> Setting the polling bit will have the effect of unmasking an interrupt
> >> source, except that an actual interrupt is not generated.
> >
> > However, if the guest sets a valid vector and the masked bit cleared,
> > we'll consider it a usual SINT and add to masks and inject interrupts,
> > etc, regardless of the polling bit.
> >
> > I must admit I'm confused by the above quote from the spec: is the
> > polling bit supposed to come together with the masked bit? If so, then
> > we probably should validate it here (but your logs indicate otherwise).
> > In general I'm missing the utility of this mode: why should an interrupt
> > controller be involved in polling at all?
>
> "Setting the polling bit will have the effect of unmasking an interrupt
> source, except that an actual interrupt is not generated."
>
> So, as I understand it, setting polling bit makes Vector value
> irrelevant - the interrupt is not generated so I *assume* we may see
> writes with zero Vector and polling bit set. But again, we're not
> implementing polling mode for now, I can just drop it from the patch if
> you think it is confusing.
I'd suggest indeed to leave polling out for now, as it seems to be a
different matter from the one being resolved by this patch (unless, of
course, there *are* known cases with invalid vector && !masked &&
polling).
Thanks,
Roman.
On Wed, Feb 28, 2018 at 02:43:59PM +0100, Vitaly Kuznetsov wrote:
> Nested Hyper-V/Windows guest running on top of KVM will use TSC page
> clocksource in two cases:
> - L0 exposes invariant TSC (CPUID.80000007H:EDX[8]).
> - L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]).
>
> Exposing invariant TSC effectively blocks migration to hosts with different
> TSC frequencies,
I wonder if TSC scaling on the destination host doesn't allow to relax
this requirement?
> providing reenlightenment support will be needed when we
> start migrating nested workloads.
>
> Implement rudimentary support for reenlightenment MSRs. For now, these are
> just read/write MSRs with no effect.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++++
> arch/x86/kvm/hyperv.c | 21 +++++++++++++++++++++
> arch/x86/kvm/x86.c | 12 +++++++++++-
> 3 files changed, 36 insertions(+), 1 deletion(-)
Reviewed-by: Roman Kagan <[email protected]>
Roman Kagan <[email protected]> writes:
> On Wed, Feb 28, 2018 at 02:43:59PM +0100, Vitaly Kuznetsov wrote:
>> Nested Hyper-V/Windows guest running on top of KVM will use TSC page
>> clocksource in two cases:
>> - L0 exposes invariant TSC (CPUID.80000007H:EDX[8]).
>> - L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]).
>>
>> Exposing invariant TSC effectively blocks migration to hosts with different
>> TSC frequencies,
>
> I wonder if TSC scaling on the destination host doesn't allow to relax
> this requirement?
I don't see why it wouldn't, Skylake+ should be fine (but this, of
course, will limit possible destination hosts to those supporting the
feature -- even if no other CPU features are required).
>> providing reenlightenment support will be needed when we
>> start migrating nested workloads.
>>
>> Implement rudimentary support for reenlightenment MSRs. For now, these are
>> just read/write MSRs with no effect.
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 4 ++++
>> arch/x86/kvm/hyperv.c | 21 +++++++++++++++++++++
>> arch/x86/kvm/x86.c | 12 +++++++++++-
>> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> Reviewed-by: Roman Kagan <[email protected]>
Thanks!
--
Vitaly