2017-03-15 21:22:26

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH v5 untested] kvm: better MWAIT emulation for guests

Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
unless explicitly provided with kernel command line argument
"idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
without checking CPUID.

We currently emulate that as a NOP but on VMX we can do better: let
guest stop the CPU until timer, IPI or memory change. CPU will be busy
but that isn't any worse than a NOP emulation.

Note that mwait within guests is not the same as on real hardware
because halt causes an exit while mwait doesn't. For this reason it
might not be a good idea to use the regular MWAIT flag in CPUID to
signal this capability. Add a flag in the hypervisor leaf instead.

Additionally, we add a capability for QEMU - e.g. if it knows there's an
isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
to improve guest behaviour.

Reported-by: "Gabriel L. Somlo" <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
---

This is for Gabriel's testing only. A bit rushed so untested.

Documentation/virtual/kvm/api.txt | 9 +++++++++
Documentation/virtual/kvm/cpuid.txt | 6 ++++++
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/cpuid.c | 3 +++
arch/x86/kvm/svm.c | 2 --
arch/x86/kvm/vmx.c | 6 ++++--
arch/x86/kvm/x86.c | 3 +++
arch/x86/kvm/x86.h | 28 ++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 1 +
9 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c248f7..6ee2e43 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4147,3 +4147,12 @@ This capability, if KVM_CHECK_EXTENSION indicates that it is
available, means that that the kernel can support guests using the
hashed page table MMU defined in Power ISA V3.00 (as implemented in
the POWER9 processor), including in-memory segment tables.
+
+8.5 KVM_CAP_X86_GUEST_MWAIT
+
+Architectures: x86
+
+This capability indicates that guest using memory monotoring instructions
+(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit. As such time
+spent while virtual CPU is halted in this way will then be accounted for as
+guest running time on the host (as opposed to e.g. HLT).
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..04c201c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit
|| || before enabling paravirtualized
|| || spinlock support.
------------------------------------------------------------------------------
+KVM_FEATURE_MWAIT || 8 || guest can use monitor/mwait
+ || || to halt the VCPU without exits,
+ || || time spent while halted in this
+ || || way is accounted for on host as
+ || || VCPU run time.
+------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index cff0bb6..9cc77a7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -24,6 +24,7 @@
#define KVM_FEATURE_STEAL_TIME 5
#define KVM_FEATURE_PV_EOI 6
#define KVM_FEATURE_PV_UNHALT 7
+#define KVM_FEATURE_MWAIT 8

/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index efde6cc..5638102 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -594,6 +594,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);

+ if (kvm_mwait_in_guest())
+ entry->eax |= (1 << KVM_FEATURE_MWAIT);
+
entry->ebx = 0;
entry->ecx = 0;
entry->edx = 0;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d1efe2c..18e53bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
set_intercept(svm, INTERCEPT_CLGI);
set_intercept(svm, INTERCEPT_SKINIT);
set_intercept(svm, INTERCEPT_WBINVD);
- set_intercept(svm, INTERCEPT_MONITOR);
- set_intercept(svm, INTERCEPT_MWAIT);
set_intercept(svm, INTERCEPT_XSETBV);

control->iopm_base_pa = iopm_base;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 98e82ee..ea0c96a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3547,11 +3547,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
CPU_BASED_USE_IO_BITMAPS |
CPU_BASED_MOV_DR_EXITING |
CPU_BASED_USE_TSC_OFFSETING |
- CPU_BASED_MWAIT_EXITING |
- CPU_BASED_MONITOR_EXITING |
CPU_BASED_INVLPG_EXITING |
CPU_BASED_RDPMC_EXITING;

+ if (!kvm_mwait_in_guest())
+ min |= CPU_BASED_MWAIT_EXITING |
+ CPU_BASED_MONITOR_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
CPU_BASED_USE_MSR_BITMAPS |
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1faf620..8c74fff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2684,6 +2684,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ADJUST_CLOCK:
r = KVM_CLOCK_TSC_STABLE;
break;
+ case KVM_CAP_X86_GUEST_MWAIT:
+ r = kvm_mwait_in_guest();
+ break;
case KVM_CAP_X86_SMM:
/* SMBASE is usually relocated above 1M on modern chipsets,
* and SMM handlers might indeed rely on 4G segment limits,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e8ff3e4..a2d8964 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -1,6 +1,8 @@
#ifndef ARCH_X86_KVM_X86_H
#define ARCH_X86_KVM_X86_H

+#include <asm/processor.h>
+#include <asm/mwait.h>
#include <linux/kvm_host.h>
#include <asm/pvclock.h>
#include "kvm_cache_regs.h"
@@ -212,4 +214,30 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
__rem; \
})

+static inline bool kvm_mwait_in_guest(void)
+{
+ unsigned int eax, ebx, ecx, edx;
+
+ if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
+ return false;
+
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ return false;
+
+ /*
+ * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
+ * they would allow guest to stop the CPU completely by disabling
+ * interrupts then invoking MWAIT.
+ */
+ if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+ return false;
+
+ cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+
+ if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+ return false;
+
+ return true;
+}
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d508..8b6bc06 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_MMU_RADIX 134
#define KVM_CAP_PPC_MMU_HASH_V3 135
#define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_X86_GUEST_MWAIT 137

#ifdef KVM_CAP_IRQ_ROUTING

--
MST


2017-03-15 23:36:59

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> unless explicitly provided with kernel command line argument
> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> without checking CPUID.
>
> We currently emulate that as a NOP but on VMX we can do better: let
> guest stop the CPU until timer, IPI or memory change. CPU will be busy
> but that isn't any worse than a NOP emulation.
>
> Note that mwait within guests is not the same as on real hardware
> because halt causes an exit while mwait doesn't. For this reason it
> might not be a good idea to use the regular MWAIT flag in CPUID to
> signal this capability. Add a flag in the hypervisor leaf instead.
>
> Additionally, we add a capability for QEMU - e.g. if it knows there's an
> isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> to improve guest behaviour.

Same behavior (on the mac pro 1,1 running F22 with custom-compiled
kernel from kvm git master, plus this patch on top).

The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
on boot, does not bring up guest graphical interface within the first
10 minutes that I waited for it. That, in contrast with the default
nop-based emulation where the guest comes up within 30 seconds.

I will run another round of tests on a newer Mac (4-year-old macbook
air) and report back tomorrow.

Going off on a tangent, why would encouraging otherwise well-behaved
guests (like linux ones, for example) to use MWAIT be desirable to
begin with ? Is it a matter of minimizing the overhead associated with
exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
running guest-mode MWAIT in a tight loop will actually waste the host
CPU without the opportunity to yield to some other L0 thread. Sorry if
I fell into the middle of an ongoing conversation on this and missed
most of the relevant context, in which case please feel free to ignore
me... :)

Thanks,
--G

>
> Reported-by: "Gabriel L. Somlo" <[email protected]>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
>
> This is for Gabriel's testing only. A bit rushed so untested.
>
> Documentation/virtual/kvm/api.txt | 9 +++++++++
> Documentation/virtual/kvm/cpuid.txt | 6 ++++++
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kvm/cpuid.c | 3 +++
> arch/x86/kvm/svm.c | 2 --
> arch/x86/kvm/vmx.c | 6 ++++--
> arch/x86/kvm/x86.c | 3 +++
> arch/x86/kvm/x86.h | 28 ++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 1 +
> 9 files changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 3c248f7..6ee2e43 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4147,3 +4147,12 @@ This capability, if KVM_CHECK_EXTENSION indicates that it is
> available, means that that the kernel can support guests using the
> hashed page table MMU defined in Power ISA V3.00 (as implemented in
> the POWER9 processor), including in-memory segment tables.
> +
> +8.5 KVM_CAP_X86_GUEST_MWAIT
> +
> +Architectures: x86
> +
> +This capability indicates that guest using memory monotoring instructions
> +(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit. As such time
> +spent while virtual CPU is halted in this way will then be accounted for as
> +guest running time on the host (as opposed to e.g. HLT).
> diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> index 3c65feb..04c201c 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit
> || || before enabling paravirtualized
> || || spinlock support.
> ------------------------------------------------------------------------------
> +KVM_FEATURE_MWAIT || 8 || guest can use monitor/mwait
> + || || to halt the VCPU without exits,
> + || || time spent while halted in this
> + || || way is accounted for on host as
> + || || VCPU run time.
> +------------------------------------------------------------------------------
> KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
> || || per-cpu warps are expected in
> || || kvmclock.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index cff0bb6..9cc77a7 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -24,6 +24,7 @@
> #define KVM_FEATURE_STEAL_TIME 5
> #define KVM_FEATURE_PV_EOI 6
> #define KVM_FEATURE_PV_UNHALT 7
> +#define KVM_FEATURE_MWAIT 8
>
> /* The last 8 bits are used to indicate how to interpret the flags field
> * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index efde6cc..5638102 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -594,6 +594,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> if (sched_info_on())
> entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
>
> + if (kvm_mwait_in_guest())
> + entry->eax |= (1 << KVM_FEATURE_MWAIT);
> +
> entry->ebx = 0;
> entry->ecx = 0;
> entry->edx = 0;
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d1efe2c..18e53bc 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
> set_intercept(svm, INTERCEPT_CLGI);
> set_intercept(svm, INTERCEPT_SKINIT);
> set_intercept(svm, INTERCEPT_WBINVD);
> - set_intercept(svm, INTERCEPT_MONITOR);
> - set_intercept(svm, INTERCEPT_MWAIT);
> set_intercept(svm, INTERCEPT_XSETBV);
>
> control->iopm_base_pa = iopm_base;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 98e82ee..ea0c96a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3547,11 +3547,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
> CPU_BASED_USE_IO_BITMAPS |
> CPU_BASED_MOV_DR_EXITING |
> CPU_BASED_USE_TSC_OFFSETING |
> - CPU_BASED_MWAIT_EXITING |
> - CPU_BASED_MONITOR_EXITING |
> CPU_BASED_INVLPG_EXITING |
> CPU_BASED_RDPMC_EXITING;
>
> + if (!kvm_mwait_in_guest())
> + min |= CPU_BASED_MWAIT_EXITING |
> + CPU_BASED_MONITOR_EXITING;
> +
> opt = CPU_BASED_TPR_SHADOW |
> CPU_BASED_USE_MSR_BITMAPS |
> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1faf620..8c74fff 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2684,6 +2684,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ADJUST_CLOCK:
> r = KVM_CLOCK_TSC_STABLE;
> break;
> + case KVM_CAP_X86_GUEST_MWAIT:
> + r = kvm_mwait_in_guest();
> + break;
> case KVM_CAP_X86_SMM:
> /* SMBASE is usually relocated above 1M on modern chipsets,
> * and SMM handlers might indeed rely on 4G segment limits,
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index e8ff3e4..a2d8964 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -1,6 +1,8 @@
> #ifndef ARCH_X86_KVM_X86_H
> #define ARCH_X86_KVM_X86_H
>
> +#include <asm/processor.h>
> +#include <asm/mwait.h>
> #include <linux/kvm_host.h>
> #include <asm/pvclock.h>
> #include "kvm_cache_regs.h"
> @@ -212,4 +214,30 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
> __rem; \
> })
>
> +static inline bool kvm_mwait_in_guest(void)
> +{
> + unsigned int eax, ebx, ecx, edx;
> +
> + if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> + return false;
> +
> + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> + return false;
> +
> + /*
> + * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> + * they would allow guest to stop the CPU completely by disabling
> + * interrupts then invoking MWAIT.
> + */
> + if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> + return false;
> +
> + cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
> +
> + if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> + return false;
> +
> + return true;
> +}
> +
> #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f51d508..8b6bc06 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_PPC_MMU_RADIX 134
> #define KVM_CAP_PPC_MMU_HASH_V3 135
> #define KVM_CAP_IMMEDIATE_EXIT 136
> +#define KVM_CAP_X86_GUEST_MWAIT 137
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> --
> MST

2017-03-15 23:41:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > unless explicitly provided with kernel command line argument
> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > without checking CPUID.
> >
> > We currently emulate that as a NOP but on VMX we can do better: let
> > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> > but that isn't any worse than a NOP emulation.
> >
> > Note that mwait within guests is not the same as on real hardware
> > because halt causes an exit while mwait doesn't. For this reason it
> > might not be a good idea to use the regular MWAIT flag in CPUID to
> > signal this capability. Add a flag in the hypervisor leaf instead.
> >
> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > to improve guest behaviour.
>
> Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> kernel from kvm git master, plus this patch on top).
>
> The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> on boot, does not bring up guest graphical interface within the first
> 10 minutes that I waited for it. That, in contrast with the default
> nop-based emulation where the guest comes up within 30 seconds.


Thanks a lot, meanwhile I'll try to write a unit-test and experiment
with various behaviours.

> I will run another round of tests on a newer Mac (4-year-old macbook
> air) and report back tomorrow.
>
> Going off on a tangent, why would encouraging otherwise well-behaved
> guests (like linux ones, for example) to use MWAIT be desirable to
> begin with ? Is it a matter of minimizing the overhead associated with
> exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> running guest-mode MWAIT in a tight loop will actually waste the host
> CPU without the opportunity to yield to some other L0 thread. Sorry if
> I fell into the middle of an ongoing conversation on this and missed
> most of the relevant context, in which case please feel free to ignore
> me... :)
>
> Thanks,
> --G

It's just some experiments I'm running, I'm not ready to describe it
yet. I thought this part might be useful to at least some guests, so
trying to upstream it right now.

--
MST

2017-03-16 13:24:43

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > unless explicitly provided with kernel command line argument
> > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > without checking CPUID.
> > >
> > > We currently emulate that as a NOP but on VMX we can do better: let
> > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> > > but that isn't any worse than a NOP emulation.
> > >
> > > Note that mwait within guests is not the same as on real hardware
> > > because halt causes an exit while mwait doesn't. For this reason it
> > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > signal this capability. Add a flag in the hypervisor leaf instead.
> > >
> > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > to improve guest behaviour.
> >
> > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > kernel from kvm git master, plus this patch on top).
> >
> > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > on boot, does not bring up guest graphical interface within the first
> > 10 minutes that I waited for it. That, in contrast with the default
> > nop-based emulation where the guest comes up within 30 seconds.
>
>
> Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> with various behaviours.
>
> > I will run another round of tests on a newer Mac (4-year-old macbook
> > air) and report back tomorrow.
> >
> > Going off on a tangent, why would encouraging otherwise well-behaved
> > guests (like linux ones, for example) to use MWAIT be desirable to
> > begin with ? Is it a matter of minimizing the overhead associated with
> > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > running guest-mode MWAIT in a tight loop will actually waste the host
> > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > I fell into the middle of an ongoing conversation on this and missed
> > most of the relevant context, in which case please feel free to ignore
> > me... :)
> >
> > Thanks,
> > --G
>
> It's just some experiments I'm running, I'm not ready to describe it
> yet. I thought this part might be useful to at least some guests, so
> trying to upstream it right now.

OK, so on a macbook air running F25 and the latest kvm git master plus
your v5 patch (4.11.0-rc2+), things appear to work.

host-side cpuid output:
eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120

guest-side cpuid output:
eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
stepping : 7
microcode : 0x29
cpu MHz : 1157.849
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 3604.68
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

After studying your patch a bit more carefully (sorry, it's crazy
around here right now :) ) I realized you're simply trying to
(selectively) decide when to exit L1 and emulate as NOP vs. when to
just allow L1 to execute MONITOR & MWAIT natively.

Is that right ? Because if so, the issues I saw on my MacPro1,1 are
weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
natively was one of the options Alex Graf and Rene Rebe used back in
the very early days of OS X on QEMU, at the time I got involved with
that project. Here's part of an out of tree patch against 3.4 which did
just that, and worked as far as I remember on *any* MWAIT capable
intel chip I had access to back in 2010:

##############################################################################
# 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
##############################################################################
diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
--- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
@@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
- F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
+ F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
0 /* Reserved, DCA */ | F(XMM4_1) |
F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
--- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
@@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
set_intercept(svm, INTERCEPT_VMSAVE);
set_intercept(svm, INTERCEPT_STGI);
set_intercept(svm, INTERCEPT_CLGI);
set_intercept(svm, INTERCEPT_SKINIT);
set_intercept(svm, INTERCEPT_WBINVD);
- set_intercept(svm, INTERCEPT_MONITOR);
- set_intercept(svm, INTERCEPT_MWAIT);
set_intercept(svm, INTERCEPT_XSETBV);

control->iopm_base_pa = iopm_base;
control->msrpm_base_pa = __pa(svm->msrpm);
control->int_ctl = V_INTR_MASKING_MASK;
diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
--- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
@@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
nested_vmx_procbased_ctls_low = 0;
nested_vmx_procbased_ctls_high &=
CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
- CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+ CPU_BASED_CR3_LOAD_EXITING |
CPU_BASED_CR3_STORE_EXITING |
#ifdef CONFIG_X86_64
CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
#endif
CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
@@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
CPU_BASED_CR3_LOAD_EXITING |
CPU_BASED_CR3_STORE_EXITING |
CPU_BASED_USE_IO_BITMAPS |
CPU_BASED_MOV_DR_EXITING |
CPU_BASED_USE_TSC_OFFSETING |
- CPU_BASED_MWAIT_EXITING |
- CPU_BASED_MONITOR_EXITING |
CPU_BASED_INVLPG_EXITING |
CPU_BASED_RDPMC_EXITING;

opt = CPU_BASED_TPR_SHADOW |
CPU_BASED_USE_MSR_BITMAPS |

If all you're trying to do is (selectively) revert to this behavior,
that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
confused at this point :)

Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
didn't power down the physical CPU, just immediately moved on to the
next instruction. As such, there was no power saving and no
opportunity to yield to another L0 thread either, unlike with NOP
emulation at L0.

Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
doing something smarter than just acting as a guest-mode NOP) ?

Thanks,
--Gabriel

2017-03-16 14:06:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> After studying your patch a bit more carefully (sorry, it's crazy
> around here right now :) ) I realized you're simply trying to
> (selectively) decide when to exit L1 and emulate as NOP vs. when to
> just allow L1 to execute MONITOR & MWAIT natively.
>
> Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> natively was one of the options Alex Graf and Rene Rebe used back in
> the very early days of OS X on QEMU, at the time I got involved with
> that project. Here's part of an out of tree patch against 3.4 which did
> just that, and worked as far as I remember on *any* MWAIT capable
> intel chip I had access to back in 2010:
>
> ##############################################################################
> # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> ##############################################################################
> diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> /* cpuid 1.ecx */
> const u32 kvm_supported_word4_x86_features =
> - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> 0 /* DS-CPL, VMX, SMX, EST */ |
> 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> 0 /* Reserved, DCA */ | F(XMM4_1) |
> F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> set_intercept(svm, INTERCEPT_VMSAVE);
> set_intercept(svm, INTERCEPT_STGI);
> set_intercept(svm, INTERCEPT_CLGI);
> set_intercept(svm, INTERCEPT_SKINIT);
> set_intercept(svm, INTERCEPT_WBINVD);
> - set_intercept(svm, INTERCEPT_MONITOR);
> - set_intercept(svm, INTERCEPT_MWAIT);
> set_intercept(svm, INTERCEPT_XSETBV);
>
> control->iopm_base_pa = iopm_base;
> control->msrpm_base_pa = __pa(svm->msrpm);
> control->int_ctl = V_INTR_MASKING_MASK;
> diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> nested_vmx_procbased_ctls_low = 0;
> nested_vmx_procbased_ctls_high &=
> CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> + CPU_BASED_CR3_LOAD_EXITING |
> CPU_BASED_CR3_STORE_EXITING |
> #ifdef CONFIG_X86_64
> CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> #endif
> CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> CPU_BASED_CR3_LOAD_EXITING |
> CPU_BASED_CR3_STORE_EXITING |
> CPU_BASED_USE_IO_BITMAPS |
> CPU_BASED_MOV_DR_EXITING |
> CPU_BASED_USE_TSC_OFFSETING |
> - CPU_BASED_MWAIT_EXITING |
> - CPU_BASED_MONITOR_EXITING |
> CPU_BASED_INVLPG_EXITING |
> CPU_BASED_RDPMC_EXITING;
>
> opt = CPU_BASED_TPR_SHADOW |
> CPU_BASED_USE_MSR_BITMAPS |
>
> If all you're trying to do is (selectively) revert to this behavior,
> that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> confused at this point :)

Yes. Me too. Want to try that other patch and see what happens?

> Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> didn't power down the physical CPU, just immediately moved on to the
> next instruction. As such, there was no power saving and no
> opportunity to yield to another L0 thread either, unlike with NOP
> emulation at L0.
>
> Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> doing something smarter than just acting as a guest-mode NOP) ?
>
> Thanks,
> --Gabriel

Interesting. What it seems to say is this:

MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
exiting” VM-execution control:
— If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
(see Section 22.1.3).
— If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
any of the following is true: (1) the “interrupt-window exiting” VM-execution
control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
— If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
does not cause the processor to enter an implementation-dependent
optimized state; instead, control passes to the instruction following the
MWAIT instruction.


And since interrupt-window exiting is 0 most of the time for KVM,
I would expect MWAIT to behave normally.


--
MST

2017-03-16 14:08:22

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-16 09:24-0400, Gabriel L. Somlo:
> On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > > unless explicitly provided with kernel command line argument
> > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > > without checking CPUID.
> > > >
> > > > We currently emulate that as a NOP but on VMX we can do better: let
> > > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> > > > but that isn't any worse than a NOP emulation.
> > > >
> > > > Note that mwait within guests is not the same as on real hardware
> > > > because halt causes an exit while mwait doesn't. For this reason it
> > > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > > signal this capability. Add a flag in the hypervisor leaf instead.
> > > >
> > > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > > to improve guest behaviour.
> > >
> > > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > > kernel from kvm git master, plus this patch on top).
> > >
> > > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > > on boot, does not bring up guest graphical interface within the first
> > > 10 minutes that I waited for it. That, in contrast with the default
> > > nop-based emulation where the guest comes up within 30 seconds.
> >
> >
> > Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> > with various behaviours.
> >
> > > I will run another round of tests on a newer Mac (4-year-old macbook
> > > air) and report back tomorrow.
> > >
> > > Going off on a tangent, why would encouraging otherwise well-behaved
> > > guests (like linux ones, for example) to use MWAIT be desirable to
> > > begin with ? Is it a matter of minimizing the overhead associated with
> > > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > > running guest-mode MWAIT in a tight loop will actually waste the host
> > > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > > I fell into the middle of an ongoing conversation on this and missed
> > > most of the relevant context, in which case please feel free to ignore
> > > me... :)
> > >
> > > Thanks,
> > > --G
> >
> > It's just some experiments I'm running, I'm not ready to describe it
> > yet. I thought this part might be useful to at least some guests, so
> > trying to upstream it right now.
>
> OK, so on a macbook air running F25 and the latest kvm git master plus
> your v5 patch (4.11.0-rc2+), things appear to work.
>
> host-side cpuid output:
> eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120
>
> guest-side cpuid output:
> eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000
>
> processor : 3
> vendor_id : GenuineIntel
> cpu family : 6
> model : 42
> model name : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
> stepping : 7
> microcode : 0x29
> cpu MHz : 1157.849
> cache size : 4096 KB
> physical id : 0
> siblings : 4
> core id : 1
> cpu cores : 2
> apicid : 3
> initial apicid : 3
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> bugs :
> bogomips : 3604.68
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
> After studying your patch a bit more carefully (sorry, it's crazy
> around here right now :) ) I realized you're simply trying to
> (selectively) decide when to exit L1 and emulate as NOP vs. when to
> just allow L1 to execute MONITOR & MWAIT natively.
>
> Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> natively was one of the options Alex Graf and Rene Rebe used back in
> the very early days of OS X on QEMU, at the time I got involved with
> that project. Here's part of an out of tree patch against 3.4 which did
> just that, and worked as far as I remember on *any* MWAIT capable
> intel chip I had access to back in 2010:
>
> ##############################################################################
> # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> ##############################################################################
> diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> /* cpuid 1.ecx */
> const u32 kvm_supported_word4_x86_features =
> - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> 0 /* DS-CPL, VMX, SMX, EST */ |
> 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> 0 /* Reserved, DCA */ | F(XMM4_1) |
> F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> set_intercept(svm, INTERCEPT_VMSAVE);
> set_intercept(svm, INTERCEPT_STGI);
> set_intercept(svm, INTERCEPT_CLGI);
> set_intercept(svm, INTERCEPT_SKINIT);
> set_intercept(svm, INTERCEPT_WBINVD);
> - set_intercept(svm, INTERCEPT_MONITOR);
> - set_intercept(svm, INTERCEPT_MWAIT);
> set_intercept(svm, INTERCEPT_XSETBV);
>
> control->iopm_base_pa = iopm_base;
> control->msrpm_base_pa = __pa(svm->msrpm);
> control->int_ctl = V_INTR_MASKING_MASK;
> diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> nested_vmx_procbased_ctls_low = 0;
> nested_vmx_procbased_ctls_high &=
> CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> + CPU_BASED_CR3_LOAD_EXITING |
> CPU_BASED_CR3_STORE_EXITING |
> #ifdef CONFIG_X86_64
> CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> #endif
> CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> CPU_BASED_CR3_LOAD_EXITING |
> CPU_BASED_CR3_STORE_EXITING |
> CPU_BASED_USE_IO_BITMAPS |
> CPU_BASED_MOV_DR_EXITING |
> CPU_BASED_USE_TSC_OFFSETING |
> - CPU_BASED_MWAIT_EXITING |
> - CPU_BASED_MONITOR_EXITING |
> CPU_BASED_INVLPG_EXITING |
> CPU_BASED_RDPMC_EXITING;
>
> opt = CPU_BASED_TPR_SHADOW |
> CPU_BASED_USE_MSR_BITMAPS |
>
> If all you're trying to do is (selectively) revert to this behavior,
> that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> confused at this point :)
>
> Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> didn't power down the physical CPU, just immediately moved on to the
> next instruction. As such, there was no power saving and no
> opportunity to yield to another L0 thread either, unlike with NOP
> emulation at L0.
>
> Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> doing something smarter than just acting as a guest-mode NOP) ?

Probably, MWAIT in new intel chips enters power saving mode normally.

If hardware-executed MWAIT acted as a NOP in your old chip, then that
shouldn't be a problem either ... Maybe OS X gets confused into doing
something really dumb because we do not expose the MONITOR/MWAIT feature
bit correctly.

Can you try this QEMU patch on the old hardware?

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 7aa762245a54..4b112e12188a 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -2764,10 +2764,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
break;
case 5:
/* mwait info: needed for Core compatibility */
- *eax = 0; /* Smallest monitor-line size in bytes */
- *ebx = 0; /* Largest monitor-line size in bytes */
- *ecx = CPUID_MWAIT_EMX | CPUID_MWAIT_IBE;
- *edx = 0;
+ host_cpuid(index, 0, eax, ebx, ecx, edx);
break;
case 6:
/* Thermal and Power Leaf */
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 55865dbee0aa..1eb78291b093 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -360,6 +360,7 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
if (!kvm_irqchip_in_kernel()) {
ret &= ~CPUID_EXT_X2APIC;
}
+ ret |= CPUID_EXT_MONITOR;
} else if (function == 6 && reg == R_EAX) {
ret |= CPUID_6_EAX_ARAT; /* safe to allow because of emulated APIC */
} else if (function == 7 && index == 0 && reg == R_EBX) {


Thanks.

2017-03-16 14:59:18

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > After studying your patch a bit more carefully (sorry, it's crazy
> > around here right now :) ) I realized you're simply trying to
> > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > just allow L1 to execute MONITOR & MWAIT natively.
> >
> > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > natively was one of the options Alex Graf and Rene Rebe used back in
> > the very early days of OS X on QEMU, at the time I got involved with
> > that project. Here's part of an out of tree patch against 3.4 which did
> > just that, and worked as far as I remember on *any* MWAIT capable
> > intel chip I had access to back in 2010:
> >
> > ##############################################################################
> > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > ##############################################################################
> > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > /* cpuid 1.ecx */
> > const u32 kvm_supported_word4_x86_features =
> > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > 0 /* DS-CPL, VMX, SMX, EST */ |
> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > set_intercept(svm, INTERCEPT_VMSAVE);
> > set_intercept(svm, INTERCEPT_STGI);
> > set_intercept(svm, INTERCEPT_CLGI);
> > set_intercept(svm, INTERCEPT_SKINIT);
> > set_intercept(svm, INTERCEPT_WBINVD);
> > - set_intercept(svm, INTERCEPT_MONITOR);
> > - set_intercept(svm, INTERCEPT_MWAIT);
> > set_intercept(svm, INTERCEPT_XSETBV);
> >
> > control->iopm_base_pa = iopm_base;
> > control->msrpm_base_pa = __pa(svm->msrpm);
> > control->int_ctl = V_INTR_MASKING_MASK;
> > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > nested_vmx_procbased_ctls_low = 0;
> > nested_vmx_procbased_ctls_high &=
> > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > + CPU_BASED_CR3_LOAD_EXITING |
> > CPU_BASED_CR3_STORE_EXITING |
> > #ifdef CONFIG_X86_64
> > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > #endif
> > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > CPU_BASED_CR3_LOAD_EXITING |
> > CPU_BASED_CR3_STORE_EXITING |
> > CPU_BASED_USE_IO_BITMAPS |
> > CPU_BASED_MOV_DR_EXITING |
> > CPU_BASED_USE_TSC_OFFSETING |
> > - CPU_BASED_MWAIT_EXITING |
> > - CPU_BASED_MONITOR_EXITING |
> > CPU_BASED_INVLPG_EXITING |
> > CPU_BASED_RDPMC_EXITING;
> >
> > opt = CPU_BASED_TPR_SHADOW |
> > CPU_BASED_USE_MSR_BITMAPS |
> >
> > If all you're trying to do is (selectively) revert to this behavior,
> > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > confused at this point :)
>
> Yes. Me too. Want to try that other patch and see what happens?

You mean the old 3.4 patch against current KVM ? I'll try to do that,
might take me a while :)

> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> > didn't power down the physical CPU, just immediately moved on to the
> > next instruction. As such, there was no power saving and no
> > opportunity to yield to another L0 thread either, unlike with NOP
> > emulation at L0.
> >
> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > doing something smarter than just acting as a guest-mode NOP) ?
> >
> > Thanks,
> > --Gabriel
>
> Interesting. What it seems to say is this:
>
> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
> exiting” VM-execution control:
> — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
> (see Section 22.1.3).
> — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
> any of the following is true: (1) the “interrupt-window exiting” VM-execution
> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
> exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> does not cause the processor to enter an implementation-dependent
> optimized state; instead, control passes to the instruction following the
> MWAIT instruction.
>
>
> And since interrupt-window exiting is 0 most of the time for KVM,
> I would expect MWAIT to behave normally.

The intel manual said the same thing back in 2010 as well. However,
regardless of how any flags were set, interrupt-window exiting or not,
"normal" L1 MWAIT behavior was that it woke up immediately regardless.
Remember, never going to sleep is still correct ("normal" ?) behavior
per the ISA definition of MWAIT :)

Also, when I tested your patch on the macbook air (where it worked),
not only was the host reporting 400% CPU for qemu (which is to be
expected), but the thermal fan/cooling thing also shifted up into high
gear, which means the physical CPU got hot, which it shouldn't have if
the guest-mode MWAIT actually did put the host CPU into low power.

So at least on this 4-year-old core-I7 chip, the story Intel tells in
its manual still doesn't check out. I could never get any
clarification on what they mean by "operates normally" :)

2017-03-16 15:24:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 10:58:20AM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > After studying your patch a bit more carefully (sorry, it's crazy
> > > around here right now :) ) I realized you're simply trying to
> > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > just allow L1 to execute MONITOR & MWAIT natively.
> > >
> > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > the very early days of OS X on QEMU, at the time I got involved with
> > > that project. Here's part of an out of tree patch against 3.4 which did
> > > just that, and worked as far as I remember on *any* MWAIT capable
> > > intel chip I had access to back in 2010:
> > >
> > > ##############################################################################
> > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > ##############################################################################
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > /* cpuid 1.ecx */
> > > const u32 kvm_supported_word4_x86_features =
> > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > set_intercept(svm, INTERCEPT_STGI);
> > > set_intercept(svm, INTERCEPT_CLGI);
> > > set_intercept(svm, INTERCEPT_SKINIT);
> > > set_intercept(svm, INTERCEPT_WBINVD);
> > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > set_intercept(svm, INTERCEPT_XSETBV);
> > >
> > > control->iopm_base_pa = iopm_base;
> > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > control->int_ctl = V_INTR_MASKING_MASK;
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > nested_vmx_procbased_ctls_low = 0;
> > > nested_vmx_procbased_ctls_high &=
> > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > + CPU_BASED_CR3_LOAD_EXITING |
> > > CPU_BASED_CR3_STORE_EXITING |
> > > #ifdef CONFIG_X86_64
> > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > #endif
> > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > CPU_BASED_CR3_LOAD_EXITING |
> > > CPU_BASED_CR3_STORE_EXITING |
> > > CPU_BASED_USE_IO_BITMAPS |
> > > CPU_BASED_MOV_DR_EXITING |
> > > CPU_BASED_USE_TSC_OFFSETING |
> > > - CPU_BASED_MWAIT_EXITING |
> > > - CPU_BASED_MONITOR_EXITING |
> > > CPU_BASED_INVLPG_EXITING |
> > > CPU_BASED_RDPMC_EXITING;
> > >
> > > opt = CPU_BASED_TPR_SHADOW |
> > > CPU_BASED_USE_MSR_BITMAPS |
> > >
> > > If all you're trying to do is (selectively) revert to this behavior,
> > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > confused at this point :)
> >
> > Yes. Me too. Want to try that other patch and see what happens?
>
> You mean the old 3.4 patch against current KVM ? I'll try to do that,
> might take me a while :)

I can rebase them for you if you send me a link.

> > > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> > > didn't power down the physical CPU, just immediately moved on to the
> > > next instruction. As such, there was no power saving and no
> > > opportunity to yield to another L0 thread either, unlike with NOP
> > > emulation at L0.
> > >
> > > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > > doing something smarter than just acting as a guest-mode NOP) ?
> > >
> > > Thanks,
> > > --Gabriel
> >
> > Interesting. What it seems to say is this:
> >
> > MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> > opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
> > exiting” VM-execution control:
> > — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
> > (see Section 22.1.3).
> > — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
> > any of the following is true: (1) the “interrupt-window exiting” VM-execution
> > control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> > — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
> > exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> > does not cause the processor to enter an implementation-dependent
> > optimized state; instead, control passes to the instruction following the
> > MWAIT instruction.
> >
> >
> > And since interrupt-window exiting is 0 most of the time for KVM,
> > I would expect MWAIT to behave normally.
>
> The intel manual said the same thing back in 2010 as well. However,
> regardless of how any flags were set, interrupt-window exiting or not,
> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> Remember, never going to sleep is still correct ("normal" ?) behavior
> per the ISA definition of MWAIT :)
>
> Also, when I tested your patch on the macbook air (where it worked),
> not only was the host reporting 400% CPU for qemu (which is to be
> expected), but the thermal fan/cooling thing also shifted up into high
> gear, which means the physical CPU got hot, which it shouldn't have if
> the guest-mode MWAIT actually did put the host CPU into low power.

Does same happen with NOP btw?

> So at least on this 4-year-old core-I7 chip, the story Intel tells in
> its manual still doesn't check out. I could never get any
> clarification on what they mean by "operates normally" :)

It could be Mac OS sets ECX[0] = 1 and RFLAGS.IF = 0.

--
MST

2017-03-16 15:35:38

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-16 10:58-0400, Gabriel L. Somlo:
> On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > After studying your patch a bit more carefully (sorry, it's crazy
> > > around here right now :) ) I realized you're simply trying to
> > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > just allow L1 to execute MONITOR & MWAIT natively.
> > >
> > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > the very early days of OS X on QEMU, at the time I got involved with
> > > that project. Here's part of an out of tree patch against 3.4 which did
> > > just that, and worked as far as I remember on *any* MWAIT capable
> > > intel chip I had access to back in 2010:
> > >
> > > ##############################################################################
> > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > ##############################################################################
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > /* cpuid 1.ecx */
> > > const u32 kvm_supported_word4_x86_features =
> > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > set_intercept(svm, INTERCEPT_STGI);
> > > set_intercept(svm, INTERCEPT_CLGI);
> > > set_intercept(svm, INTERCEPT_SKINIT);
> > > set_intercept(svm, INTERCEPT_WBINVD);
> > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > set_intercept(svm, INTERCEPT_XSETBV);
> > >
> > > control->iopm_base_pa = iopm_base;
> > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > control->int_ctl = V_INTR_MASKING_MASK;
> > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > nested_vmx_procbased_ctls_low = 0;
> > > nested_vmx_procbased_ctls_high &=
> > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > + CPU_BASED_CR3_LOAD_EXITING |
> > > CPU_BASED_CR3_STORE_EXITING |
> > > #ifdef CONFIG_X86_64
> > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > #endif
> > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > CPU_BASED_CR3_LOAD_EXITING |
> > > CPU_BASED_CR3_STORE_EXITING |
> > > CPU_BASED_USE_IO_BITMAPS |
> > > CPU_BASED_MOV_DR_EXITING |
> > > CPU_BASED_USE_TSC_OFFSETING |
> > > - CPU_BASED_MWAIT_EXITING |
> > > - CPU_BASED_MONITOR_EXITING |
> > > CPU_BASED_INVLPG_EXITING |
> > > CPU_BASED_RDPMC_EXITING;
> > >
> > > opt = CPU_BASED_TPR_SHADOW |
> > > CPU_BASED_USE_MSR_BITMAPS |
> > >
> > > If all you're trying to do is (selectively) revert to this behavior,
> > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > confused at this point :)
> >
> > Yes. Me too. Want to try that other patch and see what happens?
>
> You mean the old 3.4 patch against current KVM ? I'll try to do that,
> might take me a while :)

Michael's patch already did most of that, you just need to add

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index efde6cc50875..b12f07d4ce17 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
const u32 kvm_cpuid_1_ecx_x86_features =
/* NOTE: MONITOR (and MWAIT) are emulated as NOP,
* but *not* advertised to guests via CPUID ! */
- F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
+ F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |

Note: this will never be upstream, because mwait isn't what we want by
default. :)

>> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
>> > didn't power down the physical CPU, just immediately moved on to the
>> > next instruction. As such, there was no power saving and no
>> > opportunity to yield to another L0 thread either, unlike with NOP
>> > emulation at L0.
>> >
>> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
>> > doing something smarter than just acting as a guest-mode NOP) ?
>> >
>> > Thanks,
>> > --Gabriel
>>
>> Interesting. What it seems to say is this:
>>
>> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
>> opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
>> exiting” VM-execution control:
>> — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
>> (see Section 22.1.3).
>> — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
>> any of the following is true: (1) the “interrupt-window exiting” VM-execution
>> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
>> — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
>> exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
>> does not cause the processor to enter an implementation-dependent
>> optimized state; instead, control passes to the instruction following the
>> MWAIT instruction.
>>
>>
>> And since interrupt-window exiting is 0 most of the time for KVM,
>> I would expect MWAIT to behave normally.
>
> The intel manual said the same thing back in 2010 as well. However,
> regardless of how any flags were set, interrupt-window exiting or not,
> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> Remember, never going to sleep is still correct ("normal" ?) behavior
> per the ISA definition of MWAIT :)

I'll write a simple kvm-unit-test to better understand why it is broken
for you ...

> Also, when I tested your patch on the macbook air (where it worked),
> not only was the host reporting 400% CPU for qemu (which is to be
> expected), but the thermal fan/cooling thing also shifted up into high
> gear, which means the physical CPU got hot, which it shouldn't have if
> the guest-mode MWAIT actually did put the host CPU into low power.

I tested MWAIT with basically the same kernel patch and the qemu patch
with Linux guest on Haswell and Nehalem. Running the guest took 100% of
the host CPUs, but it still had the same temperature as when the host
was idle.

That reminds me that you to pass '-cpu host' for QEMU reasons.

2017-03-16 15:54:14

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 03:08:07PM +0100, Radim Krčmář wrote:
> 2017-03-16 09:24-0400, Gabriel L. Somlo:
> > On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > > > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > > > unless explicitly provided with kernel command line argument
> > > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > > > without checking CPUID.
> > > > >
> > > > > We currently emulate that as a NOP but on VMX we can do better: let
> > > > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> > > > > but that isn't any worse than a NOP emulation.
> > > > >
> > > > > Note that mwait within guests is not the same as on real hardware
> > > > > because halt causes an exit while mwait doesn't. For this reason it
> > > > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > > > signal this capability. Add a flag in the hypervisor leaf instead.
> > > > >
> > > > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > > > to improve guest behaviour.
> > > >
> > > > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > > > kernel from kvm git master, plus this patch on top).
> > > >
> > > > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > > > on boot, does not bring up guest graphical interface within the first
> > > > 10 minutes that I waited for it. That, in contrast with the default
> > > > nop-based emulation where the guest comes up within 30 seconds.
> > >
> > >
> > > Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> > > with various behaviours.
> > >
> > > > I will run another round of tests on a newer Mac (4-year-old macbook
> > > > air) and report back tomorrow.
> > > >
> > > > Going off on a tangent, why would encouraging otherwise well-behaved
> > > > guests (like linux ones, for example) to use MWAIT be desirable to
> > > > begin with ? Is it a matter of minimizing the overhead associated with
> > > > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > > > running guest-mode MWAIT in a tight loop will actually waste the host
> > > > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > > > I fell into the middle of an ongoing conversation on this and missed
> > > > most of the relevant context, in which case please feel free to ignore
> > > > me... :)
> > > >
> > > > Thanks,
> > > > --G
> > >
> > > It's just some experiments I'm running, I'm not ready to describe it
> > > yet. I thought this part might be useful to at least some guests, so
> > > trying to upstream it right now.
> >
> > OK, so on a macbook air running F25 and the latest kvm git master plus
> > your v5 patch (4.11.0-rc2+), things appear to work.
> >
> > host-side cpuid output:
> > eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120
> >
> > guest-side cpuid output:
> > eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000
> >
> > processor : 3
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 42
> > model name : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
> > stepping : 7
> > microcode : 0x29
> > cpu MHz : 1157.849
> > cache size : 4096 KB
> > physical id : 0
> > siblings : 4
> > core id : 1
> > cpu cores : 2
> > apicid : 3
> > initial apicid : 3
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 13
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> > bugs :
> > bogomips : 3604.68
> > clflush size : 64
> > cache_alignment : 64
> > address sizes : 36 bits physical, 48 bits virtual
> > power management:
> >
> > After studying your patch a bit more carefully (sorry, it's crazy
> > around here right now :) ) I realized you're simply trying to
> > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > just allow L1 to execute MONITOR & MWAIT natively.
> >
> > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > natively was one of the options Alex Graf and Rene Rebe used back in
> > the very early days of OS X on QEMU, at the time I got involved with
> > that project. Here's part of an out of tree patch against 3.4 which did
> > just that, and worked as far as I remember on *any* MWAIT capable
> > intel chip I had access to back in 2010:
> >
> > ##############################################################################
> > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > ##############################################################################
> > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > /* cpuid 1.ecx */
> > const u32 kvm_supported_word4_x86_features =
> > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > 0 /* DS-CPL, VMX, SMX, EST */ |
> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > set_intercept(svm, INTERCEPT_VMSAVE);
> > set_intercept(svm, INTERCEPT_STGI);
> > set_intercept(svm, INTERCEPT_CLGI);
> > set_intercept(svm, INTERCEPT_SKINIT);
> > set_intercept(svm, INTERCEPT_WBINVD);
> > - set_intercept(svm, INTERCEPT_MONITOR);
> > - set_intercept(svm, INTERCEPT_MWAIT);
> > set_intercept(svm, INTERCEPT_XSETBV);
> >
> > control->iopm_base_pa = iopm_base;
> > control->msrpm_base_pa = __pa(svm->msrpm);
> > control->int_ctl = V_INTR_MASKING_MASK;
> > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > nested_vmx_procbased_ctls_low = 0;
> > nested_vmx_procbased_ctls_high &=
> > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > + CPU_BASED_CR3_LOAD_EXITING |
> > CPU_BASED_CR3_STORE_EXITING |
> > #ifdef CONFIG_X86_64
> > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > #endif
> > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > CPU_BASED_CR3_LOAD_EXITING |
> > CPU_BASED_CR3_STORE_EXITING |
> > CPU_BASED_USE_IO_BITMAPS |
> > CPU_BASED_MOV_DR_EXITING |
> > CPU_BASED_USE_TSC_OFFSETING |
> > - CPU_BASED_MWAIT_EXITING |
> > - CPU_BASED_MONITOR_EXITING |
> > CPU_BASED_INVLPG_EXITING |
> > CPU_BASED_RDPMC_EXITING;
> >
> > opt = CPU_BASED_TPR_SHADOW |
> > CPU_BASED_USE_MSR_BITMAPS |
> >
> > If all you're trying to do is (selectively) revert to this behavior,
> > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > confused at this point :)
> >
> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> > didn't power down the physical CPU, just immediately moved on to the
> > next instruction. As such, there was no power saving and no
> > opportunity to yield to another L0 thread either, unlike with NOP
> > emulation at L0.
> >
> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > doing something smarter than just acting as a guest-mode NOP) ?
>
> Probably, MWAIT in new intel chips enters power saving mode normally.
>
> If hardware-executed MWAIT acted as a NOP in your old chip, then that
> shouldn't be a problem either ... Maybe OS X gets confused into doing
> something really dumb because we do not expose the MONITOR/MWAIT feature
> bit correctly.
>
> Can you try this QEMU patch on the old hardware?
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 7aa762245a54..4b112e12188a 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -2764,10 +2764,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> break;
> case 5:
> /* mwait info: needed for Core compatibility */
> - *eax = 0; /* Smallest monitor-line size in bytes */
> - *ebx = 0; /* Largest monitor-line size in bytes */
> - *ecx = CPUID_MWAIT_EMX | CPUID_MWAIT_IBE;
> - *edx = 0;
> + host_cpuid(index, 0, eax, ebx, ecx, edx);
> break;
> case 6:
> /* Thermal and Power Leaf */
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index 55865dbee0aa..1eb78291b093 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -360,6 +360,7 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
> if (!kvm_irqchip_in_kernel()) {
> ret &= ~CPUID_EXT_X2APIC;
> }
> + ret |= CPUID_EXT_MONITOR;
> } else if (function == 6 && reg == R_EAX) {
> ret |= CPUID_6_EAX_ARAT; /* safe to allow because of emulated APIC */
> } else if (function == 7 && index == 0 && reg == R_EBX) {
>
>
> Thanks.

No change, still hangs on boot.

Thanks,
--G

2017-03-16 15:54:29

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-16 11:44-0400, Gabriel L. Somlo:
> On Thu, Mar 16, 2017 at 03:08:07PM +0100, Radim Krčmář wrote:
>> 2017-03-16 09:24-0400, Gabriel L. Somlo:
>> > On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
>> > > On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
>> > > > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
>> > > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> > > > > unless explicitly provided with kernel command line argument
>> > > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> > > > > without checking CPUID.
>> > > > >
>> > > > > We currently emulate that as a NOP but on VMX we can do better: let
>> > > > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
>> > > > > but that isn't any worse than a NOP emulation.
>> > > > >
>> > > > > Note that mwait within guests is not the same as on real hardware
>> > > > > because halt causes an exit while mwait doesn't. For this reason it
>> > > > > might not be a good idea to use the regular MWAIT flag in CPUID to
>> > > > > signal this capability. Add a flag in the hypervisor leaf instead.
>> > > > >
>> > > > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
>> > > > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
>> > > > > to improve guest behaviour.
>> > > >
>> > > > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
>> > > > kernel from kvm git master, plus this patch on top).
>> > > >
>> > > > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
>> > > > on boot, does not bring up guest graphical interface within the first
>> > > > 10 minutes that I waited for it. That, in contrast with the default
>> > > > nop-based emulation where the guest comes up within 30 seconds.
>> > >
>> > >
>> > > Thanks a lot, meanwhile I'll try to write a unit-test and experiment
>> > > with various behaviours.
>> > >
>> > > > I will run another round of tests on a newer Mac (4-year-old macbook
>> > > > air) and report back tomorrow.
>> > > >
>> > > > Going off on a tangent, why would encouraging otherwise well-behaved
>> > > > guests (like linux ones, for example) to use MWAIT be desirable to
>> > > > begin with ? Is it a matter of minimizing the overhead associated with
>> > > > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
>> > > > running guest-mode MWAIT in a tight loop will actually waste the host
>> > > > CPU without the opportunity to yield to some other L0 thread. Sorry if
>> > > > I fell into the middle of an ongoing conversation on this and missed
>> > > > most of the relevant context, in which case please feel free to ignore
>> > > > me... :)
>> > > >
>> > > > Thanks,
>> > > > --G
>> > >
>> > > It's just some experiments I'm running, I'm not ready to describe it
>> > > yet. I thought this part might be useful to at least some guests, so
>> > > trying to upstream it right now.
>> >
>> > OK, so on a macbook air running F25 and the latest kvm git master plus
>> > your v5 patch (4.11.0-rc2+), things appear to work.
>> >
>> > host-side cpuid output:
>> > eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120
>> >
>> > guest-side cpuid output:
>> > eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000
>> >
>> > processor : 3
>> > vendor_id : GenuineIntel
>> > cpu family : 6
>> > model : 42
>> > model name : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
>> > stepping : 7
>> > microcode : 0x29
>> > cpu MHz : 1157.849
>> > cache size : 4096 KB
>> > physical id : 0
>> > siblings : 4
>> > core id : 1
>> > cpu cores : 2
>> > apicid : 3
>> > initial apicid : 3
>> > fpu : yes
>> > fpu_exception : yes
>> > cpuid level : 13
>> > wp : yes
>> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
>> > bugs :
>> > bogomips : 3604.68
>> > clflush size : 64
>> > cache_alignment : 64
>> > address sizes : 36 bits physical, 48 bits virtual
>> > power management:
>> >
>> > After studying your patch a bit more carefully (sorry, it's crazy
>> > around here right now :) ) I realized you're simply trying to
>> > (selectively) decide when to exit L1 and emulate as NOP vs. when to
>> > just allow L1 to execute MONITOR & MWAIT natively.
>> >
>> > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
>> > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
>> > natively was one of the options Alex Graf and Rene Rebe used back in
>> > the very early days of OS X on QEMU, at the time I got involved with
>> > that project. Here's part of an out of tree patch against 3.4 which did
>> > just that, and worked as far as I remember on *any* MWAIT capable
>> > intel chip I had access to back in 2010:
>> >
>> > ##############################################################################
>> > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
>> > ##############################################################################
>> > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
>> > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
>> > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
>> > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
>> > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
>> > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
>> > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>> > /* cpuid 1.ecx */
>> > const u32 kvm_supported_word4_x86_features =
>> > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
>> > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
>> > 0 /* DS-CPL, VMX, SMX, EST */ |
>> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
>> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
>> > 0 /* Reserved, DCA */ | F(XMM4_1) |
>> > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
>> > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
>> > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
>> > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
>> > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
>> > set_intercept(svm, INTERCEPT_VMSAVE);
>> > set_intercept(svm, INTERCEPT_STGI);
>> > set_intercept(svm, INTERCEPT_CLGI);
>> > set_intercept(svm, INTERCEPT_SKINIT);
>> > set_intercept(svm, INTERCEPT_WBINVD);
>> > - set_intercept(svm, INTERCEPT_MONITOR);
>> > - set_intercept(svm, INTERCEPT_MWAIT);
>> > set_intercept(svm, INTERCEPT_XSETBV);
>> >
>> > control->iopm_base_pa = iopm_base;
>> > control->msrpm_base_pa = __pa(svm->msrpm);
>> > control->int_ctl = V_INTR_MASKING_MASK;
>> > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
>> > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
>> > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
>> > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
>> > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
>> > nested_vmx_procbased_ctls_low = 0;
>> > nested_vmx_procbased_ctls_high &=
>> > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
>> > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
>> > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
>> > + CPU_BASED_CR3_LOAD_EXITING |
>> > CPU_BASED_CR3_STORE_EXITING |
>> > #ifdef CONFIG_X86_64
>> > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
>> > #endif
>> > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
>> > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
>> > CPU_BASED_CR3_LOAD_EXITING |
>> > CPU_BASED_CR3_STORE_EXITING |
>> > CPU_BASED_USE_IO_BITMAPS |
>> > CPU_BASED_MOV_DR_EXITING |
>> > CPU_BASED_USE_TSC_OFFSETING |
>> > - CPU_BASED_MWAIT_EXITING |
>> > - CPU_BASED_MONITOR_EXITING |
>> > CPU_BASED_INVLPG_EXITING |
>> > CPU_BASED_RDPMC_EXITING;
>> >
>> > opt = CPU_BASED_TPR_SHADOW |
>> > CPU_BASED_USE_MSR_BITMAPS |
>> >
>> > If all you're trying to do is (selectively) revert to this behavior,
>> > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
>> > confused at this point :)
>> >
>> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
>> > didn't power down the physical CPU, just immediately moved on to the
>> > next instruction. As such, there was no power saving and no
>> > opportunity to yield to another L0 thread either, unlike with NOP
>> > emulation at L0.
>> >
>> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
>> > doing something smarter than just acting as a guest-mode NOP) ?
>>
>> Probably, MWAIT in new intel chips enters power saving mode normally.
>>
>> If hardware-executed MWAIT acted as a NOP in your old chip, then that
>> shouldn't be a problem either ... Maybe OS X gets confused into doing
>> something really dumb because we do not expose the MONITOR/MWAIT feature
>> bit correctly.
>>
>> Can you try this QEMU patch on the old hardware?
>>
>> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
>> index 7aa762245a54..4b112e12188a 100644
>> --- a/target/i386/cpu.c
>> +++ b/target/i386/cpu.c
>> @@ -2764,10 +2764,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>> break;
>> case 5:
>> /* mwait info: needed for Core compatibility */
>> - *eax = 0; /* Smallest monitor-line size in bytes */
>> - *ebx = 0; /* Largest monitor-line size in bytes */
>> - *ecx = CPUID_MWAIT_EMX | CPUID_MWAIT_IBE;
>> - *edx = 0;
>> + host_cpuid(index, 0, eax, ebx, ecx, edx);
>> break;
>> case 6:
>> /* Thermal and Power Leaf */
>> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
>> index 55865dbee0aa..1eb78291b093 100644
>> --- a/target/i386/kvm.c
>> +++ b/target/i386/kvm.c
>> @@ -360,6 +360,7 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
>> if (!kvm_irqchip_in_kernel()) {
>> ret &= ~CPUID_EXT_X2APIC;
>> }
>> + ret |= CPUID_EXT_MONITOR;
>> } else if (function == 6 && reg == R_EAX) {
>> ret |= CPUID_6_EAX_ARAT; /* safe to allow because of emulated APIC */
>> } else if (function == 7 && index == 0 && reg == R_EBX) {
>>
>>
>> Thanks.
>
> No change, still hangs on boot.

Hm, also with '-cpu host'?
(I forgot that the CPUID_EXT_MONITOR isn't visible in the guest
otherwise ...)

Thanks.

2017-03-16 16:02:32

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-16 16:35+0100, Radim Krčmář:
> 2017-03-16 10:58-0400, Gabriel L. Somlo:
>> The intel manual said the same thing back in 2010 as well. However,
>> regardless of how any flags were set, interrupt-window exiting or not,
>> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
>> Remember, never going to sleep is still correct ("normal" ?) behavior
>> per the ISA definition of MWAIT :)
>
> I'll write a simple kvm-unit-test to better understand why it is broken
> for you ...

Please get git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git

and try this, thanks!

---8<---
x86/mwait: crappy test

`./configure && make` to build it, then follow the comment in code to
try few cases.

---
x86/Makefile.common | 1 +
x86/mwait.c | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+)
create mode 100644 x86/mwait.c

diff --git a/x86/Makefile.common b/x86/Makefile.common
index 1dad18ba26e1..1e708a6acd39 100644
--- a/x86/Makefile.common
+++ b/x86/Makefile.common
@@ -46,6 +46,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
$(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
$(TEST_DIR)/hyperv_synic.flat $(TEST_DIR)/hyperv_stimer.flat \
+ $(TEST_DIR)/mwait.flat \

ifdef API
tests-common += api/api-sample
diff --git a/x86/mwait.c b/x86/mwait.c
new file mode 100644
index 000000000000..c21dab5cc97d
--- /dev/null
+++ b/x86/mwait.c
@@ -0,0 +1,41 @@
+#include "vm.h"
+
+#define TARGET_RESUMES 10000
+volatile unsigned page[4096 / 4];
+
+/*
+ * Execute
+ * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
+ * (first two arguments are eax and ecx for MWAIT, the third is FLAGS.IF bit)
+ * I assume you have 1000 Hz scheduler, so the test should take about 10
+ * seconds to run if mwait works (host timer interrupts will kick mwait).
+ *
+ * If you get far less, then mwait is just nop, as in the case of
+ *
+ * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
+ *
+ * All other combinations of arguments should take 10 seconds.
+ * Getting killed by the TIMEOUT most likely means that you have different HZ,
+ * but could also be a bug ...
+ */
+int main(int argc, char **argv)
+{
+ uint32_t eax = atol(argv[1]);
+ uint32_t ecx = atol(argv[2]);
+ bool sti = atol(argv[3]);
+ unsigned resumes = 0;
+
+ if (sti)
+ asm volatile ("sti");
+ else
+ asm volatile ("cli");
+
+ while (resumes < TARGET_RESUMES) {
+ asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
+ asm volatile("mwait" :: "a" (eax), "c" (ecx));
+ resumes++;
+ }
+
+ report("resumed from mwait %u times", resumes == TARGET_RESUMES, resumes);
+ return report_summary();
+}
--
2.11.0

2017-03-16 16:18:22

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim Krčmář wrote:
> 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > around here right now :) ) I realized you're simply trying to
> > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > >
> > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > intel chip I had access to back in 2010:
> > > >
> > > > ##############################################################################
> > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > > ##############################################################################
> > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > /* cpuid 1.ecx */
> > > > const u32 kvm_supported_word4_x86_features =
> > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > set_intercept(svm, INTERCEPT_STGI);
> > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > >
> > > > control->iopm_base_pa = iopm_base;
> > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > nested_vmx_procbased_ctls_low = 0;
> > > > nested_vmx_procbased_ctls_high &=
> > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > CPU_BASED_CR3_STORE_EXITING |
> > > > #ifdef CONFIG_X86_64
> > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > #endif
> > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > CPU_BASED_CR3_STORE_EXITING |
> > > > CPU_BASED_USE_IO_BITMAPS |
> > > > CPU_BASED_MOV_DR_EXITING |
> > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > - CPU_BASED_MWAIT_EXITING |
> > > > - CPU_BASED_MONITOR_EXITING |
> > > > CPU_BASED_INVLPG_EXITING |
> > > > CPU_BASED_RDPMC_EXITING;
> > > >
> > > > opt = CPU_BASED_TPR_SHADOW |
> > > > CPU_BASED_USE_MSR_BITMAPS |
> > > >
> > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > confused at this point :)
> > >
> > > Yes. Me too. Want to try that other patch and see what happens?
> >
> > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > might take me a while :)
>
> Michael's patch already did most of that, you just need to add
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index efde6cc50875..b12f07d4ce17 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> const u32 kvm_cpuid_1_ecx_x86_features =
> /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> * but *not* advertised to guests via CPUID ! */
> - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> 0 /* DS-CPL, VMX, SMX, EST */ |
> 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
>
> Note: this will never be upstream, because mwait isn't what we want by
> default. :)

But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
assuming they're present, the above one-liner would make no
difference. If everything else in the old patch I quoted is identical
to what Michael does, then I don't know -- maybe the MacPro1,1 has
really broken L>=1 MWAIT, and it only ever worked with vmexit and
emulation on the host side.

> >> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> >> > didn't power down the physical CPU, just immediately moved on to the
> >> > next instruction. As such, there was no power saving and no
> >> > opportunity to yield to another L0 thread either, unlike with NOP
> >> > emulation at L0.
> >> >
> >> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> >> > doing something smarter than just acting as a guest-mode NOP) ?
> >> >
> >> > Thanks,
> >> > --Gabriel
> >>
> >> Interesting. What it seems to say is this:
> >>
> >> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> >> opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
> >> exiting” VM-execution control:
> >> — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
> >> (see Section 22.1.3).
> >> — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
> >> any of the following is true: (1) the “interrupt-window exiting” VM-execution
> >> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> >> — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
> >> exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> >> does not cause the processor to enter an implementation-dependent
> >> optimized state; instead, control passes to the instruction following the
> >> MWAIT instruction.
> >>
> >>
> >> And since interrupt-window exiting is 0 most of the time for KVM,
> >> I would expect MWAIT to behave normally.
> >
> > The intel manual said the same thing back in 2010 as well. However,
> > regardless of how any flags were set, interrupt-window exiting or not,
> > "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> > Remember, never going to sleep is still correct ("normal" ?) behavior
> > per the ISA definition of MWAIT :)
>
> I'll write a simple kvm-unit-test to better understand why it is broken
> for you ...
>
> > Also, when I tested your patch on the macbook air (where it worked),
> > not only was the host reporting 400% CPU for qemu (which is to be
> > expected), but the thermal fan/cooling thing also shifted up into high
> > gear, which means the physical CPU got hot, which it shouldn't have if
> > the guest-mode MWAIT actually did put the host CPU into low power.
>
> I tested MWAIT with basically the same kernel patch and the qemu patch
> with Linux guest on Haswell and Nehalem. Running the guest took 100% of
> the host CPUs, but it still had the same temperature as when the host
> was idle.
>
> That reminds me that you to pass '-cpu host' for QEMU reasons.

For OS X to boot, one needs '-cpu core2duo' for <= 10.11, and
'-cpu Penryn' for 10.12. I never managed to get it working with any
other settings.

So I'm ready to write off the MacPro1,1 (unless you want me run more
tests and report back for you, which I'm happy to do in any case).

But please please, so at least I walk away from this having learned
something :) help me understand the use case:

- By careful setting of vmx flags, and/or on newer, sanely
built Intel hardware, L1 MWAIT actually powers down the
physical host core (while I couldn't get it to stay cool
on my end, I totally believe you managed to pull it off)

- We never admit to supporting MWAIT to guests, but when they
do anyway (either because they're old/grumpy/careless OS X
versions, or some newfangled custom-built Linux kernel which
is hacked to ignore CPUID on purpose), we now allow the
guest to:
- keep its alloted time slice
- but "waste" it by powering down the host CPU
instead of
- vmexit to the host OS at L0
- yield the host core to another L0 runnable thread

Since newer OS X actually checks CPUID, I don't have a major stake in
one way vs. the other, but I'm really really curious:

Are we trying to save power assuming the host is unlikely to have
enough runnable L0 threads for when the L0-emulated NOP yields? So
we're better off letting the guest keep the CPU but also keep it cool
while at it (assuming the guest isn't totally hostile and didn't pick
a setting where L1 MWAIT actually works as L1 NOP, in which case we
don't even get to stay cool)?

Man, I wish I had the cycles to resurrect my attempt at acually emulating
MWAIT with something like a condition queue (below, just for reference).

Thanks much,
--Gabriel


##############################################################################
# kvm-mwait-emu.patch (Gabriel Somlo <[email protected]> 2014/02/05)
# -- based on an idea suggested by Alex Graf --
# GLS: emulate MONITOR and MWAIT at page-level granularity by write-protecting
# the page containing a monitored location and appropriately handling
# subsequent write faults.
# After debugging the SMP issue, we'll need a way to trigger a
# periodic cleanup that will switch write-protected monitored pages
# back to read-write, once they've stayed unused for "long enough"
##############################################################################
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..7ca9b51 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -337,6 +337,16 @@ struct kvm_pmu {
u64 reprogram_pmi;
};

+/*
+ * mwait-monitored page list element type
+ */
+struct kvm_mwait_pg {
+ gpa_t gpa;
+ struct list_head vcpu_list; /* VCPUs monitoring (armed on) this page */
+ struct list_head link; /* links mwait-pages within a KVM */
+ unsigned accessed;
+};
+
struct kvm_vcpu_arch {
/*
* rip and regs accesses must go through
@@ -528,6 +538,10 @@ struct kvm_vcpu_arch {
struct {
bool pv_unhalted;
} pv;
+
+ /* MONITOR/MWAIT support */
+ struct kvm_mwait_pg *mwp; /* page monitored by this VCPU */
+ struct list_head mw_link; /* all VCPUs monitoring the same page */
};

struct kvm_lpage_info {
@@ -607,6 +621,10 @@ struct kvm_arch {
u64 hv_hypercall;
u64 hv_tsc_page;

+ /* MONITOR/MWAIT support */
+ struct mutex mwait_lock;
+ struct list_head mwait_pg_list; /* monitored pages within this KVM */
+
#ifdef CONFIG_KVM_MMU_AUDIT
int audit_point;
#endif
@@ -854,6 +872,8 @@ int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port);
void kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
int kvm_emulate_halt(struct kvm_vcpu *vcpu);
int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
+int kvm_emulate_monitor(struct kvm_vcpu *vcpu);
+int kvm_emulate_mwait(struct kvm_vcpu *vcpu);

void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
@@ -915,6 +935,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
const u8 *new, int bytes);
int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
+int kvm_mmu_protect_page(struct kvm *kvm, gfn_t gfn);
void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
int kvm_mmu_load(struct kvm_vcpu *vcpu);
void kvm_mmu_unload(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..7d4f1ca 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -279,6 +279,14 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
+ /* OS X does not check CPUID before using MONITOR/MWAIT from its
+ * power-optimized idle loop (AppleIntelPowerManagement.kext).
+ * For now, we don't advertise MWAIT support below, but attempt
+ * to emulate them instead of issuing an invalid opcode fault
+ * if a misbehaving guest calls them anyway. Removing the above
+ * mentioned kext from OS X will cause it to fall back to a
+ * HLT-based idle loop, as an optional guest optimization step.
+ */
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e50425d..bc02ebd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2283,6 +2283,20 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page);

+int kvm_mmu_protect_page(struct kvm *kvm, gfn_t gfn)
+{
+ int r;
+
+ spin_lock(&kvm->mmu_lock);
+ r = rmap_write_protect(kvm, gfn);
+ if (r)
+ kvm_flush_remote_tlbs(kvm);
+ spin_unlock(&kvm->mmu_lock);
+
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_protect_page);
+
/*
* The function is based on mtrr_type_lookup() in
* arch/x86/kernel/cpu/mtrr/generic.c
@@ -4146,12 +4160,68 @@ static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
return vcpu_match_mmio_gva(vcpu, addr);
}

+// try to handle fault caused by write to monitored (mwait) page
+// FIXME: aim for better integration between this and FNAME(page_fault)() and
+// kvm_mmu_page_fault() below. For now, this is proof-of-concept code.
+static bool handle_mwait_write_fault(struct kvm_vcpu *vcpu, gva_t gva,
+ void *in, int in_len)
+{
+ gpa_t gpa;
+ struct kvm_mwait_pg *p, *mwp = NULL;
+ struct kvm_vcpu_arch *v, *u;
+ bool r = false;
+
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto ul_out;
+
+ mutex_lock(&vcpu->kvm->arch.mwait_lock);
+
+ /* is gpa matching a monitored (mwait) page? */
+ list_for_each_entry(p, &vcpu->kvm->arch.mwait_pg_list, link)
+ if (p->gpa == gpa) {
+ mwp = p;
+ break;
+ }
+ if (mwp == NULL)
+ goto out;
+
+ mwp->accessed = 1;
+
+ if (x86_emulate_instruction(vcpu, gva,
+ EMULTYPE_RETRY, in, in_len) != EMULATE_DONE)
+ goto out;
+
+ /* disarm all VCPUs monitoring this page, waking them if needed */
+ list_for_each_entry_safe(v, u, &mwp->vcpu_list, mw_link) {
+ list_del(&v->mw_link);
+ v->mwp = NULL;
+ if (v->mp_state == KVM_MP_STATE_MWAIT)
+ v->mp_state = KVM_MP_STATE_RUNNABLE;
+ }
+
+ // What if the mwait is woken up by an interrupt instead of a write ?
+ // It might remain "armed" on its old mwait page, but any subsequent
+ // MONITOR instruction would replace that, so I don't think we need
+ // to worry about it...
+
+ r = true;
+out:
+ mutex_unlock(&vcpu->kvm->arch.mwait_lock);
+ul_out:
+ return r;
+}
+
int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
void *insn, int insn_len)
{
int r, emulation_type = EMULTYPE_RETRY;
enum emulation_result er;

+ /* writing to MONITORed memory area ? */
+ if (handle_mwait_write_fault(vcpu, cr2, insn, insn_len))
+ return 1;
+
r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
if (r < 0)
goto out;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e81df8f..638704c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3262,6 +3262,18 @@ static int pause_interception(struct vcpu_svm *svm)
return 1;
}

+static int monitor_interception(struct vcpu_svm *svm)
+{
+ skip_emulated_instruction(&(svm->vcpu));
+ return kvm_emulate_monitor(&(svm->vcpu));
+}
+
+static int mwait_interception(struct vcpu_svm *svm)
+{
+ skip_emulated_instruction(&(svm->vcpu));
+ return kvm_emulate_mwait(&(svm->vcpu));
+}
+
static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
[SVM_EXIT_READ_CR0] = cr_interception,
[SVM_EXIT_READ_CR3] = cr_interception,
@@ -3319,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
[SVM_EXIT_CLGI] = clgi_interception,
[SVM_EXIT_SKINIT] = skinit_interception,
[SVM_EXIT_WBINVD] = emulate_on_interception,
- [SVM_EXIT_MONITOR] = invalid_op_interception,
- [SVM_EXIT_MWAIT] = invalid_op_interception,
+ [SVM_EXIT_MONITOR] = monitor_interception,
+ [SVM_EXIT_MWAIT] = mwait_interception,
[SVM_EXIT_XSETBV] = xsetbv_interception,
[SVM_EXIT_NPF] = pf_interception,
};
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a06f101..a7382e1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5603,6 +5603,18 @@ static int handle_invalid_op(struct kvm_vcpu *vcpu)
return 1;
}

+static int handle_monitor(struct kvm_vcpu *vcpu)
+{
+ skip_emulated_instruction(vcpu);
+ return kvm_emulate_monitor(vcpu);
+}
+
+static int handle_mwait(struct kvm_vcpu *vcpu)
+{
+ skip_emulated_instruction(vcpu);
+ return kvm_emulate_mwait(vcpu);
+}
+
/*
* To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
* We could reuse a single VMCS for all the L2 guests, but we also want the
@@ -6483,8 +6495,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EPT_VIOLATION] = handle_ept_violation,
[EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig,
[EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause,
- [EXIT_REASON_MWAIT_INSTRUCTION] = handle_invalid_op,
- [EXIT_REASON_MONITOR_INSTRUCTION] = handle_invalid_op,
+ [EXIT_REASON_MWAIT_INSTRUCTION] = handle_mwait,
+ [EXIT_REASON_MONITOR_INSTRUCTION] = handle_monitor,
[EXIT_REASON_INVEPT] = handle_invept,
};

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39c28f09..8edc1be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5592,6 +5592,70 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_emulate_halt);

+int kvm_emulate_monitor(struct kvm_vcpu *vcpu)
+{
+ gva_t gva;
+ gpa_t gpa;
+ struct kvm_mwait_pg *p;
+
+ /* emulate as NOP if no-kvm-irqchip */
+ if (!irqchip_in_kernel(vcpu->kvm))
+ return 1;
+
+ mutex_lock(&vcpu->kvm->arch.mwait_lock);
+
+ /* relinguish any previously monitored mwait page */
+ if (vcpu->arch.mwp != NULL) {
+ list_del(&vcpu->arch.mw_link);
+ vcpu->arch.mwp->accessed = 1;
+ vcpu->arch.mwp = NULL;
+ }
+
+ gva = kvm_register_read(vcpu, VCPU_REGS_RAX);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto out; /* let some write op map the page first */
+
+ /* does the mwait page we're looking for already exist? */
+ list_for_each_entry(p, &vcpu->kvm->arch.mwait_pg_list, link)
+ if (p->gpa == gpa) {
+ vcpu->arch.mwp = p;
+ break;
+ }
+ if (vcpu->arch.mwp == NULL) { /* no, add new mwait page */
+ if (!kvm_mmu_protect_page(vcpu->kvm, gpa_to_gfn(gpa)))
+ goto out;
+ p = kmalloc(sizeof(struct kvm_mwait_pg), GFP_KERNEL);
+ p->gpa = gpa;
+ INIT_LIST_HEAD(&p->vcpu_list);
+ list_add(&p->link, &vcpu->kvm->arch.mwait_pg_list);
+
+ vcpu->arch.mwp = p;
+ }
+
+ /* link this VCPU into list of VCPUs monitoring this mwait page */
+ list_add(&vcpu->arch.mw_link, &vcpu->arch.mwp->vcpu_list);
+
+out:
+ mutex_unlock(&vcpu->kvm->arch.mwait_lock);
+ return 1;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_monitor);
+
+int kvm_emulate_mwait(struct kvm_vcpu *vcpu)
+{
+ /* emulate as NOP if no-kvm-irqchip */
+ if (!irqchip_in_kernel(vcpu->kvm))
+ return 1;
+
+ mutex_lock(&vcpu->kvm->arch.mwait_lock);
+ if (vcpu->arch.mwp != NULL)
+ vcpu->arch.mp_state = KVM_MP_STATE_MWAIT;
+ mutex_unlock(&vcpu->kvm->arch.mwait_lock);
+ return 1;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_mwait);
+
int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
{
u64 param, ingpa, outgpa, ret;
@@ -6077,6 +6141,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_UNHALT, vcpu)) {
kvm_apic_accept_events(vcpu);
switch(vcpu->arch.mp_state) {
+ case KVM_MP_STATE_MWAIT:
case KVM_MP_STATE_HALTED:
vcpu->arch.pv.pv_unhalted = false;
vcpu->arch.mp_state =
@@ -6961,6 +7026,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);

+ vcpu->arch.mwp = NULL;
+
return 0;
fail_free_wbinvd_dirty_mask:
free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
@@ -7013,6 +7080,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)

pvclock_update_vm_gtod_copy(kvm);

+ mutex_init(&kvm->arch.mwait_lock);
+ INIT_LIST_HEAD(&kvm->arch.mwait_pg_list);
+
return 0;
}

@@ -7254,8 +7324,10 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
|| kvm_apic_has_events(vcpu)
|| vcpu->arch.pv.pv_unhalted
|| atomic_read(&vcpu->arch.nmi_queued) ||
- (kvm_arch_interrupt_allowed(vcpu) &&
- kvm_cpu_has_interrupt(vcpu));
+ (kvm_cpu_has_interrupt(vcpu) &&
+ (kvm_arch_interrupt_allowed(vcpu) ||
+ (vcpu->arch.mp_state == KVM_MP_STATE_MWAIT &&
+ kvm_register_read(vcpu, VCPU_REGS_RCX) & 0x01)));
}

int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 932d7f2..a4925fc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -398,6 +398,7 @@ struct kvm_vapic_addr {
#define KVM_MP_STATE_INIT_RECEIVED 2
#define KVM_MP_STATE_HALTED 3
#define KVM_MP_STATE_SIPI_RECEIVED 4
+#define KVM_MP_STATE_MWAIT 5

struct kvm_mp_state {
__u32 mp_state;

2017-03-16 16:26:29

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 04:54:06PM +0100, Radim Krčmář wrote:
> 2017-03-16 11:44-0400, Gabriel L. Somlo:
> > On Thu, Mar 16, 2017 at 03:08:07PM +0100, Radim Krčmář wrote:
> >> 2017-03-16 09:24-0400, Gabriel L. Somlo:
> >> > On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> >> > > On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> >> > > > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> >> > > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> >> > > > > unless explicitly provided with kernel command line argument
> >> > > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> >> > > > > without checking CPUID.
> >> > > > >
> >> > > > > We currently emulate that as a NOP but on VMX we can do better: let
> >> > > > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> >> > > > > but that isn't any worse than a NOP emulation.
> >> > > > >
> >> > > > > Note that mwait within guests is not the same as on real hardware
> >> > > > > because halt causes an exit while mwait doesn't. For this reason it
> >> > > > > might not be a good idea to use the regular MWAIT flag in CPUID to
> >> > > > > signal this capability. Add a flag in the hypervisor leaf instead.
> >> > > > >
> >> > > > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> >> > > > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> >> > > > > to improve guest behaviour.
> >> > > >
> >> > > > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> >> > > > kernel from kvm git master, plus this patch on top).
> >> > > >
> >> > > > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> >> > > > on boot, does not bring up guest graphical interface within the first
> >> > > > 10 minutes that I waited for it. That, in contrast with the default
> >> > > > nop-based emulation where the guest comes up within 30 seconds.
> >> > >
> >> > >
> >> > > Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> >> > > with various behaviours.
> >> > >
> >> > > > I will run another round of tests on a newer Mac (4-year-old macbook
> >> > > > air) and report back tomorrow.
> >> > > >
> >> > > > Going off on a tangent, why would encouraging otherwise well-behaved
> >> > > > guests (like linux ones, for example) to use MWAIT be desirable to
> >> > > > begin with ? Is it a matter of minimizing the overhead associated with
> >> > > > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> >> > > > running guest-mode MWAIT in a tight loop will actually waste the host
> >> > > > CPU without the opportunity to yield to some other L0 thread. Sorry if
> >> > > > I fell into the middle of an ongoing conversation on this and missed
> >> > > > most of the relevant context, in which case please feel free to ignore
> >> > > > me... :)
> >> > > >
> >> > > > Thanks,
> >> > > > --G
> >> > >
> >> > > It's just some experiments I'm running, I'm not ready to describe it
> >> > > yet. I thought this part might be useful to at least some guests, so
> >> > > trying to upstream it right now.
> >> >
> >> > OK, so on a macbook air running F25 and the latest kvm git master plus
> >> > your v5 patch (4.11.0-rc2+), things appear to work.
> >> >
> >> > host-side cpuid output:
> >> > eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120
> >> >
> >> > guest-side cpuid output:
> >> > eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000
> >> >
> >> > processor : 3
> >> > vendor_id : GenuineIntel
> >> > cpu family : 6
> >> > model : 42
> >> > model name : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
> >> > stepping : 7
> >> > microcode : 0x29
> >> > cpu MHz : 1157.849
> >> > cache size : 4096 KB
> >> > physical id : 0
> >> > siblings : 4
> >> > core id : 1
> >> > cpu cores : 2
> >> > apicid : 3
> >> > initial apicid : 3
> >> > fpu : yes
> >> > fpu_exception : yes
> >> > cpuid level : 13
> >> > wp : yes
> >> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> >> > bugs :
> >> > bogomips : 3604.68
> >> > clflush size : 64
> >> > cache_alignment : 64
> >> > address sizes : 36 bits physical, 48 bits virtual
> >> > power management:
> >> >
> >> > After studying your patch a bit more carefully (sorry, it's crazy
> >> > around here right now :) ) I realized you're simply trying to
> >> > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> >> > just allow L1 to execute MONITOR & MWAIT natively.
> >> >
> >> > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> >> > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> >> > natively was one of the options Alex Graf and Rene Rebe used back in
> >> > the very early days of OS X on QEMU, at the time I got involved with
> >> > that project. Here's part of an out of tree patch against 3.4 which did
> >> > just that, and worked as far as I remember on *any* MWAIT capable
> >> > intel chip I had access to back in 2010:
> >> >
> >> > ##############################################################################
> >> > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> >> > ##############################################################################
> >> > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> >> > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> >> > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> >> > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> >> > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> >> > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> >> > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> >> > /* cpuid 1.ecx */
> >> > const u32 kvm_supported_word4_x86_features =
> >> > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> >> > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> >> > 0 /* DS-CPL, VMX, SMX, EST */ |
> >> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> >> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> >> > 0 /* Reserved, DCA */ | F(XMM4_1) |
> >> > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> >> > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> >> > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> >> > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> >> > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> >> > set_intercept(svm, INTERCEPT_VMSAVE);
> >> > set_intercept(svm, INTERCEPT_STGI);
> >> > set_intercept(svm, INTERCEPT_CLGI);
> >> > set_intercept(svm, INTERCEPT_SKINIT);
> >> > set_intercept(svm, INTERCEPT_WBINVD);
> >> > - set_intercept(svm, INTERCEPT_MONITOR);
> >> > - set_intercept(svm, INTERCEPT_MWAIT);
> >> > set_intercept(svm, INTERCEPT_XSETBV);
> >> >
> >> > control->iopm_base_pa = iopm_base;
> >> > control->msrpm_base_pa = __pa(svm->msrpm);
> >> > control->int_ctl = V_INTR_MASKING_MASK;
> >> > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> >> > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> >> > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> >> > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> >> > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> >> > nested_vmx_procbased_ctls_low = 0;
> >> > nested_vmx_procbased_ctls_high &=
> >> > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> >> > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> >> > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> >> > + CPU_BASED_CR3_LOAD_EXITING |
> >> > CPU_BASED_CR3_STORE_EXITING |
> >> > #ifdef CONFIG_X86_64
> >> > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> >> > #endif
> >> > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> >> > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> >> > CPU_BASED_CR3_LOAD_EXITING |
> >> > CPU_BASED_CR3_STORE_EXITING |
> >> > CPU_BASED_USE_IO_BITMAPS |
> >> > CPU_BASED_MOV_DR_EXITING |
> >> > CPU_BASED_USE_TSC_OFFSETING |
> >> > - CPU_BASED_MWAIT_EXITING |
> >> > - CPU_BASED_MONITOR_EXITING |
> >> > CPU_BASED_INVLPG_EXITING |
> >> > CPU_BASED_RDPMC_EXITING;
> >> >
> >> > opt = CPU_BASED_TPR_SHADOW |
> >> > CPU_BASED_USE_MSR_BITMAPS |
> >> >
> >> > If all you're trying to do is (selectively) revert to this behavior,
> >> > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> >> > confused at this point :)
> >> >
> >> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> >> > didn't power down the physical CPU, just immediately moved on to the
> >> > next instruction. As such, there was no power saving and no
> >> > opportunity to yield to another L0 thread either, unlike with NOP
> >> > emulation at L0.
> >> >
> >> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> >> > doing something smarter than just acting as a guest-mode NOP) ?
> >>
> >> Probably, MWAIT in new intel chips enters power saving mode normally.
> >>
> >> If hardware-executed MWAIT acted as a NOP in your old chip, then that
> >> shouldn't be a problem either ... Maybe OS X gets confused into doing
> >> something really dumb because we do not expose the MONITOR/MWAIT feature
> >> bit correctly.
> >>
> >> Can you try this QEMU patch on the old hardware?
> >>
> >> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> >> index 7aa762245a54..4b112e12188a 100644
> >> --- a/target/i386/cpu.c
> >> +++ b/target/i386/cpu.c
> >> @@ -2764,10 +2764,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >> break;
> >> case 5:
> >> /* mwait info: needed for Core compatibility */
> >> - *eax = 0; /* Smallest monitor-line size in bytes */
> >> - *ebx = 0; /* Largest monitor-line size in bytes */
> >> - *ecx = CPUID_MWAIT_EMX | CPUID_MWAIT_IBE;
> >> - *edx = 0;
> >> + host_cpuid(index, 0, eax, ebx, ecx, edx);
> >> break;
> >> case 6:
> >> /* Thermal and Power Leaf */
> >> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> >> index 55865dbee0aa..1eb78291b093 100644
> >> --- a/target/i386/kvm.c
> >> +++ b/target/i386/kvm.c
> >> @@ -360,6 +360,7 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
> >> if (!kvm_irqchip_in_kernel()) {
> >> ret &= ~CPUID_EXT_X2APIC;
> >> }
> >> + ret |= CPUID_EXT_MONITOR;
> >> } else if (function == 6 && reg == R_EAX) {
> >> ret |= CPUID_6_EAX_ARAT; /* safe to allow because of emulated APIC */
> >> } else if (function == 7 && index == 0 && reg == R_EBX) {
> >>
> >>
> >> Thanks.
> >
> > No change, still hangs on boot.
>
> Hm, also with '-cpu host'?
> (I forgot that the CPUID_EXT_MONITOR isn't visible in the guest
> otherwise ...)

Yeah, managed to get it started with '-cpu host', but same behavior.
Maybe that version of Xeon really was braindamaged in some way, and
never would have worked with L1 MWAIT regardless.

I only ever used that machine after the emulate-as-nop patch made it
into KVM (commit 87c0057), so I honestly can't say whether it ever
worked with MWAIT run natively at L1...

Thanks,
--Gabriel

2017-03-16 16:45:08

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 12:16:13PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim Krčmář wrote:
> > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > > around here right now :) ) I realized you're simply trying to
> > > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > > >
> > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > > intel chip I had access to back in 2010:
> > > > >
> > > > > ##############################################################################
> > > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > > > ##############################################################################
> > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > > /* cpuid 1.ecx */
> > > > > const u32 kvm_supported_word4_x86_features =
> > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > > set_intercept(svm, INTERCEPT_STGI);
> > > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > > >
> > > > > control->iopm_base_pa = iopm_base;
> > > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > > nested_vmx_procbased_ctls_low = 0;
> > > > > nested_vmx_procbased_ctls_high &=
> > > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > #ifdef CONFIG_X86_64
> > > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > > #endif
> > > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > CPU_BASED_USE_IO_BITMAPS |
> > > > > CPU_BASED_MOV_DR_EXITING |
> > > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > > - CPU_BASED_MWAIT_EXITING |
> > > > > - CPU_BASED_MONITOR_EXITING |
> > > > > CPU_BASED_INVLPG_EXITING |
> > > > > CPU_BASED_RDPMC_EXITING;
> > > > >
> > > > > opt = CPU_BASED_TPR_SHADOW |
> > > > > CPU_BASED_USE_MSR_BITMAPS |
> > > > >
> > > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > > confused at this point :)
> > > >
> > > > Yes. Me too. Want to try that other patch and see what happens?
> > >
> > > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > > might take me a while :)
> >
> > Michael's patch already did most of that, you just need to add
> >
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index efde6cc50875..b12f07d4ce17 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > const u32 kvm_cpuid_1_ecx_x86_features =
> > /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > * but *not* advertised to guests via CPUID ! */
> > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > 0 /* DS-CPL, VMX, SMX, EST */ |
> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> >
> > Note: this will never be upstream, because mwait isn't what we want by
> > default. :)
>
> But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
> assuming they're present, the above one-liner would make no
> difference. If everything else in the old patch I quoted is identical
> to what Michael does, then I don't know -- maybe the MacPro1,1 has
> really broken L>=1 MWAIT, and it only ever worked with vmexit and
> emulation on the host side.


I think I have an idea. It is probably one of the monitor bugs
on this host.

X86_BUG_CLFLUSH_MONITOR or X86_BUG_MONITOR.

If you tell guest you have a CPU that does not need it
but host does need it, then mwait will not work.

if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_CLFLUSH) &&
(c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
set_cpu_bug(c, X86_BUG_CLFLUSH_MONITOR);


if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
set_cpu_bug(c, X86_BUG_MONITOR);

what did you say your host model is?



> > >> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> > >> > didn't power down the physical CPU, just immediately moved on to the
> > >> > next instruction. As such, there was no power saving and no
> > >> > opportunity to yield to another L0 thread either, unlike with NOP
> > >> > emulation at L0.
> > >> >
> > >> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > >> > doing something smarter than just acting as a guest-mode NOP) ?
> > >> >
> > >> > Thanks,
> > >> > --Gabriel
> > >>
> > >> Interesting. What it seems to say is this:
> > >>
> > >> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> > >> opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT
> > >> exiting” VM-execution control:
> > >> — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit
> > >> (see Section 22.1.3).
> > >> — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if
> > >> any of the following is true: (1) the “interrupt-window exiting” VM-execution
> > >> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> > >> — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window
> > >> exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> > >> does not cause the processor to enter an implementation-dependent
> > >> optimized state; instead, control passes to the instruction following the
> > >> MWAIT instruction.
> > >>
> > >>
> > >> And since interrupt-window exiting is 0 most of the time for KVM,
> > >> I would expect MWAIT to behave normally.
> > >
> > > The intel manual said the same thing back in 2010 as well. However,
> > > regardless of how any flags were set, interrupt-window exiting or not,
> > > "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> > > Remember, never going to sleep is still correct ("normal" ?) behavior
> > > per the ISA definition of MWAIT :)
> >
> > I'll write a simple kvm-unit-test to better understand why it is broken
> > for you ...
> >
> > > Also, when I tested your patch on the macbook air (where it worked),
> > > not only was the host reporting 400% CPU for qemu (which is to be
> > > expected), but the thermal fan/cooling thing also shifted up into high
> > > gear, which means the physical CPU got hot, which it shouldn't have if
> > > the guest-mode MWAIT actually did put the host CPU into low power.
> >
> > I tested MWAIT with basically the same kernel patch and the qemu patch
> > with Linux guest on Haswell and Nehalem. Running the guest took 100% of
> > the host CPUs, but it still had the same temperature as when the host
> > was idle.
> >
> > That reminds me that you to pass '-cpu host' for QEMU reasons.
>
> For OS X to boot, one needs '-cpu core2duo' for <= 10.11, and
> '-cpu Penryn' for 10.12. I never managed to get it working with any
> other settings.
>
> So I'm ready to write off the MacPro1,1 (unless you want me run more
> tests and report back for you, which I'm happy to do in any case).
>
> But please please, so at least I walk away from this having learned
> something :) help me understand the use case:
>
> - By careful setting of vmx flags, and/or on newer, sanely
> built Intel hardware, L1 MWAIT actually powers down the
> physical host core (while I couldn't get it to stay cool
> on my end, I totally believe you managed to pull it off)
>
> - We never admit to supporting MWAIT to guests, but when they
> do anyway (either because they're old/grumpy/careless OS X
> versions, or some newfangled custom-built Linux kernel which
> is hacked to ignore CPUID on purpose), we now allow the
> guest to:
> - keep its alloted time slice
> - but "waste" it by powering down the host CPU
> instead of
> - vmexit to the host OS at L0
> - yield the host core to another L0 runnable thread

NOP doesn't yield atomatically, does it? CPU stays runnable,
it just makes it a bit cheaper to switch to another thread
as you don't need to exit.

> Since newer OS X actually checks CPUID, I don't have a major stake in
> one way vs. the other, but I'm really really curious:
>
> Are we trying to save power assuming the host is unlikely to have
> enough runnable L0 threads for when the L0-emulated NOP yields? So
> we're better off letting the guest keep the CPU but also keep it cool
> while at it (assuming the guest isn't totally hostile and didn't pick
> a setting where L1 MWAIT actually works as L1 NOP, in which case we
> don't even get to stay cool)?
>
> Man, I wish I had the cycles to resurrect my attempt at acually emulating
> MWAIT with something like a condition queue (below, just for reference).
>
> Thanks much,
> --Gabriel
>
>
> ##############################################################################
> # kvm-mwait-emu.patch (Gabriel Somlo <[email protected]> 2014/02/05)
> # -- based on an idea suggested by Alex Graf --
> # GLS: emulate MONITOR and MWAIT at page-level granularity by write-protecting
> # the page containing a monitored location and appropriately handling
> # subsequent write faults.
> # After debugging the SMP issue, we'll need a way to trigger a
> # periodic cleanup that will switch write-protected monitored pages
> # back to read-write, once they've stayed unused for "long enough"
> ##############################################################################
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fdf83af..7ca9b51 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -337,6 +337,16 @@ struct kvm_pmu {
> u64 reprogram_pmi;
> };
>
> +/*
> + * mwait-monitored page list element type
> + */
> +struct kvm_mwait_pg {
> + gpa_t gpa;
> + struct list_head vcpu_list; /* VCPUs monitoring (armed on) this page */
> + struct list_head link; /* links mwait-pages within a KVM */
> + unsigned accessed;
> +};
> +
> struct kvm_vcpu_arch {
> /*
> * rip and regs accesses must go through
> @@ -528,6 +538,10 @@ struct kvm_vcpu_arch {
> struct {
> bool pv_unhalted;
> } pv;
> +
> + /* MONITOR/MWAIT support */
> + struct kvm_mwait_pg *mwp; /* page monitored by this VCPU */
> + struct list_head mw_link; /* all VCPUs monitoring the same page */
> };
>
> struct kvm_lpage_info {
> @@ -607,6 +621,10 @@ struct kvm_arch {
> u64 hv_hypercall;
> u64 hv_tsc_page;
>
> + /* MONITOR/MWAIT support */
> + struct mutex mwait_lock;
> + struct list_head mwait_pg_list; /* monitored pages within this KVM */
> +
> #ifdef CONFIG_KVM_MMU_AUDIT
> int audit_point;
> #endif
> @@ -854,6 +872,8 @@ int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port);
> void kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
> int kvm_emulate_halt(struct kvm_vcpu *vcpu);
> int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
> +int kvm_emulate_monitor(struct kvm_vcpu *vcpu);
> +int kvm_emulate_mwait(struct kvm_vcpu *vcpu);
>
> void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
> int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
> @@ -915,6 +935,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> const u8 *new, int bytes);
> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
> int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
> +int kvm_mmu_protect_page(struct kvm *kvm, gfn_t gfn);
> void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
> int kvm_mmu_load(struct kvm_vcpu *vcpu);
> void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c697625..7d4f1ca 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -279,6 +279,14 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> /* cpuid 1.ecx */
> const u32 kvm_supported_word4_x86_features =
> + /* OS X does not check CPUID before using MONITOR/MWAIT from its
> + * power-optimized idle loop (AppleIntelPowerManagement.kext).
> + * For now, we don't advertise MWAIT support below, but attempt
> + * to emulate them instead of issuing an invalid opcode fault
> + * if a misbehaving guest calls them anyway. Removing the above
> + * mentioned kext from OS X will cause it to fall back to a
> + * HLT-based idle loop, as an optional guest optimization step.
> + */
> F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> 0 /* DS-CPL, VMX, SMX, EST */ |
> 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e50425d..bc02ebd 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2283,6 +2283,20 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page);
>
> +int kvm_mmu_protect_page(struct kvm *kvm, gfn_t gfn)
> +{
> + int r;
> +
> + spin_lock(&kvm->mmu_lock);
> + r = rmap_write_protect(kvm, gfn);
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> + spin_unlock(&kvm->mmu_lock);
> +
> + return r;
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_protect_page);
> +
> /*
> * The function is based on mtrr_type_lookup() in
> * arch/x86/kernel/cpu/mtrr/generic.c
> @@ -4146,12 +4160,68 @@ static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
> return vcpu_match_mmio_gva(vcpu, addr);
> }
>
> +// try to handle fault caused by write to monitored (mwait) page
> +// FIXME: aim for better integration between this and FNAME(page_fault)() and
> +// kvm_mmu_page_fault() below. For now, this is proof-of-concept code.
> +static bool handle_mwait_write_fault(struct kvm_vcpu *vcpu, gva_t gva,
> + void *in, int in_len)
> +{
> + gpa_t gpa;
> + struct kvm_mwait_pg *p, *mwp = NULL;
> + struct kvm_vcpu_arch *v, *u;
> + bool r = false;
> +
> + gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
> + if (gpa == UNMAPPED_GVA)
> + goto ul_out;
> +
> + mutex_lock(&vcpu->kvm->arch.mwait_lock);
> +
> + /* is gpa matching a monitored (mwait) page? */
> + list_for_each_entry(p, &vcpu->kvm->arch.mwait_pg_list, link)
> + if (p->gpa == gpa) {
> + mwp = p;
> + break;
> + }
> + if (mwp == NULL)
> + goto out;
> +
> + mwp->accessed = 1;
> +
> + if (x86_emulate_instruction(vcpu, gva,
> + EMULTYPE_RETRY, in, in_len) != EMULATE_DONE)
> + goto out;
> +
> + /* disarm all VCPUs monitoring this page, waking them if needed */
> + list_for_each_entry_safe(v, u, &mwp->vcpu_list, mw_link) {
> + list_del(&v->mw_link);
> + v->mwp = NULL;
> + if (v->mp_state == KVM_MP_STATE_MWAIT)
> + v->mp_state = KVM_MP_STATE_RUNNABLE;
> + }
> +
> + // What if the mwait is woken up by an interrupt instead of a write ?
> + // It might remain "armed" on its old mwait page, but any subsequent
> + // MONITOR instruction would replace that, so I don't think we need
> + // to worry about it...
> +
> + r = true;
> +out:
> + mutex_unlock(&vcpu->kvm->arch.mwait_lock);
> +ul_out:
> + return r;
> +}
> +
> int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
> void *insn, int insn_len)
> {
> int r, emulation_type = EMULTYPE_RETRY;
> enum emulation_result er;
>
> + /* writing to MONITORed memory area ? */
> + if (handle_mwait_write_fault(vcpu, cr2, insn, insn_len))
> + return 1;
> +
> r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
> if (r < 0)
> goto out;
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index e81df8f..638704c 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3262,6 +3262,18 @@ static int pause_interception(struct vcpu_svm *svm)
> return 1;
> }
>
> +static int monitor_interception(struct vcpu_svm *svm)
> +{
> + skip_emulated_instruction(&(svm->vcpu));
> + return kvm_emulate_monitor(&(svm->vcpu));
> +}
> +
> +static int mwait_interception(struct vcpu_svm *svm)
> +{
> + skip_emulated_instruction(&(svm->vcpu));
> + return kvm_emulate_mwait(&(svm->vcpu));
> +}
> +
> static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
> [SVM_EXIT_READ_CR0] = cr_interception,
> [SVM_EXIT_READ_CR3] = cr_interception,
> @@ -3319,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
> [SVM_EXIT_CLGI] = clgi_interception,
> [SVM_EXIT_SKINIT] = skinit_interception,
> [SVM_EXIT_WBINVD] = emulate_on_interception,
> - [SVM_EXIT_MONITOR] = invalid_op_interception,
> - [SVM_EXIT_MWAIT] = invalid_op_interception,
> + [SVM_EXIT_MONITOR] = monitor_interception,
> + [SVM_EXIT_MWAIT] = mwait_interception,
> [SVM_EXIT_XSETBV] = xsetbv_interception,
> [SVM_EXIT_NPF] = pf_interception,
> };
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a06f101..a7382e1 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5603,6 +5603,18 @@ static int handle_invalid_op(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +static int handle_monitor(struct kvm_vcpu *vcpu)
> +{
> + skip_emulated_instruction(vcpu);
> + return kvm_emulate_monitor(vcpu);
> +}
> +
> +static int handle_mwait(struct kvm_vcpu *vcpu)
> +{
> + skip_emulated_instruction(vcpu);
> + return kvm_emulate_mwait(vcpu);
> +}
> +
> /*
> * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
> * We could reuse a single VMCS for all the L2 guests, but we also want the
> @@ -6483,8 +6495,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> [EXIT_REASON_EPT_VIOLATION] = handle_ept_violation,
> [EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig,
> [EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause,
> - [EXIT_REASON_MWAIT_INSTRUCTION] = handle_invalid_op,
> - [EXIT_REASON_MONITOR_INSTRUCTION] = handle_invalid_op,
> + [EXIT_REASON_MWAIT_INSTRUCTION] = handle_mwait,
> + [EXIT_REASON_MONITOR_INSTRUCTION] = handle_monitor,
> [EXIT_REASON_INVEPT] = handle_invept,
> };
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 39c28f09..8edc1be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5592,6 +5592,70 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_emulate_halt);
>
> +int kvm_emulate_monitor(struct kvm_vcpu *vcpu)
> +{
> + gva_t gva;
> + gpa_t gpa;
> + struct kvm_mwait_pg *p;
> +
> + /* emulate as NOP if no-kvm-irqchip */
> + if (!irqchip_in_kernel(vcpu->kvm))
> + return 1;
> +
> + mutex_lock(&vcpu->kvm->arch.mwait_lock);
> +
> + /* relinguish any previously monitored mwait page */
> + if (vcpu->arch.mwp != NULL) {
> + list_del(&vcpu->arch.mw_link);
> + vcpu->arch.mwp->accessed = 1;
> + vcpu->arch.mwp = NULL;
> + }
> +
> + gva = kvm_register_read(vcpu, VCPU_REGS_RAX);
> + gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
> + if (gpa == UNMAPPED_GVA)
> + goto out; /* let some write op map the page first */
> +
> + /* does the mwait page we're looking for already exist? */
> + list_for_each_entry(p, &vcpu->kvm->arch.mwait_pg_list, link)
> + if (p->gpa == gpa) {
> + vcpu->arch.mwp = p;
> + break;
> + }
> + if (vcpu->arch.mwp == NULL) { /* no, add new mwait page */
> + if (!kvm_mmu_protect_page(vcpu->kvm, gpa_to_gfn(gpa)))
> + goto out;
> + p = kmalloc(sizeof(struct kvm_mwait_pg), GFP_KERNEL);
> + p->gpa = gpa;
> + INIT_LIST_HEAD(&p->vcpu_list);
> + list_add(&p->link, &vcpu->kvm->arch.mwait_pg_list);
> +
> + vcpu->arch.mwp = p;
> + }
> +
> + /* link this VCPU into list of VCPUs monitoring this mwait page */
> + list_add(&vcpu->arch.mw_link, &vcpu->arch.mwp->vcpu_list);
> +
> +out:
> + mutex_unlock(&vcpu->kvm->arch.mwait_lock);
> + return 1;
> +}
> +EXPORT_SYMBOL_GPL(kvm_emulate_monitor);
> +
> +int kvm_emulate_mwait(struct kvm_vcpu *vcpu)
> +{
> + /* emulate as NOP if no-kvm-irqchip */
> + if (!irqchip_in_kernel(vcpu->kvm))
> + return 1;
> +
> + mutex_lock(&vcpu->kvm->arch.mwait_lock);
> + if (vcpu->arch.mwp != NULL)
> + vcpu->arch.mp_state = KVM_MP_STATE_MWAIT;
> + mutex_unlock(&vcpu->kvm->arch.mwait_lock);
> + return 1;
> +}
> +EXPORT_SYMBOL_GPL(kvm_emulate_mwait);
> +
> int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> {
> u64 param, ingpa, outgpa, ret;
> @@ -6077,6 +6141,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> if (kvm_check_request(KVM_REQ_UNHALT, vcpu)) {
> kvm_apic_accept_events(vcpu);
> switch(vcpu->arch.mp_state) {
> + case KVM_MP_STATE_MWAIT:
> case KVM_MP_STATE_HALTED:
> vcpu->arch.pv.pv_unhalted = false;
> vcpu->arch.mp_state =
> @@ -6961,6 +7026,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
> kvm_async_pf_hash_reset(vcpu);
> kvm_pmu_init(vcpu);
>
> + vcpu->arch.mwp = NULL;
> +
> return 0;
> fail_free_wbinvd_dirty_mask:
> free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
> @@ -7013,6 +7080,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>
> pvclock_update_vm_gtod_copy(kvm);
>
> + mutex_init(&kvm->arch.mwait_lock);
> + INIT_LIST_HEAD(&kvm->arch.mwait_pg_list);
> +
> return 0;
> }
>
> @@ -7254,8 +7324,10 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> || kvm_apic_has_events(vcpu)
> || vcpu->arch.pv.pv_unhalted
> || atomic_read(&vcpu->arch.nmi_queued) ||
> - (kvm_arch_interrupt_allowed(vcpu) &&
> - kvm_cpu_has_interrupt(vcpu));
> + (kvm_cpu_has_interrupt(vcpu) &&
> + (kvm_arch_interrupt_allowed(vcpu) ||
> + (vcpu->arch.mp_state == KVM_MP_STATE_MWAIT &&
> + kvm_register_read(vcpu, VCPU_REGS_RCX) & 0x01)));
> }
>
> int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 932d7f2..a4925fc 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -398,6 +398,7 @@ struct kvm_vapic_addr {
> #define KVM_MP_STATE_INIT_RECEIVED 2
> #define KVM_MP_STATE_HALTED 3
> #define KVM_MP_STATE_SIPI_RECEIVED 4
> +#define KVM_MP_STATE_MWAIT 5
>
> struct kvm_mp_state {
> __u32 mp_state;

2017-03-16 16:47:57

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 05:01:58PM +0100, Radim Krčmář wrote:
> 2017-03-16 16:35+0100, Radim Krčmář:
> > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> >> The intel manual said the same thing back in 2010 as well. However,
> >> regardless of how any flags were set, interrupt-window exiting or not,
> >> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> >> Remember, never going to sleep is still correct ("normal" ?) behavior
> >> per the ISA definition of MWAIT :)
> >
> > I'll write a simple kvm-unit-test to better understand why it is broken
> > for you ...
>
> Please get git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
>
> and try this, thanks!
>
> ---8<---
> x86/mwait: crappy test
>
> `./configure && make` to build it, then follow the comment in code to
> try few cases.

kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 1
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m10.564s
user 0m10.339s
sys 0m0.225s


and

kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 0
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.746s
user 0m0.555s
sys 0m0.200s

Both of these with Michael's v5 patch applied, on the MacPro1,1.

Similar behavior (0 1 1 takes 10 seconds, 0 1 0 returns immediately)
on the macbook air.

If I revert to the original (nop-emulated MWAIT) kvm source, I get
both versions to return immediately.

HTH,
--Gabriel



>
> ---
> x86/Makefile.common | 1 +
> x86/mwait.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 42 insertions(+)
> create mode 100644 x86/mwait.c
>
> diff --git a/x86/Makefile.common b/x86/Makefile.common
> index 1dad18ba26e1..1e708a6acd39 100644
> --- a/x86/Makefile.common
> +++ b/x86/Makefile.common
> @@ -46,6 +46,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
> $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
> $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
> $(TEST_DIR)/hyperv_synic.flat $(TEST_DIR)/hyperv_stimer.flat \
> + $(TEST_DIR)/mwait.flat \
>
> ifdef API
> tests-common += api/api-sample
> diff --git a/x86/mwait.c b/x86/mwait.c
> new file mode 100644
> index 000000000000..c21dab5cc97d
> --- /dev/null
> +++ b/x86/mwait.c
> @@ -0,0 +1,41 @@
> +#include "vm.h"
> +
> +#define TARGET_RESUMES 10000
> +volatile unsigned page[4096 / 4];
> +
> +/*
> + * Execute
> + * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> + * (first two arguments are eax and ecx for MWAIT, the third is FLAGS.IF bit)
> + * I assume you have 1000 Hz scheduler, so the test should take about 10
> + * seconds to run if mwait works (host timer interrupts will kick mwait).
> + *
> + * If you get far less, then mwait is just nop, as in the case of
> + *
> + * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> + *
> + * All other combinations of arguments should take 10 seconds.
> + * Getting killed by the TIMEOUT most likely means that you have different HZ,
> + * but could also be a bug ...
> + */
> +int main(int argc, char **argv)
> +{
> + uint32_t eax = atol(argv[1]);
> + uint32_t ecx = atol(argv[2]);
> + bool sti = atol(argv[3]);
> + unsigned resumes = 0;
> +
> + if (sti)
> + asm volatile ("sti");
> + else
> + asm volatile ("cli");
> +
> + while (resumes < TARGET_RESUMES) {
> + asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
> + asm volatile("mwait" :: "a" (eax), "c" (ecx));
> + resumes++;
> + }
> +
> + report("resumed from mwait %u times", resumes == TARGET_RESUMES, resumes);
> + return report_summary();
> +}
> --
> 2.11.0
>

2017-03-16 16:53:31

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 06:45:02PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 12:16:13PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim Krčmář wrote:
> > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > > > around here right now :) ) I realized you're simply trying to
> > > > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > > > >
> > > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > > > intel chip I had access to back in 2010:
> > > > > >
> > > > > > ##############################################################################
> > > > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > > > > ##############################################################################
> > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > > > /* cpuid 1.ecx */
> > > > > > const u32 kvm_supported_word4_x86_features =
> > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > > > set_intercept(svm, INTERCEPT_STGI);
> > > > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > > > >
> > > > > > control->iopm_base_pa = iopm_base;
> > > > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > > > nested_vmx_procbased_ctls_low = 0;
> > > > > > nested_vmx_procbased_ctls_high &=
> > > > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > #ifdef CONFIG_X86_64
> > > > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > > > #endif
> > > > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > CPU_BASED_USE_IO_BITMAPS |
> > > > > > CPU_BASED_MOV_DR_EXITING |
> > > > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > > > - CPU_BASED_MWAIT_EXITING |
> > > > > > - CPU_BASED_MONITOR_EXITING |
> > > > > > CPU_BASED_INVLPG_EXITING |
> > > > > > CPU_BASED_RDPMC_EXITING;
> > > > > >
> > > > > > opt = CPU_BASED_TPR_SHADOW |
> > > > > > CPU_BASED_USE_MSR_BITMAPS |
> > > > > >
> > > > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > > > confused at this point :)
> > > > >
> > > > > Yes. Me too. Want to try that other patch and see what happens?
> > > >
> > > > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > > > might take me a while :)
> > >
> > > Michael's patch already did most of that, you just need to add
> > >
> > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > index efde6cc50875..b12f07d4ce17 100644
> > > --- a/arch/x86/kvm/cpuid.c
> > > +++ b/arch/x86/kvm/cpuid.c
> > > @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > > const u32 kvm_cpuid_1_ecx_x86_features =
> > > /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > > * but *not* advertised to guests via CPUID ! */
> > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > >
> > > Note: this will never be upstream, because mwait isn't what we want by
> > > default. :)
> >
> > But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
> > assuming they're present, the above one-liner would make no
> > difference. If everything else in the old patch I quoted is identical
> > to what Michael does, then I don't know -- maybe the MacPro1,1 has
> > really broken L>=1 MWAIT, and it only ever worked with vmexit and
> > emulation on the host side.
>
>
> I think I have an idea. It is probably one of the monitor bugs
> on this host.
>
> X86_BUG_CLFLUSH_MONITOR or X86_BUG_MONITOR.
>
> If you tell guest you have a CPU that does not need it
> but host does need it, then mwait will not work.
>
> if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_CLFLUSH) &&
> (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
> set_cpu_bug(c, X86_BUG_CLFLUSH_MONITOR);
>
>
> if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
> ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
> set_cpu_bug(c, X86_BUG_MONITOR);
>
> what did you say your host model is?

# dmidecode -t1
# dmidecode 2.12
SMBIOS 2.4 present.

Handle 0x0021, DMI type 1, 27 bytes
System Information
Manufacturer: Apple Computer, Inc.
Product Name: MacPro1,1
Version: 1.0
Serial Number: G87030UEUPZ
UUID: 9CFE245E-D0C8-BD45-A79F-54EA5FBD3D97
Wake-up Type: Power Switch
SKU Number: System SKU#
Family: MacPro


Thx,
--G

2017-03-16 16:55:34

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 12:52:32PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 06:45:02PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 16, 2017 at 12:16:13PM -0400, Gabriel L. Somlo wrote:
> > > On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim Krčmář wrote:
> > > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > > > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > > > > around here right now :) ) I realized you're simply trying to
> > > > > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > > > > >
> > > > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > > > > intel chip I had access to back in 2010:
> > > > > > >
> > > > > > > ##############################################################################
> > > > > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > > > > > ##############################################################################
> > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > > > > /* cpuid 1.ecx */
> > > > > > > const u32 kvm_supported_word4_x86_features =
> > > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > > > > set_intercept(svm, INTERCEPT_STGI);
> > > > > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > > > > >
> > > > > > > control->iopm_base_pa = iopm_base;
> > > > > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > > > > nested_vmx_procbased_ctls_low = 0;
> > > > > > > nested_vmx_procbased_ctls_high &=
> > > > > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > #ifdef CONFIG_X86_64
> > > > > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > > > > #endif
> > > > > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > CPU_BASED_USE_IO_BITMAPS |
> > > > > > > CPU_BASED_MOV_DR_EXITING |
> > > > > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > - CPU_BASED_MWAIT_EXITING |
> > > > > > > - CPU_BASED_MONITOR_EXITING |
> > > > > > > CPU_BASED_INVLPG_EXITING |
> > > > > > > CPU_BASED_RDPMC_EXITING;
> > > > > > >
> > > > > > > opt = CPU_BASED_TPR_SHADOW |
> > > > > > > CPU_BASED_USE_MSR_BITMAPS |
> > > > > > >
> > > > > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > > > > confused at this point :)
> > > > > >
> > > > > > Yes. Me too. Want to try that other patch and see what happens?
> > > > >
> > > > > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > > > > might take me a while :)
> > > >
> > > > Michael's patch already did most of that, you just need to add
> > > >
> > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > > index efde6cc50875..b12f07d4ce17 100644
> > > > --- a/arch/x86/kvm/cpuid.c
> > > > +++ b/arch/x86/kvm/cpuid.c
> > > > @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > > > const u32 kvm_cpuid_1_ecx_x86_features =
> > > > /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > > > * but *not* advertised to guests via CPUID ! */
> > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > >
> > > > Note: this will never be upstream, because mwait isn't what we want by
> > > > default. :)
> > >
> > > But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
> > > assuming they're present, the above one-liner would make no
> > > difference. If everything else in the old patch I quoted is identical
> > > to what Michael does, then I don't know -- maybe the MacPro1,1 has
> > > really broken L>=1 MWAIT, and it only ever worked with vmexit and
> > > emulation on the host side.
> >
> >
> > I think I have an idea. It is probably one of the monitor bugs
> > on this host.
> >
> > X86_BUG_CLFLUSH_MONITOR or X86_BUG_MONITOR.
> >
> > If you tell guest you have a CPU that does not need it
> > but host does need it, then mwait will not work.
> >
> > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_CLFLUSH) &&
> > (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
> > set_cpu_bug(c, X86_BUG_CLFLUSH_MONITOR);
> >
> >
> > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
> > ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
> > set_cpu_bug(c, X86_BUG_MONITOR);
> >
> > what did you say your host model is?
>
> # dmidecode -t1
> # dmidecode 2.12
> SMBIOS 2.4 present.
>
> Handle 0x0021, DMI type 1, 27 bytes
> System Information
> Manufacturer: Apple Computer, Inc.
> Product Name: MacPro1,1
> Version: 1.0
> Serial Number: G87030UEUPZ
> UUID: 9CFE245E-D0C8-BD45-A79F-54EA5FBD3D97
> Wake-up Type: Power Switch
> SKU Number: System SKU#
> Family: MacPro

And, probably more usefully:

[... ommitting 0,1,2 ...]

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
stepping : 6
microcode : 0xd2
cpu MHz : 2659.998
cache size : 4096 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 2
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow dtherm
bugs :
bogomips : 5320.05
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

2017-03-16 17:14:28

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 12:54:50PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 12:52:32PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 06:45:02PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Mar 16, 2017 at 12:16:13PM -0400, Gabriel L. Somlo wrote:
> > > > On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim Krčmář wrote:
> > > > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > > > > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > > > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > > > > > around here right now :) ) I realized you're simply trying to
> > > > > > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > > > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > > > > > >
> > > > > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > > > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > > > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > > > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > > > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > > > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > > > > > intel chip I had access to back in 2010:
> > > > > > > >
> > > > > > > > ##############################################################################
> > > > > > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > > > > > > ##############################################################################
> > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > > > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > > > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > > > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > > > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > > > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > > > > > /* cpuid 1.ecx */
> > > > > > > > const u32 kvm_supported_word4_x86_features =
> > > > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > > > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > > > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > > > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > > > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > > > > > set_intercept(svm, INTERCEPT_STGI);
> > > > > > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > > > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > > > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > > > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > > > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > > > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > > > > > >
> > > > > > > > control->iopm_base_pa = iopm_base;
> > > > > > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > > > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > > > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > > > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > > > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > > > > > nested_vmx_procbased_ctls_low = 0;
> > > > > > > > nested_vmx_procbased_ctls_high &=
> > > > > > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > > > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > > #ifdef CONFIG_X86_64
> > > > > > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > > > > > #endif
> > > > > > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > > > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > > > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > > CPU_BASED_USE_IO_BITMAPS |
> > > > > > > > CPU_BASED_MOV_DR_EXITING |
> > > > > > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > > - CPU_BASED_MWAIT_EXITING |
> > > > > > > > - CPU_BASED_MONITOR_EXITING |
> > > > > > > > CPU_BASED_INVLPG_EXITING |
> > > > > > > > CPU_BASED_RDPMC_EXITING;
> > > > > > > >
> > > > > > > > opt = CPU_BASED_TPR_SHADOW |
> > > > > > > > CPU_BASED_USE_MSR_BITMAPS |
> > > > > > > >
> > > > > > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > > > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > > > > > confused at this point :)
> > > > > > >
> > > > > > > Yes. Me too. Want to try that other patch and see what happens?
> > > > > >
> > > > > > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > > > > > might take me a while :)
> > > > >
> > > > > Michael's patch already did most of that, you just need to add
> > > > >
> > > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > > > index efde6cc50875..b12f07d4ce17 100644
> > > > > --- a/arch/x86/kvm/cpuid.c
> > > > > +++ b/arch/x86/kvm/cpuid.c
> > > > > @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > > > > const u32 kvm_cpuid_1_ecx_x86_features =
> > > > > /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > > > > * but *not* advertised to guests via CPUID ! */
> > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > >
> > > > > Note: this will never be upstream, because mwait isn't what we want by
> > > > > default. :)
> > > >
> > > > But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
> > > > assuming they're present, the above one-liner would make no
> > > > difference. If everything else in the old patch I quoted is identical
> > > > to what Michael does, then I don't know -- maybe the MacPro1,1 has
> > > > really broken L>=1 MWAIT, and it only ever worked with vmexit and
> > > > emulation on the host side.
> > >
> > >
> > > I think I have an idea. It is probably one of the monitor bugs
> > > on this host.
> > >
> > > X86_BUG_CLFLUSH_MONITOR or X86_BUG_MONITOR.
> > >
> > > If you tell guest you have a CPU that does not need it
> > > but host does need it, then mwait will not work.
> > >
> > > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_CLFLUSH) &&
> > > (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
> > > set_cpu_bug(c, X86_BUG_CLFLUSH_MONITOR);
> > >
> > >
> > > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
> > > ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
> > > set_cpu_bug(c, X86_BUG_MONITOR);
> > >
> > > what did you say your host model is?
> >
> > # dmidecode -t1
> > # dmidecode 2.12
> > SMBIOS 2.4 present.
> >
> > Handle 0x0021, DMI type 1, 27 bytes
> > System Information
> > Manufacturer: Apple Computer, Inc.
> > Product Name: MacPro1,1
> > Version: 1.0
> > Serial Number: G87030UEUPZ
> > UUID: 9CFE245E-D0C8-BD45-A79F-54EA5FBD3D97
> > Wake-up Type: Power Switch
> > SKU Number: System SKU#
> > Family: MacPro
>
> And, probably more usefully:
>
> [... ommitting 0,1,2 ...]
>
> processor : 3
> vendor_id : GenuineIntel
> cpu family : 6
> model : 15
> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
> stepping : 6
> microcode : 0xd2
> cpu MHz : 2659.998
> cache size : 4096 KB
> physical id : 3
> siblings : 2
> core id : 0
> cpu cores : 2
> apicid : 6
> initial apicid : 6
> fpu : yes
> fpu_exception : yes
> cpuid level : 10
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow dtherm
> bugs :
> bogomips : 5320.05
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:


Hmm nope not one of these.
Need to poke at errata some more.

--
MST

2017-03-16 17:23:28

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-16 12:47-0400, Gabriel L. Somlo:
> On Thu, Mar 16, 2017 at 05:01:58PM +0100, Radim Krčmář wrote:
> > 2017-03-16 16:35+0100, Radim Krčmář:
> > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > >> The intel manual said the same thing back in 2010 as well. However,
> > >> regardless of how any flags were set, interrupt-window exiting or not,
> > >> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> > >> Remember, never going to sleep is still correct ("normal" ?) behavior
> > >> per the ISA definition of MWAIT :)
> > >
> > > I'll write a simple kvm-unit-test to better understand why it is broken
> > > for you ...
> >
> > Please get git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> >
> > and try this, thanks!
> >
> > ---8<---
> > x86/mwait: crappy test
> >
> > `./configure && make` to build it, then follow the comment in code to
> > try few cases.
>
> kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 1
> enabling apic
> PASS: resumed from mwait 10000 times
> SUMMARY: 1 tests
>
> real 0m10.564s
> user 0m10.339s
> sys 0m0.225s
>
>
> and
>
> kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 0
> enabling apic
> PASS: resumed from mwait 10000 times
> SUMMARY: 1 tests
>
> real 0m0.746s
> user 0m0.555s
> sys 0m0.200s
>
> Both of these with Michael's v5 patch applied, on the MacPro1,1.
>
> Similar behavior (0 1 1 takes 10 seconds, 0 1 0 returns immediately)
> on the macbook air.
>
> If I revert to the original (nop-emulated MWAIT) kvm source, I get
> both versions to return immediately.

Those look normal ... maybe MWAIT just ignores writes to the monitored
area?

Please apply the patch below and following and try:

time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1' -smp 2
time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 0 1' -smp 2
time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 0 0' -smp 2

All of them should take rougly the same time as the NOP one,

time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0' -smp 2

Thanks.

---8<---
diff --git a/x86/mwait.c b/x86/mwait.c
index c21dab5cc97d..ca38e7223596 100644
--- a/x86/mwait.c
+++ b/x86/mwait.c
@@ -1,7 +1,9 @@
#include "vm.h"
+#include "smp.h"

#define TARGET_RESUMES 10000
volatile unsigned page[4096 / 4];
+volatile unsigned resumes;

/*
* Execute
@@ -18,19 +20,39 @@ volatile unsigned page[4096 / 4];
* Getting killed by the TIMEOUT most likely means that you have different HZ,
* but could also be a bug ...
*/
+void writer(void *null)
+{
+ int i;
+ unsigned old_resumes = 0, new_resumes;
+
+ for (i = 0; i < TARGET_RESUMES; i++) {
+ (*page)++;
+
+ while (old_resumes == (new_resumes = resumes))
+ pause();
+ old_resumes = new_resumes;
+ }
+}
+
int main(int argc, char **argv)
{
uint32_t eax = atol(argv[1]);
uint32_t ecx = atol(argv[2]);
bool sti = atol(argv[3]);
- unsigned resumes = 0;
+ bool smp;
+
+ smp_init();
+ smp = cpu_count() > 1;
+
+ if (smp)
+ on_cpu_async(1, writer, NULL);

if (sti)
asm volatile ("sti");
else
asm volatile ("cli");

- while (resumes < TARGET_RESUMES) {
+ while ((smp ? *page : resumes) < TARGET_RESUMES) {
asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
asm volatile("mwait" :: "a" (eax), "c" (ecx));
resumes++;

2017-03-16 17:27:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 12:47:50PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 05:01:58PM +0100, Radim Krčmář wrote:
> > 2017-03-16 16:35+0100, Radim Krčmář:
> > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > >> The intel manual said the same thing back in 2010 as well. However,
> > >> regardless of how any flags were set, interrupt-window exiting or not,
> > >> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> > >> Remember, never going to sleep is still correct ("normal" ?) behavior
> > >> per the ISA definition of MWAIT :)
> > >
> > > I'll write a simple kvm-unit-test to better understand why it is broken
> > > for you ...
> >
> > Please get git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> >
> > and try this, thanks!
> >
> > ---8<---
> > x86/mwait: crappy test
> >
> > `./configure && make` to build it, then follow the comment in code to
> > try few cases.
>
> kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 1
> enabling apic
> PASS: resumed from mwait 10000 times
> SUMMARY: 1 tests
>
> real 0m10.564s
> user 0m10.339s
> sys 0m0.225s
>
>
> and
>
> kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 0
> enabling apic
> PASS: resumed from mwait 10000 times
> SUMMARY: 1 tests
>
> real 0m0.746s
> user 0m0.555s
> sys 0m0.200s
>
> Both of these with Michael's v5 patch applied, on the MacPro1,1.

Would it make sense to try to set ECX to 0? 0 0 1 and 0 0 0.


> Similar behavior (0 1 1 takes 10 seconds, 0 1 0 returns immediately)
> on the macbook air.
>
> If I revert to the original (nop-emulated MWAIT) kvm source, I get
> both versions to return immediately.
>
> HTH,
> --Gabriel
>
>
>
> >
> > ---
> > x86/Makefile.common | 1 +
> > x86/mwait.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 42 insertions(+)
> > create mode 100644 x86/mwait.c
> >
> > diff --git a/x86/Makefile.common b/x86/Makefile.common
> > index 1dad18ba26e1..1e708a6acd39 100644
> > --- a/x86/Makefile.common
> > +++ b/x86/Makefile.common
> > @@ -46,6 +46,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
> > $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
> > $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
> > $(TEST_DIR)/hyperv_synic.flat $(TEST_DIR)/hyperv_stimer.flat \
> > + $(TEST_DIR)/mwait.flat \
> >
> > ifdef API
> > tests-common += api/api-sample
> > diff --git a/x86/mwait.c b/x86/mwait.c
> > new file mode 100644
> > index 000000000000..c21dab5cc97d
> > --- /dev/null
> > +++ b/x86/mwait.c
> > @@ -0,0 +1,41 @@
> > +#include "vm.h"
> > +
> > +#define TARGET_RESUMES 10000
> > +volatile unsigned page[4096 / 4];
> > +
> > +/*
> > + * Execute
> > + * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> > + * (first two arguments are eax and ecx for MWAIT, the third is FLAGS.IF bit)
> > + * I assume you have 1000 Hz scheduler, so the test should take about 10
> > + * seconds to run if mwait works (host timer interrupts will kick mwait).
> > + *
> > + * If you get far less, then mwait is just nop, as in the case of
> > + *
> > + * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> > + *
> > + * All other combinations of arguments should take 10 seconds.
> > + * Getting killed by the TIMEOUT most likely means that you have different HZ,
> > + * but could also be a bug ...
> > + */
> > +int main(int argc, char **argv)
> > +{
> > + uint32_t eax = atol(argv[1]);
> > + uint32_t ecx = atol(argv[2]);
> > + bool sti = atol(argv[3]);
> > + unsigned resumes = 0;
> > +
> > + if (sti)
> > + asm volatile ("sti");
> > + else
> > + asm volatile ("cli");
> > +
> > + while (resumes < TARGET_RESUMES) {
> > + asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
> > + asm volatile("mwait" :: "a" (eax), "c" (ecx));
> > + resumes++;
> > + }
> > +
> > + report("resumed from mwait %u times", resumes == TARGET_RESUMES, resumes);
> > + return report_summary();
> > +}
> > --
> > 2.11.0
> >

2017-03-16 17:39:01

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-16 19:14+0200, Michael S. Tsirkin:
> On Thu, Mar 16, 2017 at 12:54:50PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 12:52:32PM -0400, Gabriel L. Somlo wrote:
> > > On Thu, Mar 16, 2017 at 06:45:02PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 16, 2017 at 12:16:13PM -0400, Gabriel L. Somlo wrote:
> > > > > On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim Krčmář wrote:
> > > > > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > > > > > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > > > > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > > > > > > around here right now :) ) I realized you're simply trying to
> > > > > > > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > > > > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > > > > > > >
> > > > > > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > > > > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > > > > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > > > > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > > > > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > > > > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > > > > > > intel chip I had access to back in 2010:
> > > > > > > > >
> > > > > > > > > ##############################################################################
> > > > > > > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <[email protected]>) 2010-04-27
> > > > > > > > > ##############################################################################
> > > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > > > > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > > > > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > > > > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > > > > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > > > > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > > > > > > /* cpuid 1.ecx */
> > > > > > > > > const u32 kvm_supported_word4_x86_features =
> > > > > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > > > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > > > > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > > > > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > > > > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > > > > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > > > > > > set_intercept(svm, INTERCEPT_STGI);
> > > > > > > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > > > > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > > > > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > > > > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > > > > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > > > > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > > > > > > >
> > > > > > > > > control->iopm_base_pa = iopm_base;
> > > > > > > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > > > > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > > > > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > > > > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > > > > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > > > > > > nested_vmx_procbased_ctls_low = 0;
> > > > > > > > > nested_vmx_procbased_ctls_high &=
> > > > > > > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > > > > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > > > #ifdef CONFIG_X86_64
> > > > > > > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > > > > > > #endif
> > > > > > > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > > > > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > > > > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > > > CPU_BASED_USE_IO_BITMAPS |
> > > > > > > > > CPU_BASED_MOV_DR_EXITING |
> > > > > > > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > > > - CPU_BASED_MWAIT_EXITING |
> > > > > > > > > - CPU_BASED_MONITOR_EXITING |
> > > > > > > > > CPU_BASED_INVLPG_EXITING |
> > > > > > > > > CPU_BASED_RDPMC_EXITING;
> > > > > > > > >
> > > > > > > > > opt = CPU_BASED_TPR_SHADOW |
> > > > > > > > > CPU_BASED_USE_MSR_BITMAPS |
> > > > > > > > >
> > > > > > > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > > > > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > > > > > > confused at this point :)
> > > > > > > >
> > > > > > > > Yes. Me too. Want to try that other patch and see what happens?
> > > > > > >
> > > > > > > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > > > > > > might take me a while :)
> > > > > >
> > > > > > Michael's patch already did most of that, you just need to add
> > > > > >
> > > > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > > > > index efde6cc50875..b12f07d4ce17 100644
> > > > > > --- a/arch/x86/kvm/cpuid.c
> > > > > > +++ b/arch/x86/kvm/cpuid.c
> > > > > > @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > > > > > const u32 kvm_cpuid_1_ecx_x86_features =
> > > > > > /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > > > > > * but *not* advertised to guests via CPUID ! */
> > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > >
> > > > > > Note: this will never be upstream, because mwait isn't what we want by
> > > > > > default. :)
> > > > >
> > > > > But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
> > > > > assuming they're present, the above one-liner would make no
> > > > > difference. If everything else in the old patch I quoted is identical
> > > > > to what Michael does, then I don't know -- maybe the MacPro1,1 has
> > > > > really broken L>=1 MWAIT, and it only ever worked with vmexit and
> > > > > emulation on the host side.
> > > >
> > > >
> > > > I think I have an idea. It is probably one of the monitor bugs
> > > > on this host.
> > > >
> > > > X86_BUG_CLFLUSH_MONITOR or X86_BUG_MONITOR.
> > > >
> > > > If you tell guest you have a CPU that does not need it
> > > > but host does need it, then mwait will not work.
> > > >
> > > > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_CLFLUSH) &&
> > > > (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
> > > > set_cpu_bug(c, X86_BUG_CLFLUSH_MONITOR);
> > > >
> > > >
> > > > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
> > > > ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
> > > > set_cpu_bug(c, X86_BUG_MONITOR);
> > > >
> > > > what did you say your host model is?
> > >
> > > # dmidecode -t1
> > > # dmidecode 2.12
> > > SMBIOS 2.4 present.
> > >
> > > Handle 0x0021, DMI type 1, 27 bytes
> > > System Information
> > > Manufacturer: Apple Computer, Inc.
> > > Product Name: MacPro1,1
> > > Version: 1.0
> > > Serial Number: G87030UEUPZ
> > > UUID: 9CFE245E-D0C8-BD45-A79F-54EA5FBD3D97
> > > Wake-up Type: Power Switch
> > > SKU Number: System SKU#
> > > Family: MacPro
> >
> > And, probably more usefully:
> >
> > [... ommitting 0,1,2 ...]
> >
> > processor : 3
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 15
> > model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
> > stepping : 6
> > microcode : 0xd2
> > cpu MHz : 2659.998
> > cache size : 4096 KB
> > physical id : 3
> > siblings : 2
> > core id : 0
> > cpu cores : 2
> > apicid : 6
> > initial apicid : 6
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 10
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow dtherm
> > bugs :
> > bogomips : 5320.05
> > clflush size : 64
> > cache_alignment : 64
> > address sizes : 36 bits physical, 48 bits virtual
> > power management:
>
>
> Hmm nope not one of these.
> Need to poke at errata some more.

Intel lists two bugs with MWAIT on that Model:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5100-spec-update.pdf

AG36. Split Locked Stores May not Trigger the Monitoring Hardware
AG106. A REP STOS/MOVS to a MONITOR/MWAIT Address Range May Prevent Triggering of the Monitoring Hardware

The latter can be dimissed as it should have been hit on bare metal as
well. The former looks pretty unlikely as well, but maybe the guest
maps w/b what bare metal would map differently?

2017-03-16 17:39:42

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 06:22:44PM +0100, Radim Krčmář wrote:
> 2017-03-16 12:47-0400, Gabriel L. Somlo:
> > On Thu, Mar 16, 2017 at 05:01:58PM +0100, Radim Krčmář wrote:
> > > 2017-03-16 16:35+0100, Radim Krčmář:
> > > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > >> The intel manual said the same thing back in 2010 as well. However,
> > > >> regardless of how any flags were set, interrupt-window exiting or not,
> > > >> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> > > >> Remember, never going to sleep is still correct ("normal" ?) behavior
> > > >> per the ISA definition of MWAIT :)
> > > >
> > > > I'll write a simple kvm-unit-test to better understand why it is broken
> > > > for you ...
> > >
> > > Please get git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> > >
> > > and try this, thanks!
> > >
> > > ---8<---
> > > x86/mwait: crappy test
> > >
> > > `./configure && make` to build it, then follow the comment in code to
> > > try few cases.
> >
> > kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> > timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 1
> > enabling apic
> > PASS: resumed from mwait 10000 times
> > SUMMARY: 1 tests
> >
> > real 0m10.564s
> > user 0m10.339s
> > sys 0m0.225s
> >
> >
> > and
> >
> > kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> > timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 0
> > enabling apic
> > PASS: resumed from mwait 10000 times
> > SUMMARY: 1 tests
> >
> > real 0m0.746s
> > user 0m0.555s
> > sys 0m0.200s
> >
> > Both of these with Michael's v5 patch applied, on the MacPro1,1.
> >
> > Similar behavior (0 1 1 takes 10 seconds, 0 1 0 returns immediately)
> > on the macbook air.
> >
> > If I revert to the original (nop-emulated MWAIT) kvm source, I get
> > both versions to return immediately.
>
> Those look normal ... maybe MWAIT just ignores writes to the monitored
> area?
>
> Please apply the patch below and following and try:
>
> time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1' -smp 2

timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 1 -smp 2
enabling apic
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.758s
user 0m0.557s
sys 0m0.220s

> time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 0 1' -smp 2

timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 0 1 -smp 2
enabling apic
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.748s
user 0m0.550s
sys 0m0.210s

> time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 0 0' -smp 2

timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 0 0 -smp 2
enabling apic
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.745s
user 0m0.558s
sys 0m0.203s

>
> All of them should take rougly the same time as the NOP one,
>
> time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0' -smp 2

They all *did* return fast, as you expected.

> ---8<---
> diff --git a/x86/mwait.c b/x86/mwait.c
> index c21dab5cc97d..ca38e7223596 100644
> --- a/x86/mwait.c
> +++ b/x86/mwait.c
> @@ -1,7 +1,9 @@
> #include "vm.h"
> +#include "smp.h"
>
> #define TARGET_RESUMES 10000
> volatile unsigned page[4096 / 4];
> +volatile unsigned resumes;
>
> /*
> * Execute
> @@ -18,19 +20,39 @@ volatile unsigned page[4096 / 4];
> * Getting killed by the TIMEOUT most likely means that you have different HZ,
> * but could also be a bug ...
> */
> +void writer(void *null)
> +{
> + int i;
> + unsigned old_resumes = 0, new_resumes;
> +
> + for (i = 0; i < TARGET_RESUMES; i++) {
> + (*page)++;
> +
> + while (old_resumes == (new_resumes = resumes))
> + pause();
> + old_resumes = new_resumes;
> + }
> +}
> +
> int main(int argc, char **argv)
> {
> uint32_t eax = atol(argv[1]);
> uint32_t ecx = atol(argv[2]);
> bool sti = atol(argv[3]);
> - unsigned resumes = 0;
> + bool smp;
> +
> + smp_init();
> + smp = cpu_count() > 1;
> +
> + if (smp)
> + on_cpu_async(1, writer, NULL);
>
> if (sti)
> asm volatile ("sti");
> else
> asm volatile ("cli");
>
> - while (resumes < TARGET_RESUMES) {
> + while ((smp ? *page : resumes) < TARGET_RESUMES) {
> asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
> asm volatile("mwait" :: "a" (eax), "c" (ecx));
> resumes++;

2017-03-16 17:42:48

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 07:27:34PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 12:47:50PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 05:01:58PM +0100, Radim Krčmář wrote:
> > > 2017-03-16 16:35+0100, Radim Krčmář:
> > > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > >> The intel manual said the same thing back in 2010 as well. However,
> > > >> regardless of how any flags were set, interrupt-window exiting or not,
> > > >> "normal" L1 MWAIT behavior was that it woke up immediately regardless.
> > > >> Remember, never going to sleep is still correct ("normal" ?) behavior
> > > >> per the ISA definition of MWAIT :)
> > > >
> > > > I'll write a simple kvm-unit-test to better understand why it is broken
> > > > for you ...
> > >
> > > Please get git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> > >
> > > and try this, thanks!
> > >
> > > ---8<---
> > > x86/mwait: crappy test
> > >
> > > `./configure && make` to build it, then follow the comment in code to
> > > try few cases.
> >
> > kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> > timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 1
> > enabling apic
> > PASS: resumed from mwait 10000 times
> > SUMMARY: 1 tests
> >
> > real 0m10.564s
> > user 0m10.339s
> > sys 0m0.225s
> >
> >
> > and
> >
> > kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> > timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 1 0
> > enabling apic
> > PASS: resumed from mwait 10000 times
> > SUMMARY: 1 tests
> >
> > real 0m0.746s
> > user 0m0.555s
> > sys 0m0.200s
> >
> > Both of these with Michael's v5 patch applied, on the MacPro1,1.
>
> Would it make sense to try to set ECX to 0? 0 0 1 and 0 0 0.

$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 0 1'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 0 1
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m10.567s
user 0m10.367s
sys 0m0.210s


$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 0 0'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 0 0 0
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m10.549s
user 0m10.352s
sys 0m0.206s

Both took 10 seconds.

>
> > Similar behavior (0 1 1 takes 10 seconds, 0 1 0 returns immediately)
> > on the macbook air.
> >
> > If I revert to the original (nop-emulated MWAIT) kvm source, I get
> > both versions to return immediately.
> >
> > HTH,
> > --Gabriel
> >
> >
> >
> > >
> > > ---
> > > x86/Makefile.common | 1 +
> > > x86/mwait.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 42 insertions(+)
> > > create mode 100644 x86/mwait.c
> > >
> > > diff --git a/x86/Makefile.common b/x86/Makefile.common
> > > index 1dad18ba26e1..1e708a6acd39 100644
> > > --- a/x86/Makefile.common
> > > +++ b/x86/Makefile.common
> > > @@ -46,6 +46,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
> > > $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
> > > $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
> > > $(TEST_DIR)/hyperv_synic.flat $(TEST_DIR)/hyperv_stimer.flat \
> > > + $(TEST_DIR)/mwait.flat \
> > >
> > > ifdef API
> > > tests-common += api/api-sample
> > > diff --git a/x86/mwait.c b/x86/mwait.c
> > > new file mode 100644
> > > index 000000000000..c21dab5cc97d
> > > --- /dev/null
> > > +++ b/x86/mwait.c
> > > @@ -0,0 +1,41 @@
> > > +#include "vm.h"
> > > +
> > > +#define TARGET_RESUMES 10000
> > > +volatile unsigned page[4096 / 4];
> > > +
> > > +/*
> > > + * Execute
> > > + * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 1'
> > > + * (first two arguments are eax and ecx for MWAIT, the third is FLAGS.IF bit)
> > > + * I assume you have 1000 Hz scheduler, so the test should take about 10
> > > + * seconds to run if mwait works (host timer interrupts will kick mwait).
> > > + *
> > > + * If you get far less, then mwait is just nop, as in the case of
> > > + *
> > > + * time TIMEOUT=20 ./x86-run x86/mwait.flat -append '0 1 0'
> > > + *
> > > + * All other combinations of arguments should take 10 seconds.
> > > + * Getting killed by the TIMEOUT most likely means that you have different HZ,
> > > + * but could also be a bug ...
> > > + */
> > > +int main(int argc, char **argv)
> > > +{
> > > + uint32_t eax = atol(argv[1]);
> > > + uint32_t ecx = atol(argv[2]);
> > > + bool sti = atol(argv[3]);
> > > + unsigned resumes = 0;
> > > +
> > > + if (sti)
> > > + asm volatile ("sti");
> > > + else
> > > + asm volatile ("cli");
> > > +
> > > + while (resumes < TARGET_RESUMES) {
> > > + asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
> > > + asm volatile("mwait" :: "a" (eax), "c" (ecx));
> > > + resumes++;
> > > + }
> > > +
> > > + report("resumed from mwait %u times", resumes == TARGET_RESUMES, resumes);
> > > + return report_summary();
> > > +}
> > > --
> > > 2.11.0
> > >

2017-03-16 18:29:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

Let's take a step back and try to figure out how is
mwait called. How about dumping code of VCPUs
around mwait? gdb disa command will do this.

--
MST

2017-03-16 19:25:59

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 08:29:32PM +0200, Michael S. Tsirkin wrote:
> Let's take a step back and try to figure out how is
> mwait called. How about dumping code of VCPUs
> around mwait? gdb disa command will do this.

Started guest with '-s', tried to attach from gdb with
"target remote localhost:1234", got
"remote 'g' packet reply is too long: <lengthy string of numbers>"

Tried typing 'cont' in the qemu monitor, got os x to crash:

panic (cpu 1 caller 0xffffff7f813ff488): pmLock: waited too long, held
by 0xffffff7f813eff65

Hmm, maybe that's where it keeps its monitor/mwait idle loop.
Restarted the guest, tried this from monitor:

dump-guest-memory foobar 0xffffff7f813e0000 0x20000

Got "'dump-guest-memory' has failed: integer is for 32-bit values"

Hmmm... I have no idea what I'm doing anymore at this point... :)

--G

2017-03-16 19:28:01

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 03:24:41PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 08:29:32PM +0200, Michael S. Tsirkin wrote:
> > Let's take a step back and try to figure out how is
> > mwait called. How about dumping code of VCPUs
> > around mwait? gdb disa command will do this.
>
> Started guest with '-s', tried to attach from gdb with
> "target remote localhost:1234", got
> "remote 'g' packet reply is too long: <lengthy string of numbers>"

Try

set arch x86-64:x86-64


> Tried typing 'cont' in the qemu monitor, got os x to crash:
>
> panic (cpu 1 caller 0xffffff7f813ff488): pmLock: waited too long, held
> by 0xffffff7f813eff65
>
> Hmm, maybe that's where it keeps its monitor/mwait idle loop.
> Restarted the guest, tried this from monitor:
>
> dump-guest-memory foobar 0xffffff7f813e0000 0x20000
>
> Got "'dump-guest-memory' has failed: integer is for 32-bit values"
>
> Hmmm... I have no idea what I'm doing anymore at this point... :)
>
> --G

I think 0xffffff7f813ff488 is a PC.

--
MST

2017-03-16 20:17:17

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 09:27:56PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 03:24:41PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 08:29:32PM +0200, Michael S. Tsirkin wrote:
> > > Let's take a step back and try to figure out how is
> > > mwait called. How about dumping code of VCPUs
> > > around mwait? gdb disa command will do this.
> >
> > Started guest with '-s', tried to attach from gdb with
> > "target remote localhost:1234", got
> > "remote 'g' packet reply is too long: <lengthy string of numbers>"
>
> Try
>
> set arch x86-64:x86-64

'set architecture i386:x86-64:intel' is what worked for me;

Been rooting around for a while, can't find mwait or monitor :(

Guess I'll have to recompile KVM to actually issue an invalid opcode,
so OS X will print a panic message with the exact address :)

Stay tuned...

>
> > Tried typing 'cont' in the qemu monitor, got os x to crash:
> >
> > panic (cpu 1 caller 0xffffff7f813ff488): pmLock: waited too long, held
> > by 0xffffff7f813eff65
> >
> > Hmm, maybe that's where it keeps its monitor/mwait idle loop.
> > Restarted the guest, tried this from monitor:
> >
> > dump-guest-memory foobar 0xffffff7f813e0000 0x20000
> >
> > Got "'dump-guest-memory' has failed: integer is for 32-bit values"
> >
> > Hmmm... I have no idea what I'm doing anymore at this point... :)
> >
> > --G
>
> I think 0xffffff7f813ff488 is a PC.
>
> --
> MST

2017-03-16 21:15:19

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 04:17:11PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 09:27:56PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 16, 2017 at 03:24:41PM -0400, Gabriel L. Somlo wrote:
> > > On Thu, Mar 16, 2017 at 08:29:32PM +0200, Michael S. Tsirkin wrote:
> > > > Let's take a step back and try to figure out how is
> > > > mwait called. How about dumping code of VCPUs
> > > > around mwait? gdb disa command will do this.
> > >
> > > Started guest with '-s', tried to attach from gdb with
> > > "target remote localhost:1234", got
> > > "remote 'g' packet reply is too long: <lengthy string of numbers>"
> >
> > Try
> >
> > set arch x86-64:x86-64
>
> 'set architecture i386:x86-64:intel' is what worked for me;
>
> Been rooting around for a while, can't find mwait or monitor :(
>
> Guess I'll have to recompile KVM to actually issue an invalid opcode,
> so OS X will print a panic message with the exact address :)
>
> Stay tuned...

OK, so I found a few instances. The one closest to where a random
interrupt from gdb landed, was this one:

...
0xffffff7f813ff379: mov 0x90(%r15),%rax
0xffffff7f813ff380: mov 0x18(%rax),%rsi
0xffffff7f813ff384: xor %ecx,%ecx
0xffffff7f813ff386: mov %rsi,%rax
0xffffff7f813ff389: xor %edx,%edx
0xffffff7f813ff38b: monitor %rax,%rcx,%rdx
0xffffff7f813ff38e: test %r14,%r14
0xffffff7f813ff391: je 0xffffff7f813ff3ad
0xffffff7f813ff393: movq $0x0,0x8(%r14)
0xffffff7f813ff39b: movl $0x0,(%r14)
0xffffff7f813ff3a2: test %ebx,%ebx
0xffffff7f813ff3a4: je 0xffffff7f813ff3b2
0xffffff7f813ff3a6: mfence
0xffffff7f813ff3a9: wbinvd
0xffffff7f813ff3ab: jmp 0xffffff7f813ff3b2
0xffffff7f813ff3ad: cmpl $0x0,(%rsi)
0xffffff7f813ff3b0: jne 0xffffff7f813ff3d6
0xffffff7f813ff3b2: mov %r12d,%eax
0xffffff7f813ff3b5: imul $0x148,%rax,%rax
0xffffff7f813ff3bc: lea 0x153bd(%rip),%rcx # 0xffffff7f81414780
0xffffff7f813ff3c3: mov (%rcx),%rcx
0xffffff7f813ff3c6: mov 0x20(%rcx),%rcx
0xffffff7f813ff3ca: mov 0xc(%rcx,%rax,1),%eax
0xffffff7f813ff3ce: mov $0x1,%ecx
0xffffff7f813ff3d3: mwait %rax,%rcx
=> 0xffffff7f813ff3d6: lfence
0xffffff7f813ff3d9: rdtsc
0xffffff7f813ff3db: lfence
0xffffff7f813ff3de: mov %rax,%rbx
0xffffff7f813ff3e1: mov %rdx,%r15
...

Also, there were a few more within the range occupied by
AppleIntelCPUPowerManagement.kext (which provides is the "smart"
idle loop used by OS X):


...
0xffffff7f813f799a: mov 0x90(%r15),%rax
0xffffff7f813f79a1: mov 0x18(%rax),%r15
0xffffff7f813f79a5: xor %ecx,%ecx
0xffffff7f813f79a7: mov %r15,%rax
0xffffff7f813f79aa: xor %edx,%edx
0xffffff7f813f79ac: monitor %rax,%rcx,%rdx
0xffffff7f813f79af: mov %r12d,%r12d
0xffffff7f813f79b2: imul $0x148,%r12,%r13
0xffffff7f813f79b9: lea 0x1cdc0(%rip),%rax # 0xffffff7f81414780
0xffffff7f813f79c0: mov (%rax),%rax
0xffffff7f813f79c3: mov 0x20(%rax),%rcx
0xffffff7f813f79c7: testb $0x10,0x2(%rcx,%r13,1)
0xffffff7f813f79cd: je 0xffffff7f813f79d5
0xffffff7f813f79cf: callq *0x80(%rax)
0xffffff7f813f79d5: test %r14,%r14
0xffffff7f813f79d8: je 0xffffff7f813f79f4
0xffffff7f813f79da: movq $0x0,0x8(%r14)
0xffffff7f813f79e2: movl $0x0,(%r14)
0xffffff7f813f79e9: test %ebx,%ebx
0xffffff7f813f79eb: je 0xffffff7f813f79fa
0xffffff7f813f79ed: mfence
0xffffff7f813f79f0: wbinvd
0xffffff7f813f79f2: jmp 0xffffff7f813f79fa
0xffffff7f813f79f4: cmpl $0x0,(%r15)
0xffffff7f813f79f8: jne 0xffffff7f813f7a15
0xffffff7f813f79fa: lea 0x1cd7f(%rip),%rax # 0xffffff7f81414780
0xffffff7f813f7a01: mov (%rax),%rax
0xffffff7f813f7a04: mov 0x20(%rax),%rax
0xffffff7f813f7a08: mov 0xc(%rax,%r13,1),%eax
0xffffff7f813f7a0d: mov $0x1,%ecx
0xffffff7f813f7a12: mwait %rax,%rcx
0xffffff7f813f7a15: lfence
0xffffff7f813f7a18: rdtsc
0xffffff7f813f7a1a: lfence
0xffffff7f813f7a1d: mov %rax,%rbx
0xffffff7f813f7a20: mov %rdx,%r15
...

...
0xffffff7f813f89c9: xor %ecx,%ecx
0xffffff7f813f89cb: mov %r13,%rax
0xffffff7f813f89ce: xor %edx,%edx
0xffffff7f813f89d0: monitor %rax,%rcx,%rdx
0xffffff7f813f89d3: mov %r12d,%r15d
0xffffff7f813f89d6: imul $0x148,%r15,%r12
0xffffff7f813f89dd: lea 0x1bd9c(%rip),%rax # 0xffffff7f81414780
0xffffff7f813f89e4: mov (%rax),%rax
0xffffff7f813f89e7: mov 0x20(%rax),%rcx
0xffffff7f813f89eb: testb $0x10,0x2(%rcx,%r12,1)
0xffffff7f813f89f1: je 0xffffff7f813f89f9
0xffffff7f813f89f3: callq *0x80(%rax)
0xffffff7f813f89f9: test %r14,%r14
0xffffff7f813f89fc: je 0xffffff7f813f8a18
0xffffff7f813f89fe: movq $0x0,0x8(%r14)
0xffffff7f813f8a06: movl $0x0,(%r14)
0xffffff7f813f8a0d: test %ebx,%ebx
0xffffff7f813f8a0f: je 0xffffff7f813f8a1f
0xffffff7f813f8a11: mfence
0xffffff7f813f8a14: wbinvd
0xffffff7f813f8a16: jmp 0xffffff7f813f8a1f
0xffffff7f813f8a18: cmpl $0x0,0x0(%r13)
0xffffff7f813f8a1d: jne 0xffffff7f813f8a3a
0xffffff7f813f8a1f: lea 0x1bd5a(%rip),%rax # 0xffffff7f81414780
0xffffff7f813f8a26: mov (%rax),%rax
0xffffff7f813f8a29: mov 0x20(%rax),%rax
0xffffff7f813f8a2d: mov 0xc(%rax,%r12,1),%eax
0xffffff7f813f8a32: mov $0x1,%ecx
0xffffff7f813f8a37: mwait %rax,%rcx
0xffffff7f813f8a3a: lfence
0xffffff7f813f8a3d: rdtsc
0xffffff7f813f8a3f: lfence
0xffffff7f813f8a42: mov %rax,%rbx
0xffffff7f813f8a45: mov %rdx,%r12
0xffffff7f813f8a48: shl $0x20,%r12
...

...
0xffffff7f81401c10: mov %r13,%rax
0xffffff7f81401c13: xor %edx,%edx
0xffffff7f81401c15: monitor %rax,%rcx,%rdx
0xffffff7f81401c18: mov %r12d,%r15d
0xffffff7f81401c1b: imul $0x148,%r15,%r12
0xffffff7f81401c22: lea 0x12b57(%rip),%rax # 0xffffff7f81414780
0xffffff7f81401c29: mov (%rax),%rax
0xffffff7f81401c2c: mov 0x20(%rax),%rcx
0xffffff7f81401c30: testb $0x10,0x2(%rcx,%r12,1)
0xffffff7f81401c36: je 0xffffff7f81401c3e
0xffffff7f81401c38: callq *0x80(%rax)
0xffffff7f81401c3e: test %r14,%r14
0xffffff7f81401c41: je 0xffffff7f81401c5d
0xffffff7f81401c43: movq $0x0,0x8(%r14)
0xffffff7f81401c4b: movl $0x0,(%r14)
0xffffff7f81401c52: test %ebx,%ebx
0xffffff7f81401c54: je 0xffffff7f81401c64
0xffffff7f81401c56: mfence
0xffffff7f81401c59: wbinvd
0xffffff7f81401c5b: jmp 0xffffff7f81401c64
0xffffff7f81401c5d: cmpl $0x0,0x0(%r13)
0xffffff7f81401c62: jne 0xffffff7f81401c7f
0xffffff7f81401c64: lea 0x12b15(%rip),%rax # 0xffffff7f81414780
0xffffff7f81401c6b: mov (%rax),%rax
0xffffff7f81401c6e: mov 0x20(%rax),%rax
0xffffff7f81401c72: mov 0xc(%rax,%r12,1),%eax
0xffffff7f81401c77: mov $0x1,%ecx
0xffffff7f81401c7c: mwait %rax,%rcx
0xffffff7f81401c7f: lfence
0xffffff7f81401c82: rdtsc
0xffffff7f81401c84: lfence
0xffffff7f81401c87: mov %rax,%rbx
0xffffff7f81401c8a: mov %rdx,%r12
0xffffff7f81401c8d: shl $0x20,%r12
0xffffff7f81401c91: lea 0xaf1c(%rip),%rax # 0xffffff7f8140cbb4
0xffffff7f81401c98: testb $0x1,(%rax)
...

If that's not enough context, I can email you the whole 'script'
output I collected...

HTH,
--Gabriel

2017-03-17 02:04:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Thu, Mar 16, 2017 at 05:14:15PM -0400, Gabriel L. Somlo wrote:
> On Thu, Mar 16, 2017 at 04:17:11PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 09:27:56PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Mar 16, 2017 at 03:24:41PM -0400, Gabriel L. Somlo wrote:
> > > > On Thu, Mar 16, 2017 at 08:29:32PM +0200, Michael S. Tsirkin wrote:
> > > > > Let's take a step back and try to figure out how is
> > > > > mwait called. How about dumping code of VCPUs
> > > > > around mwait? gdb disa command will do this.
> > > >
> > > > Started guest with '-s', tried to attach from gdb with
> > > > "target remote localhost:1234", got
> > > > "remote 'g' packet reply is too long: <lengthy string of numbers>"
> > >
> > > Try
> > >
> > > set arch x86-64:x86-64
> >
> > 'set architecture i386:x86-64:intel' is what worked for me;
> >
> > Been rooting around for a while, can't find mwait or monitor :(
> >
> > Guess I'll have to recompile KVM to actually issue an invalid opcode,
> > so OS X will print a panic message with the exact address :)
> >
> > Stay tuned...
>
> OK, so I found a few instances. The one closest to where a random
> interrupt from gdb landed, was this one:
>
> ...
> 0xffffff7f813ff379: mov 0x90(%r15),%rax
> 0xffffff7f813ff380: mov 0x18(%rax),%rsi
> 0xffffff7f813ff384: xor %ecx,%ecx
> 0xffffff7f813ff386: mov %rsi,%rax
> 0xffffff7f813ff389: xor %edx,%edx
> 0xffffff7f813ff38b: monitor %rax,%rcx,%rdx
> 0xffffff7f813ff38e: test %r14,%r14
> 0xffffff7f813ff391: je 0xffffff7f813ff3ad
> 0xffffff7f813ff393: movq $0x0,0x8(%r14)
> 0xffffff7f813ff39b: movl $0x0,(%r14)
> 0xffffff7f813ff3a2: test %ebx,%ebx
> 0xffffff7f813ff3a4: je 0xffffff7f813ff3b2
> 0xffffff7f813ff3a6: mfence
> 0xffffff7f813ff3a9: wbinvd
> 0xffffff7f813ff3ab: jmp 0xffffff7f813ff3b2
> 0xffffff7f813ff3ad: cmpl $0x0,(%rsi)

Seems to do cmpl - could indicate it uses different bytes
for signalling? Radim's test monitors and
modifies the same byte...

> 0xffffff7f813ff3b0: jne 0xffffff7f813ff3d6
> 0xffffff7f813ff3b2: mov %r12d,%eax
> 0xffffff7f813ff3b5: imul $0x148,%rax,%rax
> 0xffffff7f813ff3bc: lea 0x153bd(%rip),%rcx # 0xffffff7f81414780
> 0xffffff7f813ff3c3: mov (%rcx),%rcx
> 0xffffff7f813ff3c6: mov 0x20(%rcx),%rcx
> 0xffffff7f813ff3ca: mov 0xc(%rcx,%rax,1),%eax
> 0xffffff7f813ff3ce: mov $0x1,%ecx
> 0xffffff7f813ff3d3: mwait %rax,%rcx
> => 0xffffff7f813ff3d6: lfence
> 0xffffff7f813ff3d9: rdtsc
> 0xffffff7f813ff3db: lfence
> 0xffffff7f813ff3de: mov %rax,%rbx
> 0xffffff7f813ff3e1: mov %rdx,%r15
> ...

OK nice, so it's actually using 1 for ECX. Now what's rax?
Can you check that with gdb pls, then try that value with
Radim's test?

--
MST

2017-03-17 13:32:58

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Fri, Mar 17, 2017 at 04:03:59AM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 05:14:15PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 04:17:11PM -0400, Gabriel L. Somlo wrote:
> > > On Thu, Mar 16, 2017 at 09:27:56PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 16, 2017 at 03:24:41PM -0400, Gabriel L. Somlo wrote:
> > > > > On Thu, Mar 16, 2017 at 08:29:32PM +0200, Michael S. Tsirkin wrote:
> > > > > > Let's take a step back and try to figure out how is
> > > > > > mwait called. How about dumping code of VCPUs
> > > > > > around mwait? gdb disa command will do this.
> > > > >
> > > > > Started guest with '-s', tried to attach from gdb with
> > > > > "target remote localhost:1234", got
> > > > > "remote 'g' packet reply is too long: <lengthy string of numbers>"
> > > >
> > > > Try
> > > >
> > > > set arch x86-64:x86-64
> > >
> > > 'set architecture i386:x86-64:intel' is what worked for me;
> > >
> > > Been rooting around for a while, can't find mwait or monitor :(
> > >
> > > Guess I'll have to recompile KVM to actually issue an invalid opcode,
> > > so OS X will print a panic message with the exact address :)
> > >
> > > Stay tuned...
> >
> > OK, so I found a few instances. The one closest to where a random
> > interrupt from gdb landed, was this one:
> >
> > ...
> > 0xffffff7f813ff379: mov 0x90(%r15),%rax
> > 0xffffff7f813ff380: mov 0x18(%rax),%rsi
> > 0xffffff7f813ff384: xor %ecx,%ecx
> > 0xffffff7f813ff386: mov %rsi,%rax
> > 0xffffff7f813ff389: xor %edx,%edx
> > 0xffffff7f813ff38b: monitor %rax,%rcx,%rdx
> > 0xffffff7f813ff38e: test %r14,%r14
> > 0xffffff7f813ff391: je 0xffffff7f813ff3ad
> > 0xffffff7f813ff393: movq $0x0,0x8(%r14)
> > 0xffffff7f813ff39b: movl $0x0,(%r14)
> > 0xffffff7f813ff3a2: test %ebx,%ebx
> > 0xffffff7f813ff3a4: je 0xffffff7f813ff3b2
> > 0xffffff7f813ff3a6: mfence
> > 0xffffff7f813ff3a9: wbinvd
> > 0xffffff7f813ff3ab: jmp 0xffffff7f813ff3b2
> > 0xffffff7f813ff3ad: cmpl $0x0,(%rsi)
>
> Seems to do cmpl - could indicate it uses different bytes
> for signalling? Radim's test monitors and
> modifies the same byte...
>
> > 0xffffff7f813ff3b0: jne 0xffffff7f813ff3d6
> > 0xffffff7f813ff3b2: mov %r12d,%eax
> > 0xffffff7f813ff3b5: imul $0x148,%rax,%rax
> > 0xffffff7f813ff3bc: lea 0x153bd(%rip),%rcx # 0xffffff7f81414780
> > 0xffffff7f813ff3c3: mov (%rcx),%rcx
> > 0xffffff7f813ff3c6: mov 0x20(%rcx),%rcx
> > 0xffffff7f813ff3ca: mov 0xc(%rcx,%rax,1),%eax
> > 0xffffff7f813ff3ce: mov $0x1,%ecx
> > 0xffffff7f813ff3d3: mwait %rax,%rcx
> > => 0xffffff7f813ff3d6: lfence
> > 0xffffff7f813ff3d9: rdtsc
> > 0xffffff7f813ff3db: lfence
> > 0xffffff7f813ff3de: mov %rax,%rbx
> > 0xffffff7f813ff3e1: mov %rdx,%r15
> > ...
>
> OK nice, so it's actually using 1 for ECX. Now what's rax?
> Can you check that with gdb pls, then try that value with
> Radim's test?

Thread 1 received signal SIGINT, Interrupt.
0xffffff80002c8991 in ?? ()
(gdb) break *0xffffff7f813ff3ce
Breakpoint 1 at 0xffffff7f813ff3ce
(gdb) continue
Continuing.

Thread 3 hit Breakpoint 1, 0xffffff7f813ff3ce in ?? ()
(gdb) p $rax
$1 = 240
(gdb) cont
Continuing.
[Switching to Thread 1]

Thread 1 hit Breakpoint 1, 0xffffff7f813ff3ce in ?? ()
(gdb) p $rax
$2 = 240
(gdb) cont
Continuing.
[Switching to Thread 4]

Thread 4 hit Breakpoint 1, 0xffffff7f813ff3ce in ?? ()
(gdb) p $rax
$3 = 240
(gdb) cont
Continuing.

Thread 4 hit Breakpoint 1, 0xffffff7f813ff3ce in ?? ()
(gdb) p $rax
$4 = 240
(gdb)

So, 240 or 0xf0

OK, now on to Radim's test, on the MacPro1,1:

[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.746s
user 0m0.542s
sys 0m0.215s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.743s
user 0m0.528s
sys 0m0.226s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1' -smp 2
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1 -smp 2
enabling apic
enabling apic
FAIL: resumed from mwait 10150 times
SUMMARY: 1 tests, 1 unexpected failures

real 0m0.745s
user 0m0.545s
sys 0m0.214s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0' -smp 2
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0 -smp 2
enabling apic
enabling apic
FAIL: resumed from mwait 10116 times
SUMMARY: 1 tests, 1 unexpected failures

real 0m0.744s
user 0m0.541s
sys 0m0.217s

HTH,
--Gabriel

2017-03-21 03:22:27

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Fri, Mar 17, 2017 at 09:23:56AM -0400, Gabriel L. Somlo wrote:
> OK, now on to Radim's test, on the MacPro1,1:
>
> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1'
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1
> enabling apic
> PASS: resumed from mwait 10000 times
> SUMMARY: 1 tests
>
> real 0m0.746s
> user 0m0.542s
> sys 0m0.215s
> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0'
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0
> enabling apic
> PASS: resumed from mwait 10000 times
> SUMMARY: 1 tests
>
> real 0m0.743s
> user 0m0.528s
> sys 0m0.226s
> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1' -smp 2
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1 -smp 2
> enabling apic
> enabling apic
> FAIL: resumed from mwait 10150 times
> SUMMARY: 1 tests, 1 unexpected failures
>
> real 0m0.745s
> user 0m0.545s
> sys 0m0.214s
> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0' -smp 2
> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0 -smp 2
> enabling apic
> enabling apic
> FAIL: resumed from mwait 10116 times
> SUMMARY: 1 tests, 1 unexpected failures
>
> real 0m0.744s
> user 0m0.541s
> sys 0m0.217s
>
> HTH,
> --Gabriel

Weird. How can it go above 10000? Radim - any idea?

--
MST

2017-03-21 16:17:00

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d1efe2c..18e53bc 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
> set_intercept(svm, INTERCEPT_CLGI);
> set_intercept(svm, INTERCEPT_SKINIT);
> set_intercept(svm, INTERCEPT_WBINVD);
> - set_intercept(svm, INTERCEPT_MONITOR);
> - set_intercept(svm, INTERCEPT_MWAIT);

Why do you remove the intercepts for AMD? The new kvm_mwait_in_guest()
function will always return false on AMD anyway, and on Intel you re-add
the intercepts for !kvm_mwait_in_guest().


Joerg

2017-03-21 17:00:18

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-21 05:22+0200, Michael S. Tsirkin:
> On Fri, Mar 17, 2017 at 09:23:56AM -0400, Gabriel L. Somlo wrote:
>> OK, now on to Radim's test, on the MacPro1,1:
>>
>> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1'
>> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1
>> enabling apic
>> PASS: resumed from mwait 10000 times
>> SUMMARY: 1 tests
>>
>> real 0m0.746s
>> user 0m0.542s
>> sys 0m0.215s
>> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0'
>> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0
>> enabling apic
>> PASS: resumed from mwait 10000 times
>> SUMMARY: 1 tests
>>
>> real 0m0.743s
>> user 0m0.528s
>> sys 0m0.226s
>> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1' -smp 2
>> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1 -smp 2
>> enabling apic
>> enabling apic
>> FAIL: resumed from mwait 10150 times
>> SUMMARY: 1 tests, 1 unexpected failures
>>
>> real 0m0.745s
>> user 0m0.545s
>> sys 0m0.214s
>> [kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0' -smp 2
>> timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0 -smp 2
>> enabling apic
>> enabling apic
>> FAIL: resumed from mwait 10116 times
>> SUMMARY: 1 tests, 1 unexpected failures
>>
>> real 0m0.744s
>> user 0m0.541s
>> sys 0m0.217s
>>
>> HTH,
>> --Gabriel
>
> Weird. How can it go above 10000? Radim - any idea?

In '-smp 2', the writing VCPU always does 10000 wakeups by writing into
monitored memory, but the mwaiting VCPU can be also woken up by host
interrupts, which might add a few exits depending on timing.

I didn't spend much time in making the PASS/FAIL mean much, or ensuring
that we only get 10000 wakeups ... it is nothing to be worried about.

Hint 240 behaves as nop even on my system, so I still don't find
anything insane on that machine (if OS X is exluded) ...

2017-03-21 17:30:09

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests


> On Mar 21, 2017, at 9:58 AM, Radim Krčmář <[email protected]> wrote:

> In '-smp 2', the writing VCPU always does 10000 wakeups by writing into
> monitored memory, but the mwaiting VCPU can be also woken up by host
> interrupts, which might add a few exits depending on timing.
>
> I didn't spend much time in making the PASS/FAIL mean much, or ensuring
> that we only get 10000 wakeups ... it is nothing to be worried about.
>
> Hint 240 behaves as nop even on my system, so I still don't find
> anything insane on that machine (if OS X is exluded) ...

>From my days in Intel (10 years ago), I can say that MWAIT wakes for many
microarchitecural events beside interrupts.

Out of curiosity, aren’t you worried that on OS X the wbinvd causes an exit
after the monitor and before the mwait?



2017-03-21 18:47:27

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Tue, Mar 21, 2017 at 05:16:32PM +0100, Joerg Roedel wrote:
> On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index d1efe2c..18e53bc 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1198,8 +1198,6 @@ static void init_vmcb(struct vcpu_svm *svm)
> > set_intercept(svm, INTERCEPT_CLGI);
> > set_intercept(svm, INTERCEPT_SKINIT);
> > set_intercept(svm, INTERCEPT_WBINVD);
> > - set_intercept(svm, INTERCEPT_MONITOR);
> > - set_intercept(svm, INTERCEPT_MWAIT);
>
> Why do you remove the intercepts for AMD? The new kvm_mwait_in_guest()
> function will always return false on AMD anyway,

I think that's a bug and I should fix it to return true there.

> and on Intel you re-add
> the intercepts for !kvm_mwait_in_guest().
>
>
> Joerg

Does AMD need some work-around similar to CPUID5_ECX_INTERRUPT_BREAK?
That's why we have kvm_mwait_in_guest ...

--
MST

2017-03-21 19:27:10

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-21 10:29-0700, Nadav Amit:
>
> > On Mar 21, 2017, at 9:58 AM, Radim Krčmář <[email protected]> wrote:
>
> > In '-smp 2', the writing VCPU always does 10000 wakeups by writing into
> > monitored memory, but the mwaiting VCPU can be also woken up by host
> > interrupts, which might add a few exits depending on timing.
> >
> > I didn't spend much time in making the PASS/FAIL mean much, or ensuring
> > that we only get 10000 wakeups ... it is nothing to be worried about.
> >
> > Hint 240 behaves as nop even on my system, so I still don't find
> > anything insane on that machine (if OS X is exluded) ...
>
> From my days in Intel (10 years ago), I can say that MWAIT wakes for many
> microarchitecural events beside interrupts.
>
> Out of curiosity, aren’t you worried that on OS X the wbinvd causes an exit
> after the monitor and before the mwait?

VM entry clears the monitoring, so it should behave just like an MWAIT
without MONITOR, which is NOP according to the spec. It does so on
modern hardware, but it definitely is a good thing to try ...
(I am worried about disabling MWAIT exits by default and it's a no-go
until we understand why OS X doesn't work.)

Gabriel, how does testing with this change behave on the old machine?

Thanks.

---8<---
This should be the same as "wbinvd", because "wbinvd" does nothing
without non-coherent vfio.
Simply replacing "vmcall" with "wbinvd" is an option if the "vmcall"
version works as expected.
---
diff --git a/x86/mwait.c b/x86/mwait.c
index 20f4dcbff8ae..19f988b94541 100644
--- a/x86/mwait.c
+++ b/x86/mwait.c
@@ -54,6 +54,7 @@ int main(int argc, char **argv)

while ((smp ? *page : resumes) < TARGET_RESUMES) {
asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
+ asm volatile("vmcall" :: "a"(-1));
asm volatile("mwait" :: "a" (eax), "c" (ecx));
resumes++;
}

2017-03-21 22:51:32

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Tue, Mar 21, 2017 at 08:22:39PM +0100, Radim Krčmář wrote:
> 2017-03-21 10:29-0700, Nadav Amit:
> >
> > > On Mar 21, 2017, at 9:58 AM, Radim Krčmář <[email protected]> wrote:
> >
> > > In '-smp 2', the writing VCPU always does 10000 wakeups by writing into
> > > monitored memory, but the mwaiting VCPU can be also woken up by host
> > > interrupts, which might add a few exits depending on timing.
> > >
> > > I didn't spend much time in making the PASS/FAIL mean much, or ensuring
> > > that we only get 10000 wakeups ... it is nothing to be worried about.
> > >
> > > Hint 240 behaves as nop even on my system, so I still don't find
> > > anything insane on that machine (if OS X is exluded) ...

And I get the exact same results on the MacBookAir4,2 (which exhibits
no freezing or extreme sluggishness when running OS X 10.7 smp with
Michael's KVM MWAIT-in-L1 patch)...

> >
> > From my days in Intel (10 years ago), I can say that MWAIT wakes for many
> > microarchitecural events beside interrupts.
> >
> > Out of curiosity, aren’t you worried that on OS X the wbinvd causes an exit
> > after the monitor and before the mwait?
>
> VM entry clears the monitoring, so it should behave just like an MWAIT
> without MONITOR, which is NOP according to the spec. It does so on
> modern hardware, but it definitely is a good thing to try ...
> (I am worried about disabling MWAIT exits by default and it's a no-go
> until we understand why OS X doesn't work.)
>
> Gabriel, how does testing with this change behave on the old machine?
>
> Thanks.
>
> ---8<---
> This should be the same as "wbinvd", because "wbinvd" does nothing
> without non-coherent vfio.
> Simply replacing "vmcall" with "wbinvd" is an option if the "vmcall"
> version works as expected.
> ---
> diff --git a/x86/mwait.c b/x86/mwait.c
> index 20f4dcbff8ae..19f988b94541 100644
> --- a/x86/mwait.c
> +++ b/x86/mwait.c
> @@ -54,6 +54,7 @@ int main(int argc, char **argv)
>
> while ((smp ? *page : resumes) < TARGET_RESUMES) {
> asm volatile("monitor" :: "a" (page), "c" (0), "d" (0));
> + asm volatile("vmcall" :: "a"(-1));
> asm volatile("mwait" :: "a" (eax), "c" (ecx));
> resumes++;
> }

Sure thing, here's the MacPro1,1 results:

[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m1.709s
user 0m0.547s
sys 0m0.243s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.752s
user 0m0.545s
sys 0m0.218s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0' -smp 2
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0 -smp 2
enabling apic
enabling apic
FAIL: resumed from mwait 10004 times
SUMMARY: 1 tests, 1 unexpected failures

real 0m0.753s
user 0m0.554s
sys 0m0.227s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1' -smp 2
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1 -smp 2
enabling apic
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.755s
user 0m0.562s
sys 0m0.221s


For comparison, the resuls including 'vmcall' on the MacBookAir4,2
(interesting, the results for the last test, "-append '240 1 1' -smp 2",
are different):

[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.622s
user 0m0.501s
sys 0m0.130s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1'
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1
enabling apic
PASS: resumed from mwait 10000 times
SUMMARY: 1 tests

real 0m0.624s
user 0m0.504s
sys 0m0.127s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 0' -smp 2
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 0 -smp 2
enabling apic
enabling apic
FAIL: resumed from mwait 10023 times
SUMMARY: 1 tests, 1 unexpected failures

real 0m0.623s
user 0m0.544s
sys 0m0.110s
[kvm-unit-tests]$ time TIMEOUT=20 ./x86-run x86/mwait.flat -append '240 1 1' -smp 2
timeout -k 1s --foreground 20 qemu-kvm -nodefaults -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -kernel x86/mwait.flat -append 240 1 1 -smp 2
enabling apic
enabling apic
FAIL: resumed from mwait 10006 times
SUMMARY: 1 tests, 1 unexpected failures

real 0m0.618s
user 0m0.527s
sys 0m0.121s

HTH,
--Gabriel

2017-03-22 00:02:37

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests


> On Mar 21, 2017, at 3:51 PM, Gabriel Somlo <[email protected]> wrote:
>
> And I get the exact same results on the MacBookAir4,2 (which exhibits
> no freezing or extreme sluggishness when running OS X 10.7 smp with
> Michael's KVM MWAIT-in-L1 patch)...

Sorry for my confusion. I didn’t read the entire thread and thought that
the problem is spurious wake-ups.

Since that is not the case, I would just suggest two things that you can
freely ignore:

1. According to the SDM, when an interrupt is delivered, the interrupt
is only delivered on the following instruction, so you may consider
skipping the MWAIT first.

2. Perhaps the CPU changes for some reason GUEST_ACTIVITY_STATE (which
is not according to the SDM).

That is it. No more BS from me.

Nadav

2017-03-22 13:37:33

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Tue, Mar 21, 2017 at 05:02:25PM -0700, Nadav Amit wrote:
>
> > On Mar 21, 2017, at 3:51 PM, Gabriel Somlo <[email protected]> wrote:
> >
> > And I get the exact same results on the MacBookAir4,2 (which exhibits
> > no freezing or extreme sluggishness when running OS X 10.7 smp with
> > Michael's KVM MWAIT-in-L1 patch)...
>
> Sorry for my confusion. I didn’t read the entire thread and thought that
> the problem is spurious wake-ups.
>
> Since that is not the case, I would just suggest two things that you can
> freely ignore:
>
> 1. According to the SDM, when an interrupt is delivered, the interrupt
> is only delivered on the following instruction, so you may consider
> skipping the MWAIT first.
>
> 2. Perhaps the CPU changes for some reason GUEST_ACTIVITY_STATE (which
> is not according to the SDM).
>
> That is it. No more BS from me.
>
> Nadav

Intersting. I found this errata:
A REP STOS/MOVS to a MONITOR/MWAIT Address Range May Prevent Triggering of
the Monitoring Hardware

Could the macbook CPU be affected?

--
MST

2017-03-22 14:10:22

by Gabriel L. Somlo

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Wed, Mar 22, 2017 at 03:35:18PM +0200, Michael S. Tsirkin wrote:
> On Tue, Mar 21, 2017 at 05:02:25PM -0700, Nadav Amit wrote:
> >
> > > On Mar 21, 2017, at 3:51 PM, Gabriel Somlo <[email protected]> wrote:
> > >
> > > And I get the exact same results on the MacBookAir4,2 (which exhibits
> > > no freezing or extreme sluggishness when running OS X 10.7 smp with
> > > Michael's KVM MWAIT-in-L1 patch)...
> >
> > Sorry for my confusion. I didn’t read the entire thread and thought that
> > the problem is spurious wake-ups.
> >
> > Since that is not the case, I would just suggest two things that you can
> > freely ignore:
> >
> > 1. According to the SDM, when an interrupt is delivered, the interrupt
> > is only delivered on the following instruction, so you may consider
> > skipping the MWAIT first.
> >
> > 2. Perhaps the CPU changes for some reason GUEST_ACTIVITY_STATE (which
> > is not according to the SDM).
> >
> > That is it. No more BS from me.
> >
> > Nadav
>
> Intersting. I found this errata:
> A REP STOS/MOVS to a MONITOR/MWAIT Address Range May Prevent Triggering of
> the Monitoring Hardware

Any way to tell if they mean that for L0, or L>=1, or all of them?

> Could the macbook CPU be affected?

I ran a grep on the log file I collected when disassembling
AppleIntelCPUPowerManagement.kext (where the MWAIT-based idle
thread lives) a few days ago, and didn't find any "rep stos" or
"rep movs" instances.


2017-03-22 14:15:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Wed, Mar 22, 2017 at 10:10:05AM -0400, Gabriel L. Somlo wrote:
> On Wed, Mar 22, 2017 at 03:35:18PM +0200, Michael S. Tsirkin wrote:
> > On Tue, Mar 21, 2017 at 05:02:25PM -0700, Nadav Amit wrote:
> > >
> > > > On Mar 21, 2017, at 3:51 PM, Gabriel Somlo <[email protected]> wrote:
> > > >
> > > > And I get the exact same results on the MacBookAir4,2 (which exhibits
> > > > no freezing or extreme sluggishness when running OS X 10.7 smp with
> > > > Michael's KVM MWAIT-in-L1 patch)...
> > >
> > > Sorry for my confusion. I didn’t read the entire thread and thought that
> > > the problem is spurious wake-ups.
> > >
> > > Since that is not the case, I would just suggest two things that you can
> > > freely ignore:
> > >
> > > 1. According to the SDM, when an interrupt is delivered, the interrupt
> > > is only delivered on the following instruction, so you may consider
> > > skipping the MWAIT first.
> > >
> > > 2. Perhaps the CPU changes for some reason GUEST_ACTIVITY_STATE (which
> > > is not according to the SDM).
> > >
> > > That is it. No more BS from me.
> > >
> > > Nadav
> >
> > Intersting. I found this errata:
> > A REP STOS/MOVS to a MONITOR/MWAIT Address Range May Prevent Triggering of
> > the Monitoring Hardware
>
> Any way to tell if they mean that for L0, or L>=1, or all of them?
>
> > Could the macbook CPU be affected?
>
> I ran a grep on the log file I collected when disassembling
> AppleIntelCPUPowerManagement.kext (where the MWAIT-based idle
> thread lives) a few days ago, and didn't find any "rep stos" or
> "rep movs" instances.
>

Right but that would be on the waking side, not the one
that does mwait.

--
MST

2017-03-27 13:34:46

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests



On 15/03/2017 22:22, Michael S. Tsirkin wrote:
> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> unless explicitly provided with kernel command line argument
> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> without checking CPUID.
>
> We currently emulate that as a NOP but on VMX we can do better: let
> guest stop the CPU until timer, IPI or memory change. CPU will be busy
> but that isn't any worse than a NOP emulation.
>
> Note that mwait within guests is not the same as on real hardware
> because halt causes an exit while mwait doesn't. For this reason it
> might not be a good idea to use the regular MWAIT flag in CPUID to
> signal this capability. Add a flag in the hypervisor leaf instead.

So imagine we had proper MWAIT emulation capabilities based on page
faults. In that case, we could do something as fancy as

Treat MWAIT as pass-through by default

Have a per-vcpu monitor timer 10 times a second in the background that
checks which instruction we're in

If we're in mwait for the last - say - 1 second, switch to emulated
MWAIT, if $IP was in non-mwait within that time, reset counter.

Or instead maybe just reuse the adapter hlt logic?

Either way, with that we should be able to get super low latency IPIs
running while still maintaining some sanity on systems which don't have
dedicated CPUs for workloads.

And we wouldn't need guest modifications, which is a great plus. So
older guests (and Windows?) could benefit from mwait as well.


Alex

2017-03-28 14:29:03

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-27 15:34+0200, Alexander Graf:
> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> unless explicitly provided with kernel command line argument
>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> without checking CPUID.
>>
>> We currently emulate that as a NOP but on VMX we can do better: let
>> guest stop the CPU until timer, IPI or memory change. CPU will be busy
>> but that isn't any worse than a NOP emulation.
>>
>> Note that mwait within guests is not the same as on real hardware
>> because halt causes an exit while mwait doesn't. For this reason it
>> might not be a good idea to use the regular MWAIT flag in CPUID to
>> signal this capability. Add a flag in the hypervisor leaf instead.
>
> So imagine we had proper MWAIT emulation capabilities based on page faults.
> In that case, we could do something as fancy as
>
> Treat MWAIT as pass-through by default
>
> Have a per-vcpu monitor timer 10 times a second in the background that
> checks which instruction we're in
>
> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
> if $IP was in non-mwait within that time, reset counter.

Or we could reuse external interrupts for sampling. Exits trigerred by
them would check for current instruction (probably would be best to
limit just to timer tick) and a sufficient ratio (> 0?) of other exits
would imply that MWAIT is not used.

> Or instead maybe just reuse the adapter hlt logic?

Emulated MWAIT is very similar to emulated HLT, so reusing the logic
makes sense. We would just add new wakeup methods.

> Either way, with that we should be able to get super low latency IPIs
> running while still maintaining some sanity on systems which don't have
> dedicated CPUs for workloads.
>
> And we wouldn't need guest modifications, which is a great plus. So older
> guests (and Windows?) could benefit from mwait as well.

There is no need guest modifications -- it could be exposed as standard
MWAIT feature to the guest, with responsibilities for guest/host-impact
on the user.

I think that the page-fault based MWAIT would require paravirt if it
should be enabled by default, because of performance concerns:
Enabling write protection on a page needs a VM exit on all other VCPUs
when beginning monitoring (to reload page permissions and prevent missed
writes).
We'd want to keep trapping writes to the page all the time because
toggling is slow, but this could regress performance for an OS that has
other data accessed by other VCPUs in that page.
No current interface can tell the guest that it should reserve the whole
page instead of what CPUID[5] says and that writes to the monitored page
are not "cheap", but can trigger a VM exit ...

And before we disable MWAIT exiting by default, we also have to
understand the old OS X on core 2 bug from Gabriel.

2017-03-28 20:35:31

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <[email protected]> wrote:
> 2017-03-27 15:34+0200, Alexander Graf:
>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>> unless explicitly provided with kernel command line argument
>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>> without checking CPUID.
>>>
>>> We currently emulate that as a NOP but on VMX we can do better: let
>>> guest stop the CPU until timer, IPI or memory change. CPU will be busy
>>> but that isn't any worse than a NOP emulation.
>>>
>>> Note that mwait within guests is not the same as on real hardware
>>> because halt causes an exit while mwait doesn't. For this reason it
>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>> signal this capability. Add a flag in the hypervisor leaf instead.
>>
>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>> In that case, we could do something as fancy as
>>
>> Treat MWAIT as pass-through by default
>>
>> Have a per-vcpu monitor timer 10 times a second in the background that
>> checks which instruction we're in
>>
>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>> if $IP was in non-mwait within that time, reset counter.
>
> Or we could reuse external interrupts for sampling. Exits trigerred by
> them would check for current instruction (probably would be best to
> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
> would imply that MWAIT is not used.
>
>> Or instead maybe just reuse the adapter hlt logic?
>
> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
> makes sense. We would just add new wakeup methods.
>
>> Either way, with that we should be able to get super low latency IPIs
>> running while still maintaining some sanity on systems which don't have
>> dedicated CPUs for workloads.
>>
>> And we wouldn't need guest modifications, which is a great plus. So older
>> guests (and Windows?) could benefit from mwait as well.
>
> There is no need guest modifications -- it could be exposed as standard
> MWAIT feature to the guest, with responsibilities for guest/host-impact
> on the user.
>
> I think that the page-fault based MWAIT would require paravirt if it
> should be enabled by default, because of performance concerns:
> Enabling write protection on a page needs a VM exit on all other VCPUs
> when beginning monitoring (to reload page permissions and prevent missed
> writes).
> We'd want to keep trapping writes to the page all the time because
> toggling is slow, but this could regress performance for an OS that has
> other data accessed by other VCPUs in that page.
> No current interface can tell the guest that it should reserve the whole
> page instead of what CPUID[5] says and that writes to the monitored page
> are not "cheap", but can trigger a VM exit ...

CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
when running Mac OS X guests. Per Intel's SDM volume 3, section
8.10.5, "To avoid false wake-ups; use the largest monitor line size to
pad the data structure used to monitor writes. Software must make sure
that beyond the data structure, no unrelated data variable exists in
the triggering area for MWAIT. A pad may be needed to avoid this
situation." Unfortunately, most operating systems do not follow this
advice.

>
> And before we disable MWAIT exiting by default, we also have to
> understand the old OS X on core 2 bug from Gabriel.

2017-03-29 12:11:58

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-03-28 13:35-0700, Jim Mattson:
> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <[email protected]> wrote:
>> 2017-03-27 15:34+0200, Alexander Graf:
>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>> unless explicitly provided with kernel command line argument
>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>> without checking CPUID.
>>>>
>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>> guest stop the CPU until timer, IPI or memory change. CPU will be busy
>>>> but that isn't any worse than a NOP emulation.
>>>>
>>>> Note that mwait within guests is not the same as on real hardware
>>>> because halt causes an exit while mwait doesn't. For this reason it
>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>> signal this capability. Add a flag in the hypervisor leaf instead.
>>>
>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>> In that case, we could do something as fancy as
>>>
>>> Treat MWAIT as pass-through by default
>>>
>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>> checks which instruction we're in
>>>
>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>> if $IP was in non-mwait within that time, reset counter.
>>
>> Or we could reuse external interrupts for sampling. Exits trigerred by
>> them would check for current instruction (probably would be best to
>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>> would imply that MWAIT is not used.
>>
>>> Or instead maybe just reuse the adapter hlt logic?
>>
>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>> makes sense. We would just add new wakeup methods.
>>
>>> Either way, with that we should be able to get super low latency IPIs
>>> running while still maintaining some sanity on systems which don't have
>>> dedicated CPUs for workloads.
>>>
>>> And we wouldn't need guest modifications, which is a great plus. So older
>>> guests (and Windows?) could benefit from mwait as well.
>>
>> There is no need guest modifications -- it could be exposed as standard
>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>> on the user.
>>
>> I think that the page-fault based MWAIT would require paravirt if it
>> should be enabled by default, because of performance concerns:
>> Enabling write protection on a page needs a VM exit on all other VCPUs
>> when beginning monitoring (to reload page permissions and prevent missed
>> writes).
>> We'd want to keep trapping writes to the page all the time because
>> toggling is slow, but this could regress performance for an OS that has
>> other data accessed by other VCPUs in that page.
>> No current interface can tell the guest that it should reserve the whole
>> page instead of what CPUID[5] says and that writes to the monitored page
>> are not "cheap", but can trigger a VM exit ...
>
> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
> when running Mac OS X guests. Per Intel's SDM volume 3, section
> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
> pad the data structure used to monitor writes. Software must make sure
> that beyond the data structure, no unrelated data variable exists in
> the triggering area for MWAIT. A pad may be needed to avoid this
> situation." Unfortunately, most operating systems do not follow this
> advice.

Right, EBX provides what we need to expose that the whole page is
monitored, thanks!

> Unfortunately, most operating systems do not follow this
> advice.

Yeah ... KVM could add yet another heuristic to drop MWAIT emulation and
use hardware if there were many traps while the target was not MWAITING,
it's getting over-complicated, though :/

2017-04-03 10:04:39

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On 03/29/2017 02:11 PM, Radim Krčmář wrote:
> 2017-03-28 13:35-0700, Jim Mattson:
>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <[email protected]> wrote:
>>> 2017-03-27 15:34+0200, Alexander Graf:
>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>>> unless explicitly provided with kernel command line argument
>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>>> without checking CPUID.
>>>>>
>>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>>> guest stop the CPU until timer, IPI or memory change. CPU will be busy
>>>>> but that isn't any worse than a NOP emulation.
>>>>>
>>>>> Note that mwait within guests is not the same as on real hardware
>>>>> because halt causes an exit while mwait doesn't. For this reason it
>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>>> signal this capability. Add a flag in the hypervisor leaf instead.
>>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>>> In that case, we could do something as fancy as
>>>>
>>>> Treat MWAIT as pass-through by default
>>>>
>>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>>> checks which instruction we're in
>>>>
>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>>> if $IP was in non-mwait within that time, reset counter.
>>> Or we could reuse external interrupts for sampling. Exits trigerred by
>>> them would check for current instruction (probably would be best to
>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>>> would imply that MWAIT is not used.
>>>
>>>> Or instead maybe just reuse the adapter hlt logic?
>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>>> makes sense. We would just add new wakeup methods.
>>>
>>>> Either way, with that we should be able to get super low latency IPIs
>>>> running while still maintaining some sanity on systems which don't have
>>>> dedicated CPUs for workloads.
>>>>
>>>> And we wouldn't need guest modifications, which is a great plus. So older
>>>> guests (and Windows?) could benefit from mwait as well.
>>> There is no need guest modifications -- it could be exposed as standard
>>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>>> on the user.
>>>
>>> I think that the page-fault based MWAIT would require paravirt if it
>>> should be enabled by default, because of performance concerns:
>>> Enabling write protection on a page needs a VM exit on all other VCPUs
>>> when beginning monitoring (to reload page permissions and prevent missed
>>> writes).
>>> We'd want to keep trapping writes to the page all the time because
>>> toggling is slow, but this could regress performance for an OS that has
>>> other data accessed by other VCPUs in that page.
>>> No current interface can tell the guest that it should reserve the whole
>>> page instead of what CPUID[5] says and that writes to the monitored page
>>> are not "cheap", but can trigger a VM exit ...
>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>> when running Mac OS X guests. Per Intel's SDM volume 3, section
>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>> pad the data structure used to monitor writes. Software must make sure
>> that beyond the data structure, no unrelated data variable exists in
>> the triggering area for MWAIT. A pad may be needed to avoid this
>> situation." Unfortunately, most operating systems do not follow this
>> advice.
> Right, EBX provides what we need to expose that the whole page is
> monitored, thanks!

So coming back to the original patch, is there anything that should keep
us from exposing MWAIT straight into the guest at all times?


Alex

2017-04-04 12:39:31

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-04-03 12:04+0200, Alexander Graf:
> On 03/29/2017 02:11 PM, Radim Krčmář wrote:
>> 2017-03-28 13:35-0700, Jim Mattson:
>> > On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <[email protected]> wrote:
>> > > 2017-03-27 15:34+0200, Alexander Graf:
>> > > > On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>> > > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>> > > > > unless explicitly provided with kernel command line argument
>> > > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>> > > > > without checking CPUID.
>> > > > >
>> > > > > We currently emulate that as a NOP but on VMX we can do better: let
>> > > > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
>> > > > > but that isn't any worse than a NOP emulation.
>> > > > >
>> > > > > Note that mwait within guests is not the same as on real hardware
>> > > > > because halt causes an exit while mwait doesn't. For this reason it
>> > > > > might not be a good idea to use the regular MWAIT flag in CPUID to
>> > > > > signal this capability. Add a flag in the hypervisor leaf instead.
>> > > > So imagine we had proper MWAIT emulation capabilities based on page faults.
>> > > > In that case, we could do something as fancy as
>> > > >
>> > > > Treat MWAIT as pass-through by default
>> > > >
>> > > > Have a per-vcpu monitor timer 10 times a second in the background that
>> > > > checks which instruction we're in
>> > > >
>> > > > If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>> > > > if $IP was in non-mwait within that time, reset counter.
>> > > Or we could reuse external interrupts for sampling. Exits trigerred by
>> > > them would check for current instruction (probably would be best to
>> > > limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>> > > would imply that MWAIT is not used.
>> > >
>> > > > Or instead maybe just reuse the adapter hlt logic?
>> > > Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>> > > makes sense. We would just add new wakeup methods.
>> > >
>> > > > Either way, with that we should be able to get super low latency IPIs
>> > > > running while still maintaining some sanity on systems which don't have
>> > > > dedicated CPUs for workloads.
>> > > >
>> > > > And we wouldn't need guest modifications, which is a great plus. So older
>> > > > guests (and Windows?) could benefit from mwait as well.
>> > > There is no need guest modifications -- it could be exposed as standard
>> > > MWAIT feature to the guest, with responsibilities for guest/host-impact
>> > > on the user.
>> > >
>> > > I think that the page-fault based MWAIT would require paravirt if it
>> > > should be enabled by default, because of performance concerns:
>> > > Enabling write protection on a page needs a VM exit on all other VCPUs
>> > > when beginning monitoring (to reload page permissions and prevent missed
>> > > writes).
>> > > We'd want to keep trapping writes to the page all the time because
>> > > toggling is slow, but this could regress performance for an OS that has
>> > > other data accessed by other VCPUs in that page.
>> > > No current interface can tell the guest that it should reserve the whole
>> > > page instead of what CPUID[5] says and that writes to the monitored page
>> > > are not "cheap", but can trigger a VM exit ...
>> > CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>> > VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>> > when running Mac OS X guests. Per Intel's SDM volume 3, section
>> > 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>> > pad the data structure used to monitor writes. Software must make sure
>> > that beyond the data structure, no unrelated data variable exists in
>> > the triggering area for MWAIT. A pad may be needed to avoid this
>> > situation." Unfortunately, most operating systems do not follow this
>> > advice.
>> Right, EBX provides what we need to expose that the whole page is
>> monitored, thanks!
>
> So coming back to the original patch, is there anything that should keep us
> from exposing MWAIT straight into the guest at all times?

Just minor issues:
* OS X on Core 2 fails for unknown reason if we disable the instruction
trapping, which is an argument against doing it by default
* idling guests would consume host CPU, which is a significant change
in behavior and shouldn't be done without userspace's involvement

I think the best compromise is to add a capability for the MWAIT VM-exit
controls and let userspace expose MWAIT if it wishes to.
Will send a patch.

2017-04-04 12:51:11

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On 04/04/2017 02:39 PM, Radim Krčmář wrote:
> 2017-04-03 12:04+0200, Alexander Graf:
>> On 03/29/2017 02:11 PM, Radim Krčmář wrote:
>>> 2017-03-28 13:35-0700, Jim Mattson:
>>>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <[email protected]> wrote:
>>>>> 2017-03-27 15:34+0200, Alexander Graf:
>>>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>>>>> unless explicitly provided with kernel command line argument
>>>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>>>>> without checking CPUID.
>>>>>>>
>>>>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>>>>> guest stop the CPU until timer, IPI or memory change. CPU will be busy
>>>>>>> but that isn't any worse than a NOP emulation.
>>>>>>>
>>>>>>> Note that mwait within guests is not the same as on real hardware
>>>>>>> because halt causes an exit while mwait doesn't. For this reason it
>>>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>>>>> signal this capability. Add a flag in the hypervisor leaf instead.
>>>>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>>>>> In that case, we could do something as fancy as
>>>>>>
>>>>>> Treat MWAIT as pass-through by default
>>>>>>
>>>>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>>>>> checks which instruction we're in
>>>>>>
>>>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>>>>> if $IP was in non-mwait within that time, reset counter.
>>>>> Or we could reuse external interrupts for sampling. Exits trigerred by
>>>>> them would check for current instruction (probably would be best to
>>>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>>>>> would imply that MWAIT is not used.
>>>>>
>>>>>> Or instead maybe just reuse the adapter hlt logic?
>>>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>>>>> makes sense. We would just add new wakeup methods.
>>>>>
>>>>>> Either way, with that we should be able to get super low latency IPIs
>>>>>> running while still maintaining some sanity on systems which don't have
>>>>>> dedicated CPUs for workloads.
>>>>>>
>>>>>> And we wouldn't need guest modifications, which is a great plus. So older
>>>>>> guests (and Windows?) could benefit from mwait as well.
>>>>> There is no need guest modifications -- it could be exposed as standard
>>>>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>>>>> on the user.
>>>>>
>>>>> I think that the page-fault based MWAIT would require paravirt if it
>>>>> should be enabled by default, because of performance concerns:
>>>>> Enabling write protection on a page needs a VM exit on all other VCPUs
>>>>> when beginning monitoring (to reload page permissions and prevent missed
>>>>> writes).
>>>>> We'd want to keep trapping writes to the page all the time because
>>>>> toggling is slow, but this could regress performance for an OS that has
>>>>> other data accessed by other VCPUs in that page.
>>>>> No current interface can tell the guest that it should reserve the whole
>>>>> page instead of what CPUID[5] says and that writes to the monitored page
>>>>> are not "cheap", but can trigger a VM exit ...
>>>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>>>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>>>> when running Mac OS X guests. Per Intel's SDM volume 3, section
>>>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>>>> pad the data structure used to monitor writes. Software must make sure
>>>> that beyond the data structure, no unrelated data variable exists in
>>>> the triggering area for MWAIT. A pad may be needed to avoid this
>>>> situation." Unfortunately, most operating systems do not follow this
>>>> advice.
>>> Right, EBX provides what we need to expose that the whole page is
>>> monitored, thanks!
>> So coming back to the original patch, is there anything that should keep us
>> from exposing MWAIT straight into the guest at all times?
> Just minor issues:
> * OS X on Core 2 fails for unknown reason if we disable the instruction
> trapping, which is an argument against doing it by default

So for that we should try and see if changing the exposed CPUID MWAIT
leaf helps. Currently we return 0/0 which is pretty bogus and might be
the reason OSX fails.

> * idling guests would consume host CPU, which is a significant change
> in behavior and shouldn't be done without userspace's involvement

That's the same as today, as idling guests with MWAIT would also today
end up in a NOP emulated loop.

Please bear in mind that I do not advocate to expose the MWAIT CPUID
flag. This is only for the instruction trap.

> I think the best compromise is to add a capability for the MWAIT VM-exit
> controls and let userspace expose MWAIT if it wishes to.
> Will send a patch.


Please see my patch to force enable CPUID bits ;).



Alex


2017-04-04 13:13:22

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

2017-04-04 14:51+0200, Alexander Graf:
> On 04/04/2017 02:39 PM, Radim Krčmář wrote:
>> 2017-04-03 12:04+0200, Alexander Graf:
>> > So coming back to the original patch, is there anything that should keep us
>> > from exposing MWAIT straight into the guest at all times?
>> Just minor issues:
>> * OS X on Core 2 fails for unknown reason if we disable the instruction
>> trapping, which is an argument against doing it by default
>
> So for that we should try and see if changing the exposed CPUID MWAIT leaf
> helps. Currently we return 0/0 which is pretty bogus and might be the reason
> OSX fails.

We have tried to pass host's CPUID MWAIT leaf and it still failed:
https://www.spinics.net/lists/kvm/msg146686.html

I wouldn't mind breaking that particular combination of OS X and
hardware, but I'm worried to do it because we don't understand why it
broke, so there could be more ...

>> * idling guests would consume host CPU, which is a significant change
>> in behavior and shouldn't be done without userspace's involvement
>
> That's the same as today, as idling guests with MWAIT would also today end
> up in a NOP emulated loop.
>
> Please bear in mind that I do not advocate to expose the MWAIT CPUID flag.
> This is only for the instruction trap.

Ah, makes sense.

>> I think the best compromise is to add a capability for the MWAIT VM-exit
>> controls and let userspace expose MWAIT if it wishes to.
>> Will send a patch.
>
> Please see my patch to force enable CPUID bits ;).

Nice. MWAIT could also use setting of arbitrary values for its leaf,
but a generic interface for that would probably look clunky on the
command line ...

2017-04-04 13:15:52

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

On 04/04/2017 03:13 PM, Radim Krčmář wrote:
> 2017-04-04 14:51+0200, Alexander Graf:
>> On 04/04/2017 02:39 PM, Radim Krčmář wrote:
>>> 2017-04-03 12:04+0200, Alexander Graf:
>>>> So coming back to the original patch, is there anything that should keep us
>>>> from exposing MWAIT straight into the guest at all times?
>>> Just minor issues:
>>> * OS X on Core 2 fails for unknown reason if we disable the instruction
>>> trapping, which is an argument against doing it by default
>> So for that we should try and see if changing the exposed CPUID MWAIT leaf
>> helps. Currently we return 0/0 which is pretty bogus and might be the reason
>> OSX fails.
> We have tried to pass host's CPUID MWAIT leaf and it still failed:
> https://www.spinics.net/lists/kvm/msg146686.html
>
> I wouldn't mind breaking that particular combination of OS X and
> hardware, but I'm worried to do it because we don't understand why it
> broke, so there could be more ...
>
>>> * idling guests would consume host CPU, which is a significant change
>>> in behavior and shouldn't be done without userspace's involvement
>> That's the same as today, as idling guests with MWAIT would also today end
>> up in a NOP emulated loop.
>>
>> Please bear in mind that I do not advocate to expose the MWAIT CPUID flag.
>> This is only for the instruction trap.
> Ah, makes sense.
>
>>> I think the best compromise is to add a capability for the MWAIT VM-exit
>>> controls and let userspace expose MWAIT if it wishes to.
>>> Will send a patch.
>> Please see my patch to force enable CPUID bits ;).
> Nice. MWAIT could also use setting of arbitrary values for its leaf,
> but a generic interface for that would probably look clunky on the
> command line ...


I think we should have an interface similar to smbios for that
eventually. Something where you can explicitly set arbitrary CPUID leaf
information using leaf specific syntax. There are more leafs where it
would make sense - cache topology for example.


Alex


2017-04-04 13:44:58

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

[Cc qemu-devel as we've gone off-topic]

2017-04-04 15:15+0200, Alexander Graf:
> On 04/04/2017 03:13 PM, Radim Krčmář wrote:
>> 2017-04-04 14:51+0200, Alexander Graf:
>> > Please see my patch to force enable CPUID bits ;).
>> Nice. MWAIT could also use setting of arbitrary values for its leaf,
>> but a generic interface for that would probably look clunky on the
>> command line ...
>
>
> I think we should have an interface similar to smbios for that eventually.
> Something where you can explicitly set arbitrary CPUID leaf information
> using leaf specific syntax. There are more leafs where it would make sense -
> cache topology for example.

Right, separating cpuid from -cpu makes it bearable, like

-cpuid leaf=%x[,subleaf=%x][,eax=%x][,ebx=%x][,ecx=%x][,edx=%x]

And Having multiple interfaces for the same thing would result in some
corner case decisions ...
I think QEMU should check that feature flags specified flags specified
by -cpu are not cleared by -cpuid.
I'm not sure if setters like "|=" and "&=~" would be beneficial in some
cases.