Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755051AbdCPPfi (ORCPT ); Thu, 16 Mar 2017 11:35:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37126 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753070AbdCPPff (ORCPT ); Thu, 16 Mar 2017 11:35:35 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 35258804E4 Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=rkrcmar@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 35258804E4 Date: Thu, 16 Mar 2017 16:35:18 +0100 From: Radim =?utf-8?B?S3LEjW3DocWZ?= To: "Gabriel L. Somlo" Cc: "Michael S. Tsirkin" , linux-kernel@vger.kernel.org, Paolo Bonzini , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, Joerg Roedel , kvm@vger.kernel.org, linux-doc@vger.kernel.org Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests Message-ID: <20170316153517.GL14081@potion> References: <1489612895-12799-1-git-send-email-mst@redhat.com> <20170315233534.GG2239@HEDWIG.INI.CMU.EDU> <20170316013903-mutt-send-email-mst@kernel.org> <20170316132426.GB4085@HEDWIG.INI.CMU.EDU> <20170316155550-mutt-send-email-mst@kernel.org> <20170316145819.GC4085@HEDWIG.INI.CMU.EDU> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170316145819.GC4085@HEDWIG.INI.CMU.EDU> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 16 Mar 2017 15:35:25 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8199 Lines: 162 2017-03-16 10:58-0400, Gabriel L. Somlo: > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote: > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote: > > > After studying your patch a bit more carefully (sorry, it's crazy > > > around here right now :) ) I realized you're simply trying to > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to > > > just allow L1 to execute MONITOR & MWAIT natively. > > > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT > > > natively was one of the options Alex Graf and Rene Rebe used back in > > > the very early days of OS X on QEMU, at the time I got involved with > > > that project. Here's part of an out of tree patch against 3.4 which did > > > just that, and worked as far as I remember on *any* MWAIT capable > > > intel chip I had access to back in 2010: > > > > > > ############################################################################## > > > # 99-mwait.patch.kvm-kmod (Rene Rebe ) 2010-04-27 > > > ############################################################################## > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400 > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400 > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) | > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp | > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); > > > /* cpuid 1.ecx */ > > > const u32 kvm_supported_word4_x86_features = > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ | > > > 0 /* DS-CPL, VMX, SMX, EST */ | > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | > > > 0 /* Reserved, DCA */ | F(XMM4_1) | > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) | > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400 > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400 > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s > > > set_intercept(svm, INTERCEPT_VMSAVE); > > > set_intercept(svm, INTERCEPT_STGI); > > > set_intercept(svm, INTERCEPT_CLGI); > > > set_intercept(svm, INTERCEPT_SKINIT); > > > set_intercept(svm, INTERCEPT_WBINVD); > > > - set_intercept(svm, INTERCEPT_MONITOR); > > > - set_intercept(svm, INTERCEPT_MWAIT); > > > set_intercept(svm, INTERCEPT_XSETBV); > > > > > > control->iopm_base_pa = iopm_base; > > > control->msrpm_base_pa = __pa(svm->msrpm); > > > control->int_ctl = V_INTR_MASKING_MASK; > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400 > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400 > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); > > > nested_vmx_procbased_ctls_low = 0; > > > nested_vmx_procbased_ctls_high &= > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING | > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING | > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING | > > > + CPU_BASED_CR3_LOAD_EXITING | > > > CPU_BASED_CR3_STORE_EXITING | > > > #ifdef CONFIG_X86_64 > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | > > > #endif > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING | > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru > > > CPU_BASED_CR3_LOAD_EXITING | > > > CPU_BASED_CR3_STORE_EXITING | > > > CPU_BASED_USE_IO_BITMAPS | > > > CPU_BASED_MOV_DR_EXITING | > > > CPU_BASED_USE_TSC_OFFSETING | > > > - CPU_BASED_MWAIT_EXITING | > > > - CPU_BASED_MONITOR_EXITING | > > > CPU_BASED_INVLPG_EXITING | > > > CPU_BASED_RDPMC_EXITING; > > > > > > opt = CPU_BASED_TPR_SHADOW | > > > CPU_BASED_USE_MSR_BITMAPS | > > > > > > If all you're trying to do is (selectively) revert to this behavior, > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly > > > confused at this point :) > > > > Yes. Me too. Want to try that other patch and see what happens? > > You mean the old 3.4 patch against current KVM ? I'll try to do that, > might take me a while :) Michael's patch already did most of that, you just need to add diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index efde6cc50875..b12f07d4ce17 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, const u32 kvm_cpuid_1_ecx_x86_features = /* NOTE: MONITOR (and MWAIT) are emulated as NOP, * but *not* advertised to guests via CPUID ! */ - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | Note: this will never be upstream, because mwait isn't what we want by default. :) >> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP, >> > didn't power down the physical CPU, just immediately moved on to the >> > next instruction. As such, there was no power saving and no >> > opportunity to yield to another L0 thread either, unlike with NOP >> > emulation at L0. >> > >> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now >> > doing something smarter than just acting as a guest-mode NOP) ? >> > >> > Thanks, >> > --Gabriel >> >> Interesting. What it seems to say is this: >> >> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid- >> opcode exception—#UD—if CPL > 0) is determined by the setting of the “MWAIT >> exiting” VM-execution control: >> — If the “MWAIT exiting” VM-execution control is 1, MWAIT causes a VM exit >> (see Section 22.1.3). >> — If the “MWAIT exiting” VM-execution control is 0, MWAIT operates normally if >> any of the following is true: (1) the “interrupt-window exiting” VM-execution >> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1. >> — If the “MWAIT exiting” VM-execution control is 0, the “interrupt-window >> exiting” VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT >> does not cause the processor to enter an implementation-dependent >> optimized state; instead, control passes to the instruction following the >> MWAIT instruction. >> >> >> And since interrupt-window exiting is 0 most of the time for KVM, >> I would expect MWAIT to behave normally. > > The intel manual said the same thing back in 2010 as well. However, > regardless of how any flags were set, interrupt-window exiting or not, > "normal" L1 MWAIT behavior was that it woke up immediately regardless. > Remember, never going to sleep is still correct ("normal" ?) behavior > per the ISA definition of MWAIT :) I'll write a simple kvm-unit-test to better understand why it is broken for you ... > Also, when I tested your patch on the macbook air (where it worked), > not only was the host reporting 400% CPU for qemu (which is to be > expected), but the thermal fan/cooling thing also shifted up into high > gear, which means the physical CPU got hot, which it shouldn't have if > the guest-mode MWAIT actually did put the host CPU into low power. I tested MWAIT with basically the same kernel patch and the qemu patch with Linux guest on Haswell and Nehalem. Running the guest took 100% of the host CPUs, but it still had the same temperature as when the host was idle. That reminds me that you to pass '-cpu host' for QEMU reasons.