Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754133AbdDDMvL (ORCPT ); Tue, 4 Apr 2017 08:51:11 -0400 Received: from mx2.suse.de ([195.135.220.15]:55120 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751547AbdDDMvJ (ORCPT ); Tue, 4 Apr 2017 08:51:09 -0400 Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= References: <1489612895-12799-1-git-send-email-mst@redhat.com> <87f187de-64ef-22a2-7714-a811883bce02@suse.de> <20170328142837.GA21738@potion> <20170329121147.GA5129@potion> <20170404123915.GA9525@potion> Cc: Jim Mattson , "Michael S. Tsirkin" , LKML , "Gabriel L. Somlo" , Paolo Bonzini , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Joerg Roedel , kvm list , linux-doc@vger.kernel.org From: Alexander Graf Message-ID: Date: Tue, 4 Apr 2017 14:51:06 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: <20170404123915.GA9525@potion> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5093 Lines: 101 On 04/04/2017 02:39 PM, Radim Krčmář wrote: > 2017-04-03 12:04+0200, Alexander Graf: >> On 03/29/2017 02:11 PM, Radim Krčmář wrote: >>> 2017-03-28 13:35-0700, Jim Mattson: >>>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář wrote: >>>>> 2017-03-27 15:34+0200, Alexander Graf: >>>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote: >>>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem: >>>>>>> unless explicitly provided with kernel command line argument >>>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability, >>>>>>> without checking CPUID. >>>>>>> >>>>>>> We currently emulate that as a NOP but on VMX we can do better: let >>>>>>> guest stop the CPU until timer, IPI or memory change. CPU will be busy >>>>>>> but that isn't any worse than a NOP emulation. >>>>>>> >>>>>>> Note that mwait within guests is not the same as on real hardware >>>>>>> because halt causes an exit while mwait doesn't. For this reason it >>>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to >>>>>>> signal this capability. Add a flag in the hypervisor leaf instead. >>>>>> So imagine we had proper MWAIT emulation capabilities based on page faults. >>>>>> In that case, we could do something as fancy as >>>>>> >>>>>> Treat MWAIT as pass-through by default >>>>>> >>>>>> Have a per-vcpu monitor timer 10 times a second in the background that >>>>>> checks which instruction we're in >>>>>> >>>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT, >>>>>> if $IP was in non-mwait within that time, reset counter. >>>>> Or we could reuse external interrupts for sampling. Exits trigerred by >>>>> them would check for current instruction (probably would be best to >>>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits >>>>> would imply that MWAIT is not used. >>>>> >>>>>> Or instead maybe just reuse the adapter hlt logic? >>>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic >>>>> makes sense. We would just add new wakeup methods. >>>>> >>>>>> Either way, with that we should be able to get super low latency IPIs >>>>>> running while still maintaining some sanity on systems which don't have >>>>>> dedicated CPUs for workloads. >>>>>> >>>>>> And we wouldn't need guest modifications, which is a great plus. So older >>>>>> guests (and Windows?) could benefit from mwait as well. >>>>> There is no need guest modifications -- it could be exposed as standard >>>>> MWAIT feature to the guest, with responsibilities for guest/host-impact >>>>> on the user. >>>>> >>>>> I think that the page-fault based MWAIT would require paravirt if it >>>>> should be enabled by default, because of performance concerns: >>>>> Enabling write protection on a page needs a VM exit on all other VCPUs >>>>> when beginning monitoring (to reload page permissions and prevent missed >>>>> writes). >>>>> We'd want to keep trapping writes to the page all the time because >>>>> toggling is slow, but this could regress performance for an OS that has >>>>> other data accessed by other VCPUs in that page. >>>>> No current interface can tell the guest that it should reserve the whole >>>>> page instead of what CPUID[5] says and that writes to the monitored page >>>>> are not "cheap", but can trigger a VM exit ... >>>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC, >>>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX >>>> when running Mac OS X guests. Per Intel's SDM volume 3, section >>>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to >>>> pad the data structure used to monitor writes. Software must make sure >>>> that beyond the data structure, no unrelated data variable exists in >>>> the triggering area for MWAIT. A pad may be needed to avoid this >>>> situation." Unfortunately, most operating systems do not follow this >>>> advice. >>> Right, EBX provides what we need to expose that the whole page is >>> monitored, thanks! >> So coming back to the original patch, is there anything that should keep us >> from exposing MWAIT straight into the guest at all times? > Just minor issues: > * OS X on Core 2 fails for unknown reason if we disable the instruction > trapping, which is an argument against doing it by default So for that we should try and see if changing the exposed CPUID MWAIT leaf helps. Currently we return 0/0 which is pretty bogus and might be the reason OSX fails. > * idling guests would consume host CPU, which is a significant change > in behavior and shouldn't be done without userspace's involvement That's the same as today, as idling guests with MWAIT would also today end up in a NOP emulated loop. Please bear in mind that I do not advocate to expose the MWAIT CPUID flag. This is only for the instruction trap. > I think the best compromise is to add a capability for the MWAIT VM-exit > controls and let userspace expose MWAIT if it wishes to. > Will send a patch. Please see my patch to force enable CPUID bits ;). Alex