Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752422AbdDCKEj (ORCPT ); Mon, 3 Apr 2017 06:04:39 -0400 Received: from mx2.suse.de ([195.135.220.15]:39755 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751986AbdDCKEh (ORCPT ); Mon, 3 Apr 2017 06:04:37 -0400 Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Jim Mattson References: <1489612895-12799-1-git-send-email-mst@redhat.com> <87f187de-64ef-22a2-7714-a811883bce02@suse.de> <20170328142837.GA21738@potion> <20170329121147.GA5129@potion> Cc: "Michael S. Tsirkin" , LKML , "Gabriel L. Somlo" , Paolo Bonzini , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Joerg Roedel , kvm list , linux-doc@vger.kernel.org From: Alexander Graf Message-ID: Date: Mon, 3 Apr 2017 12:04:34 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: <20170329121147.GA5129@potion> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3956 Lines: 75 On 03/29/2017 02:11 PM, Radim Krčmář wrote: > 2017-03-28 13:35-0700, Jim Mattson: >> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář wrote: >>> 2017-03-27 15:34+0200, Alexander Graf: >>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote: >>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem: >>>>> unless explicitly provided with kernel command line argument >>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability, >>>>> without checking CPUID. >>>>> >>>>> We currently emulate that as a NOP but on VMX we can do better: let >>>>> guest stop the CPU until timer, IPI or memory change. CPU will be busy >>>>> but that isn't any worse than a NOP emulation. >>>>> >>>>> Note that mwait within guests is not the same as on real hardware >>>>> because halt causes an exit while mwait doesn't. For this reason it >>>>> might not be a good idea to use the regular MWAIT flag in CPUID to >>>>> signal this capability. Add a flag in the hypervisor leaf instead. >>>> So imagine we had proper MWAIT emulation capabilities based on page faults. >>>> In that case, we could do something as fancy as >>>> >>>> Treat MWAIT as pass-through by default >>>> >>>> Have a per-vcpu monitor timer 10 times a second in the background that >>>> checks which instruction we're in >>>> >>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT, >>>> if $IP was in non-mwait within that time, reset counter. >>> Or we could reuse external interrupts for sampling. Exits trigerred by >>> them would check for current instruction (probably would be best to >>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits >>> would imply that MWAIT is not used. >>> >>>> Or instead maybe just reuse the adapter hlt logic? >>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic >>> makes sense. We would just add new wakeup methods. >>> >>>> Either way, with that we should be able to get super low latency IPIs >>>> running while still maintaining some sanity on systems which don't have >>>> dedicated CPUs for workloads. >>>> >>>> And we wouldn't need guest modifications, which is a great plus. So older >>>> guests (and Windows?) could benefit from mwait as well. >>> There is no need guest modifications -- it could be exposed as standard >>> MWAIT feature to the guest, with responsibilities for guest/host-impact >>> on the user. >>> >>> I think that the page-fault based MWAIT would require paravirt if it >>> should be enabled by default, because of performance concerns: >>> Enabling write protection on a page needs a VM exit on all other VCPUs >>> when beginning monitoring (to reload page permissions and prevent missed >>> writes). >>> We'd want to keep trapping writes to the page all the time because >>> toggling is slow, but this could regress performance for an OS that has >>> other data accessed by other VCPUs in that page. >>> No current interface can tell the guest that it should reserve the whole >>> page instead of what CPUID[5] says and that writes to the monitored page >>> are not "cheap", but can trigger a VM exit ... >> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC, >> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX >> when running Mac OS X guests. Per Intel's SDM volume 3, section >> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to >> pad the data structure used to monitor writes. Software must make sure >> that beyond the data structure, no unrelated data variable exists in >> the triggering area for MWAIT. A pad may be needed to avoid this >> situation." Unfortunately, most operating systems do not follow this >> advice. > Right, EBX provides what we need to expose that the whole page is > monitored, thanks! So coming back to the original patch, is there anything that should keep us from exposing MWAIT straight into the guest at all times? Alex