Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests
To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        Jim Mattson <jmattson@google.com>
References: <1489612895-12799-1-git-send-email-mst@redhat.com>
 <87f187de-64ef-22a2-7714-a811883bce02@suse.de>
 <20170328142837.GA21738@potion>
 <CALMp9eRs+WMt+8+wgWk7+H8Kd5zU0_U+On0O_G4cp_7xEffrGQ@mail.gmail.com>
 <20170329121147.GA5129@potion>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
        LKML <linux-kernel@vger.kernel.org>,
        "Gabriel L. Somlo" <gsomlo@gmail.com>,
        Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        the arch/x86 maintainers <x86@kernel.org>,
        Joerg Roedel <joro@8bytes.org>, kvm list <kvm@vger.kernel.org>,
        linux-doc@vger.kernel.org
From: Alexander Graf <agraf@suse.de>
Message-ID: <f6607513-4cbd-3fa0-1663-5477e855e783@suse.de>
Date: Mon, 3 Apr 2017 12:04:34 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.6.0
MIME-Version: 1.0
In-Reply-To: <20170329121147.GA5129@potion>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3956
Lines: 75

On 03/29/2017 02:11 PM, Radim Krčmář wrote:
> 2017-03-28 13:35-0700, Jim Mattson:
>> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>>> 2017-03-27 15:34+0200, Alexander Graf:
>>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>>> unless explicitly provided with kernel command line argument
>>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>>> without checking CPUID.
>>>>>
>>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>>>> but that isn't any worse than a NOP emulation.
>>>>>
>>>>> Note that mwait within guests is not the same as on real hardware
>>>>> because halt causes an exit while mwait doesn't.  For this reason it
>>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>>> In that case, we could do something as fancy as
>>>>
>>>> Treat MWAIT as pass-through by default
>>>>
>>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>>> checks which instruction we're in
>>>>
>>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>>> if $IP was in non-mwait within that time, reset counter.
>>> Or we could reuse external interrupts for sampling.  Exits trigerred by
>>> them would check for current instruction (probably would be best to
>>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>>> would imply that MWAIT is not used.
>>>
>>>> Or instead maybe just reuse the adapter hlt logic?
>>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>>> makes sense.  We would just add new wakeup methods.
>>>
>>>> Either way, with that we should be able to get super low latency IPIs
>>>> running while still maintaining some sanity on systems which don't have
>>>> dedicated CPUs for workloads.
>>>>
>>>> And we wouldn't need guest modifications, which is a great plus. So older
>>>> guests (and Windows?) could benefit from mwait as well.
>>> There is no need guest modifications -- it could be exposed as standard
>>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>>> on the user.
>>>
>>> I think that the page-fault based MWAIT would require paravirt if it
>>> should be enabled by default, because of performance concerns:
>>> Enabling write protection on a page needs a VM exit on all other VCPUs
>>> when beginning monitoring (to reload page permissions and prevent missed
>>> writes).
>>> We'd want to keep trapping writes to the page all the time because
>>> toggling is slow, but this could regress performance for an OS that has
>>> other data accessed by other VCPUs in that page.
>>> No current interface can tell the guest that it should reserve the whole
>>> page instead of what CPUID[5] says and that writes to the monitored page
>>> are not "cheap", but can trigger a VM exit ...
>> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
>> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
>> when running Mac OS X guests. Per Intel's SDM volume 3, section
>> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
>> pad the data structure used to monitor writes. Software must make sure
>> that beyond the data structure, no unrelated data variable exists in
>> the triggering area for MWAIT. A pad may be needed to avoid this
>> situation." Unfortunately, most operating systems do not follow this
>> advice.
> Right, EBX provides what we need to expose that the whole page is
> monitored, thanks!

So coming back to the original patch, is there anything that should keep 
us from exposing MWAIT straight into the guest at all times?


Alex