DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 67EC2624BE
Date: Wed, 29 Mar 2017 14:11:47 +0200
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
To: Jim Mattson <jmattson@google.com>
Cc: Alexander Graf <agraf@suse.de>, "Michael S. Tsirkin" <mst@redhat.com>,
        LKML <linux-kernel@vger.kernel.org>,
        "Gabriel L. Somlo" <gsomlo@gmail.com>,
        Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        the arch/x86 maintainers <x86@kernel.org>,
        Joerg Roedel <joro@8bytes.org>, kvm list <kvm@vger.kernel.org>,
        linux-doc@vger.kernel.org
Subject: Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests
Message-ID: <20170329121147.GA5129@potion>
References: <1489612895-12799-1-git-send-email-mst@redhat.com>
 <87f187de-64ef-22a2-7714-a811883bce02@suse.de>
 <20170328142837.GA21738@potion>
 <CALMp9eRs+WMt+8+wgWk7+H8Kd5zU0_U+On0O_G4cp_7xEffrGQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CALMp9eRs+WMt+8+wgWk7+H8Kd5zU0_U+On0O_G4cp_7xEffrGQ@mail.gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3982
Lines: 81

2017-03-28 13:35-0700, Jim Mattson:
> On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>> 2017-03-27 15:34+0200, Alexander Graf:
>>> On 15/03/2017 22:22, Michael S. Tsirkin wrote:
>>>> Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
>>>> unless explicitly provided with kernel command line argument
>>>> "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
>>>> without checking CPUID.
>>>>
>>>> We currently emulate that as a NOP but on VMX we can do better: let
>>>> guest stop the CPU until timer, IPI or memory change.  CPU will be busy
>>>> but that isn't any worse than a NOP emulation.
>>>>
>>>> Note that mwait within guests is not the same as on real hardware
>>>> because halt causes an exit while mwait doesn't.  For this reason it
>>>> might not be a good idea to use the regular MWAIT flag in CPUID to
>>>> signal this capability.  Add a flag in the hypervisor leaf instead.
>>>
>>> So imagine we had proper MWAIT emulation capabilities based on page faults.
>>> In that case, we could do something as fancy as
>>>
>>> Treat MWAIT as pass-through by default
>>>
>>> Have a per-vcpu monitor timer 10 times a second in the background that
>>> checks which instruction we're in
>>>
>>> If we're in mwait for the last - say - 1 second, switch to emulated MWAIT,
>>> if $IP was in non-mwait within that time, reset counter.
>>
>> Or we could reuse external interrupts for sampling.  Exits trigerred by
>> them would check for current instruction (probably would be best to
>> limit just to timer tick) and a sufficient ratio (> 0?) of other exits
>> would imply that MWAIT is not used.
>>
>>> Or instead maybe just reuse the adapter hlt logic?
>>
>> Emulated MWAIT is very similar to emulated HLT, so reusing the logic
>> makes sense.  We would just add new wakeup methods.
>>
>>> Either way, with that we should be able to get super low latency IPIs
>>> running while still maintaining some sanity on systems which don't have
>>> dedicated CPUs for workloads.
>>>
>>> And we wouldn't need guest modifications, which is a great plus. So older
>>> guests (and Windows?) could benefit from mwait as well.
>>
>> There is no need guest modifications -- it could be exposed as standard
>> MWAIT feature to the guest, with responsibilities for guest/host-impact
>> on the user.
>>
>> I think that the page-fault based MWAIT would require paravirt if it
>> should be enabled by default, because of performance concerns:
>> Enabling write protection on a page needs a VM exit on all other VCPUs
>> when beginning monitoring (to reload page permissions and prevent missed
>> writes).
>> We'd want to keep trapping writes to the page all the time because
>> toggling is slow, but this could regress performance for an OS that has
>> other data accessed by other VCPUs in that page.
>> No current interface can tell the guest that it should reserve the whole
>> page instead of what CPUID[5] says and that writes to the monitored page
>> are not "cheap", but can trigger a VM exit ...
> 
> CPUID.05H:EBX is supposed to address the false sharing issue. IIRC,
> VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX
> when running Mac OS X guests. Per Intel's SDM volume 3, section
> 8.10.5, "To avoid false wake-ups; use the largest monitor line size to
> pad the data structure used to monitor writes. Software must make sure
> that beyond the data structure, no unrelated data variable exists in
> the triggering area for MWAIT. A pad may be needed to avoid this
> situation." Unfortunately, most operating systems do not follow this
> advice.

Right, EBX provides what we need to expose that the whole page is
monitored, thanks!

>             Unfortunately, most operating systems do not follow this
> advice.

Yeah ... KVM could add yet another heuristic to drop MWAIT emulation and
use hardware if there were many traps while the target was not MWAITING,
it's getting over-complicated, though :/