DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 7561C7EABC
Date: Thu, 17 Aug 2017 01:31:27 +0300
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
        stable@vger.kernel.org
Subject: Re: [PATCH] kvm: x86: disable KVM_FAST_MMIO_BUS
Message-ID: <20170817011815-mutt-send-email-mst@kernel.org>
References: <20170816112249.28939-1-pbonzini@redhat.com>
 <20170816155132-mutt-send-email-mst@kernel.org>
 <9de5ebf5-457d-2a34-0314-c6c612ddb2e9@redhat.com>
 <20170816161301-mutt-send-email-mst@kernel.org>
 <fb384aa5-6b8a-4d70-98bb-2ce2adbcaab4@redhat.com>
 <20170816194342-mutt-send-email-mst@kernel.org>
 <81dabc78-edfd-32fc-024c-c57330386a51@redhat.com>
 <20170816190316.GA2566@flask>
 <20170816224815-mutt-send-email-mst@kernel.org>
 <ac5c6c7e-1738-b1c6-96eb-6bdf198b67d2@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <ac5c6c7e-1738-b1c6-96eb-6bdf198b67d2@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4317
Lines: 92

On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 21:59, Michael S. Tsirkin wrote:
> > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote:
> >> 2017-08-16 19:19+0200, Paolo Bonzini:
> >>> On 16/08/2017 18:50, Michael S. Tsirkin wrote:
> >>>> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
> >>>>> While you can filter out instruction fetches, that's not enough.  A data
> >>>>> read could happen because someone pointed the IDT to MMIO area, and who
> >>>>> knows what the VM-exit instruction length points to in that case.
> >>>>
> >>>> Thinking more about it, I don't really see how anything
> >>>> legal guest might be doing with virtio would trigger anything
> >>>> but a fault after decoding the instruction. How does
> >>>> skipping instruction even make sense in the example you give?
> >>>
> >>> There's no such thing as a legal guest.  Anything that the hypervisor
> >>> does, that differs from real hardware, is a possible escalation path.
> >>>
> >>> This in fact makes me doubt the EMULTYPE_SKIP patch too.
> >>
> >> The main hack is that we expect EPT misconfig within a given range to be
> >> a MMIO NULL write.  I think it is fine -- EMULTYPE_SKIP is a common path
> >> that should have well tested error paths and, IIUC, virtio doesn't allow
> >> any other access, so it is a problem of the guest if a buggy/malicious
> >> application can access virtio memory.
> 
> Yes, I agree.  EMULTYPE_SKIP is fine because failed decoding still
> causes an exception to be injected.  Maybe it's better to gate the
> EMULTYPE_SKIP emulation on the exit qualification saying this is a write

I thought it's already limited to writes. I agree that's a reasonable
limitation in any case.

> and also not a page table walk---just in case.

I still don't get it, sorry. Let's assume for the sake of argument
that it's a PT walk causing the MMIO access. Just why do you think
that it makes sense to skip the instruction that caused the walk?

> >>>> how about we blacklist nested virt for this optimization?
> >>
> >> Not every hypervisor can be easily detected ...
> > 
> > Hypervisors that don't set a hypervisor bit in CPUID are violating the
> > spec themselves, aren't they?  Anyway, we can add a management option
> > for use in a nested scenario.
> 
> No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined.
> See for example
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458:
> "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF
> for software use. Hypervisors can use these leaves to provide an
> interface to pass information from the hypervisor to the guest operating
> system running inside a virtual machine. The hypervisor bit indicates
> the presence of a hypervisor and that it is safe to test these
> additional software leaves".

Looks like it's not a bug then. Still, most hypervisors do have this
leaf so it's a reasonable way that will catch most issues.  We can
always blacklist more as they are found. Additionally let's go ahead
and add ability for userspace to disable fast MMIO for these
hypervisors we failed to detect.

> >> KVM uses standard features and SDM clearly says that the
> >> instruction length field is undefined.
> > 
> > True. Let's see whether intel can commit to a stronger definition.
> > I don't think there's any rush to make this change.
> 
> I disagree.  Relying on undefined processor features is a bad idea.

Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para:
add mmio word store hypercall" as an alternative.  Was nacked as MMIO
was seen as safer and better. By now many people rely on mmio being
fast.  Let's talk to hardware guys to define the feature before we give
up and spend years designing an alternative.

> > It's just that this has been there for 3 years and people have built a
> > product around this.
> 
> Around 700 clock cycles?
> 
> Paolo

About 30% the cost of exit, isn't it?  There are definitely workloads
where cost of exit gates performance. We didn't work on fast mmio based
on theoretical assumptions. But maybe I am wrong. We'll see. Jason here
volunteered to test your patch and we'll see what comes out of it. If
I'm wrong and it's about 1%, I won't split hairs.

-- 
MST