Date: Mon, 18 Apr 2016 19:27:42 +0300
From: "Michael S. Tsirkin" <mst@redhat.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: qemu-devel@nongnu.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
        peterx@redhat.com, cornelia.huck@de.ibm.com,
        Stefan Hajnoczi <stefanha@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
        Amit Shah <amit.shah@redhat.com>, qemu-block@nongnu.org,
        Jason Wang <jasowang@redhat.com>,
        Alex Williamson <alex.williamson@redhat.com>,
        Andy Lutomirski <luto@kernel.org>,
        Christian Borntraeger <borntraeger@de.ibm.com>,
        Wei Liu <wei.liu2@citrix.com>,
        virtualization@lists.linux-foundation.org, kvm@vger.kernel.org
Subject: Re: [PATCH RFC] fixup! virtio: convert to use DMA api
Message-ID: <20160418190203-mutt-send-email-mst@redhat.com>
References: <1460979793-6621-1-git-send-email-mst@redhat.com>
 <1460980717.12793.43.camel@infradead.org>
 <20160418160731-mutt-send-email-mst@redhat.com>
 <1460988232.22654.7.camel@infradead.org>
 <20160418170534-mutt-send-email-mst@redhat.com>
 <1460992923.3765.8.camel@infradead.org>
 <20160418182320-mutt-send-email-mst@redhat.com>
 <1460994701.3765.23.camel@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1460994701.3765.23.camel@infradead.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4091
Lines: 94

On Mon, Apr 18, 2016 at 11:51:41AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin wrote:
> > 
> > > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> > > its own operating system's IOMMU code is expected to be broken, and
> > > that the virtio driver should eschew the DMA API?
> > 
> > No - it tells guest that e.g. the ACPI tables (or whatever the
> > equivalent is) do not match reality with respect to this device
> > since IOMMU is ignored by hypervisor.
> > Hypervisor has no idea what does guest IOMMU code do - hopefully
> > it is not actually broken.
> 
> OK, that makes sense — thanks.
> 
> So where the platform *does* have a way to coherently tell the guest
> that some devices are behind and IOMMU and some aren't, we should never
> see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
> perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
> the truth; qv.)
> 
> This should *only* be a crutch for platforms which cannot properly
> convey that information from the hypervisor to the guest. It should be
> clearly documented "thou shalt not use this unless you've first
> attempted to fix the broken platform to get it right for itself".
> 
> And if we look at it as such... does it make more sense for this to be
> a more *generic* qemu←→guest interface? That way the software hacks can
> live in the OS IOMMU code where they belong, and prevent assignment to
> nested guests for example. And can cover cases like assigned PCI
> devices in existing qemu/x86 which need the same treatment.
>
> Put another way: if we're going to add code to the guest OS to look at
> this information, why can't we add that code in the guest's IOMMU
> support instead, to look at an out-of-band qemu-specific "ignore IOMMU
> for these devices" list instead?

I balk at adding more hacks to a broken system. My goals are
merely to
- make things work correctly with an IOMMU and new guests,
  so people can use userspace drivers with virtio devices
- prevent security risks when guest kernel mistakenly thinks
  it's protected by an IOMMU, but in fact isn't
- avoid breaking any working configurations

Looking at guest code, it looks like virtio was always
bypassing the IOMMU even if configured, but no other
guest driver did.

This makes me think the problem where guest drivers
ignore the IOMMU is virtio specific
and so a virtio specific solution seems cleaner.

The problem for assigned devices is IMHO different: they bypass
the guest IOMMU too but no guest driver knows about this,
so guests do not work. Seems cleaner to fix QEMU to make
existing guests work.


> > The status quo is that that the IOMMU might well be bypassed
> > and then you need to program physical addresses into the device,
> > but maybe not. If DMA API does not give you physical addresses, you
> > need to bypass it, but hypervisor does not know or care.
> 
> Right. The status quo is that qemu doesn't provide correct information
> about IOMMU topology to guests, and they have to have heuristics to
> work out whether to eschew the IOMMU for a given device or not. This is
> true for virtio and assigned PCI devices alike.

True but I think we should fix QEMU to shadow IOMMU
page tables for assigned devices. This seems rather
possible with VT-D, and there are patches already on list.

It looks like this will fix all legacy guests which is
much nicer than what you suggest which will only help new guests.

> Furthermore, some platforms don't *have* a standard way for qemu to
> 'tell the truth' to the guests, and that's where the real fun comes in.
> But still, I'd like to see a generic solution for that lack instead of
> a virtio-specific hack.

But the issue is not just these holes.  E.g. with VT-D it is only easy
to emulate because there's a "caching mode" hook. It is fundamentally
paravirtualization.  So a completely generic solution would be a
paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
justified if many platforms have hard to emulate interfaces.


> -- 
> dwmw2
> 
>