Date: Wed, 26 Oct 2016 09:16:15 -0600
From: Alex Williamson <alex.williamson@redhat.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>,
        "pbonzini@redhat.com" <pbonzini@redhat.com>,
        "kraxel@redhat.com" <kraxel@redhat.com>,
        "cjia@nvidia.com" <cjia@nvidia.com>,
        "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
        "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
        "Song, Jike" <jike.song@intel.com>,
        "bjsdjshi@linux.vnet.ibm.com" <bjsdjshi@linux.vnet.ibm.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
Message-ID: <20161026091615.42f0b385@t450s.home>
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D18DFE3465@SHSMSX101.ccr.corp.intel.com>
References: <1476739332-4911-1-git-send-email-kwankhede@nvidia.com>
        <1476739332-4911-5-git-send-email-kwankhede@nvidia.com>
        <20161019150231.19e32b84@t450s.home>
        <AADFC41AFE54684AB9EE6CBC0274A5D18DFE3465@SHSMSX101.ccr.corp.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3630
Lines: 77

On Wed, 26 Oct 2016 07:53:43 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, October 20, 2016 5:03 AM  
> > > @@ -83,6 +92,21 @@ struct vfio_group {
> > >  };
> > >
> > >  /*
> > > + * Guest RAM pinning working set or DMA target
> > > + */
> > > +struct vfio_pfn {
> > > +	struct rb_node		node;
> > > +	unsigned long		vaddr;		/* virtual addr */
> > > +	dma_addr_t		iova;		/* IOVA */
> > > +	unsigned long		pfn;		/* Host pfn */
> > > +	int			prot;
> > > +	atomic_t		ref_count;
> > > +};  
> > 
> > Somehow we're going to need to fit an invalidation callback here too.
> > How would we handle a case where there are multiple mdev devices, from
> > different vendor drivers, that all have the same pfn pinned?  I'm
> > already concerned about the per pfn overhead we're introducing here so
> > clearly we cannot store an invalidation callback per pinned page, per
> > vendor driver.  Perhaps invalidations should be done using a notifier
> > chain per vfio_iommu, the vendor drivers are required to register on
> > that chain (fail pinning with empty notifier list) user unmapping
> > will be broadcast to the notifier chain, the vendor driver will be
> > responsible for deciding if each unmap is relevant to them (potentially
> > it's for a pinning from another driver).
> > 
> > I expect we also need to enforce that vendors perform a synchronous
> > unmap such that after returning from the notifier list call, the
> > vfio_pfn should no longer exist.  If it does we might need to BUG_ON.
> > Also be careful to pay attention to the locking of the notifier vs
> > unpin callbacks to avoid deadlocks.
> >   
> 
> What about just requesting vendor driver to provide a callback in parent 
> device ops?

How does the iommu driver get to the mdev vendor driver callback?  We
can also have pages pinned by multiple vendor drivers, I don't think
we want the additional overhead of a per page list of invalidation
callbacks.
 
> Curious in which scenario the user application (say Qemu here) may 
> unmap memory pages which are still pinned by vendor driver... Is it 
> purely about a corner case which we want to handle elegantly? 

The vfio type1 iommu API provides a MAP and UNMAP interface.  The unmap
call is expected to work regardless of how it might inhibit the device
from working.  This is currently true of iommu protected devices today,
a user can unmap pages which might be DMA targets for the device and
the iommu prevents further access to those pages, possibly at the
expense of device operation.  We cannot support an interface where a
user can unmap a set of pages and map in new pages to replace them when
the vendor driver might be caching stale mappings.

In normal VM operation perhaps this is a corner case, but the API is
not defined only for the normal and expected behavior of a VM.
 
> If yes, possibly a simpler way is to force destroying mdev instead of 
> asking vendor driver to take care of each invalidation request under
> such situation. Since anyway the mdev device won't be in an usable
> state anymore... (sorry if I missed the key problem here.)

That's a pretty harsh response for an operation which is completely
valid from an API perspective.  What if the VM does an unmap of all
memory around reset?  We cannot guarantee that the guest driver will
have a chance to do cleanup, the guest may have crashed or a
system_reset invoked.  Would you have the mdev destroyed in this case?
How could QEMU, which has no device specific driver to know that vendor
pinnings are present, recover from this?  Thanks,

Alex