Subject: Re: [PATCH v12 10/22] vfio iommu type1: Add support for mediated
 devices
To: Alex Williamson <alex.williamson@redhat.com>
References: <1479138156-28905-1-git-send-email-kwankhede@nvidia.com>
 <1479138156-28905-11-git-send-email-kwankhede@nvidia.com>
 <20161114162547.734a2a16@t450s.home>
CC: <pbonzini@redhat.com>, <kraxel@redhat.com>, <cjia@nvidia.com>,
        <qemu-devel@nongnu.org>, <kvm@vger.kernel.org>, <kevin.tian@intel.com>,
        <jike.song@intel.com>, <bjsdjshi@linux.vnet.ibm.com>,
        <linux-kernel@vger.kernel.org>
From: Kirti Wankhede <kwankhede@nvidia.com>
Message-ID: <bc5b3724-6a82-15db-1a72-a42e74403c6f@nvidia.com>
Date: Tue, 15 Nov 2016 20:39:25 +0530
MIME-Version: 1.0
In-Reply-To: <20161114162547.734a2a16@t450s.home>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5959
Lines: 205


On 11/15/2016 4:55 AM, Alex Williamson wrote:
> On Mon, 14 Nov 2016 21:12:24 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>> managed by an IOMMU domain.
>>
>> Aim of this change is:
>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>> - To support direct assigned device and mediated device in single module
>>
>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>> backend module. More details:
>> - Domain for external user is tracked separately in vfio_iommu structure.
>>   It is allocated when group for first mdev device is attached.
>> - Pages pinned for external domain are tracked in each vfio_dma structure
>>   for that iova range.
>> - Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
>>   rb-tree is iova, but it actually aims to track pfns.
>> - On external pin request for an iova, page is pinned only once, if iova
>>   is already pinned and tracked, ref_count is incremented.
> 
> This is referring only to the external (ie. pfn_list) tracking only,
> correct?  In general, a page can be pinned up to twice per iova
> referencing it, once for an iommu mapped domain and again in the
> pfn_list, right?
> 

That's right.

...
  		}
>>  
>> -		if (!rsvd && !lock_cap && mm->locked_vm + i + 1 > limit) {
>> -			put_pfn(pfn, prot);
>> +		iova = vaddr - dma->vaddr + dma->iova;
> 
> 
> nit, this could be incremented in the for loop just like vaddr, we
> don't need to create it from scratch (iova += PAGE_SIZE).
> 

Ok.

> 
>> +		vpfn = vfio_find_vpfn(dma, iova);
>> +		if (!vpfn)
>> +			lock_acct++;
>> +
>> +		if (!rsvd && !lock_cap &&
>> +		    mm->locked_vm + lock_acct + 1 > limit) {
> 
> 
> lock_acct was incremented just above, why is this lock_acct + 1?  I
> think we're off by one here (ie. remove the +1)?
> 

Moved this vfio_find_vpfn() check after this check.


...


>> -static long vfio_unpin_pages_remote(struct vfio_dma *dma, unsigned long pfn,
>> -				    long npage, int prot, bool do_accounting)
>> +static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>> +				    unsigned long pfn, long npage,
>> +				    bool do_accounting)
>>  {
>> -	unsigned long unlocked = 0;
>> +	long unlocked = 0, locked = 0;
>>  	long i;
>>  
>> -	for (i = 0; i < npage; i++)
>> -		unlocked += put_pfn(pfn++, prot);
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *vpfn;
>> +
>> +		unlocked += put_pfn(pfn++, dma->prot);
>> +
>> +		vpfn = vfio_find_vpfn(dma, iova + (i << PAGE_SHIFT));
>> +		if (vpfn)
>> +			locked++;
> 
> This isn't taking into account reserved pages (ex. device mmaps).  In
> the pinning path above we skip accounting of reserved pages, put_pfn
> also only increments for non-reserved pages, but vfio_find_vpfn()
> doesn't care, so it's possible that (locked - unlocked) could result in
> a positive value.  Maybe something like:
> 
> if (put_pfn(...)) {
> 	unlocked++;
> 	if (vfio_find_vpfn(...))
> 		locked++;
> }
> 

Updating.

...
>>  
>> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> +static int vfio_iommu_type1_pin_pages(void *iommu_data,
>> +				      unsigned long *user_pfn,
>> +				      int npage, int prot,
>> +				      unsigned long *phys_pfn)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	int i, j, ret;
>> +	unsigned long remote_vaddr;
>> +	unsigned long *pfn = phys_pfn;
> 
> nit, why do we have this variable rather than using phys_pfn directly?
> Maybe before we had unwind support we incremented this rather than
> indexing it?
> 

Updating.

...

>> +		iova = user_pfn[i] << PAGE_SHIFT;
>> +		dma = vfio_find_dma(iommu, iova, 0);
>> +		if (!dma)
>> +			goto unpin_exit;
>> +		unlocked += vfio_unpin_page_external(dma, iova, do_accounting);
>> +	}
>> +
>> +unpin_exit:
>> +	mutex_unlock(&iommu->lock);
>> +	return unlocked;
> 
> This is not returning what it's supposed to.  unlocked is going to be
> less than or equal to the number of pages unpinned.  We don't need to
> track unlocked, I think we're just tracking where we are in the unpin
> array, whether it was partial or complete.  I think we want:
> 
> return i > npage ? npage : i;
> 
> Or maybe we can make it more obvious if there's an error on the first
> unpin entry:
> 
> return i > npage ? npage : (i > 0 ? i : -EINVAL);
> 

Thanks for pointing that out. Updating.

...
>> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
>> +{
>> +	struct rb_node *n, *p;
>> +
>> +	n = rb_first(&iommu->dma_list);
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma;
>> +		long locked = 0, unlocked = 0;
>> +
>> +		dma = rb_entry(n, struct vfio_dma, node);
>> +		unlocked += vfio_unmap_unpin(iommu, dma, false);
>> +		p = rb_first(&dma->pfn_list);
>> +		for (; p; p = rb_next(p))
>> +			locked++;
> 
> We don't know that these weren't reserved pages.  If the vendor driver
> was pinning peer-to-peer ranges vfio_unmap_unpin() might have returned
> 0 yet we're assuming each is locked RAM, so our accounting can go
> upside down.
> 

Adding if rsvd check here.

>> +		vfio_lock_acct(dma->task, locked - unlocked);
>> +	}
>> +}
>> +
>> +static void vfio_external_unpin_all(struct vfio_iommu *iommu,
>> +				    bool do_accounting)
>> +{
>> +	struct rb_node *n, *p;
>> +
>> +	n = rb_first(&iommu->dma_list);
>> +	for (; n; n = rb_next(n)) {
>> +		struct vfio_dma *dma;
>> +
>> +		dma = rb_entry(n, struct vfio_dma, node);
>> +		while ((p = rb_first(&dma->pfn_list))) {
>> +			int unlocked;
>> +			struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn,
>> +							 node);
>> +
>> +			unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>> +			if (do_accounting)
>> +				vfio_lock_acct(dma->task, -unlocked);
> 
> nit, we could batch these further, only updating accounting once per
> vfio_dma, or once entirely.
> 

Ok.

Thanks,
Kirti