Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933872AbcKHV2S (ORCPT ); Tue, 8 Nov 2016 16:28:18 -0500 Received: from mx1.redhat.com ([209.132.183.28]:60946 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933791AbcKHV2N (ORCPT ); Tue, 8 Nov 2016 16:28:13 -0500 Date: Tue, 8 Nov 2016 14:28:10 -0700 From: Alex Williamson To: Kirti Wankhede Cc: , , , , , , , , Subject: Re: [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP Message-ID: <20161108142810.44c97dd2@t450s.home> In-Reply-To: <675f0d46-aa2b-7404-701e-d9ce17b64549@nvidia.com> References: <1478293856-8191-1-git-send-email-kwankhede@nvidia.com> <1478293856-8191-12-git-send-email-kwankhede@nvidia.com> <20161107164544.57ab1a92@t450s.home> <20161108104619.6f76b918@t450s.home> <675f0d46-aa2b-7404-701e-d9ce17b64549@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Tue, 08 Nov 2016 21:28:11 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6397 Lines: 143 On Wed, 9 Nov 2016 01:29:19 +0530 Kirti Wankhede wrote: > On 11/8/2016 11:16 PM, Alex Williamson wrote: > > On Tue, 8 Nov 2016 21:56:29 +0530 > > Kirti Wankhede wrote: > > > >> On 11/8/2016 5:15 AM, Alex Williamson wrote: > >>> On Sat, 5 Nov 2016 02:40:45 +0530 > >>> Kirti Wankhede wrote: > >>> > >> ... > >>>> > >>>> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb) > >>> > >>> Is the expectation here that this is a generic notifier for all > >>> vfio->mdev signaling? That should probably be made clear in the mdev > >>> API to avoid vendor drivers assuming their notifier callback only > >>> occurs for unmaps, even if that's currently the case. > >>> > >> > >> Ok. Adding comment about notifier callback in mdev_device which is part > >> of next patch. > >> > >> ... > >> > >>>> mutex_lock(&iommu->lock); > >>>> > >>>> - if (!iommu->external_domain) { > >>>> + /* Fail if notifier list is empty */ > >>>> + if ((!iommu->external_domain) || (!iommu->notifier.head)) { > >>>> ret = -EINVAL; > >>>> goto pin_done; > >>>> } > >>>> @@ -867,6 +870,11 @@ unlock: > >>>> /* Report how much was unmapped */ > >>>> unmap->size = unmapped; > >>>> > >>>> + if (unmapped && iommu->external_domain) > >>>> + blocking_notifier_call_chain(&iommu->notifier, > >>>> + VFIO_IOMMU_NOTIFY_DMA_UNMAP, > >>>> + unmap); > >>> > >>> This is after the fact, there's already a gap here where pages are > >>> unpinned and the mdev device is still running. > >> > >> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and > >> find vfio_dma. If its not found, it doesn't unpin pages. We have to call > >> this notifier before vfio_remove_dma(). But if we call this before > >> vfio_remove_dma() there will be deadlock since iommu->lock is already > >> held here and vfio_iommu_type1_unpin_pages() will also try to hold > >> iommu->lock. > >> If we want to call blocking_notifier_call_chain() before > >> vfio_remove_dma(), sequence should be: > >> > >> unmapped += dma->size; > >> mutex_unlock(&iommu->lock); > >> if (iommu->external_domain)) { > >> struct vfio_iommu_type1_dma_unmap nb_unmap; > >> > >> nb_unmap.iova = dma->iova; > >> nb_unmap.size = dma->size; > >> blocking_notifier_call_chain(&iommu->notifier, > >> VFIO_IOMMU_NOTIFY_DMA_UNMAP, > >> &nb_unmap); > >> } > >> mutex_lock(&iommu->lock); > >> vfio_remove_dma(iommu, dma); > > > > It seems like it would be worthwhile to have the rb-tree rooted in the > > vfio-dma, then we only need to call the notifier if there are pages > > pinned within that vfio-dma (ie. the rb-tree is not empty). We can > > then release the lock call the notifier, re-acquire the lock, and > > BUG_ON if the rb-tree still is not empty. We might get duplicate pfns > > between separate vfio_dma structs, but as I mentioned in other replies, > > that seems like an exception that we don't need to optimize for. > > > > If we don't optimize for the case where iova from different vfio_dma are > mapped to same pfn and we would not consider this case for page > accounting then: Just to clarify, the current code (not handling mdevs) will pin and do page accounting per iova, regardless of whether the iova translates to a unique pfn. As long as we do no worse than that, I'm ok. > - have rb tree of pinned iova, where key would be iova, in each vfio_dma > structure. > - iova tracking structure would have iova and ref_count only. > - page accounting would only count number of iova's in rb_tree, case > where different iova could map to same pfn would not be considered in > this implementation for now. > - vfio_unpin_pages() would have user_pfn and pfn as input, we would > validate that iova exist in rb tree and trust vendor driver that > corresponding pfn is correct, there is no validation of pfn. If want > validate pfn, call GUP, verify pfn and call put_pfn(). > - In .release() or .detach_group() path, if there are entries in this rb > tree, call GUP again using that iova, get pfn and then call > put_pfn(pfn) for ref_count+1 times. This is because we are not keeping > pfn in our tracking logic. Wait a sec, if we detach a group from the container and it's not the last group in the container (which would trigger a release), we can't assume anything about which vfio_dma entries were associated with that device. The vendor driver, through the release of the device(s) within that group, needs to unpin. In a container release, we need to send a notifier to the vendor driver(s) to cause an unpin. This is the only mechanism we have to ensure that vendor drivers are not leaking references. If during the release, after the notifier, if any vfio_pfns remain, we need to BUG_ON, just like we need to do for any other DMA_UNMAP. Also, I'll say it again, I also don't like this API of passing around potentially giant arrays, and especially the API of relying on the vendor driver to tell us an arbitrary pfn to unpin. If we make the assumption that vendor drivers do not pin lots and lots of memory, perhaps we could use a struct vfio_pfn as: struct vfio_pfn { struct rb_node node; dma_addr_t iova; /* key */ unsigned long pfn; atomic_t ref_count; }; This puts us at 44-bytes per pfn, which isn't great, but I think it puts us in a better position with the API that we could make use of a page-table or sparse array in the future that would eliminate the rb_node and make the iova implicit in the location of the data structure. That would leave only the pfn and ref_count, which could potentially be combined into a single 8-byte field if we had per vfio_dma (or higher) locking to avoid the atomic_t (and we're happy that the reference count is always less than PAGE_SIZE, ie. we could fail pinning if we get to that point). That would allow for the unpin call to not provide the pfn, so could we then look at whether we need the batching provided by the iova array at all? I don't have a feel for the size of memory that gets pinned by the vendor driver, the frequency of pinning, or whether usage of hugepages for the guest is likely to translate into contiguous memory requests through this API. What's your feeling? Thanks, Alex