Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753225AbcKHPGn (ORCPT ); Tue, 8 Nov 2016 10:06:43 -0500 Received: from hqemgate14.nvidia.com ([216.228.121.143]:1382 "EHLO hqemgate14.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751745AbcKHPGm (ORCPT ); Tue, 8 Nov 2016 10:06:42 -0500 X-PGP-Universal: processed; by hqnvupgp07.nvidia.com on Mon, 07 Nov 2016 19:06:05 -0800 Subject: Re: [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices To: Alex Williamson References: <1478293856-8191-1-git-send-email-kwankhede@nvidia.com> <1478293856-8191-11-git-send-email-kwankhede@nvidia.com> <20161107161619.66e03d8f@t450s.home> CC: , , , , , , , , X-Nvconfidentiality: public From: Kirti Wankhede Message-ID: Date: Tue, 8 Nov 2016 20:36:34 +0530 MIME-Version: 1.0 In-Reply-To: <20161107161619.66e03d8f@t450s.home> X-Originating-IP: [10.24.216.210] X-ClientProxiedBy: BGMAIL101.nvidia.com (10.25.59.10) To bgmail102.nvidia.com (10.25.59.11) Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5014 Lines: 141 On 11/8/2016 4:46 AM, Alex Williamson wrote: > On Sat, 5 Nov 2016 02:40:44 +0530 > Kirti Wankhede wrote: > ... >> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma) >> +static int __vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, >> + int prot, unsigned long *pfn_base, >> + bool do_accounting) >> +{ >> + struct task_struct *task = dma->task; >> + unsigned long limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; >> + bool lock_cap = dma->mlock_cap; >> + struct mm_struct *mm = dma->addr_space->mm; >> + int ret; >> + bool rsvd; >> + >> + ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base); >> + if (ret) >> + return ret; >> + >> + rsvd = is_invalid_reserved_pfn(*pfn_base); >> + >> + if (!rsvd && !lock_cap && mm->locked_vm + 1 > limit) { >> + put_pfn(*pfn_base, prot); >> + pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK (%ld) exceeded\n", >> + __func__, task->comm, task_pid_nr(task), >> + limit << PAGE_SHIFT); >> + return -ENOMEM; >> + } >> + >> + if (!rsvd && do_accounting) >> + vfio_lock_acct(mm, 1); >> + >> + return 1; >> +} >> + >> +static void __vfio_unpin_page_external(struct vfio_addr_space *addr_space, >> + unsigned long pfn, int prot, >> + bool do_accounting) >> +{ >> + put_pfn(pfn, prot); >> + >> + if (do_accounting) >> + vfio_lock_acct(addr_space->mm, -1); > > Can't we batch this like we do elsewhere? Intel folks, AIUI you intend > to pin all VM memory through this side channel, have you tested the > scalability and performance of this with larger VMs? Our vfio_pfn > data structure alone is 40 bytes per pinned page, which means for > each 1GB of VM memory, we have 10MBs worth of struct vfio_pfn! > Additionally, unmapping each 1GB of VM memory will result in 256k > separate vfio_lock_acct() callbacks. I'm concerned that we're not > being efficient enough in either space or time. > > One thought might be whether we really need to save the pfn, we better > always get the same result if we pin it again, or maybe we can just do > a lookup through the mm at that point without re-pinning. Could we get > to the point where we only need an atomic_t ref count per page in a > linear array relative to the IOVA? Ok. Is System RAM hot-plug supported? How is system RAM hot-plug handled? Are there DMA_MAP calls on such hot-plug for additional range? If we have a linear array/memory, we will have to realloc it on memory hot-plug? > That would give us 1MB per 1GB > overhead. The semantics of the pin and unpin would make more sense then > too, both would take an IOVA range, only pinning would need a return > mechanism. For instance: > > int pin_pages(void *iommu_data, dma_addr_t iova_base, > int npage, unsigned long *pfn_base); > > This would pin physically contiguous pages up to npage, returning the > base pfn and returning the number of pages pinned (<= npage). The > vendor driver would make multiple calls to fill the necessary range. With the current patch, input is user_pfn[] array and npages. int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage, int prot, unsigned long *phys_pfn) When guest allocates memory with malloc(), gfns would not be contiguous, right? These gfns (user_pfns) are passed as argument here. Is there any case where we could get pin/unpin request for contiguous pages? > Unpin would then simply be: > > void unpin_pages(void *iommu_data, dma_addr_t iova_base, int npage); > > Hugepage usage would really make such an interface shine (ie. 2MB+ > contiguous ranges). A downside would be the overhead of getting the > group and container reference in vfio for each callback, perhaps we'd > need to figure out how the vendor driver could hold that reference. In very initial phases of proposal, I had suggested to keep pointer to container->iommu_data in struct mdev_device. But that was discarded. > The current API of passing around pfn arrays, further increasing the > overhead of the whole ecosystem just makes me cringe though. > ... >> + if (ret <= 0) { >> + WARN_ON(!ret); >> + goto pin_unwind; >> + } >> + >> + mutex_lock(&dma->addr_space->pfn_list_lock); >> + >> + /* search if pfn exist */ >> + p = vfio_find_pfn(dma->addr_space, pfn[i]); >> + if (p) { >> + atomic_inc(&p->ref_count); > > We never test whether (p->prot == prot), shouldn't we be doing > something in that case? In fact, why do we allow the side-channel > through the .{un}pin_pages to specify page protection flags that might > be different than the user specified for the DMA_MAP? If the user > specified read-only, the vendor driver should not be allowed to > override with read-write access. > If user specified protection flags for DMA_MAP could be prot = IOMMU_WRITE | IOMMU_READ; But vendor driver can request to pin page to be readonly, i.e. IOMMU_READ. In that case, pin pages should be allowed, right? Then the check should be if (p->prot & prot). Thanks, Kirti