Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753781AbcLOR6G (ORCPT ); Thu, 15 Dec 2016 12:58:06 -0500 Received: from hqemgate16.nvidia.com ([216.228.121.65]:9892 "EHLO hqemgate16.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752007AbcLOR6E (ORCPT ); Thu, 15 Dec 2016 12:58:04 -0500 X-PGP-Universal: processed; by hqnvupgp07.nvidia.com on Wed, 14 Dec 2016 21:55:26 -0800 Subject: Re: [PATCH] vfio/type1: Restore mapping performance with mdev support To: Alex Williamson References: <20161213205810.25950.32323.stgit@gimli.home> <2e2a2593-46ec-547b-e4a7-e78be446757a@nvidia.com> <20161215010347.3942360a@t450s.home> CC: , , X-Nvconfidentiality: public From: Kirti Wankhede Message-ID: <02707161-145f-25f3-ab47-c63d1de81e02@nvidia.com> Date: Thu, 15 Dec 2016 23:27:54 +0530 MIME-Version: 1.0 In-Reply-To: <20161215010347.3942360a@t450s.home> X-Originating-IP: [10.25.72.43] X-ClientProxiedBy: BGMAIL103.nvidia.com (10.25.59.12) To bgmail102.nvidia.com (10.25.59.11) Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5561 Lines: 142 On 12/15/2016 1:33 PM, Alex Williamson wrote: > On Thu, 15 Dec 2016 12:05:35 +0530 > Kirti Wankhede wrote: > >> On 12/14/2016 2:28 AM, Alex Williamson wrote: >>> As part of the mdev support, type1 now gets a task reference per >>> vfio_dma and uses that to get an mm reference for the task while >>> working on accounting. That's the correct thing to do for paths >>> where we can't rely on using current, but there are still hot paths >>> where we can optimize because we know we're invoked by the user. >>> >>> Specifically, vfio_pin_pages_remote() is only called when the user >>> does DMA mapping (vfio_dma_do_map) or if an IOMMU group is added to >>> a container with existing mappings (vfio_iommu_replay). We can >>> therefore use current->mm as well as rlimit() and capable() directly >>> rather than going through the high overhead path via the stored >>> task_struct. We also know that vfio_dma_do_unmap() is only called >>> via user ioctl, so we can also tune that path to be more lightweight. >>> >>> In a synthetic guest mapping test emulating a 1TB VM backed by a >>> single 4GB range remapped multiple times across the address space, >>> the mdev changes to the type1 backend introduced a roughly 25% hit >>> in runtime of this test. These changes restore it to nearly the >>> previous performance for the interfaces exercised here, >>> VFIO_IOMMU_MAP_DMA and release on close. >>> >>> Signed-off-by: Alex Williamson >>> --- >>> drivers/vfio/vfio_iommu_type1.c | 145 +++++++++++++++++++++------------------ >>> 1 file changed, 79 insertions(+), 66 deletions(-) >>> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index 9815e45..8dfeafb 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -103,6 +103,10 @@ struct vfio_pfn { >>> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ >>> (!list_empty(&iommu->domain_list)) >>> >>> +/* Make function bool options readable */ >>> +#define IS_CURRENT (true) >>> +#define DO_ACCOUNTING (true) >>> + >>> static int put_pfn(unsigned long pfn, int prot); >>> >>> /* >>> @@ -264,7 +268,8 @@ static void vfio_lock_acct_bg(struct work_struct *work) >>> kfree(vwork); >>> } >>> >>> -static void vfio_lock_acct(struct task_struct *task, long npage) >>> +static void vfio_lock_acct(struct task_struct *task, >>> + long npage, bool is_current) >>> { >>> struct vwork *vwork; >>> struct mm_struct *mm; >>> @@ -272,24 +277,31 @@ static void vfio_lock_acct(struct task_struct *task, long npage) >>> if (!npage) >>> return; >>> >>> - mm = get_task_mm(task); >>> + mm = is_current ? task->mm : get_task_mm(task); >>> if (!mm) >>> - return; /* process exited or nothing to do */ >>> + return; /* process exited */ >>> >>> if (down_write_trylock(&mm->mmap_sem)) { >>> mm->locked_vm += npage; >>> up_write(&mm->mmap_sem); >>> - mmput(mm); >>> + if (!is_current) >>> + mmput(mm); >>> return; >>> } >>> >>> + if (is_current) { >>> + mm = get_task_mm(task); >>> + if (!mm) >>> + return; >>> + } >>> + >>> /* >>> * Couldn't get mmap_sem lock, so must setup to update >>> * mm->locked_vm later. If locked_vm were atomic, we >>> * wouldn't need this silliness >>> */ >>> vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL); >>> - if (!vwork) { >>> + if (WARN_ON(!vwork)) { >>> mmput(mm); >>> return; >>> } >>> @@ -345,13 +357,13 @@ static int put_pfn(unsigned long pfn, int prot) >>> } >>> >>> static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, >>> - int prot, unsigned long *pfn) >>> + int prot, unsigned long *pfn, bool is_current) >>> { >>> struct page *page[1]; >>> struct vm_area_struct *vma; >>> int ret; >>> >>> - if (mm == current->mm) { >>> + if (is_current) { >> >> With this change, if vfio_pin_page_external() gets called from QEMU >> process context, for example in response to some BAR0 register access, >> it will still fallback to slow path, get_user_pages_remote(). We don't >> have to change this function. This path already takes care of taking >> best possible path. >> >> That also makes me think, vfio_pin_page_external() uses task structure >> to get mlock limit and capability. Expectation is mdev vendor driver >> shouldn't pin all system memory, but if any mdev driver does that, then >> that driver might see such performance impact. Should we optimize this >> path if (dma->task == current)? > > Hi Kirti, > > I was actually trying to avoid the (task == current) test with this > change because I wasn't sure how reliable it is. Is there a > possibility that this test generates a false positive if current > coincidentally matches our task and does that allow us the same > opportunities for making use of current that we have when we know in a > process context execution path? The above change makes this a more > direct association. Can you show that inferring the process context is > correct? Thanks, We do hold the usage count of task structure, get_task_struct(current), before saving its reference in dma->task which is released, put_task_struct(), from vfio_remove_dma(). That makes sure that we have a valid reference to task structure till we remove/free that dma structure. Why would the check (dma->task == current) be false positive? Vendor driver can call vfio_pin_pages() on access to some emulated register from the same task who have mapped dma range, in that case this check would be true. Thanks, Kirti