Received: by 10.223.148.5 with SMTP id 5csp7799631wrq; Thu, 18 Jan 2018 09:38:52 -0800 (PST) X-Google-Smtp-Source: ACJfBouDmbbT6Vxg6hctrRjS49DzKJulbO/LT6RHJQl9Ed4N34o1uL1VCyxW7jIsf+7iKDwoqPEG X-Received: by 2002:a17:902:a5cc:: with SMTP id t12-v6mr151880plq.149.1516297132558; Thu, 18 Jan 2018 09:38:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516297132; cv=none; d=google.com; s=arc-20160816; b=C24lRNmmuHcHFLNsWweC6+7XFUaeSJTlzB9l6nysrVytdPso/h76mMs04vlbVBZ0I6 vCiMRWbTZD/kRdmCzZNc0teSDEHPzqxpM6h7if4Te6TvGghUjKQVpAVc3fDzE9kFzqH4 MkTrzvUjpDiGnTFZoPjWKCAUcxghFHh/eVLwLArG+8/4myjYcKwrM9XxBoa5oyXNUVvF 7bV1m52poS5mRgx2grLpckK1zWv0bwJ2lw0byaKpNx5SOpliP7LoSL1zDhkBXC3F+sC4 P/lXzQ4vFAMPDBwG08/Xzt/AujnOFHkHZyAUah29RTkA3pYm1VTpiy1eUIbz0LqhgGXl qfpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :arc-authentication-results; bh=SQLyKCvu2GZ7HiH992z8POYfMcBId8/1IQIAa1zBOjI=; b=lF1oQm6wPcZgg9FGQBf9ZabeMwanmmQp/XtsVLw1PSY/LAhkqkkt5bp5uLm1i/bLGC 2ijcNccovamskEzbNOaWCY/duLvrVW6qWduNZRJkgt4Su+xVa0C+NYOkKXG7B45zocmP Om+Vf2NjIaD22DvbIYNPbAVtGGvlcoX/8fAkLeVYU3/iJ3B8hX9Z+t0Yhg/KigWZv2lT 8ZzLKGgbUkVmyKPAGttUvlyHOUptsdCdoLidA2hPybOdTiBQNWLZEERxS043sk6au3T0 /H8DAyBCfBQ+BRx1MGvqmt+H89BwPA8Fdq+YAstMFn36Xo3hGtRrJdyNHcZVKWJbto/2 rZdw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e22si2594057pfl.178.2018.01.18.09.38.37; Thu, 18 Jan 2018 09:38:52 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755027AbeARRgG (ORCPT + 99 others); Thu, 18 Jan 2018 12:36:06 -0500 Received: from mx1.redhat.com ([209.132.183.28]:37072 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754106AbeARRgE (ORCPT ); Thu, 18 Jan 2018 12:36:04 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 257749D4F8; Thu, 18 Jan 2018 17:36:04 +0000 (UTC) Received: from w520.home (ovpn-116-254.phx2.redhat.com [10.3.116.254]) by smtp.corp.redhat.com (Postfix) with ESMTP id 48B5817DAE; Thu, 18 Jan 2018 17:36:03 +0000 (UTC) Date: Thu, 18 Jan 2018 10:34:12 -0700 From: Alex Williamson To: Suravee Suthikulpanit Cc: linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org, joro@8bytes.org, jroedel@suse.de Subject: Re: [RFC PATCH v2 1/2] vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs Message-ID: <20180118103412.26b1a581@w520.home> In-Reply-To: <0cdcfade-481a-96de-cadb-291968392318@amd.com> References: <1514366435-12723-1-git-send-email-suravee.suthikulpanit@amd.com> <1514366435-12723-2-git-send-email-suravee.suthikulpanit@amd.com> <20180108135329.3c1e2c88@t450s.home> <0cdcfade-481a-96de-cadb-291968392318@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Thu, 18 Jan 2018 17:36:04 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 18 Jan 2018 10:25:12 +0700 Suravee Suthikulpanit wrote: > Hi Alex, > > On 1/9/18 3:53 AM, Alex Williamson wrote: > > On Wed, 27 Dec 2017 04:20:34 -0500 > > Suravee Suthikulpanit wrote: > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >> index e30e29a..f000844 100644 > >> --- a/drivers/vfio/vfio_iommu_type1.c > >> +++ b/drivers/vfio/vfio_iommu_type1.c > >> > >> ... >> > >> @@ -479,6 +486,40 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > >> return unlocked; > >> } > >> > >> +/* > >> + * Generally, VFIO needs to unpin remote pages after each IOTLB flush. > >> + * Therefore, when using IOTLB flush sync interface, VFIO need to keep track > >> + * of these regions (currently using a list). > >> + * > >> + * This value specifies maximum number of regions for each IOTLB flush sync. > >> + */ > >> +#define VFIO_IOMMU_TLB_SYNC_MAX 512 > > > > Is this an arbitrary value or are there non-obvious considerations for > > this value should we want to further tune it in the future? > > This is just an arbitrary value for now, which we could try further tuning. > On some dGPUs that I have been using, I have seen max of ~1500 regions within an unmap call. > In most case, I see less than 100 regions in an unmap call. The structure is currently 40 bytes. > So, I figured capping at 512 entry in the list is 20KB is reasonable. Let me know what you think. Seems like a reasonable starting point. For a VM user, the maximum size of an unmap will largely depend on the size of the VM, predominantly the size of memory above the 4G boundary in the VM. That could be 100s of GB. In the best case, iommu_unmap could return 1GB ranges, in the worst case, 4K. So I'm not sure there's really any upper bound, thus the cond_resched(), but we will opportunistically coalesce pages unless disabled via module option. > >> .... > >> > >> @@ -887,8 +946,14 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova, > >> break; > >> } > >> > >> - for (; i < npage && i > 0; i--, iova -= PAGE_SIZE) > >> - iommu_unmap(domain->domain, iova, PAGE_SIZE); > >> + for (; i < npage && i > 0; i--, iova -= PAGE_SIZE) { > >> + unmapped = iommu_unmap_fast(domain->domain, iova, PAGE_SIZE); > >> + if (WARN_ON(!unmapped)) > >> + break; > >> + iommu_tlb_range_add(domain->domain, iova, unmapped); > >> + } > >> + if (unmapped) > >> + iommu_tlb_sync(domain->domain); > > > > Using unmapped here seems a little sketchy, for instance if we got back > > zero on the last call to iommu_unmap_fast() but had other ranges queued > > for flush. Do we even need a WARN_ON and break here, are we just > > trying to skip adding a zero range? The intent is that we either leave > > this function with everything mapped or nothing mapped, so perhaps we > > should warn and continue. Assuming a spurious sync is ok, we could > > check (i < npage) for the sync condition, the only risk being we had no > > mappings at all and therefore no unmaps. > > > > TBH, I wonder if this function is even needed anymore or if the mapping > > problem in amd_iommu has since ben fixed. > > Actually, I never hit this execution path in my test runs. I could just left this > unchanged and use the slow unmap path to simplify the logic. I'm not aware of > the history of why this logic is needed for AMD IOMMU. Is this a bug in the driver or > the hardware? It was a bug in amd_iommu code and unfortunately I've lost details beyond what we see in the comments there, iommu ptes previously mapped as 4k can't be unmapped and remapped using large page sizes, presumably pmds still in place or something along those lines. If you're testing on AMD and not triggering this, let's not try to optimize this path, it's just a crusty old safety net. > > Also, I'm not sure why you're gating adding fast flushing to amd_iommu > > on vfio making use of it. These can be done independently. Thanks, > > Currently, the fast unmap interface is mainly called by VFIO. So, I thought I would > submit the patches together for review. If you would prefer, I can submit the IOMMU part > separately. I think it makes more sense and is most efficient to uncouple them. Joerg probably isn't paying attention to the amd_iommu changes because I'm commenting on the vfio parts and I'm not looking at the amd_iommu changes. The code doesn't depend on them going in together, so we can do them in parallel if they're split. Thanks, Alex