Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp2480007imm; Thu, 9 Aug 2018 13:46:58 -0700 (PDT) X-Google-Smtp-Source: AA+uWPzcQQ+gfsX9S+3S3+tOFHZEdje1O/uKX9FEUkIu1mjFvmYwy8pwMYZ2c/W4xxbGI0QnV1GC X-Received: by 2002:a63:8449:: with SMTP id k70-v6mr3604792pgd.309.1533847618740; Thu, 09 Aug 2018 13:46:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533847618; cv=none; d=google.com; s=arc-20160816; b=1Glj55MZtEll+Uqed1PncVM7Wi6aSfNM7Z/EVxFi5oOWIAwOzD29PyTV/qdPR+fVmL ffPfAbyUNlCHd4FwCe3qPdbW4gz9wBu0IMa/Bd59GYegsfZQdSiZ6dzfSnCeNx1vC/Eo Jg2KRSWIgOMmKfRPmK3Nios224pfUXvKEBoqEuI22dD57yX+66atJ3ejPMjJOKBaKBPi 4npBKeeKaSc3uPqtxeH1fYtTOCSmt2KIeMV/kDnaf6RxlXDmam58jMEHuFzifsj9LQ4H eDshlbulP+d2zRwuZp8BliM8rE4gMba0Zf3Yaoa09t/ikLejJWZiRTTSe+pbu9Wj7/Kq 5J5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=GRcIR+nkYb35ZGPs9HHZO546BSLEFllO4rJ2CbP1p3U=; b=v52oumBFq4Lmo3ik3RIJ4AutwufecTA6vmewyMX0mpK/9NnDXoAn/HKw29Q16d4nbx /TtUwzadwvkYMadniQRJ5bU0g15RHOpFou8kuPfOUj79dQm7B8VQWEx5+Ssvvwpnnuvz 4ok+ht9iM/oDSJWuVdk5/b+TYsCsLV0Ro+SE64v0V/E8Ldt3XWoMQkdMHkzQvSWC4Yyb UQ1IFbfoDekV3RszeNf7ddkgaWVX1m6ONsTuJbyi6ua8pvj5zw2KACDqrdsbptpj6WYP CHktG5nN2eNUnWnFB5naf9Tp4lO8bTriJPzQGyMGjBofo3N0/c0D7F1XoSD8LzuS05xN apzg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 188-v6si9025485pfy.182.2018.08.09.13.46.44; Thu, 09 Aug 2018 13:46:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727318AbeHIXKm (ORCPT + 99 others); Thu, 9 Aug 2018 19:10:42 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:58316 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726953AbeHIXKm (ORCPT ); Thu, 9 Aug 2018 19:10:42 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9FD9118A; Thu, 9 Aug 2018 13:44:10 -0700 (PDT) Received: from [192.168.1.123] (usa-sjc-mx-foss1.foss.arm.com [217.140.101.70]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B01F13F5B3; Thu, 9 Aug 2018 13:44:07 -0700 (PDT) Subject: Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range To: Ganapatrao Kulkarni Cc: Ganapatrao Kulkarni , Joerg Roedel , iommu@lists.linux-foundation.org, LKML , tomasz.nowicki@cavium.com, jnair@caviumnetworks.com, Robert Richter , Vadim.Lomovtsev@cavium.com, Jan.Glauber@cavium.com References: <20180807085437.15965-1-ganapatrao.kulkarni@cavium.com> <318f9118-df78-e78f-1ae2-72a33cbee28e@arm.com> From: Robin Murphy Message-ID: Date: Thu, 9 Aug 2018 21:43:59 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote: > Hi Robin, > > On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy wrote: >> On 07/08/18 09:54, Ganapatrao Kulkarni wrote: >>> >>> As an optimisation for PCI devices, there is always first attempt >>> been made to allocate iova from SAC address range. This will lead >>> to unnecessary attempts/function calls, when there are no free ranges >>> available. >>> >>> This patch optimises by adding flag to track previous failed attempts >>> and avoids further attempts until replenish happens. >> >> >> Agh, what I overlooked is that this still suffers from the original problem, >> wherein a large allocation which fails due to fragmentation then blocks all >> subsequent smaller allocations, even if they may have succeeded. >> >> For a minimal change, though, what I think we could do is instead of just >> having a flag, track the size of the last 32-bit allocation which failed. If >> we're happy to assume that nobody's likely to mix aligned and unaligned >> allocations within the same domain, then that should be sufficiently robust >> whilst being no more complicated than this version, i.e. (modulo thinking up >> a better name for it): > > I agree, it would be better to track size and attempt to allocate for > smaller chunks, if not for bigger one. > >> >>> >>> Signed-off-by: Ganapatrao Kulkarni >>> --- >>> This patch is based on comments from Robin Murphy >>> for patch [1] >>> >>> [1] https://lkml.org/lkml/2018/4/19/780 >>> >>> drivers/iommu/iova.c | 11 ++++++++++- >>> include/linux/iova.h | 1 + >>> 2 files changed, 11 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >>> index 83fe262..d97bb5a 100644 >>> --- a/drivers/iommu/iova.c >>> +++ b/drivers/iommu/iova.c >>> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned >>> long granule, >>> iovad->granule = granule; >>> iovad->start_pfn = start_pfn; >>> iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); >>> + iovad->free_32bit_pfns = true; >> >> >> iovad->max_32bit_free = iovad->dma_32bit_pfn; >> >>> iovad->flush_cb = NULL; >>> iovad->fq = NULL; >>> iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; >>> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain >>> *iovad, struct iova *free) >>> cached_iova = rb_entry(iovad->cached32_node, struct iova, node); >>> if (free->pfn_hi < iovad->dma_32bit_pfn && >>> - free->pfn_lo >= cached_iova->pfn_lo) >>> + free->pfn_lo >= cached_iova->pfn_lo) { >>> iovad->cached32_node = rb_next(&free->node); >>> + iovad->free_32bit_pfns = true; >> >> >> iovad->max_32bit_free = iovad->dma_32bit_pfn; > > i think, you intended to say, > iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo); Nope, that's why I said it needed a better name ;) (I nearly called it last_failed_32bit_alloc_size, but that's a bit much) The point of this value (whetever it's called) is that at any given time it holds an upper bound on the size of the largest contiguous free area. It doesn't have to be the *smallest* upper bound, which is why we can keep things simple and avoid arithmetic - in realistic use-cases like yours when the allocations are a pretty constant size, this should work out directly equivalent to the boolean, only with values of "size" and "dma_32bit_pfn" instead of 0 and 1, so we don't do any more work than necessary. In the edge cases where allocations are all different sizes, it does mean that we will probably end up performing more failing allocations than if we actually tried to track all of the free space exactly, but I think that's reasonable - just because I want to make sure we handle such cases fairly gracefully, doesn't mean that we need to do extra work on the typical fast path to try and actually optimise for them (which is why I didn't really like the accounting implementation I came up with). >> >>> + } >>> cached_iova = rb_entry(iovad->cached_node, struct iova, node); >>> if (free->pfn_lo >= cached_iova->pfn_lo) >>> @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long >>> size, >>> struct iova *new_iova; >>> int ret; >>> + if (limit_pfn <= iovad->dma_32bit_pfn && >>> + !iovad->free_32bit_pfns) >> >> >> size >= iovad->max_32bit_free) >> >>> + return NULL; >>> + >>> new_iova = alloc_iova_mem(); >>> if (!new_iova) >>> return NULL; >>> @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long >>> size, >>> if (ret) { >>> free_iova_mem(new_iova); >>> + if (limit_pfn <= iovad->dma_32bit_pfn) >>> + iovad->free_32bit_pfns = false; >> >> >> iovad->max_32bit_free = size; > > same here, we should decrease available free range after successful allocation. > iovad->max_32bit_free -= size; Equivalently, the simple assignment is strictly decreasing the upper bound already, since we can only get here if size < max_32bit_free in the first place. One more thing I've realised is that this is all potentially a bit racy as we're outside the lock here, so it might need to be pulled into __alloc_and_insert_iova_range(), something like the rough diff below (name changed again for the sake of it; it also occurs to me that we don't really need to re-check limit_pfn in the failure path either, because even a 64-bit allocation still has to walk down through the 32-bit space in order to fail completely) >> >> What do you think? > > most likely this should work, i will try this and confirm at the earliest, Thanks for sticking with this. Robin. ----->8----- diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 83fe2621effe..7cbc58885877 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -190,6 +190,10 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, /* Walk the tree backwards */ spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + if (limit_pfn <= iovad->dma_32bit_pfn && + size >= iovad->failed_alloc_size) + goto out_err; + curr = __get_cached_rbnode(iovad, limit_pfn); curr_iova = rb_entry(curr, struct iova, node); do { @@ -200,10 +204,8 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, curr_iova = rb_entry(curr, struct iova, node); } while (curr && new_pfn <= curr_iova->pfn_hi); - if (limit_pfn < size || new_pfn < iovad->start_pfn) { - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); - return -ENOMEM; - } + if (limit_pfn < size || new_pfn < iovad->start_pfn) + goto out_err; /* pfn_lo will point to size aligned address if size_aligned is set */ new->pfn_lo = new_pfn; @@ -214,9 +216,12 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, __cached_rbnode_insert_update(iovad, new); spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); - - return 0; + +out_err: + iovad->failed_alloc_size = size; + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + return -ENOMEM; } static struct kmem_cache *iova_cache;