Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3055001imm; Fri, 10 Aug 2018 03:02:23 -0700 (PDT) X-Google-Smtp-Source: AA+uWPz/sKV+Xv0wiBYSzTzrrTMqNI6T8tpC0KOMtyVSsqv67N0/jR/qHoYv/FcAk1VD2VxtfHMW X-Received: by 2002:a63:7412:: with SMTP id p18-v6mr1495464pgc.395.1533895343807; Fri, 10 Aug 2018 03:02:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533895343; cv=none; d=google.com; s=arc-20160816; b=nXytjzvp8F1EkGXBWPtaj0L8oAGdXuuLjXIqUqsOLJPWsi835MC0ouk/T+/mltGI8S HrkArWTmVgsBRdb4ke6VKTI0KkD5R7QPMM37Hd6jLzMG2dn7i9EW9x5ZgkInUyBB/Qc1 bNsLtWYgsCcrf8Xqbt4VGjDLwxD5td0Q0npeN0EXr+dPAib/At6idEbhR7PRnFf1t/ME XCyRQqTmhcVKamDujqGhmjB65ZngIOWLvgvnIuhXuYoW1yNiU0uyddbor2MiI8Bwj4yH O0qUExAHNzdGJoCzr66KyHv6nDgzGxP1QUAbD4WJNuoJSEsKcaPa6h5ybnn8wxxzzmkd O9Xw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=JkrsEBtrFw5jbMiFaE6wHs/1Etc/wqwYmGpa3QKg0Hk=; b=0hR7dUv89G0ELOvvSzVipWregvpM7sTbF/RdVbdoYsqqsqxMmGc828qsb+6fOl1tWi oKytMrCOoEeJ54Nnq0UNKZfJ1pzCJcrlJyE2Vzgycx9zcdBkwcyST1ifcF2UhRME7Nz5 WqI84WOFK5O2okO55WHA6lxanyUtKw92Aev7tS45/lPHmIGNV8ZALtgyML/2DqV2rQ9g lOO3BBxeHLvdR+F/IqEKWpF0loWIy8n8gnQn5qdizuNxmy0hLvQVr75ZAWKYkgAi9PNt ZEAH6NOVjU9FBQrrdd7CtX0XLHVfI5yZH+ukposjv6B3gwAeobfjRdrVg1g4pv9dao5j 3d1Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AdFwhEcR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 144-v6si10372329pge.406.2018.08.10.03.02.07; Fri, 10 Aug 2018 03:02:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AdFwhEcR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727516AbeHJMa2 (ORCPT + 99 others); Fri, 10 Aug 2018 08:30:28 -0400 Received: from mail-oi0-f65.google.com ([209.85.218.65]:46738 "EHLO mail-oi0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727141AbeHJMa1 (ORCPT ); Fri, 10 Aug 2018 08:30:27 -0400 Received: by mail-oi0-f65.google.com with SMTP id y207-v6so14784879oie.13 for ; Fri, 10 Aug 2018 03:01:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=JkrsEBtrFw5jbMiFaE6wHs/1Etc/wqwYmGpa3QKg0Hk=; b=AdFwhEcRSmDtGqG+QgSJwqoKWRM1vTn1aUt/fToZRMf+KtwkRlD1ZKcLc3QGVpc8P5 qCLdvREOI8/ckETJSx6Nx7RT/SNPGpXaDAwa2wm+oYIwzghobaymUiFgoIMZW37cq17x nsocY4rld955pclzi2Al22gJtz3Dwbesg2IyvXzEahF9/nvMvJQlPWSMDT9g1r4WzbTV T41NiryxniTFkRq+AazGADdL3Cg1JNJt7LofKKprvO8sixqO5MKVi3BEae5II6tmiBnb mM2pH0xEt7z/fwjnHvI9i0DzL4Vd85g+ay7gpUQ4h7utEvMFZ1hZSXETcfPGaRcjs8fM zrnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=JkrsEBtrFw5jbMiFaE6wHs/1Etc/wqwYmGpa3QKg0Hk=; b=cbZm3JwKTmHKavzI/QIWIkHBfGZoNSIODC0MfNw8IwCQ0V/n9t9OIb84qyZYXK8hbq fZ/3oYFPnouVcnJsce84+bHxlommv2iRb2TnivAC1jLHZxnzlYLXIwoge3Rne+gA2ReY 8XtoXNEQB1jtmvdmQD9y+BkH3BE9++Ha3P9Z7SyVUMgE5rMjfAZ1fhvtOCG3gR5HW6od 0kri5xFz7s6MB270s/ZmuFOp9y2QQOEfhXE65gQq+o9mRLIO1KRB9ktALj5PfOieGDox immT7iQTWxb+laI85yJztABGhprGWyKEtbbS5Z52jD3aroJSZVt5T6kQJbHWRiWOi8XB 6RRQ== X-Gm-Message-State: AOUpUlHAmtlAskS5DgihuYbk4xxBzGMr9T4NLz0mKcGrZb5/sTHnELsx LCG9vIo19GFUT1fWJNkFYCmge0O9MRkYScW2030= X-Received: by 2002:aca:3ad4:: with SMTP id h203-v6mr5686952oia.294.1533895276420; Fri, 10 Aug 2018 03:01:16 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:ac9:7702:0:0:0:0:0 with HTTP; Fri, 10 Aug 2018 03:01:15 -0700 (PDT) In-Reply-To: References: <20180807085437.15965-1-ganapatrao.kulkarni@cavium.com> <318f9118-df78-e78f-1ae2-72a33cbee28e@arm.com> From: Ganapatrao Kulkarni Date: Fri, 10 Aug 2018 15:31:15 +0530 Message-ID: Subject: Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range To: Robin Murphy Cc: Ganapatrao Kulkarni , Joerg Roedel , iommu@lists.linux-foundation.org, LKML , tomasz.nowicki@cavium.com, jnair@caviumnetworks.com, Robert Richter , Vadim.Lomovtsev@cavium.com, Jan.Glauber@cavium.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 10, 2018 at 3:19 PM, Robin Murphy wrote: > On 10/08/18 10:24, Ganapatrao Kulkarni wrote: >> >> Hi Robin, >> >> On Fri, Aug 10, 2018 at 2:13 AM, Robin Murphy >> wrote: >>> >>> On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote: >>>> >>>> >>>> Hi Robin, >>>> >>>> On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy >>>> wrote: >>>>> >>>>> >>>>> On 07/08/18 09:54, Ganapatrao Kulkarni wrote: >>>>>> >>>>>> >>>>>> >>>>>> As an optimisation for PCI devices, there is always first attempt >>>>>> been made to allocate iova from SAC address range. This will lead >>>>>> to unnecessary attempts/function calls, when there are no free ranges >>>>>> available. >>>>>> >>>>>> This patch optimises by adding flag to track previous failed attempts >>>>>> and avoids further attempts until replenish happens. >>>>> >>>>> >>>>> >>>>> >>>>> Agh, what I overlooked is that this still suffers from the original >>>>> problem, >>>>> wherein a large allocation which fails due to fragmentation then blocks >>>>> all >>>>> subsequent smaller allocations, even if they may have succeeded. >>>>> >>>>> For a minimal change, though, what I think we could do is instead of >>>>> just >>>>> having a flag, track the size of the last 32-bit allocation which >>>>> failed. >>>>> If >>>>> we're happy to assume that nobody's likely to mix aligned and unaligned >>>>> allocations within the same domain, then that should be sufficiently >>>>> robust >>>>> whilst being no more complicated than this version, i.e. (modulo >>>>> thinking >>>>> up >>>>> a better name for it): >>>> >>>> >>>> >>>> I agree, it would be better to track size and attempt to allocate for >>>> smaller chunks, if not for bigger one. >>>> >>>>> >>>>>> >>>>>> Signed-off-by: Ganapatrao Kulkarni >>>>>> --- >>>>>> This patch is based on comments from Robin Murphy >>>>>> >>>>>> for patch [1] >>>>>> >>>>>> [1] https://lkml.org/lkml/2018/4/19/780 >>>>>> >>>>>> drivers/iommu/iova.c | 11 ++++++++++- >>>>>> include/linux/iova.h | 1 + >>>>>> 2 files changed, 11 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >>>>>> index 83fe262..d97bb5a 100644 >>>>>> --- a/drivers/iommu/iova.c >>>>>> +++ b/drivers/iommu/iova.c >>>>>> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned >>>>>> long granule, >>>>>> iovad->granule = granule; >>>>>> iovad->start_pfn = start_pfn; >>>>>> iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); >>>>>> + iovad->free_32bit_pfns = true; >>>>> >>>>> >>>>> >>>>> >>>>> iovad->max_32bit_free = iovad->dma_32bit_pfn; >>>>> >>>>>> iovad->flush_cb = NULL; >>>>>> iovad->fq = NULL; >>>>>> iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; >>>>>> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain >>>>>> *iovad, struct iova *free) >>>>>> cached_iova = rb_entry(iovad->cached32_node, struct iova, >>>>>> node); >>>>>> if (free->pfn_hi < iovad->dma_32bit_pfn && >>>>>> - free->pfn_lo >= cached_iova->pfn_lo) >>>>>> + free->pfn_lo >= cached_iova->pfn_lo) { >>>>>> iovad->cached32_node = rb_next(&free->node); >>>>>> + iovad->free_32bit_pfns = true; >>>>> >>>>> >>>>> >>>>> >>>>> iovad->max_32bit_free = iovad->dma_32bit_pfn; >>>> >>>> >>>> >>>> i think, you intended to say, >>>> iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo); >>> >>> >>> >>> Nope, that's why I said it needed a better name ;) >>> >>> (I nearly called it last_failed_32bit_alloc_size, but that's a bit much) >> >> >> may be we can name it "max32_alloc_size"? >>> >>> >>> The point of this value (whetever it's called) is that at any given time >>> it >>> holds an upper bound on the size of the largest contiguous free area. It >>> doesn't have to be the *smallest* upper bound, which is why we can keep >>> things simple and avoid arithmetic - in realistic use-cases like yours >>> when >>> the allocations are a pretty constant size, this should work out directly >>> equivalent to the boolean, only with values of "size" and "dma_32bit_pfn" >>> instead of 0 and 1, so we don't do any more work than necessary. In the >>> edge >>> cases where allocations are all different sizes, it does mean that we >>> will >>> probably end up performing more failing allocations than if we actually >>> tried to track all of the free space exactly, but I think that's >>> reasonable >>> - just because I want to make sure we handle such cases fairly >>> gracefully, >>> doesn't mean that we need to do extra work on the typical fast path to >>> try >>> and actually optimise for them (which is why I didn't really like the >>> accounting implementation I came up with). >>> >> >> ok got it, thanks for the explanation. >>>>> >>>>> >>>>>> + } >>>>>> cached_iova = rb_entry(iovad->cached_node, struct iova, >>>>>> node); >>>>>> if (free->pfn_lo >= cached_iova->pfn_lo) >>>>>> @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned >>>>>> long >>>>>> size, >>>>>> struct iova *new_iova; >>>>>> int ret; >>>>>> + if (limit_pfn <= iovad->dma_32bit_pfn && >>>>>> + !iovad->free_32bit_pfns) >>>>> >>>>> >>>>> >>>>> >>>>> size >= iovad->max_32bit_free) >>>>> >>>>>> + return NULL; >>>>>> + >>>>>> new_iova = alloc_iova_mem(); >>>>>> if (!new_iova) >>>>>> return NULL; >>>>>> @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned >>>>>> long >>>>>> size, >>>>>> if (ret) { >>>>>> free_iova_mem(new_iova); >>>>>> + if (limit_pfn <= iovad->dma_32bit_pfn) >>>>>> + iovad->free_32bit_pfns = false; >>>>> >>>>> >>>>> >>>>> >>>>> iovad->max_32bit_free = size; >>>> >>>> >>>> >>>> same here, we should decrease available free range after successful >>>> allocation. >>>> iovad->max_32bit_free -= size; >>> >>> >>> >>> Equivalently, the simple assignment is strictly decreasing the upper >>> bound >>> already, since we can only get here if size < max_32bit_free in the first >>> place. One more thing I've realised is that this is all potentially a bit >>> racy as we're outside the lock here, so it might need to be pulled into >>> __alloc_and_insert_iova_range(), something like the rough diff below >>> (name >>> changed again for the sake of it; it also occurs to me that we don't >>> really >>> need to re-check limit_pfn in the failure path either, because even a >>> 64-bit >>> allocation still has to walk down through the 32-bit space in order to >>> fail >>> completely) >>> >>>>> >>>>> What do you think? >>>> >>>> >>>> >>>> most likely this should work, i will try this and confirm at the >>>> earliest, >>> >>> >>> >>> Thanks for sticking with this. >>> >>> Robin. >>> >>> ----->8----- >>> >>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >>> index 83fe2621effe..7cbc58885877 100644 >>> --- a/drivers/iommu/iova.c >>> +++ b/drivers/iommu/iova.c >>> @@ -190,6 +190,10 @@ static int __alloc_and_insert_iova_range(struct >>> iova_domain *iovad, >>> >>> /* Walk the tree backwards */ >>> spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); >>> + if (limit_pfn <= iovad->dma_32bit_pfn && >>> + size >= iovad->failed_alloc_size) >>> + goto out_err; >>> + >>> curr = __get_cached_rbnode(iovad, limit_pfn); >>> curr_iova = rb_entry(curr, struct iova, node); >>> do { >>> @@ -200,10 +204,8 @@ static int __alloc_and_insert_iova_range(struct >>> iova_domain *iovad, >>> curr_iova = rb_entry(curr, struct iova, node); >>> } while (curr && new_pfn <= curr_iova->pfn_hi); >>> >>> - if (limit_pfn < size || new_pfn < iovad->start_pfn) { >>> - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); >>> - return -ENOMEM; >>> - } >>> + if (limit_pfn < size || new_pfn < iovad->start_pfn) >>> + goto out_err; >>> >>> /* pfn_lo will point to size aligned address if size_aligned is >>> set >>> */ >>> new->pfn_lo = new_pfn; >>> @@ -214,9 +216,12 @@ static int __alloc_and_insert_iova_range(struct >>> iova_domain *iovad, >>> __cached_rbnode_insert_update(iovad, new); >>> >>> spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); >>> - >>> - >>> return 0; >>> + >>> +out_err: >>> + iovad->failed_alloc_size = size; >>> + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); >>> + return -ENOMEM; >>> } >>> >>> static struct kmem_cache *iova_cache; >> >> >> >> cant we bump up the size when ranges are freed? otherwise we never be >> able to attempt in 32bit range, even there is enough replenish. > > > Oh, I just left that part out of the example for clarity, since it's already > under the lock - I didn't mean to suggest that that we should remove it! ok, thanks > > (I was just too lazy to actually apply your patch to generate a real diff on > top of it) > no problem, i will post next version at the earliest. > Robin. > > >> >> >> @@ -139,8 +139,10 @@ __cached_rbnode_delete_update(struct iova_domain >> *iovad, struct iova *free) >> >> cached_iova = rb_entry(iovad->cached32_node, struct iova, node); >> if (free->pfn_hi < iovad->dma_32bit_pfn && >> - free->pfn_lo >= cached_iova->pfn_lo) >> + free->pfn_lo >= cached_iova->pfn_lo) { >> iovad->cached32_node = rb_next(&free->node); >> + iovad->failed_alloc_size += (free->pfn_hi - free->pfn_lo); >> + } >> > thanks Ganapat