Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751595AbdISJr2 (ORCPT ); Tue, 19 Sep 2017 05:47:28 -0400 Received: from szxga05-in.huawei.com ([45.249.212.191]:6514 "EHLO szxga05-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750972AbdISJr0 (ORCPT ); Tue, 19 Sep 2017 05:47:26 -0400 Subject: Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching To: Robin Murphy , Nate Watterson , References: <23972edc-e59e-3a43-59a9-d776e64af4d7@codeaurora.org> CC: , , , , , , , From: "Leizhen (ThunderTown)" Message-ID: <59C0E694.6080508@huawei.com> Date: Tue, 19 Sep 2017 17:42:44 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.23.164] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A0C0205.59C0E6B6.0147,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 975302a5a6854beff9d4fe3982c18a92 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11910 Lines: 288 On 2017/7/31 19:42, Robin Murphy wrote: > Hi Nate, > > On 29/07/17 04:57, Nate Watterson wrote: >> Hi Robin, >> I am seeing a crash when performing very basic testing on this series >> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this >> patch is the culprit, but this rcache business is still mostly >> witchcraft to me. >> >> # ifconfig eth5 up >> # ifconfig eth5 down >> Unable to handle kernel NULL pointer dereference at virtual address >> 00000020 >> user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00 >> [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003, >> *pmd=00000007d8720003, *pte=0000000000000000 >> Internal error: Oops: 96000007 [#1] SMP >> Modules linked in: >> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3 >> task: ffff8007da1e5780 task.stack: ffff8007ddcb8000 >> PC is at __cached_rbnode_delete_update+0x2c/0x58 >> LR is at private_free_iova+0x2c/0x60 >> pc : [] lr : [] pstate: 204001c5 >> sp : ffff8007ddcbba00 >> x29: ffff8007ddcbba00 x28: ffff8007c8350210 >> x27: ffff8007d1a80000 x26: ffff8007dcc20800 >> x25: 0000000000000140 x24: ffff8007c98f0008 >> x23: 00000000fffffe4e x22: 0000000000000140 >> x21: ffff8007c98f0008 x20: ffff8007c9adb240 >> x19: ffff8007c98f0018 x18: 0000000000000010 >> x17: 0000000000000000 x16: 0000000000000000 >> x15: 4000000000000000 x14: 0000000000000000 >> x13: 00000000ffffffff x12: 0000000000000001 >> x11: dead000000000200 x10: 00000000ffffffff >> x9 : 0000000000000000 x8 : ffff8007c9adb1c0 >> x7 : 0000000040002000 x6 : 0000000000210d00 >> x5 : 0000000000000000 x4 : 000000000000c57e >> x3 : 00000000ffffffcf x2 : 00000000ffffffcf >> x1 : ffff8007c9adb240 x0 : 0000000000000000 >> [...] >> [] __cached_rbnode_delete_update+0x2c/0x58 >> [] private_free_iova+0x2c/0x60 >> [] iova_magazine_free_pfns+0x4c/0xa0 >> [] free_iova_fast+0x1b0/0x230 >> [] iommu_dma_free_iova+0x5c/0x80 >> [] __iommu_dma_unmap+0x5c/0x98 >> [] iommu_dma_unmap_resource+0x24/0x30 >> [] iommu_dma_unmap_page+0xc/0x18 >> [] __iommu_unmap_page+0x40/0x60 >> [] mlx5e_page_release+0xbc/0x128 >> [] mlx5e_dealloc_rx_wqe+0x30/0x40 >> [] mlx5e_close_channel+0x70/0x1f8 >> [] mlx5e_close_channels+0x2c/0x50 >> [] mlx5e_close_locked+0x54/0x68 >> [] mlx5e_close+0x30/0x58 >> [...] >> >> ** Disassembly for __cached_rbnode_delete_update() near the fault ** >> 92|if (free->pfn_hi < iovad->dma_32bit_pfn) >> FFFF00000852C6C4| ldr x3,[x1,#0x18] ; x3,[free,#24] >> FFFF00000852C6C8| ldr x2,[x0,#0x30] ; x2,[iovad,#48] >> FFFF00000852C6CC| cmp x3,x2 >> FFFF00000852C6D0| b.cs 0xFFFF00000852C708 >> | curr = &iovad->cached32_node; >> 94|if (!curr) >> FFFF00000852C6D4| adds x19,x0,#0x18 ; x19,iovad,#24 >> FFFF00000852C6D8| b.eq 0xFFFF00000852C708 >> | >> |cached_iova = rb_entry(*curr, struct iova, node); >> | >> 99|if (free->pfn_lo >= cached_iova->pfn_lo) >> FFFF00000852C6DC| ldr x0,[x19] ; xiovad,[curr] >> FFFF00000852C6E0| ldr x2,[x1,#0x20] ; x2,[free,#32] >> FFFF00000852C6E4| ldr x0,[x0,#0x20] ; x0,[x0,#32] >> Apparently cached_iova was NULL so the pfn_lo access faulted. >> >> FFFF00000852C6E8| cmp x2,x0 >> FFFF00000852C6EC| b.cc 0xFFFF00000852C6FC >> FFFF00000852C6F0| mov x0,x1 ; x0,free >> 100| *curr = rb_next(&free->node); >> After instrumenting the code a bit, this seems to be the culprit. In the >> previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the >> dma_limit for the domain so rb_next() returns NULL. >> >> Let me know if you have any questions or would like additional tests >> run. I also applied your "DMA domain debug info" patches and dumped the >> contents of the domain at each of the steps above in case that would be >> useful. If nothing else, they reinforce how thirsty the CX4 NIC is >> especially when using 64k pages and many CPUs. > > Thanks for the report - I somehow managed to reason myself out of > keeping the "no cached node" check in __cached_rbnode_delete_update() on > the assumption that it must always be set by a previous allocation. > However, there is indeed just one case case for which that fails: when > you free any IOVA immediately after freeing the very topmost one. Which > is something that freeing an entire magazine's worth of IOVAs back to > the tree all at once has a very real chance of doing... > > The obvious straightforward fix is inline below, but I'm now starting to > understand the appeal of reserving a sentinel node to ensure the tree > can never be empty, so I might have a quick go at that to see if it > results in nicer code overall. > > Robin. > >> >> -Nate >> >> On 7/21/2017 7:42 AM, Robin Murphy wrote: >>> The cached node mechanism provides a significant performance benefit for >>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices >>> or where the 32-bit space is full, the loss of this benefit can be >>> significant - on large systems there can be many thousands of entries in >>> the tree, such that traversing to the end then walking all the way down >>> to find free space every time becomes increasingly awful. >>> >>> Maintain a similar cached node for the whole IOVA space as a superset of >>> the 32-bit space so that performance can remain much more consistent. >>> >>> Inspired by work by Zhen Lei . >>> >>> Tested-by: Ard Biesheuvel >>> Tested-by: Zhen Lei >>> Signed-off-by: Robin Murphy >>> --- >>> >>> v2: No change >>> >>> drivers/iommu/iova.c | 59 >>> +++++++++++++++++++++++++--------------------------- >>> include/linux/iova.h | 3 ++- >>> 2 files changed, 30 insertions(+), 32 deletions(-) >>> >>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c >>> index d094d1ca8f23..f5809a2ee6c2 100644 >>> --- a/drivers/iommu/iova.c >>> +++ b/drivers/iommu/iova.c >>> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned >>> long granule, >>> spin_lock_init(&iovad->iova_rbtree_lock); >>> iovad->rbroot = RB_ROOT; >>> + iovad->cached_node = NULL; >>> iovad->cached32_node = NULL; >>> iovad->granule = granule; >>> iovad->start_pfn = start_pfn; >>> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain); >>> static struct rb_node * >>> __get_cached_rbnode(struct iova_domain *iovad, unsigned long >>> *limit_pfn) >>> { >>> - if ((*limit_pfn > iovad->dma_32bit_pfn) || >>> - (iovad->cached32_node == NULL)) >>> + struct rb_node *cached_node = NULL; >>> + struct iova *curr_iova; >>> + >>> + if (*limit_pfn <= iovad->dma_32bit_pfn) >>> + cached_node = iovad->cached32_node; >>> + if (!cached_node) >>> + cached_node = iovad->cached_node; >>> + if (!cached_node) >>> return rb_last(&iovad->rbroot); >>> - else { >>> - struct rb_node *prev_node = rb_prev(iovad->cached32_node); >>> - struct iova *curr_iova = >>> - rb_entry(iovad->cached32_node, struct iova, node); >>> - *limit_pfn = curr_iova->pfn_lo; >>> - return prev_node; >>> - } >>> + >>> + curr_iova = rb_entry(cached_node, struct iova, node); >>> + *limit_pfn = curr_iova->pfn_lo; >>> + >>> + return rb_prev(cached_node); >>> } >>> static void >>> -__cached_rbnode_insert_update(struct iova_domain *iovad, >>> - unsigned long limit_pfn, struct iova *new) >>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova >>> *new) >>> { >>> - if (limit_pfn != iovad->dma_32bit_pfn) >>> - return; >>> - iovad->cached32_node = &new->node; >>> + if (new->pfn_lo > iovad->dma_32bit_pfn) >>> + iovad->cached_node = &new->node; >>> + else >>> + iovad->cached32_node = &new->node; >>> } >>> static void >>> __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova >>> *free) >>> { >>> struct iova *cached_iova; >>> - struct rb_node *curr; >>> + struct rb_node **curr = NULL; >>> - if (!iovad->cached32_node) >>> - return; >>> - curr = iovad->cached32_node; >>> - cached_iova = rb_entry(curr, struct iova, node); ----------------------------- >>> + if (free->pfn_hi < iovad->dma_32bit_pfn) >>> + curr = &iovad->cached32_node; >>> + if (!curr) >>> + curr = &iovad->cached_node; > > + if (!*curr) > + return; Is it necessary for us to try the following adjustment? + if (free->pfn_hi < iovad->dma_32bit_pfn) + curr = &iovad->cached32_node; + else + curr = &iovad->cached_node; + + if (!*curr) { + *curr = rb_next(&free->node); + return; + } > >>> - if (free->pfn_lo >= cached_iova->pfn_lo) { >>> - struct rb_node *node = rb_next(&free->node); >>> - struct iova *iova = rb_entry(node, struct iova, node); >>> + cached_iova = rb_entry(*curr, struct iova, node); >>> - /* only cache if it's below 32bit pfn */ >>> - if (node && iova->pfn_lo < iovad->dma_32bit_pfn) >>> - iovad->cached32_node = node; >>> - else >>> - iovad->cached32_node = NULL; >>> - } >>> + if (free->pfn_lo >= cached_iova->pfn_lo) >>> + *curr = rb_next(&free->node); >>> } >>> /* Insert the iova into domain rbtree by holding writer lock */ >>> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct >>> iova_domain *iovad, >>> { >>> struct rb_node *prev, *curr = NULL; >>> unsigned long flags; >>> - unsigned long saved_pfn; >>> unsigned long align_mask = ~0UL; >>> if (size_aligned) >>> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct >>> iova_domain *iovad, >>> /* Walk the tree backwards */ >>> spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); >>> - saved_pfn = limit_pfn; >>> curr = __get_cached_rbnode(iovad, &limit_pfn); >>> prev = curr; >>> while (curr) { >>> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct >>> iova_domain *iovad, >>> /* If we have 'prev', it's a valid place to start the >>> insertion. */ >>> iova_insert_rbtree(&iovad->rbroot, new, prev); >>> - __cached_rbnode_insert_update(iovad, saved_pfn, new); >>> + __cached_rbnode_insert_update(iovad, new); >>> spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); >>> diff --git a/include/linux/iova.h b/include/linux/iova.h >>> index e0a892ae45c0..0bb8df43b393 100644 >>> --- a/include/linux/iova.h >>> +++ b/include/linux/iova.h >>> @@ -40,7 +40,8 @@ struct iova_rcache { >>> struct iova_domain { >>> spinlock_t iova_rbtree_lock; /* Lock to protect update of >>> rbtree */ >>> struct rb_root rbroot; /* iova domain rbtree root */ >>> - struct rb_node *cached32_node; /* Save last alloced node */ >>> + struct rb_node *cached_node; /* Save last alloced node */ >>> + struct rb_node *cached32_node; /* Save last 32-bit alloced >>> node */ >>> unsigned long granule; /* pfn granularity for this domain */ >>> unsigned long start_pfn; /* Lower limit for this domain */ >>> unsigned long dma_32bit_pfn; >>> >> > > > . > -- Thanks! BestRegards