Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp362716pxj; Thu, 10 Jun 2021 02:37:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxEmuCreJ8KmeDcdcbo4QWP4AxevqXTsWHAQ9lTgKEM6rEMvSkHWw8yyFjO5wYxXjmDjktD X-Received: by 2002:a17:906:d85:: with SMTP id m5mr3618598eji.55.1623317875701; Thu, 10 Jun 2021 02:37:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623317875; cv=none; d=google.com; s=arc-20160816; b=j51xR8rhya1hiMby4lBXbIPmeA4gvYC1FCUFViTgyIKG9AQq+bFRv2zN/NPS+NPhOR YlXXjGZCow589JXBFT9P4jPDgf8n14BPlwUWE8N/lEqKXyw5iMud+RYe31yufrrlnEBU JaNrUTkTUj4diO0m0eR2u3nX2G52gvXFOgL2eKASN5zyJMeKUcdxxpAFDUwTNKHNBXSd rLbGG5Q7dP+v68Yh4joOKnFAu9HJ0zYJkfmC0YWxd2YIKGfS6iYOUDRReLhmAoKkm9xD Q83zpeD2+9p9/Weu8d9dnD/uhx0frKCXYk6CyGf1JvtrCf0tXWGt9UDt8iNm73/zOlHU rWNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:message-id:references:in-reply-to :subject:cc:to:from:date:content-transfer-encoding:mime-version :sender:dkim-signature; bh=3Nr5XeD9lkZbSovGodMNy9bg536mnJtqaK+uPHyUp30=; b=so1weSadLPuYw98wYAtVJJON1G3ujmj66IiLpm5U7JiDTiFvxepcyGKFrSkY+zhGyo kCAenY8T04c/OHIyksPWMyOrahEGqwf2rRSG/ilFa2rZMFwgm+Vi0X7Q9Y2sPvw1hQA4 XW/qS2b8fB5jK4D/FB3XyDvcMlbLdLeiv5v3UwuPbrCbohxk4VM5WnRTOFJXNm+MscRl wnP2I+CrwHMqmtNHDsr98Qv0j3hufxKwpQmX1uCi/YpmjMdtOvQV7677K/u+qmhHGe8e XTjCr9PkpgawXtv7ebu30fecy7wFPBDmc7/vs8UZqfj8pxwNOtOIyjDFjM642qZY00d8 1qcg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@mg.codeaurora.org header.s=smtp header.b=bEzSN0h7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u8si1796205edv.293.2021.06.10.02.37.32; Thu, 10 Jun 2021 02:37:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@mg.codeaurora.org header.s=smtp header.b=bEzSN0h7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229966AbhFJJiY (ORCPT + 99 others); Thu, 10 Jun 2021 05:38:24 -0400 Received: from m43-7.mailgun.net ([69.72.43.7]:46617 "EHLO m43-7.mailgun.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229778AbhFJJiX (ORCPT ); Thu, 10 Jun 2021 05:38:23 -0400 DKIM-Signature: a=rsa-sha256; v=1; c=relaxed/relaxed; d=mg.codeaurora.org; q=dns/txt; s=smtp; t=1623317787; h=Message-ID: References: In-Reply-To: Subject: Cc: To: From: Date: Content-Transfer-Encoding: Content-Type: MIME-Version: Sender; bh=3Nr5XeD9lkZbSovGodMNy9bg536mnJtqaK+uPHyUp30=; b=bEzSN0h7rIiWN5TqEtKVt2InM63VGk5pFUhUhUCyAfkvHKzocVyWOSizQHHqvLS1kxNY5Xzg bOxUUpygKt+0ZAZzqdDst38jaOEPtZcpYN05hNjF5yRZapuye68NscnK2rDPBJvr5PUAXTaT W128T/aDWxZwHsE4ovfMsU6E76g= X-Mailgun-Sending-Ip: 69.72.43.7 X-Mailgun-Sid: WyI0MWYwYSIsICJsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnIiwgImJlOWU0YSJd Received: from smtp.codeaurora.org (ec2-35-166-182-171.us-west-2.compute.amazonaws.com [35.166.182.171]) by smtp-out-n07.prod.us-east-1.postgun.com with SMTP id 60c1dd1be27c0cc77f048150 (version=TLS1.2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256); Thu, 10 Jun 2021 09:36:27 GMT Sender: saiprakash.ranjan=codeaurora.org@mg.codeaurora.org Received: by smtp.codeaurora.org (Postfix, from userid 1001) id 532DAC433D3; Thu, 10 Jun 2021 09:36:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-caf-mail-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=ALL_TRUSTED,BAYES_00, URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.codeaurora.org (localhost.localdomain [127.0.0.1]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: saiprakash.ranjan) by smtp.codeaurora.org (Postfix) with ESMTPSA id E64D4C433F1; Thu, 10 Jun 2021 09:36:23 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Thu, 10 Jun 2021 15:06:23 +0530 From: Sai Prakash Ranjan To: Robin Murphy Cc: linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org, Will Deacon , linux-arm-kernel@lists.infradead.org Subject: Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list In-Reply-To: <35bfd245-45e2-8083-b620-330d6dbd7bd7@arm.com> References: <20210609145315.25750-1-saiprakash.ranjan@codeaurora.org> <35bfd245-45e2-8083-b620-330d6dbd7bd7@arm.com> Message-ID: <12067ffb8243b220cf03e83aaac3e823@codeaurora.org> X-Sender: saiprakash.ranjan@codeaurora.org User-Agent: Roundcube Webmail/1.3.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Robin, On 2021-06-10 14:38, Robin Murphy wrote: > On 2021-06-10 06:24, Sai Prakash Ranjan wrote: >> Hi Robin, >> >> On 2021-06-10 00:14, Robin Murphy wrote: >>> On 2021-06-09 15:53, Sai Prakash Ranjan wrote: >>>> Currently for iommu_unmap() of large scatter-gather list with page >>>> size >>>> elements, the majority of time is spent in flushing of partial walks >>>> in >>>> __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for >>>> arm-smmu). >>>> >>>> For example: to unmap a 32MB scatter-gather list with page size >>>> elements >>>> (8192 entries), there are 16->2MB buffer unmaps based on the pgsize >>>> (2MB >>>> for 4K granule) and each of 2MB will further result in 512 TLBIVAs >>>> (2MB/4K) >>>> resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a >>>> huge >>>> overhead. >>>> >>>> So instead use io_pgtable_tlb_flush_all() to invalidate the entire >>>> context >>>> if size (pgsize) is greater than the granule size (4K, 16K, 64K). >>>> For this >>>> example of 32MB scatter-gather list unmap, this results in just 16 >>>> ASID >>>> based TLB invalidations or tlb_flush_all() callback (TLBIASID in >>>> case of >>>> arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the >>>> performance of >>>> unmaps drastically. >>>> >>>> Condition (size > granule size) is chosen for >>>> io_pgtable_tlb_flush_all() >>>> because for any granule with supported pgsizes, we will have at >>>> least 512 >>>> TLB invalidations for which tlb_flush_all() is already recommended. >>>> For >>>> example, take 4K granule with 2MB pgsize, this will result in 512 >>>> TLBIVA >>>> in partial walk flush. >>>> >>>> Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap: >>>> (average over 10 iterations) >>>> >>>> Before this optimization: >>>> >>>>      size        iommu_map_sg      iommu_unmap >>>>        4K            2.067 us         1.854 us >>>>       64K            9.598 us         8.802 us >>>>        1M          148.890 us       130.718 us >>>>        2M          305.864 us        67.291 us >>>>       12M         1793.604 us       390.838 us >>>>       16M         2386.848 us       518.187 us >>>>       24M         3563.296 us       775.989 us >>>>       32M         4747.171 us      1033.364 us >>>> >>>> After this optimization: >>>> >>>>      size        iommu_map_sg      iommu_unmap >>>>        4K            1.723 us         1.765 us >>>>       64K            9.880 us         8.869 us >>>>        1M          155.364 us       135.223 us >>>>        2M          303.906 us         5.385 us >>>>       12M         1786.557 us        21.250 us >>>>       16M         2391.890 us        27.437 us >>>>       24M         3570.895 us        39.937 us >>>>       32M         4755.234 us        51.797 us >>>> >>>> This is further reduced once the map/unmap_pages() support gets in >>>> which >>>> will result in just 1 tlb_flush_all() as opposed to 16 >>>> tlb_flush_all(). >>>> >>>> Signed-off-by: Sai Prakash Ranjan >>>> --- >>>>   drivers/iommu/io-pgtable-arm.c | 7 +++++-- >>>>   1 file changed, 5 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/iommu/io-pgtable-arm.c >>>> b/drivers/iommu/io-pgtable-arm.c >>>> index 87def58e79b5..c3cb9add3179 100644 >>>> --- a/drivers/iommu/io-pgtable-arm.c >>>> +++ b/drivers/iommu/io-pgtable-arm.c >>>> @@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct >>>> arm_lpae_io_pgtable *data, >>>>             if (!iopte_leaf(pte, lvl, iop->fmt)) { >>>>               /* Also flush any partial walks */ >>>> -            io_pgtable_tlb_flush_walk(iop, iova, size, >>>> -                          ARM_LPAE_GRANULE(data)); >>>> +            if (size > ARM_LPAE_GRANULE(data)) >>>> +                io_pgtable_tlb_flush_all(iop); >>>> +            else >>> >>> Erm, when will the above condition ever not be true? ;) >>> >> >> Ah right, silly me :) >> >>> Taking a step back, though, what about the impact to drivers other >>> than SMMUv2? >> >> Other drivers would be msm_iommu.c, qcom_iommu.c which does the same >> thing as arm-smmu-v2 (page based invalidations), then there is >> ipmmu-vmsa.c >> which does tlb_flush_all() for flush walk. >> >>> In particular I'm thinking of SMMUv3.2 where the whole >>> range can be invalidated by VA in a single command anyway, so the >>> additional penalties of TLBIALL are undesirable. >>> >> >> Right, so I am thinking we can have a new generic quirk >> IO_PGTABLE_QUIRK_RANGE_INV >> to choose between range based invalidations(tlb_flush_walk) and >> tlb_flush_all(). >> In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV >> with this quirk >> and have something like below, thoughts? >> >> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV) >>         io_pgtable_tlb_flush_walk(iop, iova, size, >>                                   ARM_LPAE_GRANULE(data)); >> else >>         io_pgtable_tlb_flush_all(iop); > > The design here has always been that io-pgtable says *what* needs > invalidating, and we left it up to the drivers to decide exactly > *how*. Even though things have evolved a bit I don't think that has > fundamentally changed - tlb_flush_walk is now only used in this one > place (technically I suppose it could be renamed tlb_flush_table but > it's not worth the churn), so drivers can implement their own > preferred table-invalidating behaviour even more easily than choosing > whether to bounce a quirk through the common code or not. Consider > what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2... > Thanks for the explanation, makes sense. If I am not mistaken, I see that you are suggesting to move this logic based on size and granule-size to arm-smmu-v2 driver and one more thing below.. > I'm instinctively a little twitchy about making this a blanket > optimisation for SMMUv2 since I still remember the palaver with our > display and MMU-500 integrations, where it had to implement the dodgy > "prefetch" register to trigger translations before scanning out a > frame since it couldn't ever afford a TLB miss, thus TLBIALL when > freeing an old buffer would be a dangerous hammer to swing. However > IIRC it also had to ensure everything was mapped as 2MB blocks to > guarantee fitting everything in the TLBs in the first place, so I > guess it would still work out OK due to never realistically unmapping > a whole table at once anyway. > You are also hinting to not do this for all SMMUv2 implementations and make it QCOM specific? If I am wrong in my assumptions here, please let me know otherwise I will prepare the patch :) Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation