Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751348AbdGQNIA (ORCPT ); Mon, 17 Jul 2017 09:08:00 -0400 Received: from szxga02-in.huawei.com ([45.249.212.188]:10297 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751280AbdGQNH6 (ORCPT ); Mon, 17 Jul 2017 09:07:58 -0400 Subject: Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction To: "Leizhen (ThunderTown)" , Will Deacon References: <1498484330-10840-1-git-send-email-thunder.leizhen@huawei.com> <1498484330-10840-2-git-send-email-thunder.leizhen@huawei.com> <20170628093207.GB11053@arm.com> <5954610F.9020807@huawei.com> CC: Joerg Roedel , linux-arm-kernel , iommu , Robin Murphy , linux-kernel , Zefan Li , Xinwei Hu , Tianhong Ding , Hanjun Guo , , , From: John Garry Message-ID: Date: Mon, 17 Jul 2017 14:06:42 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <5954610F.9020807@huawei.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.181.152] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020203.596CB677.00E0,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 105f065ae85e5b44af067e5f5a9172d8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4980 Lines: 130 + On 29/06/2017 03:08, Leizhen (ThunderTown) wrote: > > > On 2017/6/28 17:32, Will Deacon wrote: >> Hi Zhen Lei, >> >> Nate (CC'd), Robin and I have been working on something very similar to >> this series, but this patch is different to what we had planned. More below. >> >> On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote: >>> Because all TLBI commands should be followed by a SYNC command, to make >>> sure that it has been completely finished. So we can just add the TLBI >>> commands into the queue, and put off the execution until meet SYNC or >>> other commands. To prevent the followed SYNC command waiting for a long >>> time because of too many commands have been delayed, restrict the max >>> delayed number. >>> >>> According to my test, I got the same performance data as I replaced writel >>> with writel_relaxed in queue_inc_prod. >>> >>> Signed-off-by: Zhen Lei >>> --- >>> drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++----- >>> 1 file changed, 37 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c >>> index 291da5f..4481123 100644 >>> --- a/drivers/iommu/arm-smmu-v3.c >>> +++ b/drivers/iommu/arm-smmu-v3.c >>> @@ -337,6 +337,7 @@ >>> /* Command queue */ >>> #define CMDQ_ENT_DWORDS 2 >>> #define CMDQ_MAX_SZ_SHIFT 8 >>> +#define CMDQ_MAX_DELAYED 32 >>> >>> #define CMDQ_ERR_SHIFT 24 >>> #define CMDQ_ERR_MASK 0x7f >>> @@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent { >>> }; >>> } cfgi; >>> >>> + #define CMDQ_OP_TLBI_NH_ALL 0x10 >>> #define CMDQ_OP_TLBI_NH_ASID 0x11 >>> #define CMDQ_OP_TLBI_NH_VA 0x12 >>> #define CMDQ_OP_TLBI_EL2_ALL 0x20 >>> @@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent { >>> >>> struct arm_smmu_queue { >>> int irq; /* Wired interrupt */ >>> + u32 nr_delay; >>> >>> __le64 *base; >>> dma_addr_t base_dma; >>> @@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q) >>> return ret; >>> } >>> >>> -static void queue_inc_prod(struct arm_smmu_queue *q) >>> +static void queue_inc_swprod(struct arm_smmu_queue *q) >>> { >>> - u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1; >>> + u32 prod = q->prod + 1; >>> >>> q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod); >>> +} >>> + >>> +static void queue_inc_prod(struct arm_smmu_queue *q) >>> +{ >>> + queue_inc_swprod(q); >>> writel(q->prod, q->prod_reg); >>> } >>> >>> @@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t n_dwords) >>> *dst++ = cpu_to_le64(*src++); >>> } >>> >>> -static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent) >>> +static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize) >>> { >>> if (queue_full(q)) >>> return -ENOSPC; >>> >>> queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords); >>> - queue_inc_prod(q); >>> + >>> + /* >>> + * We don't want too many commands to be delayed, this may lead the >>> + * followed sync command to wait for a long time. >>> + */ >>> + if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) { >>> + queue_inc_swprod(q); >>> + } else { >>> + queue_inc_prod(q); >>> + q->nr_delay = 0; >>> + } >>> + >> >> So here, you're effectively putting invalidation commands into the command >> queue without updating PROD. Do you actually see a performance advantage >> from doing so? Another side of the argument would be that we should be > Yes, my sas ssd performance test showed that it can improve about 100-150K/s(the same to I directly replace > writel with writel_relaxed). And the average execution time of iommu_unmap(which called by iommu_dma_unmap_sg) > dropped from 10us to 5us. > >> moving PROD as soon as we can, so that the SMMU can process invalidation >> commands in the background and reduce the cost of the final SYNC operation >> when the high-level unmap operation is complete. > There maybe that __iowmb() is more expensive than wait for tlbi complete. Except the time of __iowmb() > itself, it also protected by spinlock, lock confliction will rise rapidly in the stress scene. __iowmb() > average cost 300-500ns(Sorry, I forget the exact value). > > In addition, after applied this patcheset and Robin's v2, and my earlier dma64 iova optimization patchset. > Our net performance test got the same data to global bypass. But sas ssd still have more than 20% dropped. > Maybe we should still focus at map/unamp, because the average execution time of iova alloc/free is only > about 400ns. > > By the way, patch2-5 is more effective than this one, it can improve more than 350K/s. And with it, we can > got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective of Robin's v2. Sorry, I have > not tested how about this patch without patch2-5. Further more, I got the same performance data to global > bypass for the traditional mechanical hard disk with only patch2-5(without this patch and Robin's). > >> >> Will >> >> . >> >