Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp9979129rwr; Fri, 12 May 2023 01:56:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ65jir2vp24/5LGFWovN4T4WMYaKVgraVNzk2uQliDY6BO8j4vQC24DXU1jpbchsU1UlEcy X-Received: by 2002:a05:6a20:2453:b0:103:db35:f5b with SMTP id t19-20020a056a20245300b00103db350f5bmr6003449pzc.14.1683881778187; Fri, 12 May 2023 01:56:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683881778; cv=none; d=google.com; s=arc-20160816; b=eKYDLp/pqfwDH1RHK5x8H3QO9QCt4Br7P2aYaND60Cuvwe84Jx157pL4if3ZEpN8Iz Y8rHT+YKElQh8Ycw5aHPFl3iBExZS7LvRFQ3NvO9OQva+ZvMt+c54ptBMDDrtAmsLB8I ZalB3dY2Qd+pFxOPLYzHUoIOhnJcx618p79YX3ZF4z+vy0nvJ2z9x706MdXBoVqVR362 YN1UW4iXXvyFGZ6iB/BKwVdTq3rEt/uXAAeGCZGvAGiL5V7RRyRuxMIU8xF/XajSX1MQ W5EzlTops/OIuZ9U/Fcv+PJzFcp7EzwF3BxRklynt+x+Gw4N6BMh59D+OdqSIro59izC 2Onw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=Iq/mc1EHIUqB6xRu4lMqfdubuRX8mTUO0Lg6xT8WvlQ=; b=ANgIOmGc/Ebqy+QcOBisOKTo247zqvFa51ErFAOZlXapQT1dmLfXlbxPq9efU/H5Us mcBR8zWizPXj/n/CZt2KCmdiKJTV30wSa9Gg+r349XlipvDTHzk6TxWm9+xhAu6M5rvl QMgURNIEBzH3DXy9mWukyGx0cOYDXS1Jq2eooeKdRtZLzq1NAQck0E6nNh6bpgeVZlU4 LOhvcdlCM3DzpxbtZC3v2IWDuq49sLcM4sNqDpMkOhbhaEiuoY+lHg+4FghM0CRUEouL XXiPeqeT9g+b6NITp/rIfJa5vFpJA31kIPILK19HlK8GntXbTYHUTcZbOd1/xTesdvcC vWMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@163.com header.s=s110527 header.b=UcblPOcI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=163.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t10-20020a17090a4e4a00b0024968f6f8e1si21740745pjl.24.2023.05.12.01.56.03; Fri, 12 May 2023 01:56:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@163.com header.s=s110527 header.b=UcblPOcI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=163.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239935AbjELIwB (ORCPT + 99 others); Fri, 12 May 2023 04:52:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57178 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230369AbjELIwA (ORCPT ); Fri, 12 May 2023 04:52:00 -0400 Received: from m12.mail.163.com (m12.mail.163.com [220.181.12.198]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A718444A9 for ; Fri, 12 May 2023 01:51:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:Subject:Date:Message-Id:MIME-Version; bh=Iq/mc 1EHIUqB6xRu4lMqfdubuRX8mTUO0Lg6xT8WvlQ=; b=UcblPOcIgQVN8pKpNhBKL sBBAwappDAcIeqPrmH28NrQpqelJVF8xoo5lJxJpACvZkSy4N4joSIQVyn76totq OCEvDX8WMY9YUQcyyvlmXKOabg2o8KPjE+9uGL88W55IL/fpZjeWJbKZVdXlywo+ kSj2xLpzwgOrBith8flzdU= Received: from zhangf-virtual-machine.localdomain (unknown [180.111.102.183]) by zwqz-smtp-mta-g1-4 (Coremail) with SMTP id _____wAXzkYO_l1kPYCkBA--.15077S2; Fri, 12 May 2023 16:51:27 +0800 (CST) From: zhangfei To: ajones@ventanamicro.com Cc: aou@eecs.berkeley.edu, conor.dooley@microchip.com, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, palmer@dabbelt.com, paul.walmsley@sifive.com, zhang_fei_0403@163.com, zhangfei@nj.iscas.ac.cn Subject: Re: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance Date: Fri, 12 May 2023 16:51:24 +0800 Message-Id: <20230512085124.3204-1-zhang_fei_0403@163.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230511-0b91da227b91eee76f98c6b0@orel> References: <20230511-0b91da227b91eee76f98c6b0@orel> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: _____wAXzkYO_l1kPYCkBA--.15077S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxKFy7ZFWDur48XFykuF1kuFg_yoW7XF45pr WxGrnI9r15Krn7tw1SyanYvr1Fgws3tr45Jr4Ut34UCrnxWa4qqFnagFWFga4fGr9Ykw4v vr42yw1xCFn8ZFJanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0zEFksQUUUUU= X-Originating-IP: [180.111.102.183] X-CM-SenderInfo: x2kd0w5bihxsiquqjqqrwthudrp/1tbiMg9tl1WB3uXmDgAAsA X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SORTED_RECIPS,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: zhangfei On Thu, May 11, 2023 at 15:43:26PM +0200, Andrew Jones wrote: > On Thu, May 11, 2023 at 09:34:53AM +0800, zhangfei wrote: > > From: zhangfei > > > > Optimized performance when the data size is less than 16 bytes. > > Compared to byte by byte storage, significant performance improvement has been achieved. > > It allows storage instructions to be executed in parallel and reduces the number of jumps. > > Please wrap commit message lines at 74 chars. > > > Additional checks can avoid redundant stores. > > > > Signed-off-by: Fei Zhang > > --- > > arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 37 insertions(+), 3 deletions(-) > > > > diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S > > index e613c5c27998..452764bc9900 100644 > > --- a/arch/riscv/lib/memset.S > > +++ b/arch/riscv/lib/memset.S > > @@ -106,9 +106,43 @@ WEAK(memset) > > beqz a2, 6f > > add a3, t0, a2 > > 5: > > - sb a1, 0(t0) > > - addi t0, t0, 1 > > - bltu t0, a3, 5b > > + /* fill head and tail with minimal branching */ > > + sb a1, 0(t0) > > + sb a1, -1(a3) > > + li a4, 2 > > + bgeu a4, a2, 6f > > + > > + sb a1, 1(t0) > > + sb a1, 2(t0) > > + sb a1, -2(a3) > > + sb a1, -3(a3) > > + li a4, 6 > > + bgeu a4, a2, 6f > > + > > + /* > > + * Adding additional detection to avoid > > + * redundant stores can lead > > + * to better performance > > + */ > > + sb a1, 3(t0) > > + sb a1, -4(a3) > > + li a4, 8 > > + bgeu a4, a2, 6f > > + > > + sb a1, 4(t0) > > + sb a1, -5(a3) > > + li a4, 10 > > + bgeu a4, a2, 6f > > These extra checks feel ad hoc to me. Naturally you'll get better results > for 8 byte memsets when there's a branch to the ret after 8 bytes. But > what about 9? I'd think you'd want benchmarks from 1 to 15 bytes to show > how it performs better or worse than byte by byte for each of those sizes. > Also, while 8 bytes might be worth special casing, I'm not sure why 10 > would be. What makes 10 worth optimizing more than 11? > > Finally, microbenchmarking is quite hardware-specific and energy > consumption should probably also be considered. What energy cost is > there from making redundant stores? Is it worth it? Hi, I added a test from 1 to 15 bytes in the benchmarks.The test results are as follows: Before optimization(bytes/ns): 1B: 0.06 2B: 0.10 3B: 0.12 4B: 0.14 5B: 0.15 6B: 0.17 7B: 0.17 8B: 0.18 9B: 0.19 10B: 0.19 11B: 0.20 12B: 0.20 13B: 0.20 14B: 0.21 15B: 0.21 After optimization(bytes/ns): 1B: 0.05 2B: 0.10 3B: 0.11 4B: 0.15 5B: 0.19 6B: 0.23 7B: 0.23 8B: 0.26 9B: 0.24 10B: 0.27 11B: 0.25 12B: 0.27 13B: 0.28 14B: 0.30 15B: 0.31 From the above results, it can be seen that the performance of 1-4 bytes is similar, with a significant improvement in 5-15 bytes.At the same time, it can be seen that redundant stores does indeed lead to performance degradation, such as at 9 bytes and 11 bytes. Next, I modified the code to check 2, 6, 8, 11, 14, as shown below: ''' sb a1, 4(t0) sb a1, 5(t0) sb a1, -5(a3) li a4, 11 bgeu a4, a2, 6f sb a1, 6(t0) sb a1, -6(a3) sb a1, -7(a3) li a4, 14 bgeu a4, a2, 6f ''' The results obtained in this way are as follows: After optimization(bytes/ns): 1B: 0.05 2B: 0.10 3B: 0.11 4B: 0.15 5B: 0.19 6B: 0.23 7B: 0.23 8B: 0.27 9B: 0.23 10B: 0.26 11B: 0.29 12B: 0.26 13B: 0.28 14B: 0.29 15B: 0.31 From the above results, it can be seen that when we modified it to check at 11, the performance improved from 0.25 bytes/ns to 0.29 bytes/ns.Is it possible to minimize redundant stores while ensuring parallel stores to achieve optimal performance? Therefore, I modified the code to detect 2, 4, 6, 8, 10, 12, 14, as shown below: ''' sb a1, 4(t0) sb a1, -5(a3) li a4, 10 bgeu a4, a2, 6f sb a1, 5(t0) sb a1, -6(a3) li a4, 12 bgeu a4, a2, 6f sb a1, 6(t0) sb a1, -7(a3) li a4, 14 bgeu a4, a2, 6f ''' The results obtained in this way are as follows: After optimization(bytes/ns): 1B: 0.05 2B: 0.10 3B: 0.12 4B: 0.17 5B: 0.18 6B: 0.21 7B: 0.22 8B: 0.25 9B: 0.24 10B: 0.26 11B: 0.25 12B: 0.27 13B: 0.26 14B: 0.27 15B: 0.29 From the above results, it can be seen that this approach did not achieve the best performance. Through the above experiment, here is my conclusion: 1.This optimization method will inevitably result in duplicate storage. I cannot achieve the optimal state of each byte, for example, when I add checks on 9, the performance of 9 will naturally improve, but 10 and 11 may become worse due to redundant stores.Therefore, I need to make a trade-off between redundant stores and parallelism, such as checking 9 or 10, or something else. 2.Storing parallelism and reducing jumps will compensate for the cost of redundant stores. Based on the current multiple test results, regardless of which bytes I modify to check, its performance is better than byte by byte storage. 3.From the above experiment, for the detection of 2, 6, 8, 11, and 14, its overall performance is the best. Because I am not a chip designer, I find it difficult to answer specific energy consumption costs. Do you have any suggestions and how to conduct testing in this regard? I think although storage has increased, there has been a corresponding reduction in jumps and the use of pipelines. Thanks, Fei Zhang