Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp972304ybh; Wed, 22 Jul 2020 19:02:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJymEE+jpjZ+qeJVIL7FwGgrdqacDScMqkuIFTO1gnBmXQIwnoQtPFc5V0HKfKryKa7qn5Vy X-Received: by 2002:a17:906:248a:: with SMTP id e10mr2250995ejb.454.1595469750319; Wed, 22 Jul 2020 19:02:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595469750; cv=none; d=google.com; s=arc-20160816; b=ZgzfVPT4Nk4AHNsKBkgCSuHGOL/nUEY4Blrf0oY9/4qFAyb65O41kXIuVYp6pwCaNr 25oLHsBaAoH4c4J/jpp1V0uX1cPvuMyGfcdrmlwjXQqPNd0L0edFZJ8QJ7DhawkR0EJN b6E6TKizJRQhDzxBdx7eFiN5wx9iPwkdpO2wnE86gwaF1UxIEk/0Q6f0tUe2wfa9x//z R3fEIoNusGv98j6PYfglmNrOaEFzh0rXecnNS8K9kv9K7w/NnsEZuXHmeqh2QCmoRVX/ wWqymti2Mfn+ZHtDBgJS72u7otpy9WjFgF8ea151+mXq2k5AUrksU9Jq2Z3+5lm+gc0s SkDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=7BJIr1sqBspRGmtuFNDMEDVlLFhCq8eDAIpf8G0DJHY=; b=PD2j2CEcVcFhpIhFV9mI4VYd81ICaAVDbxOq8hZabG4G+dU/Ilqmou2kfP8mirv2dM B6flBrhFfLfvzFquTTmLkou8wMzvPmsvUs9GdmVuTPZ8cEW5+hXKSXiMbhQSJ+nsDa7s W85zFpPL6FukHX8+O9uGGJXtOt6MuNlgxrSWJwKzUpF2hBJZmDcKh+7dbEK5osO709T+ k0ulnJFLHhM+zrAuyOSVmg8aTCvnBf3DEdwE4VXmhNLSCHmFKGmh0Wkqpu+MlQhKRBIn C/rncti/d1JWmWDcQdIDKo0mX2kLxSv8/PvCPJq/7uYgJdLIwP3GQMT6+gIso8Q1JL19 IAuA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dg22si556003edb.25.2020.07.22.19.02.07; Wed, 22 Jul 2020 19:02:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387546AbgGWB7T (ORCPT + 99 others); Wed, 22 Jul 2020 21:59:19 -0400 Received: from out28-3.mail.aliyun.com ([115.124.28.3]:53700 "EHLO out28-3.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731394AbgGWB7S (ORCPT ); Wed, 22 Jul 2020 21:59:18 -0400 X-Alimail-AntiSpam: AC=CONTINUE;BC=0.07436282|-1;CH=green;DM=|CONTINUE|false|;DS=CONTINUE|ham_regular_dialog|0.0898497-0.00457316-0.905577;FP=0|0|0|0|0|-1|-1|-1;HT=e02c03298;MF=maochenxi@eswin.com;NM=1;PH=DS;RN=8;RT=8;SR=0;TI=SMTPD_---.I68w5aq_1595469552; Received: from 10.1.56.44(mailfrom:maochenxi@eswin.com fp:SMTPD_---.I68w5aq_1595469552) by smtp.aliyun-inc.com(10.194.99.21); Thu, 23 Jul 2020 09:59:13 +0800 Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I To: Palmer Dabbelt , kernel@esmil.dk Cc: Paul Walmsley , aou@eecs.berkeley.edu, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, chenxi.mao2013@gmail.com, wangqiang1@eswin.com References: From: Chenxi Mao Message-ID: Date: Thu, 23 Jul 2020 09:59:12 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Palmer and Emil: As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well. My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled) The disassembly code as below: CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled: 0000000000000000 <__sw_hweight32>:    0:    555557b7              lui    a5,0x55555    4:    0015571b              srliw    a4,a0,0x1    8:    55578793              addi    a5,a5,1365 # 55555555 <.LASF5+0x5555509d>    c:    8ff9                    and    a5,a5,a4    e:    9d1d                    subw    a0,a0,a5 0000000000000010 <.LVL1>:   10:    333337b7              lui    a5,0x33333   14:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>   18:    0025571b              srliw    a4,a0,0x2   1c:    8d7d                    and    a0,a0,a5   1e:    8ff9                    and    a5,a5,a4   20:    9fa9                    addw    a5,a5,a0   22:    0047d51b              srliw    a0,a5,0x4   26:    9fa9                    addw    a5,a5,a0   28:    0f0f1537              lui    a0,0xf0f1   2c:    1141                    addi    sp,sp,-16   2e:    f0f50513              addi    a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>   32:    e422                    sd    s0,8(sp)   34:    8fe9                    and    a5,a5,a0   36:    0800                    addi    s0,sp,16   38:    0087951b              slliw    a0,a5,0x8   3c:    6422                    ld    s0,8(sp)   3e:    9d3d                    addw    a0,a0,a5   40:    0105179b              slliw    a5,a0,0x10   44:    9d3d                    addw    a0,a0,a5   46:    0185551b              srliw    a0,a0,0x18   4a:    0141                    addi    sp,sp,16   4c:    8082                    ret CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled: 000000000000004e <__sw_hweight32_default>:   4e:    55555737              lui    a4,0x55555   52:    0015579b              srliw    a5,a0,0x1   56:    55570713              addi    a4,a4,1365 # 55555555 <.LASF5+0x5555509d>   5a:    8ff9                    and    a5,a5,a4   5c:    9d1d                    subw    a0,a0,a5 000000000000005e <.LVL3>:   5e:    333337b7              lui    a5,0x33333   62:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>   66:    0025571b              srliw    a4,a0,0x2   6a:    8d7d                    and    a0,a0,a5   6c:    8ff9                    and    a5,a5,a4   6e:    9fa9                    addw    a5,a5,a0   70:    0047d51b              srliw    a0,a5,0x4   74:    9d3d                    addw    a0,a0,a5   76:    0f0f17b7              lui    a5,0xf0f1   7a:    1141                    addi    sp,sp,-16   7c:    f0f78793              addi    a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>   80:    e422                    sd    s0,8(sp)   82:    8fe9                    and    a5,a5,a0   84:    0800                    addi    s0,sp,16   86:    0087d51b              srliw    a0,a5,0x8   8a:    6422                    ld    s0,8(sp)   8c:    9fa9                    addw    a5,a5,a0   8e:    0107d51b              srliw    a0,a5,0x10   92:    9d3d                    addw    a0,a0,a5   94:    0ff57513              andi    a0,a0,255   98:    0141                    addi    sp,sp,16   9a:    8082                    ret This 2 implementations is almost same but small differences. Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition,  below code didn't use "mul" instructions.     " return (w * 0x01010101) >> 24; " So I am trying to translate this code with inline assembly as below: //return (w * 0x01010101) >> 24; __asm__ ( " mul %0, %0, %1\n" : "+r" (w) : "r" (w), "r"(0x01010101) :); return w >> 24; After above change, the disassambly as below: 0000000000000000 <__sw_hweight32>:    0:    555557b7              lui    a5,0x55555    4:    0015571b              srliw    a4,a0,0x1    8:    55578793              addi    a5,a5,1365 # 55555555 <.LASF5+0x55555119>    c:    8ff9                    and    a5,a5,a4    e:    9d1d                    subw    a0,a0,a5 0000000000000010 <.LVL1>:   10:    333337b7              lui    a5,0x33333   14:    0025571b              srliw    a4,a0,0x2   18:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332ef7>   1c:    8d7d                    and    a0,a0,a5   1e:    8ff9                    and    a5,a5,a4   20:    9fa9                    addw    a5,a5,a0   22:    0047d71b              srliw    a4,a5,0x4   26:    9f3d                    addw    a4,a4,a5   28:    0f0f17b7              lui    a5,0xf0f1   2c:    1141                    addi    sp,sp,-16   2e:    f0f78793              addi    a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>   32:    e422                    sd    s0,8(sp)   34:    8ff9                    and    a5,a5,a4   36:    0800                    addi    s0,sp,16   38:    01010737              lui    a4,0x1010   3c:    853e                    mv    a0,a5 000000000000003e <.LVL2>:   3e:    1017071b              addiw    a4,a4,257   42:    02f50533              mul    a0,a0,a5   46:    6422                    ld    s0,8(sp)   48:    0185551b              srliw    a0,a0,0x18 "mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions. Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations. I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only load 20 bits per time, is this the best way to load instants? On the other hand, I try to compare ARM64 disassembly code: .....    4:    3200c3e2     mov    w2, #0x1010101                 // #16843009 ......    w =  (w + (w >> 4)) & 0x0f0f0f0f;   20:    0b401000     add    w0, w0, w0, lsr #4   24:    1200cc00     and    w0, w0, #0xf0f0f0f     return (w * 0x01010101) >> 24;   28:    1b027c00     mul    w0, w0, w2 Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply. Let me summary as below: 1.  GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled. 2. force to generate "mul" didn't get better because instants load waste instructions. 3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.  unsigned int __sw_hweight32(unsigned int w)  { -#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER +/* + * Risc-V could not generate mul(w) instruction in this case + */ +#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)         w -= (w >> 1) & 0x55555555;         w =  (w & 0x33333333) + ((w >> 2) & 0x33333333);         w =  (w + (w >> 4)) & 0x0f0f0f0f; Chenxi On 2020/7/21 上午9:17, Palmer Dabbelt wrote: > On Wed, 08 Jul 2020 22:19:22 PDT (-0700), maochenxi@eswin.com wrote: >> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I >> which works fine on GCC-9.3 and GCC-10.1 >> >> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled. >> >> Signed-off-by: Chenxi Mao >> --- >>  arch/riscv/Kconfig | 1 + >>  1 file changed, 1 insertion(+) >> >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig >> index 128192e14ff2..84e6777fecad 100644 >> --- a/arch/riscv/Kconfig >> +++ b/arch/riscv/Kconfig >> @@ -202,6 +202,7 @@ config ARCH_RV64I >>      bool "RV64I" >>      select 64BIT >>      select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000 >> +    select ARCH_HAS_FAST_MULTIPLIER >>      select HAVE_DYNAMIC_FTRACE if MMU >>      select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE >>      select HAVE_FTRACE_MCOUNT_RECORD > > Ah, thanks -- this one didn't show up when I was looking at the last one.  I > think we can put the fast multiplier on rv32 and rv64, there shouldn't be any > difference there.  I guess in theory we should be sticking this all in some > sort of "platform type" optimization flags, but that's probably bit much for > now.