Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
which works fine on GCC-9.3 and GCC-10.1
PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
Signed-off-by: Chenxi Mao <[email protected]>
---
arch/riscv/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 128192e14ff2..84e6777fecad 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -202,6 +202,7 @@ config ARCH_RV64I
bool "RV64I"
select 64BIT
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
+ select ARCH_HAS_FAST_MULTIPLIER
select HAVE_DYNAMIC_FTRACE if MMU
select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
--
2.25.1
On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
> which works fine on GCC-9.3 and GCC-10.1
>
> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>
> Signed-off-by: Chenxi Mao <[email protected]>
> ---
> arch/riscv/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 128192e14ff2..84e6777fecad 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -202,6 +202,7 @@ config ARCH_RV64I
> bool "RV64I"
> select 64BIT
> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
> + select ARCH_HAS_FAST_MULTIPLIER
> select HAVE_DYNAMIC_FTRACE if MMU
> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
> select HAVE_FTRACE_MCOUNT_RECORD
Ah, thanks -- this one didn't show up when I was looking at the last one. I
think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
difference there. I guess in theory we should be sticking this all in some
sort of "platform type" optimization flags, but that's probably bit much for
now.
Hi Palmer:
Thanks for your reply.
Frankly, I didn't test ARCH_HAS_FAST_MULTIPLIER on RV32,
so I cannot put it in RISCV platform.
I am trying to comparing ARM64 with Riscv to find out more optimization
configurations.
I suggest to enable ARCH_HAS_FAST_MULTIPLIER on RV64 first.
If someone else evaluate this on RV32, we could move it to RISCV platform.
Chenxi
On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>> which works fine on GCC-9.3 and GCC-10.1
>>
>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>
>> Signed-off-by: Chenxi Mao <[email protected]>
>> ---
>> arch/riscv/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 128192e14ff2..84e6777fecad 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>> bool "RV64I"
>> select 64BIT
>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>> + select ARCH_HAS_FAST_MULTIPLIER
>> select HAVE_DYNAMIC_FTRACE if MMU
>> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>> select HAVE_FTRACE_MCOUNT_RECORD
>
> Ah, thanks -- this one didn't show up when I was looking at the last one. I
> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
> difference there. I guess in theory we should be sticking this all in some
> sort of "platform type" optimization flags, but that's probably bit much for
> now.
Hi Palmer:
Move to RISCV platform is ok for me, but I cannot evaluate RV32 condition.
Chenxi
On 2020/7/21 上午9:47, Chenxi Mao wrote:
> Hi Palmer:
>
> Thanks for your reply.
>
> Frankly, I didn't test ARCH_HAS_FAST_MULTIPLIER on RV32,
>
> so I cannot put it in RISCV platform.
>
> I am trying to comparing ARM64 with Riscv to find out more optimization
>
> configurations.
>
> I suggest to enable ARCH_HAS_FAST_MULTIPLIER on RV64 first.
>
> If someone else evaluate this on RV32, we could move it to RISCV platform.
>
>
> Chenxi
>
>
> On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>> which works fine on GCC-9.3 and GCC-10.1
>>>
>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>
>>> Signed-off-by: Chenxi Mao <[email protected]>
>>> ---
>>> arch/riscv/Kconfig | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index 128192e14ff2..84e6777fecad 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>> bool "RV64I"
>>> select 64BIT
>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>> + select ARCH_HAS_FAST_MULTIPLIER
>>> select HAVE_DYNAMIC_FTRACE if MMU
>>> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>> select HAVE_FTRACE_MCOUNT_RECORD
>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>> difference there. I guess in theory we should be sticking this all in some
>> sort of "platform type" optimization flags, but that's probably bit much for
>> now.
Hi Palmer and Emil:
As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.
My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)
The disassembly code as below:
CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:
0000000000000000 <__sw_hweight32>:
0: 555557b7 lui a5,0x55555
4: 0015571b srliw a4,a0,0x1
8: 55578793 addi a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
c: 8ff9 and a5,a5,a4
e: 9d1d subw a0,a0,a5
0000000000000010 <.LVL1>:
10: 333337b7 lui a5,0x33333
14: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
18: 0025571b srliw a4,a0,0x2
1c: 8d7d and a0,a0,a5
1e: 8ff9 and a5,a5,a4
20: 9fa9 addw a5,a5,a0
22: 0047d51b srliw a0,a5,0x4
26: 9fa9 addw a5,a5,a0
28: 0f0f1537 lui a0,0xf0f1
2c: 1141 addi sp,sp,-16
2e: f0f50513 addi a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
32: e422 sd s0,8(sp)
34: 8fe9 and a5,a5,a0
36: 0800 addi s0,sp,16
38: 0087951b slliw a0,a5,0x8
3c: 6422 ld s0,8(sp)
3e: 9d3d addw a0,a0,a5
40: 0105179b slliw a5,a0,0x10
44: 9d3d addw a0,a0,a5
46: 0185551b srliw a0,a0,0x18
4a: 0141 addi sp,sp,16
4c: 8082 ret
CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:
000000000000004e <__sw_hweight32_default>:
4e: 55555737 lui a4,0x55555
52: 0015579b srliw a5,a0,0x1
56: 55570713 addi a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
5a: 8ff9 and a5,a5,a4
5c: 9d1d subw a0,a0,a5
000000000000005e <.LVL3>:
5e: 333337b7 lui a5,0x33333
62: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
66: 0025571b srliw a4,a0,0x2
6a: 8d7d and a0,a0,a5
6c: 8ff9 and a5,a5,a4
6e: 9fa9 addw a5,a5,a0
70: 0047d51b srliw a0,a5,0x4
74: 9d3d addw a0,a0,a5
76: 0f0f17b7 lui a5,0xf0f1
7a: 1141 addi sp,sp,-16
7c: f0f78793 addi a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
80: e422 sd s0,8(sp)
82: 8fe9 and a5,a5,a0
84: 0800 addi s0,sp,16
86: 0087d51b srliw a0,a5,0x8
8a: 6422 ld s0,8(sp)
8c: 9fa9 addw a5,a5,a0
8e: 0107d51b srliw a0,a5,0x10
92: 9d3d addw a0,a0,a5
94: 0ff57513 andi a0,a0,255
98: 0141 addi sp,sp,16
9a: 8082 ret
This 2 implementations is almost same but small differences.
Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.
" return (w * 0x01010101) >> 24; "
So I am trying to translate this code with inline assembly as below:
//return (w * 0x01010101) >> 24;
__asm__ (
" mul %0, %0, %1\n"
: "+r" (w)
: "r" (w), "r"(0x01010101)
:);
return w >> 24;
After above change, the disassambly as below:
0000000000000000 <__sw_hweight32>:
0: 555557b7 lui a5,0x55555
4: 0015571b srliw a4,a0,0x1
8: 55578793 addi a5,a5,1365 # 55555555 <.LASF5+0x55555119>
c: 8ff9 and a5,a5,a4
e: 9d1d subw a0,a0,a5
0000000000000010 <.LVL1>:
10: 333337b7 lui a5,0x33333
14: 0025571b srliw a4,a0,0x2
18: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332ef7>
1c: 8d7d and a0,a0,a5
1e: 8ff9 and a5,a5,a4
20: 9fa9 addw a5,a5,a0
22: 0047d71b srliw a4,a5,0x4
26: 9f3d addw a4,a4,a5
28: 0f0f17b7 lui a5,0xf0f1
2c: 1141 addi sp,sp,-16
2e: f0f78793 addi a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
32: e422 sd s0,8(sp)
34: 8ff9 and a5,a5,a4
36: 0800 addi s0,sp,16
38: 01010737 lui a4,0x1010
3c: 853e mv a0,a5
000000000000003e <.LVL2>:
3e: 1017071b addiw a4,a4,257
42: 02f50533 mul a0,a0,a5
46: 6422 ld s0,8(sp)
48: 0185551b srliw a0,a0,0x18
"mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.
Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.
I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only
load 20 bits per time, is this the best way to load instants?
On the other hand, I try to compare ARM64 disassembly code:
.....
4: 3200c3e2 mov w2, #0x1010101 // #16843009
......
w = (w + (w >> 4)) & 0x0f0f0f0f;
20: 0b401000 add w0, w0, w0, lsr #4
24: 1200cc00 and w0, w0, #0xf0f0f0f
return (w * 0x01010101) >> 24;
28: 1b027c00 mul w0, w0, w2
Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.
Let me summary as below:
1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.
2. force to generate "mul" didn't get better because instants load waste instructions.
3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.
unsigned int __sw_hweight32(unsigned int w)
{
-#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+/*
+ * Risc-V could not generate mul(w) instruction in this case
+ */
+#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
w -= (w >> 1) & 0x55555555;
w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
w = (w + (w >> 4)) & 0x0f0f0f0f;
Chenxi
On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>> which works fine on GCC-9.3 and GCC-10.1
>>
>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>
>> Signed-off-by: Chenxi Mao <[email protected]>
>> ---
>> arch/riscv/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 128192e14ff2..84e6777fecad 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>> bool "RV64I"
>> select 64BIT
>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>> + select ARCH_HAS_FAST_MULTIPLIER
>> select HAVE_DYNAMIC_FTRACE if MMU
>> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>> select HAVE_FTRACE_MCOUNT_RECORD
>
> Ah, thanks -- this one didn't show up when I was looking at the last one. I
> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
> difference there. I guess in theory we should be sticking this all in some
> sort of "platform type" optimization flags, but that's probably bit much for
> now.
On Wed, 22 Jul 2020 18:59:12 PDT (-0700), [email protected] wrote:
> Hi Palmer and Emil:
>
> As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.
Sorry, I guess I crossed up my emails. I think it's best to just drop this for
now, as it doesn't actually seem to generate better code for our current
target.
>
> My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)
>
> The disassembly code as below:
>
> CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:
>
> 0000000000000000 <__sw_hweight32>:
> 0: 555557b7 lui a5,0x55555
> 4: 0015571b srliw a4,a0,0x1
> 8: 55578793 addi a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
> c: 8ff9 and a5,a5,a4
> e: 9d1d subw a0,a0,a5
>
> 0000000000000010 <.LVL1>:
> 10: 333337b7 lui a5,0x33333
> 14: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
> 18: 0025571b srliw a4,a0,0x2
> 1c: 8d7d and a0,a0,a5
> 1e: 8ff9 and a5,a5,a4
> 20: 9fa9 addw a5,a5,a0
> 22: 0047d51b srliw a0,a5,0x4
> 26: 9fa9 addw a5,a5,a0
> 28: 0f0f1537 lui a0,0xf0f1
> 2c: 1141 addi sp,sp,-16
> 2e: f0f50513 addi a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
> 32: e422 sd s0,8(sp)
> 34: 8fe9 and a5,a5,a0
> 36: 0800 addi s0,sp,16
> 38: 0087951b slliw a0,a5,0x8
> 3c: 6422 ld s0,8(sp)
> 3e: 9d3d addw a0,a0,a5
> 40: 0105179b slliw a5,a0,0x10
> 44: 9d3d addw a0,a0,a5
> 46: 0185551b srliw a0,a0,0x18
> 4a: 0141 addi sp,sp,16
> 4c: 8082 ret
>
> CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:
>
> 000000000000004e <__sw_hweight32_default>:
> 4e: 55555737 lui a4,0x55555
> 52: 0015579b srliw a5,a0,0x1
> 56: 55570713 addi a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
> 5a: 8ff9 and a5,a5,a4
> 5c: 9d1d subw a0,a0,a5
>
> 000000000000005e <.LVL3>:
> 5e: 333337b7 lui a5,0x33333
> 62: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
> 66: 0025571b srliw a4,a0,0x2
> 6a: 8d7d and a0,a0,a5
> 6c: 8ff9 and a5,a5,a4
> 6e: 9fa9 addw a5,a5,a0
> 70: 0047d51b srliw a0,a5,0x4
> 74: 9d3d addw a0,a0,a5
> 76: 0f0f17b7 lui a5,0xf0f1
> 7a: 1141 addi sp,sp,-16
> 7c: f0f78793 addi a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
> 80: e422 sd s0,8(sp)
> 82: 8fe9 and a5,a5,a0
> 84: 0800 addi s0,sp,16
> 86: 0087d51b srliw a0,a5,0x8
> 8a: 6422 ld s0,8(sp)
> 8c: 9fa9 addw a5,a5,a0
> 8e: 0107d51b srliw a0,a5,0x10
> 92: 9d3d addw a0,a0,a5
> 94: 0ff57513 andi a0,a0,255
> 98: 0141 addi sp,sp,16
> 9a: 8082 ret
>
> This 2 implementations is almost same but small differences.
>
> Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.
>
> " return (w * 0x01010101) >> 24; "
>
> So I am trying to translate this code with inline assembly as below:
>
> //return (w * 0x01010101) >> 24;
> __asm__ (
> " mul %0, %0, %1\n"
> : "+r" (w)
> : "r" (w), "r"(0x01010101)
> :);
> return w >> 24;
>
> After above change, the disassambly as below:
> 0000000000000000 <__sw_hweight32>:
> 0: 555557b7 lui a5,0x55555
> 4: 0015571b srliw a4,a0,0x1
> 8: 55578793 addi a5,a5,1365 # 55555555 <.LASF5+0x55555119>
> c: 8ff9 and a5,a5,a4
> e: 9d1d subw a0,a0,a5
>
> 0000000000000010 <.LVL1>:
> 10: 333337b7 lui a5,0x33333
> 14: 0025571b srliw a4,a0,0x2
> 18: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332ef7>
> 1c: 8d7d and a0,a0,a5
> 1e: 8ff9 and a5,a5,a4
> 20: 9fa9 addw a5,a5,a0
> 22: 0047d71b srliw a4,a5,0x4
> 26: 9f3d addw a4,a4,a5
> 28: 0f0f17b7 lui a5,0xf0f1
> 2c: 1141 addi sp,sp,-16
> 2e: f0f78793 addi a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
> 32: e422 sd s0,8(sp)
> 34: 8ff9 and a5,a5,a4
> 36: 0800 addi s0,sp,16
> 38: 01010737 lui a4,0x1010
> 3c: 853e mv a0,a5
>
> 000000000000003e <.LVL2>:
> 3e: 1017071b addiw a4,a4,257
> 42: 02f50533 mul a0,a0,a5
> 46: 6422 ld s0,8(sp)
> 48: 0185551b srliw a0,a0,0x18
>
> "mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.
>
> Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.
>
> I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only
>
> load 20 bits per time, is this the best way to load instants?
>
>
> On the other hand, I try to compare ARM64 disassembly code:
>
> .....
>
> 4: 3200c3e2 mov w2, #0x1010101 // #16843009
>
> ......
>
> w = (w + (w >> 4)) & 0x0f0f0f0f;
> 20: 0b401000 add w0, w0, w0, lsr #4
> 24: 1200cc00 and w0, w0, #0xf0f0f0f
> return (w * 0x01010101) >> 24;
> 28: 1b027c00 mul w0, w0, w2
>
> Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.
>
>
> Let me summary as below:
>
> 1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.
>
> 2. force to generate "mul" didn't get better because instants load waste instructions.
>
> 3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.
>
> unsigned int __sw_hweight32(unsigned int w)
> {
> -#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
> +/*
> + * Risc-V could not generate mul(w) instruction in this case
> + */
> +#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
> w -= (w >> 1) & 0x55555555;
> w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
> w = (w + (w >> 4)) & 0x0f0f0f0f;
>
>
> Chenxi
>
>
> On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>> which works fine on GCC-9.3 and GCC-10.1
>>>
>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>
>>> Signed-off-by: Chenxi Mao <[email protected]>
>>> ---
>>> arch/riscv/Kconfig | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index 128192e14ff2..84e6777fecad 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>> bool "RV64I"
>>> select 64BIT
>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>> + select ARCH_HAS_FAST_MULTIPLIER
>>> select HAVE_DYNAMIC_FTRACE if MMU
>>> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>> select HAVE_FTRACE_MCOUNT_RECORD
>>
>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>> difference there. I guess in theory we should be sticking this all in some
>> sort of "platform type" optimization flags, but that's probably bit much for
>> now.
Hi Palmer:
Did you mean we drop this totally or drop this for __sw_hweight32 only?
Chenxi
On 2020/7/23 上午10:13, Palmer Dabbelt wrote:
> On Wed, 22 Jul 2020 18:59:12 PDT (-0700), [email protected] wrote:
>> Hi Palmer and Emil:
>>
>> As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.
>
> Sorry, I guess I crossed up my emails. I think it's best to just drop this for
> now, as it doesn't actually seem to generate better code for our current
> target.
>
>>
>> My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)
>>
>> The disassembly code as below:
>>
>> CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:
>>
>> 0000000000000000 <__sw_hweight32>:
>> 0: 555557b7 lui a5,0x55555
>> 4: 0015571b srliw a4,a0,0x1
>> 8: 55578793 addi a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
>> c: 8ff9 and a5,a5,a4
>> e: 9d1d subw a0,a0,a5
>>
>> 0000000000000010 <.LVL1>:
>> 10: 333337b7 lui a5,0x33333
>> 14: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
>> 18: 0025571b srliw a4,a0,0x2
>> 1c: 8d7d and a0,a0,a5
>> 1e: 8ff9 and a5,a5,a4
>> 20: 9fa9 addw a5,a5,a0
>> 22: 0047d51b srliw a0,a5,0x4
>> 26: 9fa9 addw a5,a5,a0
>> 28: 0f0f1537 lui a0,0xf0f1
>> 2c: 1141 addi sp,sp,-16
>> 2e: f0f50513 addi a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
>> 32: e422 sd s0,8(sp)
>> 34: 8fe9 and a5,a5,a0
>> 36: 0800 addi s0,sp,16
>> 38: 0087951b slliw a0,a5,0x8
>> 3c: 6422 ld s0,8(sp)
>> 3e: 9d3d addw a0,a0,a5
>> 40: 0105179b slliw a5,a0,0x10
>> 44: 9d3d addw a0,a0,a5
>> 46: 0185551b srliw a0,a0,0x18
>> 4a: 0141 addi sp,sp,16
>> 4c: 8082 ret
>>
>> CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:
>>
>> 000000000000004e <__sw_hweight32_default>:
>> 4e: 55555737 lui a4,0x55555
>> 52: 0015579b srliw a5,a0,0x1
>> 56: 55570713 addi a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
>> 5a: 8ff9 and a5,a5,a4
>> 5c: 9d1d subw a0,a0,a5
>>
>> 000000000000005e <.LVL3>:
>> 5e: 333337b7 lui a5,0x33333
>> 62: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
>> 66: 0025571b srliw a4,a0,0x2
>> 6a: 8d7d and a0,a0,a5
>> 6c: 8ff9 and a5,a5,a4
>> 6e: 9fa9 addw a5,a5,a0
>> 70: 0047d51b srliw a0,a5,0x4
>> 74: 9d3d addw a0,a0,a5
>> 76: 0f0f17b7 lui a5,0xf0f1
>> 7a: 1141 addi sp,sp,-16
>> 7c: f0f78793 addi a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
>> 80: e422 sd s0,8(sp)
>> 82: 8fe9 and a5,a5,a0
>> 84: 0800 addi s0,sp,16
>> 86: 0087d51b srliw a0,a5,0x8
>> 8a: 6422 ld s0,8(sp)
>> 8c: 9fa9 addw a5,a5,a0
>> 8e: 0107d51b srliw a0,a5,0x10
>> 92: 9d3d addw a0,a0,a5
>> 94: 0ff57513 andi a0,a0,255
>> 98: 0141 addi sp,sp,16
>> 9a: 8082 ret
>>
>> This 2 implementations is almost same but small differences.
>>
>> Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.
>>
>> " return (w * 0x01010101) >> 24; "
>>
>> So I am trying to translate this code with inline assembly as below:
>>
>> //return (w * 0x01010101) >> 24;
>> __asm__ (
>> " mul %0, %0, %1\n"
>> : "+r" (w)
>> : "r" (w), "r"(0x01010101)
>> :);
>> return w >> 24;
>>
>> After above change, the disassambly as below:
>> 0000000000000000 <__sw_hweight32>:
>> 0: 555557b7 lui a5,0x55555
>> 4: 0015571b srliw a4,a0,0x1
>> 8: 55578793 addi a5,a5,1365 # 55555555 <.LASF5+0x55555119>
>> c: 8ff9 and a5,a5,a4
>> e: 9d1d subw a0,a0,a5
>>
>> 0000000000000010 <.LVL1>:
>> 10: 333337b7 lui a5,0x33333
>> 14: 0025571b srliw a4,a0,0x2
>> 18: 33378793 addi a5,a5,819 # 33333333 <.LASF5+0x33332ef7>
>> 1c: 8d7d and a0,a0,a5
>> 1e: 8ff9 and a5,a5,a4
>> 20: 9fa9 addw a5,a5,a0
>> 22: 0047d71b srliw a4,a5,0x4
>> 26: 9f3d addw a4,a4,a5
>> 28: 0f0f17b7 lui a5,0xf0f1
>> 2c: 1141 addi sp,sp,-16
>> 2e: f0f78793 addi a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
>> 32: e422 sd s0,8(sp)
>> 34: 8ff9 and a5,a5,a4
>> 36: 0800 addi s0,sp,16
>> 38: 01010737 lui a4,0x1010
>> 3c: 853e mv a0,a5
>>
>> 000000000000003e <.LVL2>:
>> 3e: 1017071b addiw a4,a4,257
>> 42: 02f50533 mul a0,a0,a5
>> 46: 6422 ld s0,8(sp)
>> 48: 0185551b srliw a0,a0,0x18
>>
>> "mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.
>>
>> Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.
>>
>> I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only
>>
>> load 20 bits per time, is this the best way to load instants?
>>
>>
>> On the other hand, I try to compare ARM64 disassembly code:
>>
>> .....
>>
>> 4: 3200c3e2 mov w2, #0x1010101 // #16843009
>>
>> ......
>>
>> w = (w + (w >> 4)) & 0x0f0f0f0f;
>> 20: 0b401000 add w0, w0, w0, lsr #4
>> 24: 1200cc00 and w0, w0, #0xf0f0f0f
>> return (w * 0x01010101) >> 24;
>> 28: 1b027c00 mul w0, w0, w2
>>
>> Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.
>>
>>
>> Let me summary as below:
>>
>> 1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.
>>
>> 2. force to generate "mul" didn't get better because instants load waste instructions.
>>
>> 3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.
>>
>> unsigned int __sw_hweight32(unsigned int w)
>> {
>> -#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
>> +/*
>> + * Risc-V could not generate mul(w) instruction in this case
>> + */
>> +#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
>> w -= (w >> 1) & 0x55555555;
>> w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
>> w = (w + (w >> 4)) & 0x0f0f0f0f;
>>
>>
>> Chenxi
>>
>>
>> On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
>>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>>> which works fine on GCC-9.3 and GCC-10.1
>>>>
>>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>>
>>>> Signed-off-by: Chenxi Mao <[email protected]>
>>>> ---
>>>> arch/riscv/Kconfig | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>>> index 128192e14ff2..84e6777fecad 100644
>>>> --- a/arch/riscv/Kconfig
>>>> +++ b/arch/riscv/Kconfig
>>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>>> bool "RV64I"
>>>> select 64BIT
>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>>> + select ARCH_HAS_FAST_MULTIPLIER
>>>> select HAVE_DYNAMIC_FTRACE if MMU
>>>> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>>> select HAVE_FTRACE_MCOUNT_RECORD
>>>
>>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>>> difference there. I guess in theory we should be sticking this all in some
>>> sort of "platform type" optimization flags, but that's probably bit much for
>>> now.