LinuxLists.cc - [PATCH 1/1] riscv: Enable ARCH_HAS_FAST

2020-07-09 05:20:25

Subject: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
which works fine on GCC-9.3 and GCC-10.1

PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.

Signed-off-by: Chenxi Mao <[email protected]>
---
arch/riscv/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 128192e14ff2..84e6777fecad 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -202,6 +202,7 @@ config ARCH_RV64I
bool "RV64I"
select 64BIT
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
+ select ARCH_HAS_FAST_MULTIPLIER
select HAVE_DYNAMIC_FTRACE if MMU
select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
--
2.25.1

2020-07-21 01:20:35

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
> which works fine on GCC-9.3 and GCC-10.1
>
> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>
> Signed-off-by: Chenxi Mao <[email protected]>
> ---
> arch/riscv/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 128192e14ff2..84e6777fecad 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -202,6 +202,7 @@ config ARCH_RV64I
> bool "RV64I"
> select 64BIT
> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
> + select ARCH_HAS_FAST_MULTIPLIER
> select HAVE_DYNAMIC_FTRACE if MMU
> select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
> select HAVE_FTRACE_MCOUNT_RECORD

Ah, thanks -- this one didn't show up when I was looking at the last one. I
think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
difference there. I guess in theory we should be sticking this all in some
sort of "platform type" optimization flags, but that's probably bit much for
now.

2020-07-21 01:49:49

by Chenxi Mao

[permalink] [raw]

Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

Hi Palmer:

Thanks for your reply.

Frankly, I didn't test ARCH_HAS_FAST_MULTIPLIER on RV32,

so I cannot put it in RISCV platform.

I am trying to comparing ARM64 with Riscv to find out more optimization

configurations.

I suggest to enable ARCH_HAS_FAST_MULTIPLIER on RV64 first.

If someone else evaluate this on RV32, we could move it to RISCV platform.

Chenxi

On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>> which works fine on GCC-9.3 and GCC-10.1
>>
>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>
>> Signed-off-by: Chenxi Mao <[email protected]>
>> ---
>> arch/riscv/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 128192e14ff2..84e6777fecad 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>      bool "RV64I"
>>      select 64BIT
>>      select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>> +    select ARCH_HAS_FAST_MULTIPLIER
>>      select HAVE_DYNAMIC_FTRACE if MMU
>>      select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>      select HAVE_FTRACE_MCOUNT_RECORD
>
> Ah, thanks -- this one didn't show up when I was looking at the last one. I
> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
> difference there. I guess in theory we should be sticking this all in some
> sort of "platform type" optimization flags, but that's probably bit much for
> now.

2020-07-21 02:41:55

by Chenxi Mao

[permalink] [raw]

Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

Hi Palmer:

Move to RISCV platform is ok for me, but I cannot evaluate RV32 condition.

Chenxi

On 2020/7/21 上午9:47, Chenxi Mao wrote:
> Hi Palmer:
>
> Thanks for your reply.
>
> Frankly, I didn't test ARCH_HAS_FAST_MULTIPLIER on RV32,
>
> so I cannot put it in RISCV platform.
>
> I am trying to comparing ARM64 with Riscv to find out more optimization
>
> configurations.
>
> I suggest to enable ARCH_HAS_FAST_MULTIPLIER on RV64 first.
>
> If someone else evaluate this on RV32, we could move it to RISCV platform.
>
>
> Chenxi
>
>
> On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>> which works fine on GCC-9.3 and GCC-10.1
>>>
>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>
>>> Signed-off-by: Chenxi Mao <[email protected]>
>>> ---
>>> arch/riscv/Kconfig | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index 128192e14ff2..84e6777fecad 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>>      bool "RV64I"
>>>      select 64BIT
>>>      select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>> +    select ARCH_HAS_FAST_MULTIPLIER
>>>      select HAVE_DYNAMIC_FTRACE if MMU
>>>      select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>>      select HAVE_FTRACE_MCOUNT_RECORD
>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>> difference there. I guess in theory we should be sticking this all in some
>> sort of "platform type" optimization flags, but that's probably bit much for
>> now.

2020-07-23 02:02:30

by Chenxi Mao

[permalink] [raw]

Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

Hi Palmer and Emil:

As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.

My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)

The disassembly code as below:

CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:

0000000000000000 <__sw_hweight32>:
   0:    555557b7              lui    a5,0x55555
   4:    0015571b              srliw    a4,a0,0x1
   8:    55578793              addi    a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
   c:    8ff9                    and    a5,a5,a4
   e:    9d1d                    subw    a0,a0,a5

0000000000000010 <.LVL1>:
10:    333337b7              lui    a5,0x33333
14:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
18:    0025571b              srliw    a4,a0,0x2
1c:    8d7d                    and    a0,a0,a5
1e:    8ff9                    and    a5,a5,a4
20:    9fa9                    addw    a5,a5,a0
22:    0047d51b              srliw    a0,a5,0x4
26:    9fa9                    addw    a5,a5,a0
28:    0f0f1537              lui    a0,0xf0f1
2c:    1141                    addi    sp,sp,-16
2e:    f0f50513              addi    a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
32:    e422                    sd    s0,8(sp)
34:    8fe9                    and    a5,a5,a0
36:    0800                    addi    s0,sp,16
38:    0087951b              slliw    a0,a5,0x8
3c:    6422                    ld    s0,8(sp)
3e:    9d3d                    addw    a0,a0,a5
40:    0105179b              slliw    a5,a0,0x10
44:    9d3d                    addw    a0,a0,a5
46:    0185551b              srliw    a0,a0,0x18
4a:    0141                    addi    sp,sp,16
4c:    8082                    ret

CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:

000000000000004e <__sw_hweight32_default>:
4e:    55555737              lui    a4,0x55555
52:    0015579b              srliw    a5,a0,0x1
56:    55570713              addi    a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
5a:    8ff9                    and    a5,a5,a4
5c:    9d1d                    subw    a0,a0,a5

000000000000005e <.LVL3>:
5e:    333337b7              lui    a5,0x33333
62:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
66:    0025571b              srliw    a4,a0,0x2
6a:    8d7d                    and    a0,a0,a5
6c:    8ff9                    and    a5,a5,a4
6e:    9fa9                    addw    a5,a5,a0
70:    0047d51b              srliw    a0,a5,0x4
74:    9d3d                    addw    a0,a0,a5
76:    0f0f17b7              lui    a5,0xf0f1
7a:    1141                    addi    sp,sp,-16
7c:    f0f78793              addi    a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
80:    e422                    sd    s0,8(sp)
82:    8fe9                    and    a5,a5,a0
84:    0800                    addi    s0,sp,16
86:    0087d51b              srliw    a0,a5,0x8
8a:    6422                    ld    s0,8(sp)
8c:    9fa9                    addw    a5,a5,a0
8e:    0107d51b              srliw    a0,a5,0x10
92:    9d3d                    addw    a0,a0,a5
94:    0ff57513              andi    a0,a0,255
98:    0141                    addi    sp,sp,16
9a:    8082                    ret

This 2 implementations is almost same but small differences.

Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.

    " return (w * 0x01010101) >> 24; "

So I am trying to translate this code with inline assembly as below:

//return (w * 0x01010101) >> 24;
__asm__ (
" mul %0, %0, %1\n"
: "+r" (w)
: "r" (w), "r"(0x01010101)
:);
return w >> 24;

After above change, the disassambly as below:
0000000000000000 <__sw_hweight32>:
   0:   555557b7             lui   a5,0x55555
   4:   0015571b             srliw   a4,a0,0x1
   8:   55578793             addi   a5,a5,1365 # 55555555 <.LASF5+0x55555119>
   c:   8ff9                   and   a5,a5,a4
   e:   9d1d                   subw   a0,a0,a5

0000000000000010 <.LVL1>:
10:   333337b7             lui   a5,0x33333
14:   0025571b             srliw   a4,a0,0x2
18:   33378793             addi   a5,a5,819 # 33333333 <.LASF5+0x33332ef7>
1c:   8d7d                   and   a0,a0,a5
1e:   8ff9                   and   a5,a5,a4
20:   9fa9                   addw   a5,a5,a0
22:   0047d71b             srliw   a4,a5,0x4
26:   9f3d                   addw   a4,a4,a5
28:   0f0f17b7             lui   a5,0xf0f1
2c:   1141                   addi   sp,sp,-16
2e:   f0f78793             addi   a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
32:   e422                   sd   s0,8(sp)
34:   8ff9                   and   a5,a5,a4
36:   0800                   addi   s0,sp,16
38:   01010737             lui   a4,0x1010
3c:   853e                   mv   a0,a5

000000000000003e <.LVL2>:
3e:   1017071b             addiw   a4,a4,257
42:   02f50533             mul   a0,a0,a5
46:   6422                   ld   s0,8(sp)
48:   0185551b             srliw   a0,a0,0x18

"mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.

Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.

I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only

load 20 bits per time, is this the best way to load instants?

On the other hand, I try to compare ARM64 disassembly code:

.....

   4:    3200c3e2     mov    w2, #0x1010101                 // #16843009

......

   w = (w + (w >> 4)) & 0x0f0f0f0f;
20:    0b401000     add    w0, w0, w0, lsr #4
24:    1200cc00     and    w0, w0, #0xf0f0f0f
    return (w * 0x01010101) >> 24;
28:    1b027c00     mul    w0, w0, w2

Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.

Let me summary as below:

1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.

2. force to generate "mul" didn't get better because instants load waste instructions.

3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.

unsigned int __sw_hweight32(unsigned int w)
{
-#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+/*
+ * Risc-V could not generate mul(w) instruction in this case
+ */
+#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
        w -= (w >> 1) & 0x55555555;
        w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
        w = (w + (w >> 4)) & 0x0f0f0f0f;

Chenxi

On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>> which works fine on GCC-9.3 and GCC-10.1
>>
>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>
>> Signed-off-by: Chenxi Mao <[email protected]>
>> ---
>> arch/riscv/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 128192e14ff2..84e6777fecad 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>      bool "RV64I"
>>      select 64BIT
>>      select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>> +    select ARCH_HAS_FAST_MULTIPLIER
>>      select HAVE_DYNAMIC_FTRACE if MMU
>>      select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>      select HAVE_FTRACE_MCOUNT_RECORD
>
> Ah, thanks -- this one didn't show up when I was looking at the last one. I
> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
> difference there. I guess in theory we should be sticking this all in some
> sort of "platform type" optimization flags, but that's probably bit much for
> now.

2020-07-23 02:13:43

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

On Wed, 22 Jul 2020 18:59:12 PDT (-0700), [email protected] wrote:
> Hi Palmer and Emil:
>
> As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.

Sorry, I guess I crossed up my emails. I think it's best to just drop this for
now, as it doesn't actually seem to generate better code for our current
target.

>
> My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)
>
> The disassembly code as below:
>
> CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:
>
> 0000000000000000 <__sw_hweight32>:
>    0:    555557b7              lui    a5,0x55555
>    4:    0015571b              srliw    a4,a0,0x1
>    8:    55578793              addi    a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
>    c:    8ff9                    and    a5,a5,a4
>    e:    9d1d                    subw    a0,a0,a5
>
> 0000000000000010 <.LVL1>:
> 10:    333337b7              lui    a5,0x33333
> 14:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
> 18:    0025571b              srliw    a4,a0,0x2
> 1c:    8d7d                    and    a0,a0,a5
> 1e:    8ff9                    and    a5,a5,a4
> 20:    9fa9                    addw    a5,a5,a0
> 22:    0047d51b              srliw    a0,a5,0x4
> 26:    9fa9                    addw    a5,a5,a0
> 28:    0f0f1537              lui    a0,0xf0f1
> 2c:    1141                    addi    sp,sp,-16
> 2e:    f0f50513              addi    a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
> 32:    e422                    sd    s0,8(sp)
> 34:    8fe9                    and    a5,a5,a0
> 36:    0800                    addi    s0,sp,16
> 38:    0087951b              slliw    a0,a5,0x8
> 3c:    6422                    ld    s0,8(sp)
> 3e:    9d3d                    addw    a0,a0,a5
> 40:    0105179b              slliw    a5,a0,0x10
> 44:    9d3d                    addw    a0,a0,a5
> 46:    0185551b              srliw    a0,a0,0x18
> 4a:    0141                    addi    sp,sp,16
> 4c:    8082                    ret
>
> CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:
>
> 000000000000004e <__sw_hweight32_default>:
> 4e:    55555737              lui    a4,0x55555
> 52:    0015579b              srliw    a5,a0,0x1
> 56:    55570713              addi    a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
> 5a:    8ff9                    and    a5,a5,a4
> 5c:    9d1d                    subw    a0,a0,a5
>
> 000000000000005e <.LVL3>:
> 5e:    333337b7              lui    a5,0x33333
> 62:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
> 66:    0025571b              srliw    a4,a0,0x2
> 6a:    8d7d                    and    a0,a0,a5
> 6c:    8ff9                    and    a5,a5,a4
> 6e:    9fa9                    addw    a5,a5,a0
> 70:    0047d51b              srliw    a0,a5,0x4
> 74:    9d3d                    addw    a0,a0,a5
> 76:    0f0f17b7              lui    a5,0xf0f1
> 7a:    1141                    addi    sp,sp,-16
> 7c:    f0f78793              addi    a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
> 80:    e422                    sd    s0,8(sp)
> 82:    8fe9                    and    a5,a5,a0
> 84:    0800                    addi    s0,sp,16
> 86:    0087d51b              srliw    a0,a5,0x8
> 8a:    6422                    ld    s0,8(sp)
> 8c:    9fa9                    addw    a5,a5,a0
> 8e:    0107d51b              srliw    a0,a5,0x10
> 92:    9d3d                    addw    a0,a0,a5
> 94:    0ff57513              andi    a0,a0,255
> 98:    0141                    addi    sp,sp,16
> 9a:    8082                    ret
>
> This 2 implementations is almost same but small differences.
>
> Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.
>
>     " return (w * 0x01010101) >> 24; "
>
> So I am trying to translate this code with inline assembly as below:
>
> //return (w * 0x01010101) >> 24;
> __asm__ (
> " mul %0, %0, %1\n"
> : "+r" (w)
> : "r" (w), "r"(0x01010101)
> :);
> return w >> 24;
>
> After above change, the disassambly as below:
> 0000000000000000 <__sw_hweight32>:
>    0:   555557b7             lui   a5,0x55555
>    4:   0015571b             srliw   a4,a0,0x1
>    8:   55578793             addi   a5,a5,1365 # 55555555 <.LASF5+0x55555119>
>    c:   8ff9                   and   a5,a5,a4
>    e:   9d1d                   subw   a0,a0,a5
>
> 0000000000000010 <.LVL1>:
> 10:   333337b7             lui   a5,0x33333
> 14:   0025571b             srliw   a4,a0,0x2
> 18:   33378793             addi   a5,a5,819 # 33333333 <.LASF5+0x33332ef7>
> 1c:   8d7d                   and   a0,a0,a5
> 1e:   8ff9                   and   a5,a5,a4
> 20:   9fa9                   addw   a5,a5,a0
> 22:   0047d71b             srliw   a4,a5,0x4
> 26:   9f3d                   addw   a4,a4,a5
> 28:   0f0f17b7             lui   a5,0xf0f1
> 2c:   1141                   addi   sp,sp,-16
> 2e:   f0f78793             addi   a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
> 32:   e422                   sd   s0,8(sp)
> 34:   8ff9                   and   a5,a5,a4
> 36:   0800                   addi   s0,sp,16
> 38:   01010737             lui   a4,0x1010
> 3c:   853e                   mv   a0,a5
>
> 000000000000003e <.LVL2>:
> 3e:   1017071b             addiw   a4,a4,257
> 42:   02f50533             mul   a0,a0,a5
> 46:   6422                   ld   s0,8(sp)
> 48:   0185551b             srliw   a0,a0,0x18
>
> "mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.
>
> Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.
>
> I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only
>
> load 20 bits per time, is this the best way to load instants?
>
>
> On the other hand, I try to compare ARM64 disassembly code:
>
> .....
>
>    4:    3200c3e2     mov    w2, #0x1010101                 // #16843009
>
> ......
>
>    w = (w + (w >> 4)) & 0x0f0f0f0f;
> 20:    0b401000     add    w0, w0, w0, lsr #4
> 24:    1200cc00     and    w0, w0, #0xf0f0f0f
>     return (w * 0x01010101) >> 24;
> 28:    1b027c00     mul    w0, w0, w2
>
> Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.
>
>
> Let me summary as below:
>
> 1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.
>
> 2. force to generate "mul" didn't get better because instants load waste instructions.
>
> 3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.
>
> unsigned int __sw_hweight32(unsigned int w)
> {
> -#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
> +/*
> + * Risc-V could not generate mul(w) instruction in this case
> + */
> +#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
>         w -= (w >> 1) & 0x55555555;
>         w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
>         w = (w + (w >> 4)) & 0x0f0f0f0f;
>
>
> Chenxi
>
>
> On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>> which works fine on GCC-9.3 and GCC-10.1
>>>
>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>
>>> Signed-off-by: Chenxi Mao <[email protected]>
>>> ---
>>> arch/riscv/Kconfig | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index 128192e14ff2..84e6777fecad 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>>      bool "RV64I"
>>>      select 64BIT
>>>      select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>> +    select ARCH_HAS_FAST_MULTIPLIER
>>>      select HAVE_DYNAMIC_FTRACE if MMU
>>>      select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>>      select HAVE_FTRACE_MCOUNT_RECORD
>>
>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>> difference there. I guess in theory we should be sticking this all in some
>> sort of "platform type" optimization flags, but that's probably bit much for
>> now.

2020-07-23 02:58:11

by Chenxi Mao

[permalink] [raw]

Subject: Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

Hi Palmer:

Did you mean we drop this totally or drop this for __sw_hweight32 only?

Chenxi

On 2020/7/23 上午10:13, Palmer Dabbelt wrote:
> On Wed, 22 Jul 2020 18:59:12 PDT (-0700), [email protected] wrote:
>> Hi Palmer and Emil:
>>
>> As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.
>
> Sorry, I guess I crossed up my emails. I think it's best to just drop this for
> now, as it doesn't actually seem to generate better code for our current
> target.
>
>>
>> My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)
>>
>> The disassembly code as below:
>>
>> CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:
>>
>> 0000000000000000 <__sw_hweight32>:
>>    0:    555557b7              lui    a5,0x55555
>>    4:    0015571b              srliw    a4,a0,0x1
>>    8:    55578793              addi    a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
>>    c:    8ff9                    and    a5,a5,a4
>>    e:    9d1d                    subw    a0,a0,a5
>>
>> 0000000000000010 <.LVL1>:
>> 10:    333337b7              lui    a5,0x33333
>> 14:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
>> 18:    0025571b              srliw    a4,a0,0x2
>> 1c:    8d7d                    and    a0,a0,a5
>> 1e:    8ff9                    and    a5,a5,a4
>> 20:    9fa9                    addw    a5,a5,a0
>> 22:    0047d51b              srliw    a0,a5,0x4
>> 26:    9fa9                    addw    a5,a5,a0
>> 28:    0f0f1537              lui    a0,0xf0f1
>> 2c:    1141                    addi    sp,sp,-16
>> 2e:    f0f50513              addi    a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
>> 32:    e422                    sd    s0,8(sp)
>> 34:    8fe9                    and    a5,a5,a0
>> 36:    0800                    addi    s0,sp,16
>> 38:    0087951b              slliw    a0,a5,0x8
>> 3c:    6422                    ld    s0,8(sp)
>> 3e:    9d3d                    addw    a0,a0,a5
>> 40:    0105179b              slliw    a5,a0,0x10
>> 44:    9d3d                    addw    a0,a0,a5
>> 46:    0185551b              srliw    a0,a0,0x18
>> 4a:    0141                    addi    sp,sp,16
>> 4c:    8082                    ret
>>
>> CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:
>>
>> 000000000000004e <__sw_hweight32_default>:
>> 4e:    55555737              lui    a4,0x55555
>> 52:    0015579b              srliw    a5,a0,0x1
>> 56:    55570713              addi    a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
>> 5a:    8ff9                    and    a5,a5,a4
>> 5c:    9d1d                    subw    a0,a0,a5
>>
>> 000000000000005e <.LVL3>:
>> 5e:    333337b7              lui    a5,0x33333
>> 62:    33378793              addi    a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
>> 66:    0025571b              srliw    a4,a0,0x2
>> 6a:    8d7d                    and    a0,a0,a5
>> 6c:    8ff9                    and    a5,a5,a4
>> 6e:    9fa9                    addw    a5,a5,a0
>> 70:    0047d51b              srliw    a0,a5,0x4
>> 74:    9d3d                    addw    a0,a0,a5
>> 76:    0f0f17b7              lui    a5,0xf0f1
>> 7a:    1141                    addi    sp,sp,-16
>> 7c:    f0f78793              addi    a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
>> 80:    e422                    sd    s0,8(sp)
>> 82:    8fe9                    and    a5,a5,a0
>> 84:    0800                    addi    s0,sp,16
>> 86:    0087d51b              srliw    a0,a5,0x8
>> 8a:    6422                    ld    s0,8(sp)
>> 8c:    9fa9                    addw    a5,a5,a0
>> 8e:    0107d51b              srliw    a0,a5,0x10
>> 92:    9d3d                    addw    a0,a0,a5
>> 94:    0ff57513              andi    a0,a0,255
>> 98:    0141                    addi    sp,sp,16
>> 9a:    8082                    ret
>>
>> This 2 implementations is almost same but small differences.
>>
>> Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.
>>
>>     " return (w * 0x01010101) >> 24; "
>>
>> So I am trying to translate this code with inline assembly as below:
>>
>> //return (w * 0x01010101) >> 24;
>> __asm__ (
>> " mul %0, %0, %1\n"
>> : "+r" (w)
>> : "r" (w), "r"(0x01010101)
>> :);
>> return w >> 24;
>>
>> After above change, the disassambly as below:
>> 0000000000000000 <__sw_hweight32>:
>>    0:   555557b7             lui   a5,0x55555
>>    4:   0015571b             srliw   a4,a0,0x1
>>    8:   55578793             addi   a5,a5,1365 # 55555555 <.LASF5+0x55555119>
>>    c:   8ff9                   and   a5,a5,a4
>>    e:   9d1d                   subw   a0,a0,a5
>>
>> 0000000000000010 <.LVL1>:
>> 10:   333337b7             lui   a5,0x33333
>> 14:   0025571b             srliw   a4,a0,0x2
>> 18:   33378793             addi   a5,a5,819 # 33333333 <.LASF5+0x33332ef7>
>> 1c:   8d7d                   and   a0,a0,a5
>> 1e:   8ff9                   and   a5,a5,a4
>> 20:   9fa9                   addw   a5,a5,a0
>> 22:   0047d71b             srliw   a4,a5,0x4
>> 26:   9f3d                   addw   a4,a4,a5
>> 28:   0f0f17b7             lui   a5,0xf0f1
>> 2c:   1141                   addi   sp,sp,-16
>> 2e:   f0f78793             addi   a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
>> 32:   e422                   sd   s0,8(sp)
>> 34:   8ff9                   and   a5,a5,a4
>> 36:   0800                   addi   s0,sp,16
>> 38:   01010737             lui   a4,0x1010
>> 3c:   853e                   mv   a0,a5
>>
>> 000000000000003e <.LVL2>:
>> 3e:   1017071b             addiw   a4,a4,257
>> 42:   02f50533             mul   a0,a0,a5
>> 46:   6422                   ld   s0,8(sp)
>> 48:   0185551b             srliw   a0,a0,0x18
>>
>> "mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.
>>
>> Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.
>>
>> I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only
>>
>> load 20 bits per time, is this the best way to load instants?
>>
>>
>> On the other hand, I try to compare ARM64 disassembly code:
>>
>> .....
>>
>>    4:    3200c3e2     mov    w2, #0x1010101                 // #16843009
>>
>> ......
>>
>>    w = (w + (w >> 4)) & 0x0f0f0f0f;
>> 20:    0b401000     add    w0, w0, w0, lsr #4
>> 24:    1200cc00     and    w0, w0, #0xf0f0f0f
>>     return (w * 0x01010101) >> 24;
>> 28:    1b027c00     mul    w0, w0, w2
>>
>> Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.
>>
>>
>> Let me summary as below:
>>
>> 1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.
>>
>> 2. force to generate "mul" didn't get better because instants load waste instructions.
>>
>> 3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.
>>
>> unsigned int __sw_hweight32(unsigned int w)
>> {
>> -#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
>> +/*
>> + * Risc-V could not generate mul(w) instruction in this case
>> + */
>> +#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
>>         w -= (w >> 1) & 0x55555555;
>>         w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
>>         w = (w + (w >> 4)) & 0x0f0f0f0f;
>>
>>
>> Chenxi
>>
>>
>> On 2020/7/21 上午9:17, Palmer Dabbelt wrote:
>>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), [email protected] wrote:
>>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>>> which works fine on GCC-9.3 and GCC-10.1
>>>>
>>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>>
>>>> Signed-off-by: Chenxi Mao <[email protected]>
>>>> ---
>>>> arch/riscv/Kconfig | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>>> index 128192e14ff2..84e6777fecad 100644
>>>> --- a/arch/riscv/Kconfig
>>>> +++ b/arch/riscv/Kconfig
>>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>>>      bool "RV64I"
>>>>      select 64BIT
>>>>      select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>>> +    select ARCH_HAS_FAST_MULTIPLIER
>>>>      select HAVE_DYNAMIC_FTRACE if MMU
>>>>      select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>>>      select HAVE_FTRACE_MCOUNT_RECORD
>>>
>>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>>> difference there. I guess in theory we should be sticking this all in some
>>> sort of "platform type" optimization flags, but that's probably bit much for
>>> now.