2024-03-27 15:02:28

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH] riscv: lib: Implement optimized memchr function

On Mon, 11 Dec 2023 07:25:15 PST (-0800), [email protected] wrote:
> On 11/12/2023 15:08, Andrew Jones wrote:
>>> As you can see, the new function shows much better results even for
>>> the small arrays of 256 elements, therefore I believe it could be a
>>> useful addition to the existing riscv-specific string functions.
>>
>> Looks good, but do we want to maintain both this version and a zbb
>> version? I'd expect a zbb version to be even better.
>>
>
> Hi Andrew,
>
> Yes, ZBB analog would be much better, and if we use ZBB operations we
> could avoid the most part of bit magic happening there.
>
>>> + add t1, x0, a2
>>
>> move t1, a2
>>
>> and for the remainder of the function s/x0/zero/
>>
>
> Alright, will be fixed in the next version.
>>> + sltiu t2, a2, MIN_BORDER
>>> + bnez t2, 6f
>>> +
>>> + // get the number of bytes we should iterate before alignment
>>
>> I'm not sure, but I think even in assembly we prefer the /* */ comment
>> format.
>>
>>> + andi t0, a0, SZREG - 1
>>> + beqz t0, 4f
>>> +
>>> + # get the SZREG - t0
>>
>> I'm 99% sure we don't want to use the # comment syntax.
>>
>>> + xor t0, t0, SZREG - 1
>>
>> xori?
>>
>
> Hmm, I'm surprised that it is actually compilable... Yeah, should be fixed
>>> + addi t0, t0, 1
>>> +
>>> + sub a2, a2, t0
>>
>> nit: Looks a bit odd to put a blank line above the sub line above,
>> instead of above the below comment.
>>
>>> + // iterate before alignment
>>> +1:
>>> + beq t0, x0, 4f
>>> + lbu t2, 0(a0)
>>> + beq t2, a1, 3f
>>> + addi t0, t0, -1
>>
>> This addi t0... isn't necessary if we do
>>
>
> Yeah, sounds reasonable, we can make it faster
>> add t0, a0, t0
>> 1:
>> beq a0, t0, 4f
>> ...
>> ...
>> addi a0, a0, 1
>> j 1b
>>
>>> + addi a0, a0, 1
>>> + j 1b
>>> +
>>> +2:
>>> + // found a word. Iterate it until we find the target byte
>>> + li t1, SZREG
>>> + j 6f
>>
>> These instructions seem oddly placed among the rest.
>>
>>> +3:
>>> + ret
>>
>> And this is an odd place to put this ret (after unconditional jump and
>> in the middle of the function). We can just put a label at the bottom ret.
>>
>
> I agree, thanks!
>>> +
>>> +4:
>>> + // get the count remainder
>>> + andi t1, a2, SZREG - 1
>>> +
>>> + // align the count
>>> + sub a2, a2, t1
>>> +
>>> + // if we have no words to iterate, iterate the remainder
>>> + beqz a2, 6f
>>> +
>>> + // from 0xBA we will get 0xBABABABABABABABA
>>> + li t3, REP_01
>>> + mul t3, t3, a1
>>
>> I don't think we want to implement an optimized assembly function with
>> mul. We can just use a few shifts and ors.
>>
>> slli t3, a1, 8
>> or t3, t3, a1
>> slli t4, t3, 16
>> or t3, t4, t3
>> #if __riscv_xlen == 64
>> slli t4, t3, 32
>> or t3, t4, t3
>> #endif
>>
>
> Nice point, thanks! Will be optimized :)
>>> +
>>> + add a2, a2, a0
>>> +
>>> + li t4, REP_01
>>> + li t5, REP_80
>>> +
>>> +5:
>>> + REG_L t2, 0(a0)
>>> +
>>> + // after this xor we will get one zero byte in the word if it contains the target byte
>>> + xor t2, t2, t3
>>> +
>>> + // word v contains the target byte if (v - 0x01010101) & (~v) & 0x80808080 is positive
>>
>> s/is positive/is not zero/
>>
>>> + sub t0, t2, t4
>>> +
>>> + not t2, t2
>>> +
>>> + and t0, t0, t2
>>> + and t0, t0, t5
>>> +
>>> + bnez t0, 2b
>>> + addi a0, a0, SZREG
>>> + bne a0, a2, 5b
>>> +
>>> +6:
>>> + // iterate the remainder
>>> + beq t1, x0, 7f
>>> + lbu t4, 0(a0)
>>> + beq t4, a1, 3b
>>> + addi a0, a0, 1
>>> + addi t1, t1, -1
>>
>> Same comment as above about being able to drop the addi t1...
>>
>>> + j 6b
>>> +
>>> +7:
>>> + addi a0, x0, 0
>>
>> li a0, 0
>>
>>> + ret
>>> +SYM_FUNC_END(memchr)
>>> +SYM_FUNC_ALIAS(__pi_memchr, memchr)
>>> --
>>> 2.34.1
>>>
>>
>> Thanks,
>> drew
>>
>
> Thanks a lot for the review!

Do you have a v2? Sorry if I lost it.


2024-04-02 13:13:30

by Ivan Orlov

[permalink] [raw]
Subject: Re: [PATCH] riscv: lib: Implement optimized memchr function

On 27/03/2024 14:21, Palmer Dabbelt wrote:
> On Mon, 11 Dec 2023 07:25:15 PST (-0800), [email protected] wrote:
>> On 11/12/2023 15:08, Andrew Jones wrote:
>>>> As you can see, the new function shows much better results even for
>>>> the small arrays of 256 elements, therefore I believe it could be a
>>>> useful addition to the existing riscv-specific string functions.
>>>
>>> Looks good, but do we want to maintain both this version and a zbb
>>> version? I'd expect a zbb version to be even better.
>>>
>>
>> Hi Andrew,
>>
>> Yes, ZBB analog would be much better, and if we use ZBB operations we
>> could avoid the most part of bit magic happening there.
>>
>>>> +    add t1, x0, a2
>>>
>>> move t1, a2
>>>
>>> and for the remainder of the function s/x0/zero/
>>>
>>
>> Alright, will be fixed in the next version.
>>>> +    sltiu t2, a2, MIN_BORDER
>>>> +    bnez t2, 6f
>>>> +
>>>> +    // get the number of bytes we should iterate before alignment
>>>
>>> I'm not sure, but I think even in assembly we prefer the /* */ comment
>>> format.
>>>
>>>> +    andi t0, a0, SZREG - 1
>>>> +    beqz t0, 4f
>>>> +
>>>> +    # get the SZREG - t0
>>>
>>> I'm 99% sure we don't want to use the # comment syntax.
>>>
>>>> +    xor t0, t0, SZREG - 1
>>>
>>> xori?
>>>
>>
>> Hmm, I'm surprised that it is actually compilable... Yeah, should be
>> fixed
>>>> +    addi t0, t0, 1
>>>> +
>>>> +    sub a2, a2, t0
>>>
>>> nit: Looks a bit odd to put a blank line above the sub line above,
>>> instead of above the below comment.
>>>
>>>> +    // iterate before alignment
>>>> +1:
>>>> +    beq t0, x0, 4f
>>>> +    lbu t2, 0(a0)
>>>> +    beq t2, a1, 3f
>>>> +    addi t0, t0, -1
>>>
>>> This addi t0... isn't necessary if we do
>>>
>>
>> Yeah, sounds reasonable, we can make it faster
>>>     add t0, a0, t0
>>> 1:
>>>     beq a0, t0, 4f
>>>     ...
>>>     ...
>>>     addi a0, a0, 1
>>>     j 1b
>>>
>>>> +    addi a0, a0, 1
>>>> +    j 1b
>>>> +
>>>> +2:
>>>> +    // found a word. Iterate it until we find the target byte
>>>> +    li t1, SZREG
>>>> +    j 6f
>>>
>>> These instructions seem oddly placed among the rest.
>>>
>>>> +3:
>>>> +    ret
>>>
>>> And this is an odd place to put this ret (after unconditional jump and
>>> in the middle of the function). We can just put a label at the bottom
>>> ret.
>>>
>>
>> I agree, thanks!
>>>> +
>>>> +4:
>>>> +    // get the count remainder
>>>> +    andi t1, a2, SZREG - 1
>>>> +
>>>> +    // align the count
>>>> +    sub a2, a2, t1
>>>> +
>>>> +    // if we have no words to iterate, iterate the remainder
>>>> +    beqz a2, 6f
>>>> +
>>>> +    // from 0xBA we will get 0xBABABABABABABABA
>>>> +    li t3, REP_01
>>>> +    mul t3, t3, a1
>>>
>>> I don't think we want to implement an optimized assembly function with
>>> mul. We can just use a few shifts and ors.
>>>
>>>     slli    t3, a1, 8
>>>     or    t3, t3, a1
>>>     slli    t4, t3, 16
>>>     or    t3, t4, t3
>>> #if __riscv_xlen == 64
>>>     slli    t4, t3, 32
>>>     or    t3, t4, t3
>>> #endif
>>>
>>
>> Nice point, thanks! Will be optimized :)
>>>> +
>>>> +    add a2, a2, a0
>>>> +
>>>> +    li t4, REP_01
>>>> +    li t5, REP_80
>>>> +
>>>> +5:
>>>> +    REG_L t2, 0(a0)
>>>> +
>>>> +    // after this xor we will get one zero byte in the word if it
>>>> contains the target byte
>>>> +    xor t2, t2, t3
>>>> +
>>>> +    // word v contains the target byte if (v - 0x01010101) & (~v) &
>>>> 0x80808080 is positive
>>>
>>> s/is positive/is not zero/
>>>
>>>> +    sub t0, t2, t4
>>>> +
>>>> +    not t2, t2
>>>> +
>>>> +    and t0, t0, t2
>>>> +    and t0, t0, t5
>>>> +
>>>> +    bnez t0, 2b
>>>> +    addi a0, a0, SZREG
>>>> +    bne a0, a2, 5b
>>>> +
>>>> +6:
>>>> +    // iterate the remainder
>>>> +    beq t1, x0, 7f
>>>> +    lbu t4, 0(a0)
>>>> +    beq t4, a1, 3b
>>>> +    addi a0, a0, 1
>>>> +    addi t1, t1, -1
>>>
>>> Same comment as above about being able to drop the addi t1...
>>>
>>>> +    j 6b
>>>> +
>>>> +7:
>>>> +    addi a0, x0, 0
>>>
>>> li a0, 0
>>>
>>>> +    ret
>>>> +SYM_FUNC_END(memchr)
>>>> +SYM_FUNC_ALIAS(__pi_memchr, memchr)
>>>> --
>>>> 2.34.1
>>>>
>>>
>>> Thanks,
>>> drew
>>>
>>
>> Thanks a lot for the review!
>
> Do you have a v2?  Sorry if I lost it.
>

Hi Palmer,

Sorry for the late reply.

After a few experiments it became clear that we won't get such a large
performance gain for the xlen=32. Also, I collected some usage
statistics on the system, and it shown that `memchr` has to iterate more
than 128 bytes quite infrequently.

Considering this information, it seems to me that such an
overcomplication of the `memchr` function simply doesn't worth it. So,
there was no V2 for this patch :(

Sorry, I should've written it earlier.

--
Kind regards,
Ivan Orlov