2023-12-13 15:47:04

by Ivan Orlov

[permalink] [raw]
Subject: [PATCH] riscv: lib: Optimize 'strlen' function

The current non-ZBB implementation of 'strlen' function iterates the
memory bytewise, looking for a zero byte. It could be optimized to use
the wordwise iteration instead, so we will process 4/8 bytes of memory
at a time.

Word could be tested for containing the zero byte in just 4 operations:

1. Subtract 0x0101..01 from the word. If there was a zero byte in the
word, the leading zero bit of this byte will turn to 1.
2. Calculate ~word.
3. And the subtraction result with ~word. That way we will know if some
bits were 0 and then turned to 1.
4. And the result of step 3 with 0x8080..80. 0x8080..90 filters leading
bits for every byte in the word, so we will know if one of them turned
from 0 to 1.

This patch modifies the riscv-specific strlen function to work in the
following way:

1. If the address is unaligned, iterate SZREG - (address % SZREG) bytes
to align it.
2. After the address is aligned, iterate words of memory and check them
for containing the zero byte using the algorithm described above.
3. When we found a word with a zero byte, iterate through the word
bytewise to get the exact position of the 0 and return the corresponding
string length.

Here you can find the benchmarking results for the VisionFive2 board
comparing the old and new implementations of the strlen function.

Size: 1 (+-0), mean_old: 673, mean_new: 666
Size: 2 (+-0), mean_old: 672, mean_new: 676
Size: 4 (+-0), mean_old: 685, mean_new: 659
Size: 8 (+-0), mean_old: 682, mean_new: 673
Size: 16 (+-0), mean_old: 718, mean_new: 694
Size: 32 (+-0), mean_old: 753, mean_new: 726
Size: 64 (+-3), mean_old: 835, mean_new: 773
Size: 128 (+-8), mean_old: 994, mean_new: 821
Size: 256 (+-4), mean_old: 1314, mean_new: 924
Size: 512 (+-38), mean_old: 1978, mean_new: 1105
Size: 1024 (+-40), mean_old: 3263, mean_new: 1542
Size: 2048 (+-54), mean_old: 5871, mean_new: 2352
Size: 4096 (+-155), mean_old: 11061, mean_new: 3972
Size: 8192 (+-542), mean_old: 21431, mean_new: 7214
Size: 16384 (+-43), mean_old: 42239, mean_new: 13722
Size: 32768 (+-2712), mean_old: 85674, mean_new: 28471
Size: 65536 (+-907), mean_old: 189219, mean_new: 74236
Size: 131072 (+-1343), mean_old: 377023, mean_new: 147130
Size: 262144 (+-411), mean_old: 752993, mean_new: 292799
Size: 524288 (+-12242), mean_old: 1504279, mean_new: 583952

The function was tested on different string lengths and random offsets
(to check how it works with unaligned addresses). The clocks count were
measured with ktime_get and the mean values are calculated from 1000
test runs for every string length.

You can notice that the new function is slightly faster for small string
lengths and 2-3 times faster for longer strings, therefore I believe
this change could be really useful.

Signed-off-by: Ivan Orlov <[email protected]>
---
arch/riscv/lib/strlen.S | 72 +++++++++++++++++++++++++++++++----------
1 file changed, 55 insertions(+), 17 deletions(-)

diff --git a/arch/riscv/lib/strlen.S b/arch/riscv/lib/strlen.S
index 8ae3064e45ff..76486dd3c07d 100644
--- a/arch/riscv/lib/strlen.S
+++ b/arch/riscv/lib/strlen.S
@@ -5,29 +5,67 @@
#include <asm/alternative-macros.h>
#include <asm/hwcap.h>

-/* int strlen(const char *s) */
-SYM_FUNC_START(strlen)
+#define REP_80 __REG_SEL(0x8080808080808080, 0x80808080)
+#define REP_01 __REG_SEL(0x0101010101010101, 0x01010101)

+/* size_t strlen(const char *s) */
+SYM_FUNC_START(strlen)
ALTERNATIVE("nop", "j strlen_zbb", 0, RISCV_ISA_EXT_ZBB, CONFIG_RISCV_ISA_ZBB)

/*
- * Returns
- * a0 - string length
- *
- * Parameters
- * a0 - String to measure
- *
- * Clobbers:
- * t0, t1
- */
- mv t1, a0
+ * Returns
+ * a0 - string length
+ *
+ * Parameters
+ * a0 - String to measure
+ *
+ * Clobbers:
+ * t0, t1, t2, t3, t4
+ */
+
+
+ mv t4, a0
+
+ /* Check the address memory alignment */
+ andi t2, a0, SZREG-1
+ beqz t2, 2f
+
+ /* Get SZREG - (address remainder) */
+ xori t2, t2, SZREG-1
+ addi t2, t2, 1
+
+ /* align the address */
+ add t2, a0, t2
1:
- lbu t0, 0(t1)
- beqz t0, 2f
- addi t1, t1, 1
- j 1b
+ beq a0, t2, 2f
+ lbu t1, 0(a0)
+ beqz t1, 5f
+ addi a0, a0, 1
+ j 1b
2:
- sub a0, t1, a0
+ li t2, REP_80
+ li t3, REP_01
+3:
+ /* v contains 0 byte if (v - 0x0101..01) & ~v & 0x8080..80 != 0 */
+ REG_L t0, 0(a0)
+ sub t1, t0, t3
+ not t0, t0
+ and t1, t1, t0
+ and t1, t1, t2
+ bnez t1, 4f
+ addi a0, a0, SZREG
+ j 3b
+4:
+ /*
+ * We found the word with 0, iterate it byte-wise to find the actual
+ * string length.
+ */
+ lbu t0, 0(a0)
+ beqz t0, 5f
+ addi a0, a0, 1
+ j 4b
+5:
+ sub a0, a0, t4
ret

/*
--
2.34.1


2023-12-17 17:01:29

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
> Sent: 13 December 2023 15:46
>
> The current non-ZBB implementation of 'strlen' function iterates the
> memory bytewise, looking for a zero byte. It could be optimized to use
> the wordwise iteration instead, so we will process 4/8 bytes of memory
> at a time.
...
> 1. If the address is unaligned, iterate SZREG - (address % SZREG) bytes
> to align it.

An alternative is to mask the address and 'or' in non-zero bytes
into the first word - might be faster.

...
> Here you can find the benchmarking results for the VisionFive2 board
> comparing the old and new implementations of the strlen function.
>
> Size: 1 (+-0), mean_old: 673, mean_new: 666
> Size: 2 (+-0), mean_old: 672, mean_new: 676
> Size: 4 (+-0), mean_old: 685, mean_new: 659
> Size: 8 (+-0), mean_old: 682, mean_new: 673
> Size: 16 (+-0), mean_old: 718, mean_new: 694
...

Is that 32bit or 64bit?
The word-at-a-time strlen() is typically not worth it for 32bit.

I'd also guess that pretty much all the calls in-kernel are short.
You might try counting as: histogram[ilog2(strlen_result)]++
and seeing what it shows for some workload.
I bet you (a beer if I see you!) that you won't see many over 1k.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-17 18:11:55

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
> Sent: 13 December 2023 15:46

Looking at the old code...

> 1:
> - lbu t0, 0(t1)
> - beqz t0, 2f
> - addi t1, t1, 1
> - j 1b

I suspect there is (at least) a two clock stall between
the 'ldu' and 'beqz'.
Allowing for one clock for the 'predicted taken' branch
that is 7 clocks/byte.

Try this one - especially on 32bit:

mov t0, a0
and t1, t0, 1
sub t0, t0, t1
bnez t1, 2f
1:
ldb t1, 0(t0)
2: ldb t2, 1(t0)
add t0, t0, 2
beqz t1, 3f
bnez t2, 1b
add t0, t0, 1
3: sub t0, t0, 2
sub a0, t0, a0
ret

Might be 6 clocks for 2 bytes.
The much smaller cache footprint will also help.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-17 22:54:59

by Ivan Orlov

[permalink] [raw]
Subject: Re: [PATCH] riscv: lib: Optimize 'strlen' function

On 12/17/23 17:00, David Laight wrote:
> From: Ivan Orlov
>> Sent: 13 December 2023 15:46
>>
>> The current non-ZBB implementation of 'strlen' function iterates the
>> memory bytewise, looking for a zero byte. It could be optimized to use
>> the wordwise iteration instead, so we will process 4/8 bytes of memory
>> at a time.
> ...
>> 1. If the address is unaligned, iterate SZREG - (address % SZREG) bytes
>> to align it.
>
> An alternative is to mask the address and 'or' in non-zero bytes
> into the first word - might be faster.
>
Hi David,

Yeah, it might be an option, I'll test it. Thanks!

> ...
>> Here you can find the benchmarking results for the VisionFive2 board
>> comparing the old and new implementations of the strlen function.
>>
>> Size: 1 (+-0), mean_old: 673, mean_new: 666
>> Size: 2 (+-0), mean_old: 672, mean_new: 676
>> Size: 4 (+-0), mean_old: 685, mean_new: 659
>> Size: 8 (+-0), mean_old: 682, mean_new: 673
>> Size: 16 (+-0), mean_old: 718, mean_new: 694
> ...
>
> Is that 32bit or 64bit?
> The word-at-a-time strlen() is typically not worth it for 32bit.
>

I tested it on 64-bit board only as it is the only board I have...

I assume the performance gain would be less noticeable on 32bit,
probably the word-oriented function could be even slower than the
byte-oriented one for shorter strings.

However, I'm not sure if any physical 32-bit risc-v boards with Linux
support actually exist at the moment... So the only way to test the
solution on the 32-bit system would be QEMU, and probably it wouldn't be
really representative, right?

But it definitely worth a try and probably I could include a separate
implementation for 32-bit RISC-V which will simply iterate the bytes in
case if QEMU 32-bit test will show significant overhead for
word-oriented function.

> I'd also guess that pretty much all the calls in-kernel are short.

I'm 99% sure they are! However, I believe if word-oriented solution
doesn't introduce performance overhead for shorter strings but works
much faster for longer strings, it still worth an implementation! :)

> You might try counting as: histogram[ilog2(strlen_result)]++
> and seeing what it shows for some workload.
> I bet you (a beer if I see you!) that you won't see many over 1k.

Sounds like a funny experiment, and I accept a bet! Beer is more than
doable as I'm also located in the UK (Manchester).

--
Kind regards,
Ivan Orlov


2023-12-17 23:23:34

by Ivan Orlov

[permalink] [raw]
Subject: Re: [PATCH] riscv: lib: Optimize 'strlen' function

On 12/17/23 18:10, David Laight wrote:
> From: Ivan Orlov
>> Sent: 13 December 2023 15:46
>
> Looking at the old code...
>
>> 1:
>> - lbu t0, 0(t1)
>> - beqz t0, 2f
>> - addi t1, t1, 1
>> - j 1b
>
> I suspect there is (at least) a two clock stall between
> the 'ldu' and 'beqz'.

Hmm, the stall exists due to memory access? Why does two subsequent
accesses to the memory (as in the example you provided) do the trick? Is
it because two "ldb"s could be parallelized?

> Allowing for one clock for the 'predicted taken' branch
> that is 7 clocks/byte.
>
> Try this one - especially on 32bit:
>
> mov t0, a0
> and t1, t0, 1
> sub t0, t0, t1
> bnez t1, 2f
> 1:
> ldb t1, 0(t0)
> 2: ldb t2, 1(t0)
> add t0, t0, 2
> beqz t1, 3f
> bnez t2, 1b
> add t0, t0, 1
> 3: sub t0, t0, 2
> sub a0, t0, a0
> ret
>
I tested it on my 64bit board, and this variant is definitely faster
than the original implementation! Here is the results of the benchmark
which compares this variant with the word-oriented one:

Test count per size: 1000

Size: 1 (+-0), mean_old: 711, mean_new: 708
Size: 2 (+-0), mean_old: 649, mean_new: 713
Size: 4 (+-0), mean_old: 499, mean_new: 506
Size: 8 (+-0), mean_old: 344, mean_new: 350
Size: 16 (+-0), mean_old: 342, mean_new: 362
Size: 32 (+-0), mean_old: 369, mean_new: 387
Size: 64 (+-0), mean_old: 393, mean_new: 401
Size: 128 (+-4), mean_old: 457, mean_new: 424
Size: 256 (+-13), mean_old: 578, mean_new: 476
Size: 512 (+-31), mean_old: 842, mean_new: 573
Size: 1024 (+-19), mean_old: 1305, mean_new: 777
Size: 2048 (+-97), mean_old: 2280, mean_new: 1193
Size: 4096 (+-149), mean_old: 4226, mean_new: 2002
Size: 8192 (+-439), mean_old: 8131, mean_new: 3634
Size: 16384 (+-615), mean_old: 16353, mean_new: 6905
Size: 32768 (+-2566), mean_old: 37075, mean_new: 14232
Size: 65536 (+-6047), mean_old: 73797, mean_new: 37090
Size: 131072 (+-10071), mean_old: 146802, mean_new: 73402
Size: 262144 (+-18150), mean_old: 293003, mean_new: 146118
Size: 524288 (+-21247), mean_old: 585057, mean_new: 291324

Benchmark code:

https://github.com/ivanorlov2206/strlen-benchmark/blob/main/strlentest.c

It looks like the variant you suggested could be faster for shorter
strings even on the 64bit platform.

Maybe we could enhance it even more by loading 4 consequent bytes into
different registers so the memory loads would still be parallelized?

--
Kind regards,
Ivan Orlov


2023-12-18 01:41:45

by Ivan Orlov

[permalink] [raw]
Subject: Re: [PATCH] riscv: lib: Optimize 'strlen' function

On 12/17/23 17:00, David Laight wrote:
> I'd also guess that pretty much all the calls in-kernel are short.
> You might try counting as: histogram[ilog2(strlen_result)]++
> and seeing what it shows for some workload.
> I bet you (a beer if I see you!) that you won't see many over 1k.

Hi David,

Here is the statistics for strlen result:

[ 223.169575] Calls count for 2^0: 6150
[ 223.173293] Calls count for 2^1: 184852
[ 223.177142] Calls count for 2^2: 313896
[ 223.180990] Calls count for 2^3: 185844
[ 223.184881] Calls count for 2^4: 87868
[ 223.188660] Calls count for 2^5: 9916
[ 223.192368] Calls count for 2^6: 1865
[ 223.196062] Calls count for 2^7: 0
[ 223.199483] Calls count for 2^8: 0
[ 223.202952] Calls count for 2^9: 0
...

Looks like I've just lost a beer :)

Considering this statistics, I'd say implementing the word-oriented
strlen is an overcomplication - we wouldn't get any performance gain and
it just doesn't worth it.

I simplified your code a little bit, it looks like the alignment there
is unnecessary: QEMU test shows the same performance independently from
alignment. Tests on the board gave the same result (perhaps because the
CPU on the board has 2 DDR channels?)

mv t0, a0
1:
lbu t1, 0(a0)
lbu t2, 1(a0)
addi a0, a0, 2
beqz t1, 2f
bnez t2, 1b
addi a0, a0, 1
2:
addi a0, a0, -2
sub a0, a0, t0
ret

If it looks good to you, would you mind if I send the patch with it?
Could I add you to suggested-by tag?

--
Kind regards,
Ivan Orlov


2023-12-18 09:14:25

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
> Sent: 17 December 2023 23:23
>
> On 12/17/23 18:10, David Laight wrote:
> > From: Ivan Orlov
> >> Sent: 13 December 2023 15:46
> >
> > Looking at the old code...
> >
> >> 1:
> >> - lbu t0, 0(t1)
> >> - beqz t0, 2f
> >> - addi t1, t1, 1
> >> - j 1b
> >
> > I suspect there is (at least) a two clock stall between
> > the 'ldu' and 'beqz'.
>
> Hmm, the stall exists due to memory access? Why does two subsequent
> accesses to the memory (as in the example you provided) do the trick? Is
> it because two "ldb"s could be parallelized?

On the fpga RISCV (and probably other 'small' implementations)
there is a two clock result delay from memory loads while the
result is written to the 'register file'.
ALU results get short-circuited so can be used in the next instruction.

The memory loads themselves are pipelined and can be issued
every clock.
(On the fpga version actual memory delays stall the pipeline.)

...
> Maybe we could enhance it even more by loading 4 consequent bytes into
> different registers so the memory loads would still be parallelized?

You need the loads to be pipelined, not parallelized.
That will help if there are longer delays accessing cache memory.

I'd expect any cpu with (say) a 4 clock cache read latency to
pipeline reads so that one can be issued every clock provided
the results aren't needed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-18 09:29:16

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
> Sent: 18 December 2023 01:42
>
> On 12/17/23 17:00, David Laight wrote:
> > I'd also guess that pretty much all the calls in-kernel are short.
> > You might try counting as: histogram[ilog2(strlen_result)]++
> > and seeing what it shows for some workload.
> > I bet you (a beer if I see you!) that you won't see many over 1k.
>
> Hi David,
>
> Here is the statistics for strlen result:
>
> [ 223.169575] Calls count for 2^0: 6150
> [ 223.173293] Calls count for 2^1: 184852
> [ 223.177142] Calls count for 2^2: 313896
> [ 223.180990] Calls count for 2^3: 185844
> [ 223.184881] Calls count for 2^4: 87868
> [ 223.188660] Calls count for 2^5: 9916
> [ 223.192368] Calls count for 2^6: 1865
> [ 223.196062] Calls count for 2^7: 0
> [ 223.199483] Calls count for 2^8: 0
> [ 223.202952] Calls count for 2^9: 0
> ...
>
> Looks like I've just lost a beer :)
>
> Considering this statistics, I'd say implementing the word-oriented
> strlen is an overcomplication - we wouldn't get any performance gain and
> it just doesn't worth it.

And the 32bit version is about half the speed of the 64bit one.

Of course, the fast way to do strlen is add a custom instruction!

> I simplified your code a little bit, it looks like the alignment there
> is unnecessary: QEMU test shows the same performance independently from
> alignment. Tests on the board gave the same result (perhaps because the
> CPU on the board has 2 DDR channels?)

The alignment is there because it can overread the string end
by one byte - and that mustn't cross a page boundary.
So you either have to mark the second load as 'may fault return
zero' or just not do it.

If the data isn't in cache the cache load will dominate.
The DDR channels only affect cache load times.
Get a TLB miss and add a few thousand more clocks!

>
> mv t0, a0
> 1:
> lbu t1, 0(a0)
> lbu t2, 1(a0)
> addi a0, a0, 2
> beqz t1, 2f
> bnez t2, 1b
> addi a0, a0, 1
> 2:
> addi a0, a0, -2
> sub a0, a0, t0
> ret
>
> If it looks good to you, would you mind if I send the patch with it?
> Could I add you to suggested-by tag?

Yep..

David

>
> --
> Kind regards,
> Ivan Orlov

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-18 10:03:34

by Ivan Orlov

[permalink] [raw]
Subject: Re: [PATCH] riscv: lib: Optimize 'strlen' function

On 12/18/23 09:20, David Laight wrote:
> From: Ivan Orlov
>> Sent: 18 December 2023 01:42
>>
>> On 12/17/23 17:00, David Laight wrote:
>>> I'd also guess that pretty much all the calls in-kernel are short.
>>> You might try counting as: histogram[ilog2(strlen_result)]++
>>> and seeing what it shows for some workload.
>>> I bet you (a beer if I see you!) that you won't see many over 1k.
>>
>> Hi David,
>>
>> Here is the statistics for strlen result:
>>
>> [ 223.169575] Calls count for 2^0: 6150
>> [ 223.173293] Calls count for 2^1: 184852
>> [ 223.177142] Calls count for 2^2: 313896
>> [ 223.180990] Calls count for 2^3: 185844
>> [ 223.184881] Calls count for 2^4: 87868
>> [ 223.188660] Calls count for 2^5: 9916
>> [ 223.192368] Calls count for 2^6: 1865
>> [ 223.196062] Calls count for 2^7: 0
>> [ 223.199483] Calls count for 2^8: 0
>> [ 223.202952] Calls count for 2^9: 0
>> ...
>>
>> Looks like I've just lost a beer :)
>>
>> Considering this statistics, I'd say implementing the word-oriented
>> strlen is an overcomplication - we wouldn't get any performance gain and
>> it just doesn't worth it.
>
> And the 32bit version is about half the speed of the 64bit one.
>
> Of course, the fast way to do strlen is add a custom instruction!
>
>> I simplified your code a little bit, it looks like the alignment there
>> is unnecessary: QEMU test shows the same performance independently from
>> alignment. Tests on the board gave the same result (perhaps because the
>> CPU on the board has 2 DDR channels?)
>
> The alignment is there because it can overread the string end
> by one byte - and that mustn't cross a page boundary.
> So you either have to mark the second load as 'may fault return
> zero' or just not do it.
>
> If the data isn't in cache the cache load will dominate.
> The DDR channels only affect cache load times.
> Get a TLB miss and add a few thousand more clocks!
>

Ah, right, sounds reasonable...

Overall, I believe your solution is better and it would be more fair if
you send it as a patch :) Here is benchmark results for your version vs
the original (the old) one on the Starfive VisionFive2 RISC-V board:

Size: 1 (+-0), mean_old: 350, mean_new: 340
Size: 2 (+-0), mean_old: 337, mean_new: 347
Size: 4 (+-0), mean_old: 322, mean_new: 355
Size: 8 (+-0), mean_old: 345, mean_new: 335
Size: 16 (+-0), mean_old: 352, mean_new: 367
Size: 32 (+-0), mean_old: 425, mean_new: 362
Size: 64 (+-4), mean_old: 507, mean_new: 407
Size: 128 (+-10), mean_old: 730, mean_new: 442
Size: 256 (+-19), mean_old: 1142, mean_new: 592
Size: 512 (+-6), mean_old: 1945, mean_new: 812
Size: 1024 (+-21), mean_old: 3565, mean_new: 1312
Size: 2048 (+-108), mean_old: 6812, mean_new: 2280
Size: 4096 (+-362), mean_old: 13302, mean_new: 4242
Size: 8192 (+-385), mean_old: 26393, mean_new: 8160
Size: 16384 (+-1115), mean_old: 52689, mean_new: 15953
Size: 32768 (+-2515), mean_old: 107293, mean_new: 32391
Size: 65536 (+-6041), mean_old: 213789, mean_new: 74354
Size: 131072 (+-12352), mean_old: 426619, mean_new: 146972
Size: 262144 (+-2635), mean_old: 848115, mean_new: 291309
Size: 524288 (+-3336), mean_old: 1712847, mean_new: 589654



>>
>> --
>> Kind regards,
>> Ivan Orlov
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

--
Kind regards,
Ivan Orlov


2023-12-18 10:12:45

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
> Sent: 18 December 2023 10:03
>
> On 12/18/23 09:20, David Laight wrote:
> > From: Ivan Orlov
> >> Sent: 18 December 2023 01:42
> >>
> >> On 12/17/23 17:00, David Laight wrote:
> >>> I'd also guess that pretty much all the calls in-kernel are short.
> >>> You might try counting as: histogram[ilog2(strlen_result)]++
> >>> and seeing what it shows for some workload.
> >>> I bet you (a beer if I see you!) that you won't see many over 1k.
> >>
> >> Hi David,
> >>
> >> Here is the statistics for strlen result:
> >>
> >> [ 223.169575] Calls count for 2^0: 6150
> >> [ 223.173293] Calls count for 2^1: 184852
> >> [ 223.177142] Calls count for 2^2: 313896
> >> [ 223.180990] Calls count for 2^3: 185844
> >> [ 223.184881] Calls count for 2^4: 87868
> >> [ 223.188660] Calls count for 2^5: 9916
> >> [ 223.192368] Calls count for 2^6: 1865
> >> [ 223.196062] Calls count for 2^7: 0
> >> [ 223.199483] Calls count for 2^8: 0
> >> [ 223.202952] Calls count for 2^9: 0
> >> ...
> >>
> >> Looks like I've just lost a beer :)
> >>
> >> Considering this statistics, I'd say implementing the word-oriented
> >> strlen is an overcomplication - we wouldn't get any performance gain and
> >> it just doesn't worth it.
> >
> > And the 32bit version is about half the speed of the 64bit one.
> >
> > Of course, the fast way to do strlen is add a custom instruction!
> >
> >> I simplified your code a little bit, it looks like the alignment there
> >> is unnecessary: QEMU test shows the same performance independently from
> >> alignment. Tests on the board gave the same result (perhaps because the
> >> CPU on the board has 2 DDR channels?)
> >
> > The alignment is there because it can overread the string end
> > by one byte - and that mustn't cross a page boundary.
> > So you either have to mark the second load as 'may fault return
> > zero' or just not do it.
> >
> > If the data isn't in cache the cache load will dominate.
> > The DDR channels only affect cache load times.
> > Get a TLB miss and add a few thousand more clocks!
> >
>
> Ah, right, sounds reasonable...
>
> Overall, I believe your solution is better and it would be more fair if
> you send it as a patch :) Here is benchmark results for your version vs
> the original (the old) one on the Starfive VisionFive2 RISC-V board:

You might want to try reading 4 bytes before checking any.
It might be quicker on your cpu.
It is hard guessing what is best across multiple implementation.
(For testing I'd not worry about falling off the page.)

I'll let you do the patch, I don't even have a toolchain, never mind
anything to test it on.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)