2023-05-11 01:28:44

by zhangfei

[permalink] [raw]
Subject: [PATCH v2 0/2] RISC-V: Optimize memset for data sizes less than 16 bytes

From: zhangfei <[email protected]>

At present, the implementation of the memset function uses byte by byte storage
when processing tail data or when the initial data size is less than 16 bytes.
This approach is not efficient. Therefore, I filled head and tail with minimal
branching. Each conditional ensures that all the subsequently used offsets are
well-defined and in the dest region. Although this approach may result in
redundant storage, compared to byte by byte storage, it allows storage instructions
to be executed in parallel, reduces the number of jumps, and ultimately achieves
performance improvement.

I used the code linked below for performance testing and commented on the memset
that calls the arm architecture in the code to ensure it runs properly on the
risc-v platform.

[1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53

The testing platform selected RISC-V SiFive U74.The test data is as follows:

Before optimization
---------------------
Random memset (bytes/ns):
memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30

Medium memset (bytes/ns):
memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
Large memset (bytes/ns):
memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40

After optimization
---------------------
Random memset bytes/ns):
memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
Medium memset (bytes/ns )
memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
Large memset (bytes/ns):
memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40

From the results, it can be seen that memset has significantly improved its performance with
a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.

The previous work was as follows:
1. "[PATCH] riscv: Optimize memset"
[email protected]

Thanks,
Fei Zhang

Andrew Jones (1):
RISC-V: lib: Improve memset assembler formatting

arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
1 file changed, 72 insertions(+), 71 deletions(-)

zhangfei (1):
RISC-V: lib: Optimize memset performance

arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++---
1 file changed, 37 insertions(+), 3 deletions(-)



2023-05-11 01:49:42

by zhangfei

[permalink] [raw]
Subject: [PATCH v2 1/2] RISC-V: lib: Improve memset assembler formatting

From: Andrew Jones <[email protected]>

Aligning the first operand of each instructions with a tab is a
typical style which improves readability. Apply it to memset.S.
While there, we also make a small grammar change to a comment.

No functional change intended.

Signed-off-by: Andrew Jones <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Signed-off-by: Fei Zhang <[email protected]>
---
arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
1 file changed, 72 insertions(+), 71 deletions(-)

diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
index 34c5360c6705..e613c5c27998 100644
--- a/arch/riscv/lib/memset.S
+++ b/arch/riscv/lib/memset.S
@@ -3,111 +3,112 @@
* Copyright (C) 2013 Regents of the University of California
*/

-
#include <linux/linkage.h>
#include <asm/asm.h>

/* void *memset(void *, int, size_t) */
ENTRY(__memset)
WEAK(memset)
- move t0, a0 /* Preserve return value */
+ move t0, a0 /* Preserve return value */

/* Defer to byte-oriented fill for small sizes */
- sltiu a3, a2, 16
- bnez a3, 4f
+ sltiu a3, a2, 16
+ bnez a3, 4f

/*
* Round to nearest XLEN-aligned address
- * greater than or equal to start address
+ * greater than or equal to the start address.
*/
- addi a3, t0, SZREG-1
- andi a3, a3, ~(SZREG-1)
- beq a3, t0, 2f /* Skip if already aligned */
+ addi a3, t0, SZREG-1
+ andi a3, a3, ~(SZREG-1)
+ beq a3, t0, 2f /* Skip if already aligned */
+
/* Handle initial misalignment */
- sub a4, a3, t0
+ sub a4, a3, t0
1:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 1b
- sub a2, a2, a4 /* Update count */
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 1b
+ sub a2, a2, a4 /* Update count */

2: /* Duff's device with 32 XLEN stores per iteration */
/* Broadcast value into all bytes */
- andi a1, a1, 0xff
- slli a3, a1, 8
- or a1, a3, a1
- slli a3, a1, 16
- or a1, a3, a1
+ andi a1, a1, 0xff
+ slli a3, a1, 8
+ or a1, a3, a1
+ slli a3, a1, 16
+ or a1, a3, a1
#ifdef CONFIG_64BIT
- slli a3, a1, 32
- or a1, a3, a1
+ slli a3, a1, 32
+ or a1, a3, a1
#endif

/* Calculate end address */
- andi a4, a2, ~(SZREG-1)
- add a3, t0, a4
+ andi a4, a2, ~(SZREG-1)
+ add a3, t0, a4

- andi a4, a4, 31*SZREG /* Calculate remainder */
- beqz a4, 3f /* Shortcut if no remainder */
- neg a4, a4
- addi a4, a4, 32*SZREG /* Calculate initial offset */
+ andi a4, a4, 31*SZREG /* Calculate remainder */
+ beqz a4, 3f /* Shortcut if no remainder */
+ neg a4, a4
+ addi a4, a4, 32*SZREG /* Calculate initial offset */

/* Adjust start address with offset */
- sub t0, t0, a4
+ sub t0, t0, a4

/* Jump into loop body */
/* Assumes 32-bit instruction lengths */
- la a5, 3f
+ la a5, 3f
#ifdef CONFIG_64BIT
- srli a4, a4, 1
+ srli a4, a4, 1
#endif
- add a5, a5, a4
- jr a5
+ add a5, a5, a4
+ jr a5
3:
- REG_S a1, 0(t0)
- REG_S a1, SZREG(t0)
- REG_S a1, 2*SZREG(t0)
- REG_S a1, 3*SZREG(t0)
- REG_S a1, 4*SZREG(t0)
- REG_S a1, 5*SZREG(t0)
- REG_S a1, 6*SZREG(t0)
- REG_S a1, 7*SZREG(t0)
- REG_S a1, 8*SZREG(t0)
- REG_S a1, 9*SZREG(t0)
- REG_S a1, 10*SZREG(t0)
- REG_S a1, 11*SZREG(t0)
- REG_S a1, 12*SZREG(t0)
- REG_S a1, 13*SZREG(t0)
- REG_S a1, 14*SZREG(t0)
- REG_S a1, 15*SZREG(t0)
- REG_S a1, 16*SZREG(t0)
- REG_S a1, 17*SZREG(t0)
- REG_S a1, 18*SZREG(t0)
- REG_S a1, 19*SZREG(t0)
- REG_S a1, 20*SZREG(t0)
- REG_S a1, 21*SZREG(t0)
- REG_S a1, 22*SZREG(t0)
- REG_S a1, 23*SZREG(t0)
- REG_S a1, 24*SZREG(t0)
- REG_S a1, 25*SZREG(t0)
- REG_S a1, 26*SZREG(t0)
- REG_S a1, 27*SZREG(t0)
- REG_S a1, 28*SZREG(t0)
- REG_S a1, 29*SZREG(t0)
- REG_S a1, 30*SZREG(t0)
- REG_S a1, 31*SZREG(t0)
- addi t0, t0, 32*SZREG
- bltu t0, a3, 3b
- andi a2, a2, SZREG-1 /* Update count */
+ REG_S a1, 0(t0)
+ REG_S a1, SZREG(t0)
+ REG_S a1, 2*SZREG(t0)
+ REG_S a1, 3*SZREG(t0)
+ REG_S a1, 4*SZREG(t0)
+ REG_S a1, 5*SZREG(t0)
+ REG_S a1, 6*SZREG(t0)
+ REG_S a1, 7*SZREG(t0)
+ REG_S a1, 8*SZREG(t0)
+ REG_S a1, 9*SZREG(t0)
+ REG_S a1, 10*SZREG(t0)
+ REG_S a1, 11*SZREG(t0)
+ REG_S a1, 12*SZREG(t0)
+ REG_S a1, 13*SZREG(t0)
+ REG_S a1, 14*SZREG(t0)
+ REG_S a1, 15*SZREG(t0)
+ REG_S a1, 16*SZREG(t0)
+ REG_S a1, 17*SZREG(t0)
+ REG_S a1, 18*SZREG(t0)
+ REG_S a1, 19*SZREG(t0)
+ REG_S a1, 20*SZREG(t0)
+ REG_S a1, 21*SZREG(t0)
+ REG_S a1, 22*SZREG(t0)
+ REG_S a1, 23*SZREG(t0)
+ REG_S a1, 24*SZREG(t0)
+ REG_S a1, 25*SZREG(t0)
+ REG_S a1, 26*SZREG(t0)
+ REG_S a1, 27*SZREG(t0)
+ REG_S a1, 28*SZREG(t0)
+ REG_S a1, 29*SZREG(t0)
+ REG_S a1, 30*SZREG(t0)
+ REG_S a1, 31*SZREG(t0)
+
+ addi t0, t0, 32*SZREG
+ bltu t0, a3, 3b
+ andi a2, a2, SZREG-1 /* Update count */

4:
/* Handle trailing misalignment */
- beqz a2, 6f
- add a3, t0, a2
+ beqz a2, 6f
+ add a3, t0, a2
5:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 5b
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 5b
6:
ret
END(__memset)
--
2.33.0


2023-05-11 01:56:19

by zhangfei

[permalink] [raw]
Subject: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance

From: zhangfei <[email protected]>

Optimized performance when the data size is less than 16 bytes.
Compared to byte by byte storage, significant performance improvement has been achieved.
It allows storage instructions to be executed in parallel and reduces the number of jumps.
Additional checks can avoid redundant stores.

Signed-off-by: Fei Zhang <[email protected]>
---
arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++---
1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
index e613c5c27998..452764bc9900 100644
--- a/arch/riscv/lib/memset.S
+++ b/arch/riscv/lib/memset.S
@@ -106,9 +106,43 @@ WEAK(memset)
beqz a2, 6f
add a3, t0, a2
5:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 5b
+ /* fill head and tail with minimal branching */
+ sb a1, 0(t0)
+ sb a1, -1(a3)
+ li a4, 2
+ bgeu a4, a2, 6f
+
+ sb a1, 1(t0)
+ sb a1, 2(t0)
+ sb a1, -2(a3)
+ sb a1, -3(a3)
+ li a4, 6
+ bgeu a4, a2, 6f
+
+ /*
+ * Adding additional detection to avoid
+ * redundant stores can lead
+ * to better performance
+ */
+ sb a1, 3(t0)
+ sb a1, -4(a3)
+ li a4, 8
+ bgeu a4, a2, 6f
+
+ sb a1, 4(t0)
+ sb a1, -5(a3)
+ li a4, 10
+ bgeu a4, a2, 6f
+
+ sb a1, 5(t0)
+ sb a1, 6(t0)
+ sb a1, -6(a3)
+ sb a1, -7(a3)
+ li a4, 14
+ bgeu a4, a2, 6f
+
+ /* store the last byte */
+ sb a1, 7(t0)
6:
ret
END(__memset)
--
2.33.0


2023-05-11 08:04:49

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance

On Thu, May 11, 2023 at 09:34:53AM +0800, zhangfei wrote:
> From: zhangfei <[email protected]>
>
> Optimized performance when the data size is less than 16 bytes.
> Compared to byte by byte storage, significant performance improvement has been achieved.
> It allows storage instructions to be executed in parallel and reduces the number of jumps.

Please wrap commit message lines at 74 chars.

> Additional checks can avoid redundant stores.
>
> Signed-off-by: Fei Zhang <[email protected]>
> ---
> arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++---
> 1 file changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
> index e613c5c27998..452764bc9900 100644
> --- a/arch/riscv/lib/memset.S
> +++ b/arch/riscv/lib/memset.S
> @@ -106,9 +106,43 @@ WEAK(memset)
> beqz a2, 6f
> add a3, t0, a2
> 5:
> - sb a1, 0(t0)
> - addi t0, t0, 1
> - bltu t0, a3, 5b
> + /* fill head and tail with minimal branching */
> + sb a1, 0(t0)
> + sb a1, -1(a3)
> + li a4, 2
> + bgeu a4, a2, 6f
> +
> + sb a1, 1(t0)
> + sb a1, 2(t0)
> + sb a1, -2(a3)
> + sb a1, -3(a3)
> + li a4, 6
> + bgeu a4, a2, 6f
> +
> + /*
> + * Adding additional detection to avoid
> + * redundant stores can lead
> + * to better performance
> + */
> + sb a1, 3(t0)
> + sb a1, -4(a3)
> + li a4, 8
> + bgeu a4, a2, 6f
> +
> + sb a1, 4(t0)
> + sb a1, -5(a3)
> + li a4, 10
> + bgeu a4, a2, 6f

These extra checks feel ad hoc to me. Naturally you'll get better results
for 8 byte memsets when there's a branch to the ret after 8 bytes. But
what about 9? I'd think you'd want benchmarks from 1 to 15 bytes to show
how it performs better or worse than byte by byte for each of those sizes.
Also, while 8 bytes might be worth special casing, I'm not sure why 10
would be. What makes 10 worth optimizing more than 11?

Finally, microbenchmarking is quite hardware-specific and energy
consumption should probably also be considered. What energy cost is
there from making redundant stores? Is it worth it?

Thanks for cleaning up the patch series, but I'm still not 100%
convinced we want it.

> +
> + sb a1, 5(t0)
> + sb a1, 6(t0)
> + sb a1, -6(a3)
> + sb a1, -7(a3)
> + li a4, 14
> + bgeu a4, a2, 6f
> +
> + /* store the last byte */
> + sb a1, 7(t0)
> 6:
> ret
> END(__memset)
> --
> 2.33.0
>

Thanks,
drew

2023-05-11 08:06:32

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] RISC-V: Optimize memset for data sizes less than 16 bytes

On Thu, May 11, 2023 at 09:26:04AM +0800, zhangfei wrote:
> From: zhangfei <[email protected]>
>
> At present, the implementation of the memset function uses byte by byte storage
> when processing tail data or when the initial data size is less than 16 bytes.
> This approach is not efficient. Therefore, I filled head and tail with minimal
> branching. Each conditional ensures that all the subsequently used offsets are
> well-defined and in the dest region. Although this approach may result in
> redundant storage, compared to byte by byte storage, it allows storage instructions
> to be executed in parallel, reduces the number of jumps, and ultimately achieves
> performance improvement.
>
> I used the code linked below for performance testing and commented on the memset
> that calls the arm architecture in the code to ensure it runs properly on the
> risc-v platform.
>
> [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
>
> The testing platform selected RISC-V SiFive U74.The test data is as follows:
>
> Before optimization
> ---------------------
> Random memset (bytes/ns):
> memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30
>
> Medium memset (bytes/ns):
> memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
> memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40
>
> After optimization
> ---------------------
> Random memset bytes/ns):
> memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
> Medium memset (bytes/ns )
> memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
> memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40
>
> From the results, it can be seen that memset has significantly improved its performance with
> a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.
>
> The previous work was as follows:
> 1. "[PATCH] riscv: Optimize memset"
> [email protected]

Cover letters should have a changelog, in this case a couple phrases
stating what's different in v2 vs. v1.

Thanks,
drew

>
> Thanks,
> Fei Zhang
>
> Andrew Jones (1):
> RISC-V: lib: Improve memset assembler formatting
>
> arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
> 1 file changed, 72 insertions(+), 71 deletions(-)
>
> zhangfei (1):
> RISC-V: lib: Optimize memset performance
>
> arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++---
> 1 file changed, 37 insertions(+), 3 deletions(-)
>

2023-05-12 08:56:18

by zhangfei

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance

From: zhangfei <[email protected]>

On Thu, May 11, 2023 at 15:43:26PM +0200, Andrew Jones wrote:
> On Thu, May 11, 2023 at 09:34:53AM +0800, zhangfei wrote:
> > From: zhangfei <[email protected]>
> >
> > Optimized performance when the data size is less than 16 bytes.
> > Compared to byte by byte storage, significant performance improvement has been achieved.
> > It allows storage instructions to be executed in parallel and reduces the number of jumps.
>
> Please wrap commit message lines at 74 chars.
>
> > Additional checks can avoid redundant stores.
> >
> > Signed-off-by: Fei Zhang <[email protected]>
> > ---
> > arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 37 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
> > index e613c5c27998..452764bc9900 100644
> > --- a/arch/riscv/lib/memset.S
> > +++ b/arch/riscv/lib/memset.S
> > @@ -106,9 +106,43 @@ WEAK(memset)
> > beqz a2, 6f
> > add a3, t0, a2
> > 5:
> > - sb a1, 0(t0)
> > - addi t0, t0, 1
> > - bltu t0, a3, 5b
> > + /* fill head and tail with minimal branching */
> > + sb a1, 0(t0)
> > + sb a1, -1(a3)
> > + li a4, 2
> > + bgeu a4, a2, 6f
> > +
> > + sb a1, 1(t0)
> > + sb a1, 2(t0)
> > + sb a1, -2(a3)
> > + sb a1, -3(a3)
> > + li a4, 6
> > + bgeu a4, a2, 6f
> > +
> > + /*
> > + * Adding additional detection to avoid
> > + * redundant stores can lead
> > + * to better performance
> > + */
> > + sb a1, 3(t0)
> > + sb a1, -4(a3)
> > + li a4, 8
> > + bgeu a4, a2, 6f
> > +
> > + sb a1, 4(t0)
> > + sb a1, -5(a3)
> > + li a4, 10
> > + bgeu a4, a2, 6f
>
> These extra checks feel ad hoc to me. Naturally you'll get better results
> for 8 byte memsets when there's a branch to the ret after 8 bytes. But
> what about 9? I'd think you'd want benchmarks from 1 to 15 bytes to show
> how it performs better or worse than byte by byte for each of those sizes.
> Also, while 8 bytes might be worth special casing, I'm not sure why 10
> would be. What makes 10 worth optimizing more than 11?
>
> Finally, microbenchmarking is quite hardware-specific and energy
> consumption should probably also be considered. What energy cost is
> there from making redundant stores? Is it worth it?

Hi,

I added a test from 1 to 15 bytes in the benchmarks.The test results are as
follows:
Before optimization(bytes/ns):
1B: 0.06 2B: 0.10 3B: 0.12 4B: 0.14 5B: 0.15 6B: 0.17 7B: 0.17 8B: 0.18
9B: 0.19 10B: 0.19 11B: 0.20 12B: 0.20 13B: 0.20 14B: 0.21 15B: 0.21
After optimization(bytes/ns):
1B: 0.05 2B: 0.10 3B: 0.11 4B: 0.15 5B: 0.19 6B: 0.23 7B: 0.23 8B: 0.26
9B: 0.24 10B: 0.27 11B: 0.25 12B: 0.27 13B: 0.28 14B: 0.30 15B: 0.31

From the above results, it can be seen that the performance of 1-4 bytes is
similar, with a significant improvement in 5-15 bytes.At the same time, it can
be seen that redundant stores does indeed lead to performance degradation,
such as at 9 bytes and 11 bytes.

Next, I modified the code to check 2, 6, 8, 11, 14, as shown below:
'''
sb a1, 4(t0)
sb a1, 5(t0)
sb a1, -5(a3)
li a4, 11
bgeu a4, a2, 6f

sb a1, 6(t0)
sb a1, -6(a3)
sb a1, -7(a3)
li a4, 14
bgeu a4, a2, 6f
'''
The results obtained in this way are as follows:
After optimization(bytes/ns):
1B: 0.05 2B: 0.10 3B: 0.11 4B: 0.15 5B: 0.19 6B: 0.23 7B: 0.23 8B: 0.27
9B: 0.23 10B: 0.26 11B: 0.29 12B: 0.26 13B: 0.28 14B: 0.29 15B: 0.31

From the above results, it can be seen that when we modified it to check at 11,
the performance improved from 0.25 bytes/ns to 0.29 bytes/ns.Is it possible to
minimize redundant stores while ensuring parallel stores to achieve optimal
performance?

Therefore, I modified the code to detect 2, 4, 6, 8, 10, 12, 14, as shown below:
'''
sb a1, 4(t0)
sb a1, -5(a3)
li a4, 10
bgeu a4, a2, 6f

sb a1, 5(t0)
sb a1, -6(a3)
li a4, 12
bgeu a4, a2, 6f

sb a1, 6(t0)
sb a1, -7(a3)
li a4, 14
bgeu a4, a2, 6f
'''
The results obtained in this way are as follows:
After optimization(bytes/ns):
1B: 0.05 2B: 0.10 3B: 0.12 4B: 0.17 5B: 0.18 6B: 0.21 7B: 0.22 8B: 0.25
9B: 0.24 10B: 0.26 11B: 0.25 12B: 0.27 13B: 0.26 14B: 0.27 15B: 0.29

From the above results, it can be seen that this approach did not achieve the best
performance.

Through the above experiment, here is my conclusion:
1.This optimization method will inevitably result in duplicate storage. I cannot
achieve the optimal state of each byte, for example, when I add checks on 9,
the performance of 9 will naturally improve, but 10 and 11 may become worse due
to redundant stores.Therefore, I need to make a trade-off between redundant stores
and parallelism, such as checking 9 or 10, or something else.

2.Storing parallelism and reducing jumps will compensate for the cost of redundant
stores. Based on the current multiple test results, regardless of which bytes I
modify to check, its performance is better than byte by byte storage.

3.From the above experiment, for the detection of 2, 6, 8, 11, and 14, its overall
performance is the best.

Because I am not a chip designer, I find it difficult to answer specific energy
consumption costs. Do you have any suggestions and how to conduct testing in this
regard? I think although storage has increased, there has been a corresponding
reduction in jumps and the use of pipelines.

Thanks,
Fei Zhang


2023-05-12 10:05:57

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance

On Fri, May 12, 2023 at 04:51:24PM +0800, zhangfei wrote:
> From: zhangfei <[email protected]>
>
> On Thu, May 11, 2023 at 15:43:26PM +0200, Andrew Jones wrote:
> > On Thu, May 11, 2023 at 09:34:53AM +0800, zhangfei wrote:
> > > From: zhangfei <[email protected]>
> > >
> > > Optimized performance when the data size is less than 16 bytes.
> > > Compared to byte by byte storage, significant performance improvement has been achieved.
> > > It allows storage instructions to be executed in parallel and reduces the number of jumps.
> >
> > Please wrap commit message lines at 74 chars.
> >
> > > Additional checks can avoid redundant stores.
> > >
> > > Signed-off-by: Fei Zhang <[email protected]>
> > > ---
> > > arch/riscv/lib/memset.S | 40 +++++++++++++++++++++++++++++++++++++---
> > > 1 file changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
> > > index e613c5c27998..452764bc9900 100644
> > > --- a/arch/riscv/lib/memset.S
> > > +++ b/arch/riscv/lib/memset.S
> > > @@ -106,9 +106,43 @@ WEAK(memset)
> > > beqz a2, 6f
> > > add a3, t0, a2
> > > 5:
> > > - sb a1, 0(t0)
> > > - addi t0, t0, 1
> > > - bltu t0, a3, 5b
> > > + /* fill head and tail with minimal branching */
> > > + sb a1, 0(t0)
> > > + sb a1, -1(a3)
> > > + li a4, 2
> > > + bgeu a4, a2, 6f
> > > +
> > > + sb a1, 1(t0)
> > > + sb a1, 2(t0)
> > > + sb a1, -2(a3)
> > > + sb a1, -3(a3)
> > > + li a4, 6
> > > + bgeu a4, a2, 6f
> > > +
> > > + /*
> > > + * Adding additional detection to avoid
> > > + * redundant stores can lead
> > > + * to better performance
> > > + */
> > > + sb a1, 3(t0)
> > > + sb a1, -4(a3)
> > > + li a4, 8
> > > + bgeu a4, a2, 6f
> > > +
> > > + sb a1, 4(t0)
> > > + sb a1, -5(a3)
> > > + li a4, 10
> > > + bgeu a4, a2, 6f
> >
> > These extra checks feel ad hoc to me. Naturally you'll get better results
> > for 8 byte memsets when there's a branch to the ret after 8 bytes. But
> > what about 9? I'd think you'd want benchmarks from 1 to 15 bytes to show
> > how it performs better or worse than byte by byte for each of those sizes.
> > Also, while 8 bytes might be worth special casing, I'm not sure why 10
> > would be. What makes 10 worth optimizing more than 11?
> >
> > Finally, microbenchmarking is quite hardware-specific and energy
> > consumption should probably also be considered. What energy cost is
> > there from making redundant stores? Is it worth it?
>
> Hi,
>
> I added a test from 1 to 15 bytes in the benchmarks.The test results are as
> follows:
> Before optimization(bytes/ns):
> 1B: 0.06 2B: 0.10 3B: 0.12 4B: 0.14 5B: 0.15 6B: 0.17 7B: 0.17 8B: 0.18
> 9B: 0.19 10B: 0.19 11B: 0.20 12B: 0.20 13B: 0.20 14B: 0.21 15B: 0.21
> After optimization(bytes/ns):
> 1B: 0.05 2B: 0.10 3B: 0.11 4B: 0.15 5B: 0.19 6B: 0.23 7B: 0.23 8B: 0.26
> 9B: 0.24 10B: 0.27 11B: 0.25 12B: 0.27 13B: 0.28 14B: 0.30 15B: 0.31
>
> From the above results, it can be seen that the performance of 1-4 bytes is
> similar, with a significant improvement in 5-15 bytes.At the same time, it can
> be seen that redundant stores does indeed lead to performance degradation,
> such as at 9 bytes and 11 bytes.
>
> Next, I modified the code to check 2, 6, 8, 11, 14, as shown below:
> '''
> sb a1, 4(t0)
> sb a1, 5(t0)
> sb a1, -5(a3)
> li a4, 11
> bgeu a4, a2, 6f
>
> sb a1, 6(t0)
> sb a1, -6(a3)
> sb a1, -7(a3)
> li a4, 14
> bgeu a4, a2, 6f
> '''
> The results obtained in this way are as follows:
> After optimization(bytes/ns):
> 1B: 0.05 2B: 0.10 3B: 0.11 4B: 0.15 5B: 0.19 6B: 0.23 7B: 0.23 8B: 0.27
> 9B: 0.23 10B: 0.26 11B: 0.29 12B: 0.26 13B: 0.28 14B: 0.29 15B: 0.31
>
> From the above results, it can be seen that when we modified it to check at 11,
> the performance improved from 0.25 bytes/ns to 0.29 bytes/ns.Is it possible to
> minimize redundant stores while ensuring parallel stores to achieve optimal
> performance?
>
> Therefore, I modified the code to detect 2, 4, 6, 8, 10, 12, 14, as shown below:
> '''
> sb a1, 4(t0)
> sb a1, -5(a3)
> li a4, 10
> bgeu a4, a2, 6f
>
> sb a1, 5(t0)
> sb a1, -6(a3)
> li a4, 12
> bgeu a4, a2, 6f
>
> sb a1, 6(t0)
> sb a1, -7(a3)
> li a4, 14
> bgeu a4, a2, 6f
> '''
> The results obtained in this way are as follows:
> After optimization(bytes/ns):
> 1B: 0.05 2B: 0.10 3B: 0.12 4B: 0.17 5B: 0.18 6B: 0.21 7B: 0.22 8B: 0.25
> 9B: 0.24 10B: 0.26 11B: 0.25 12B: 0.27 13B: 0.26 14B: 0.27 15B: 0.29
>
> From the above results, it can be seen that this approach did not achieve the best
> performance.
>
> Through the above experiment, here is my conclusion:
> 1.This optimization method will inevitably result in duplicate storage. I cannot
> achieve the optimal state of each byte, for example, when I add checks on 9,
> the performance of 9 will naturally improve, but 10 and 11 may become worse due
> to redundant stores.Therefore, I need to make a trade-off between redundant stores
> and parallelism, such as checking 9 or 10, or something else.
>
> 2.Storing parallelism and reducing jumps will compensate for the cost of redundant
> stores. Based on the current multiple test results, regardless of which bytes I
> modify to check, its performance is better than byte by byte storage.
>
> 3.From the above experiment, for the detection of 2, 6, 8, 11, and 14, its overall
> performance is the best.
>
> Because I am not a chip designer, I find it difficult to answer specific energy
> consumption costs. Do you have any suggestions and how to conduct testing in this
> regard? I think although storage has increased, there has been a corresponding
> reduction in jumps and the use of pipelines.

That's my point. There are so many variables that it's hard to know if it
makes sense to do this micro-optimization at all and certainly hard to
decide if adding the unnecessary branches at 8, 11, and 14 make sense.
I see some value in an 8 branch and maybe a 12, since it seems somewhat
likely memset would get used for data with sizes that are 4-byte aligned.
But even that is just speculation. I think without running comprehensive
benchmarks, and seeing an overall gain, then this isn't worth merging.
That said, if you post again with stronger justification based on these
experiments, then I'll give it a tentative r-b and let the maintainers
decide.

Thanks,
drew

2023-05-12 11:24:04

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance

From: zhangfei
> Sent: 12 May 2023 09:51
...
> 2.Storing parallelism and reducing jumps will compensate for the cost of redundant
> stores. Based on the current multiple test results, regardless of which bytes I
> modify to check, its performance is better than byte by byte storage.
>
> 3.From the above experiment, for the detection of 2, 6, 8, 11, and 14, its overall
> performance is the best.

I'm surprised the RISC-V cpu support parallel stores.
Typical x86 desktop cpu can only do single store (and two loads) every clock.
Clearly doing writes offset from both ends of the buffer does
reduce the number of control instructions relative to the stores.

Since memory writes can easily be queued I'd expect that your
'aim' would be one write every clock.
Achieving that requires knowledge of which instructions can execute in
parallel and the delays associated with correctly predicted branches.
That will very much depend on which RISV-V cpu you have.
Since any loop is at least two instructions (addi+blt) you almost
certainly need at least two writes per iteration.

I do think you are missing a trick though.
IIRC some RISC-V cpu properly support misaligned writes.
In that case, for long enough memset you can do something like:
end = start + length;
*(u64 *)start = 0
start = (start + 24) & ~15;
do {
*(u64 *)(start - 16) = 0;
*(u64 *)(start - 8) = 0;
start += 16;
} while (start < end);
*(u64 *)(end - 16) = 0;
*(u64 *)(end - 8) = 0;

> Because I am not a chip designer, I find it difficult to answer specific energy
> consumption costs. Do you have any suggestions and how to conduct testing in this
> regard? I think although storage has increased, there has been a corresponding
> reduction in jumps and the use of pipelines.

Energy use will pretty much depend on the number of clocks.
Anything else will be 2nd order noise.

What does make a difference is that increasing the code size
evicts other code from the I-cache.
This has a knock-on effect on overall system performance.
So while massively unrolling a loop will improve a benchmark
(especially if it is run 'hot-cache') there can be negative
effects on overall system performance.
The code size here probably won't have a measurable effect but
unroll to many kb and the effect can get pronounced.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)