Please don't post patches as attachments.
> From: zhangfei <[email protected]>
> Date: Fri, 5 May 2023 14:58:35 +0800
> Subject: [PATCH] riscv: Optimize memset
>
> This patch has been optimized for memset data sizes less than 16 bytes.
> Compared to byte by byte storage, significant performance improvement has been achieved.
>
> Signed-off-by: Fei Zhang <[email protected]>
> ---
> arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++---
> 1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
> index 34c5360c6705..0967bdf86bd6 100644
> --- a/arch/riscv/lib/memset.S
> +++ b/arch/riscv/lib/memset.S
> @@ -105,9 +105,36 @@ WEAK(memset)
> beqz a2, 6f
> add a3, t0, a2
> 5:
> - sb a1, 0(t0)
> - addi t0, t0, 1
> - bltu t0, a3, 5b
> + sb a1, 0(t0)
> + sb a1, -1(a3)
> + li a4, 2
> + bgeu a4, a2, 6f
> +
> + sb a1, 1(t0)
> + sb a1, 2(t0)
> + sb a1, -2(a3)
> + sb a1, -3(a3)
> + li a4, 6
> + bgeu a4, a2, 6f
> +
> + sb a1, 3(t0)
> + sb a1, -4(a3)
> + li a4, 8
> + bgeu a4, a2, 6f
Why is this check here?
> +
> + sb a1, 4(t0)
> + sb a1, -5(a3)
> + li a4, 10
> + bgeu a4, a2, 6f
And this one?
After the check of a2 against 6 above we know that offsets 6(t0)
and -7(a3) are safe. Are we trying to avoid too may redundant
stores with these additional checks?
> +
> + sb a1, 5(t0)
> + sb a1, 6(t0)
> + sb a1, -6(a3)
> + sb a1, -7(a3)
> + li a4, 14
> + bgeu a4, a2, 6f
> +
> + sb a1, 7(t0)
> 6:
> ret
> END(__memset)
> --
> 2.33.0
The indent of the new code doesn't match the old. I'd prefer we cleanup
the old first, though. Please repost [1] as a first patch of a two-patch
patch series, where yours is the second and matches the new formatting
that [1] uses.
[1] https://lore.kernel.org/all/[email protected]/
Thanks,
Drew
On Fri, May 05, 2023 at 04:43:44PM +0800, 张飞 wrote:
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
From: zhangfei <[email protected]>
> > 5:
> > - sb a1, 0(t0)
> > - addi t0, t0, 1
> > - bltu t0, a3, 5b
> > + sb a1, 0(t0)
> > + sb a1, -1(a3)
> > + li a4, 2
> > + bgeu a4, a2, 6f
> > +
> > + sb a1, 1(t0)
> > + sb a1, 2(t0)
> > + sb a1, -2(a3)
> > + sb a1, -3(a3)
> > + li a4, 6
> > + bgeu a4, a2, 6f
> > +
> > + sb a1, 3(t0)
> > + sb a1, -4(a3)
> > + li a4, 8
> > + bgeu a4, a2, 6f
>
> Why is this check here?
Hi,
I filled head and tail with minimal branching. Each conditional ensures that
all the subsequently used offsets are well-defined and in the dest region.
Although this approach may result in redundant storage, compared to byte by
byte storage, it allows storage instructions to be executed in parallel and
reduces the number of jumps.
I used the code linked below for performance testing and commented on the memset
that calls the arm architecture in the code to ensure it runs properly on the
risc-v platform.
[1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
The testing platform selected RISC-V SiFive U74.The test data is as follows:
Before optimization
---------------------
Random memset (bytes/ns):
memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30
Medium memset (bytes/ns):
memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
Large memset (bytes/ns):
memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40
After optimization
---------------------
Random memset bytes/ns):
memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
Medium memset (bytes/ns )
memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
Large memset (bytes/ns):
memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40
From the results, it can be seen that memset has significantly improved its performance with
a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.
Thanks,
Fei Zhang
From: Andrew Jones <[email protected]>
Aligning the first operand of each instructions with a tab is a
typical style which improves readability. Apply it to memset.S.
While there, we also make a small grammar change to a comment.
No functional change intended.
Signed-off-by: Andrew Jones <[email protected]>
---
arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
1 file changed, 72 insertions(+), 71 deletions(-)
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
index 34c5360c6705..e613c5c27998 100644
--- a/arch/riscv/lib/memset.S
+++ b/arch/riscv/lib/memset.S
@@ -3,111 +3,112 @@
* Copyright (C) 2013 Regents of the University of California
*/
-
#include <linux/linkage.h>
#include <asm/asm.h>
/* void *memset(void *, int, size_t) */
ENTRY(__memset)
WEAK(memset)
- move t0, a0 /* Preserve return value */
+ move t0, a0 /* Preserve return value */
/* Defer to byte-oriented fill for small sizes */
- sltiu a3, a2, 16
- bnez a3, 4f
+ sltiu a3, a2, 16
+ bnez a3, 4f
/*
* Round to nearest XLEN-aligned address
- * greater than or equal to start address
+ * greater than or equal to the start address.
*/
- addi a3, t0, SZREG-1
- andi a3, a3, ~(SZREG-1)
- beq a3, t0, 2f /* Skip if already aligned */
+ addi a3, t0, SZREG-1
+ andi a3, a3, ~(SZREG-1)
+ beq a3, t0, 2f /* Skip if already aligned */
+
/* Handle initial misalignment */
- sub a4, a3, t0
+ sub a4, a3, t0
1:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 1b
- sub a2, a2, a4 /* Update count */
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 1b
+ sub a2, a2, a4 /* Update count */
2: /* Duff's device with 32 XLEN stores per iteration */
/* Broadcast value into all bytes */
- andi a1, a1, 0xff
- slli a3, a1, 8
- or a1, a3, a1
- slli a3, a1, 16
- or a1, a3, a1
+ andi a1, a1, 0xff
+ slli a3, a1, 8
+ or a1, a3, a1
+ slli a3, a1, 16
+ or a1, a3, a1
#ifdef CONFIG_64BIT
- slli a3, a1, 32
- or a1, a3, a1
+ slli a3, a1, 32
+ or a1, a3, a1
#endif
/* Calculate end address */
- andi a4, a2, ~(SZREG-1)
- add a3, t0, a4
+ andi a4, a2, ~(SZREG-1)
+ add a3, t0, a4
- andi a4, a4, 31*SZREG /* Calculate remainder */
- beqz a4, 3f /* Shortcut if no remainder */
- neg a4, a4
- addi a4, a4, 32*SZREG /* Calculate initial offset */
+ andi a4, a4, 31*SZREG /* Calculate remainder */
+ beqz a4, 3f /* Shortcut if no remainder */
+ neg a4, a4
+ addi a4, a4, 32*SZREG /* Calculate initial offset */
/* Adjust start address with offset */
- sub t0, t0, a4
+ sub t0, t0, a4
/* Jump into loop body */
/* Assumes 32-bit instruction lengths */
- la a5, 3f
+ la a5, 3f
#ifdef CONFIG_64BIT
- srli a4, a4, 1
+ srli a4, a4, 1
#endif
- add a5, a5, a4
- jr a5
+ add a5, a5, a4
+ jr a5
3:
- REG_S a1, 0(t0)
- REG_S a1, SZREG(t0)
- REG_S a1, 2*SZREG(t0)
- REG_S a1, 3*SZREG(t0)
- REG_S a1, 4*SZREG(t0)
- REG_S a1, 5*SZREG(t0)
- REG_S a1, 6*SZREG(t0)
- REG_S a1, 7*SZREG(t0)
- REG_S a1, 8*SZREG(t0)
- REG_S a1, 9*SZREG(t0)
- REG_S a1, 10*SZREG(t0)
- REG_S a1, 11*SZREG(t0)
- REG_S a1, 12*SZREG(t0)
- REG_S a1, 13*SZREG(t0)
- REG_S a1, 14*SZREG(t0)
- REG_S a1, 15*SZREG(t0)
- REG_S a1, 16*SZREG(t0)
- REG_S a1, 17*SZREG(t0)
- REG_S a1, 18*SZREG(t0)
- REG_S a1, 19*SZREG(t0)
- REG_S a1, 20*SZREG(t0)
- REG_S a1, 21*SZREG(t0)
- REG_S a1, 22*SZREG(t0)
- REG_S a1, 23*SZREG(t0)
- REG_S a1, 24*SZREG(t0)
- REG_S a1, 25*SZREG(t0)
- REG_S a1, 26*SZREG(t0)
- REG_S a1, 27*SZREG(t0)
- REG_S a1, 28*SZREG(t0)
- REG_S a1, 29*SZREG(t0)
- REG_S a1, 30*SZREG(t0)
- REG_S a1, 31*SZREG(t0)
- addi t0, t0, 32*SZREG
- bltu t0, a3, 3b
- andi a2, a2, SZREG-1 /* Update count */
+ REG_S a1, 0(t0)
+ REG_S a1, SZREG(t0)
+ REG_S a1, 2*SZREG(t0)
+ REG_S a1, 3*SZREG(t0)
+ REG_S a1, 4*SZREG(t0)
+ REG_S a1, 5*SZREG(t0)
+ REG_S a1, 6*SZREG(t0)
+ REG_S a1, 7*SZREG(t0)
+ REG_S a1, 8*SZREG(t0)
+ REG_S a1, 9*SZREG(t0)
+ REG_S a1, 10*SZREG(t0)
+ REG_S a1, 11*SZREG(t0)
+ REG_S a1, 12*SZREG(t0)
+ REG_S a1, 13*SZREG(t0)
+ REG_S a1, 14*SZREG(t0)
+ REG_S a1, 15*SZREG(t0)
+ REG_S a1, 16*SZREG(t0)
+ REG_S a1, 17*SZREG(t0)
+ REG_S a1, 18*SZREG(t0)
+ REG_S a1, 19*SZREG(t0)
+ REG_S a1, 20*SZREG(t0)
+ REG_S a1, 21*SZREG(t0)
+ REG_S a1, 22*SZREG(t0)
+ REG_S a1, 23*SZREG(t0)
+ REG_S a1, 24*SZREG(t0)
+ REG_S a1, 25*SZREG(t0)
+ REG_S a1, 26*SZREG(t0)
+ REG_S a1, 27*SZREG(t0)
+ REG_S a1, 28*SZREG(t0)
+ REG_S a1, 29*SZREG(t0)
+ REG_S a1, 30*SZREG(t0)
+ REG_S a1, 31*SZREG(t0)
+
+ addi t0, t0, 32*SZREG
+ bltu t0, a3, 3b
+ andi a2, a2, SZREG-1 /* Update count */
4:
/* Handle trailing misalignment */
- beqz a2, 6f
- add a3, t0, a2
+ beqz a2, 6f
+ add a3, t0, a2
5:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 5b
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 5b
6:
ret
END(__memset)
--
2.33.0
On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote:
> From: zhangfei <[email protected]>
>
> > > 5:
> > > - sb a1, 0(t0)
> > > - addi t0, t0, 1
> > > - bltu t0, a3, 5b
> > > + sb a1, 0(t0)
> > > + sb a1, -1(a3)
> > > + li a4, 2
> > > + bgeu a4, a2, 6f
> > > +
> > > + sb a1, 1(t0)
> > > + sb a1, 2(t0)
> > > + sb a1, -2(a3)
> > > + sb a1, -3(a3)
> > > + li a4, 6
> > > + bgeu a4, a2, 6f
> > > +
> > > + sb a1, 3(t0)
> > > + sb a1, -4(a3)
> > > + li a4, 8
> > > + bgeu a4, a2, 6f
> >
> > Why is this check here?
>
> Hi,
>
> I filled head and tail with minimal branching. Each conditional ensures that
> all the subsequently used offsets are well-defined and in the dest region.
I know. You trimmed my comment, so I'll quote myself, here
"""
After the check of a2 against 6 above we know that offsets 6(t0)
and -7(a3) are safe. Are we trying to avoid too may redundant
stores with these additional checks?
"""
So, again. Why the additional check against 8 above and, the one you
trimmed, checking 10?
>
> Although this approach may result in redundant storage, compared to byte by
> byte storage, it allows storage instructions to be executed in parallel and
> reduces the number of jumps.
I understood that when I read the code, but text like this should go in
the commit message to avoid people having to think their way through
stuff.
>
> I used the code linked below for performance testing and commented on the memset
> that calls the arm architecture in the code to ensure it runs properly on the
> risc-v platform.
>
> [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
>
> The testing platform selected RISC-V SiFive U74.The test data is as follows:
>
> Before optimization
> ---------------------
> Random memset (bytes/ns):
> memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30
>
> Medium memset (bytes/ns):
> memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
> memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40
>
> After optimization
> ---------------------
> Random memset bytes/ns):
> memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
> Medium memset (bytes/ns )
> memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
> memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40
>
> From the results, it can be seen that memset has significantly improved its performance with
> a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.
And these benchmark results belong in the cover letter, which this series
is missing.
Thanks,
drew
On Tue, May 09, 2023 at 10:22:05AM +0800, zhangfei wrote:
> From: Andrew Jones <[email protected]>
>
> Aligning the first operand of each instructions with a tab is a
> typical style which improves readability. Apply it to memset.S.
> While there, we also make a small grammar change to a comment.
>
> No functional change intended.
>
> Signed-off-by: Andrew Jones <[email protected]>
Please pick up Conor's r-b on this reposting.
Thanks,
drew
> ---
> arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
> 1 file changed, 72 insertions(+), 71 deletions(-)
>
> diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
> index 34c5360c6705..e613c5c27998 100644
> --- a/arch/riscv/lib/memset.S
> +++ b/arch/riscv/lib/memset.S
> @@ -3,111 +3,112 @@
> * Copyright (C) 2013 Regents of the University of California
> */
>
> -
> #include <linux/linkage.h>
> #include <asm/asm.h>
>
> /* void *memset(void *, int, size_t) */
> ENTRY(__memset)
> WEAK(memset)
> - move t0, a0 /* Preserve return value */
> + move t0, a0 /* Preserve return value */
>
> /* Defer to byte-oriented fill for small sizes */
> - sltiu a3, a2, 16
> - bnez a3, 4f
> + sltiu a3, a2, 16
> + bnez a3, 4f
>
> /*
> * Round to nearest XLEN-aligned address
> - * greater than or equal to start address
> + * greater than or equal to the start address.
> */
> - addi a3, t0, SZREG-1
> - andi a3, a3, ~(SZREG-1)
> - beq a3, t0, 2f /* Skip if already aligned */
> + addi a3, t0, SZREG-1
> + andi a3, a3, ~(SZREG-1)
> + beq a3, t0, 2f /* Skip if already aligned */
> +
> /* Handle initial misalignment */
> - sub a4, a3, t0
> + sub a4, a3, t0
> 1:
> - sb a1, 0(t0)
> - addi t0, t0, 1
> - bltu t0, a3, 1b
> - sub a2, a2, a4 /* Update count */
> + sb a1, 0(t0)
> + addi t0, t0, 1
> + bltu t0, a3, 1b
> + sub a2, a2, a4 /* Update count */
>
> 2: /* Duff's device with 32 XLEN stores per iteration */
> /* Broadcast value into all bytes */
> - andi a1, a1, 0xff
> - slli a3, a1, 8
> - or a1, a3, a1
> - slli a3, a1, 16
> - or a1, a3, a1
> + andi a1, a1, 0xff
> + slli a3, a1, 8
> + or a1, a3, a1
> + slli a3, a1, 16
> + or a1, a3, a1
> #ifdef CONFIG_64BIT
> - slli a3, a1, 32
> - or a1, a3, a1
> + slli a3, a1, 32
> + or a1, a3, a1
> #endif
>
> /* Calculate end address */
> - andi a4, a2, ~(SZREG-1)
> - add a3, t0, a4
> + andi a4, a2, ~(SZREG-1)
> + add a3, t0, a4
>
> - andi a4, a4, 31*SZREG /* Calculate remainder */
> - beqz a4, 3f /* Shortcut if no remainder */
> - neg a4, a4
> - addi a4, a4, 32*SZREG /* Calculate initial offset */
> + andi a4, a4, 31*SZREG /* Calculate remainder */
> + beqz a4, 3f /* Shortcut if no remainder */
> + neg a4, a4
> + addi a4, a4, 32*SZREG /* Calculate initial offset */
>
> /* Adjust start address with offset */
> - sub t0, t0, a4
> + sub t0, t0, a4
>
> /* Jump into loop body */
> /* Assumes 32-bit instruction lengths */
> - la a5, 3f
> + la a5, 3f
> #ifdef CONFIG_64BIT
> - srli a4, a4, 1
> + srli a4, a4, 1
> #endif
> - add a5, a5, a4
> - jr a5
> + add a5, a5, a4
> + jr a5
> 3:
> - REG_S a1, 0(t0)
> - REG_S a1, SZREG(t0)
> - REG_S a1, 2*SZREG(t0)
> - REG_S a1, 3*SZREG(t0)
> - REG_S a1, 4*SZREG(t0)
> - REG_S a1, 5*SZREG(t0)
> - REG_S a1, 6*SZREG(t0)
> - REG_S a1, 7*SZREG(t0)
> - REG_S a1, 8*SZREG(t0)
> - REG_S a1, 9*SZREG(t0)
> - REG_S a1, 10*SZREG(t0)
> - REG_S a1, 11*SZREG(t0)
> - REG_S a1, 12*SZREG(t0)
> - REG_S a1, 13*SZREG(t0)
> - REG_S a1, 14*SZREG(t0)
> - REG_S a1, 15*SZREG(t0)
> - REG_S a1, 16*SZREG(t0)
> - REG_S a1, 17*SZREG(t0)
> - REG_S a1, 18*SZREG(t0)
> - REG_S a1, 19*SZREG(t0)
> - REG_S a1, 20*SZREG(t0)
> - REG_S a1, 21*SZREG(t0)
> - REG_S a1, 22*SZREG(t0)
> - REG_S a1, 23*SZREG(t0)
> - REG_S a1, 24*SZREG(t0)
> - REG_S a1, 25*SZREG(t0)
> - REG_S a1, 26*SZREG(t0)
> - REG_S a1, 27*SZREG(t0)
> - REG_S a1, 28*SZREG(t0)
> - REG_S a1, 29*SZREG(t0)
> - REG_S a1, 30*SZREG(t0)
> - REG_S a1, 31*SZREG(t0)
> - addi t0, t0, 32*SZREG
> - bltu t0, a3, 3b
> - andi a2, a2, SZREG-1 /* Update count */
> + REG_S a1, 0(t0)
> + REG_S a1, SZREG(t0)
> + REG_S a1, 2*SZREG(t0)
> + REG_S a1, 3*SZREG(t0)
> + REG_S a1, 4*SZREG(t0)
> + REG_S a1, 5*SZREG(t0)
> + REG_S a1, 6*SZREG(t0)
> + REG_S a1, 7*SZREG(t0)
> + REG_S a1, 8*SZREG(t0)
> + REG_S a1, 9*SZREG(t0)
> + REG_S a1, 10*SZREG(t0)
> + REG_S a1, 11*SZREG(t0)
> + REG_S a1, 12*SZREG(t0)
> + REG_S a1, 13*SZREG(t0)
> + REG_S a1, 14*SZREG(t0)
> + REG_S a1, 15*SZREG(t0)
> + REG_S a1, 16*SZREG(t0)
> + REG_S a1, 17*SZREG(t0)
> + REG_S a1, 18*SZREG(t0)
> + REG_S a1, 19*SZREG(t0)
> + REG_S a1, 20*SZREG(t0)
> + REG_S a1, 21*SZREG(t0)
> + REG_S a1, 22*SZREG(t0)
> + REG_S a1, 23*SZREG(t0)
> + REG_S a1, 24*SZREG(t0)
> + REG_S a1, 25*SZREG(t0)
> + REG_S a1, 26*SZREG(t0)
> + REG_S a1, 27*SZREG(t0)
> + REG_S a1, 28*SZREG(t0)
> + REG_S a1, 29*SZREG(t0)
> + REG_S a1, 30*SZREG(t0)
> + REG_S a1, 31*SZREG(t0)
> +
> + addi t0, t0, 32*SZREG
> + bltu t0, a3, 3b
> + andi a2, a2, SZREG-1 /* Update count */
>
> 4:
> /* Handle trailing misalignment */
> - beqz a2, 6f
> - add a3, t0, a2
> + beqz a2, 6f
> + add a3, t0, a2
> 5:
> - sb a1, 0(t0)
> - addi t0, t0, 1
> - bltu t0, a3, 5b
> + sb a1, 0(t0)
> + addi t0, t0, 1
> + bltu t0, a3, 5b
> 6:
> ret
> END(__memset)
> --
> 2.33.0
>
From: zhangfei <[email protected]>
On Tue, May 09, 2023 11:16:33AM +0200, Andrew Jones wrote:
> On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote:
> >
> > Hi,
> >
> > I filled head and tail with minimal branching. Each conditional ensures that
> > all the subsequently used offsets are well-defined and in the dest region.
>
> I know. You trimmed my comment, so I'll quote myself, here
>
> """
> After the check of a2 against 6 above we know that offsets 6(t0)
> and -7(a3) are safe. Are we trying to avoid too may redundant
> stores with these additional checks?
> """
>
> So, again. Why the additional check against 8 above and, the one you
> trimmed, checking 10?
Hi,
These additional checks are to avoid too many redundant stores.
Adding a check for more than 8 bytes is because after the loop
segment '3' comes out, the remaining bytes are less than 8 bytes,
which also avoids redundant stores.
Thanks,
Fei Zhang
From: zhangfei <[email protected]>
At present, the implementation of the memset function uses byte by byte storage
when processing tail data or when the initial data size is less than 16 bytes.
This approach is not efficient. Therefore, I filled head and tail with minimal
branching. Each conditional ensures that all the subsequently used offsets are
well-defined and in the dest region. Although this approach may result in
redundant storage, compared to byte by byte storage, it allows storage instructions
to be executed in parallel, reduces the number of jumps, and ultimately achieves
performance improvement.
I used the code linked below for performance testing and commented on the memset
that calls the arm architecture in the code to ensure it runs properly on the
risc-v platform.
[1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
The testing platform selected RISC-V SiFive U74.The test data is as follows:
Before optimization
---------------------
Random memset (bytes/ns):
memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30
Medium memset (bytes/ns):
memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
Large memset (bytes/ns):
memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40
After optimization
---------------------
Random memset bytes/ns):
memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
Medium memset (bytes/ns )
memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
Large memset (bytes/ns):
memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40
From the results, it can be seen that memset has significantly improved its performance with
a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.
Thanks,
Fei Zhang
Andrew Jones (1):
RISC-V: lib: Improve memset assembler formatting
arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
1 file changed, 72 insertions(+), 71 deletions(-)
zhangfei (1):
riscv: Optimize memset
arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
From: Andrew Jones <[email protected]>
Aligning the first operand of each instructions with a tab is a
typical style which improves readability. Apply it to memset.S.
While there, we also make a small grammar change to a comment.
No functional change intended.
Signed-off-by: Andrew Jones <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
---
arch/riscv/lib/memset.S | 143 ++++++++++++++++++++--------------------
1 file changed, 72 insertions(+), 71 deletions(-)
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
index 34c5360c6705..e613c5c27998 100644
--- a/arch/riscv/lib/memset.S
+++ b/arch/riscv/lib/memset.S
@@ -3,111 +3,112 @@
* Copyright (C) 2013 Regents of the University of California
*/
-
#include <linux/linkage.h>
#include <asm/asm.h>
/* void *memset(void *, int, size_t) */
ENTRY(__memset)
WEAK(memset)
- move t0, a0 /* Preserve return value */
+ move t0, a0 /* Preserve return value */
/* Defer to byte-oriented fill for small sizes */
- sltiu a3, a2, 16
- bnez a3, 4f
+ sltiu a3, a2, 16
+ bnez a3, 4f
/*
* Round to nearest XLEN-aligned address
- * greater than or equal to start address
+ * greater than or equal to the start address.
*/
- addi a3, t0, SZREG-1
- andi a3, a3, ~(SZREG-1)
- beq a3, t0, 2f /* Skip if already aligned */
+ addi a3, t0, SZREG-1
+ andi a3, a3, ~(SZREG-1)
+ beq a3, t0, 2f /* Skip if already aligned */
+
/* Handle initial misalignment */
- sub a4, a3, t0
+ sub a4, a3, t0
1:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 1b
- sub a2, a2, a4 /* Update count */
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 1b
+ sub a2, a2, a4 /* Update count */
2: /* Duff's device with 32 XLEN stores per iteration */
/* Broadcast value into all bytes */
- andi a1, a1, 0xff
- slli a3, a1, 8
- or a1, a3, a1
- slli a3, a1, 16
- or a1, a3, a1
+ andi a1, a1, 0xff
+ slli a3, a1, 8
+ or a1, a3, a1
+ slli a3, a1, 16
+ or a1, a3, a1
#ifdef CONFIG_64BIT
- slli a3, a1, 32
- or a1, a3, a1
+ slli a3, a1, 32
+ or a1, a3, a1
#endif
/* Calculate end address */
- andi a4, a2, ~(SZREG-1)
- add a3, t0, a4
+ andi a4, a2, ~(SZREG-1)
+ add a3, t0, a4
- andi a4, a4, 31*SZREG /* Calculate remainder */
- beqz a4, 3f /* Shortcut if no remainder */
- neg a4, a4
- addi a4, a4, 32*SZREG /* Calculate initial offset */
+ andi a4, a4, 31*SZREG /* Calculate remainder */
+ beqz a4, 3f /* Shortcut if no remainder */
+ neg a4, a4
+ addi a4, a4, 32*SZREG /* Calculate initial offset */
/* Adjust start address with offset */
- sub t0, t0, a4
+ sub t0, t0, a4
/* Jump into loop body */
/* Assumes 32-bit instruction lengths */
- la a5, 3f
+ la a5, 3f
#ifdef CONFIG_64BIT
- srli a4, a4, 1
+ srli a4, a4, 1
#endif
- add a5, a5, a4
- jr a5
+ add a5, a5, a4
+ jr a5
3:
- REG_S a1, 0(t0)
- REG_S a1, SZREG(t0)
- REG_S a1, 2*SZREG(t0)
- REG_S a1, 3*SZREG(t0)
- REG_S a1, 4*SZREG(t0)
- REG_S a1, 5*SZREG(t0)
- REG_S a1, 6*SZREG(t0)
- REG_S a1, 7*SZREG(t0)
- REG_S a1, 8*SZREG(t0)
- REG_S a1, 9*SZREG(t0)
- REG_S a1, 10*SZREG(t0)
- REG_S a1, 11*SZREG(t0)
- REG_S a1, 12*SZREG(t0)
- REG_S a1, 13*SZREG(t0)
- REG_S a1, 14*SZREG(t0)
- REG_S a1, 15*SZREG(t0)
- REG_S a1, 16*SZREG(t0)
- REG_S a1, 17*SZREG(t0)
- REG_S a1, 18*SZREG(t0)
- REG_S a1, 19*SZREG(t0)
- REG_S a1, 20*SZREG(t0)
- REG_S a1, 21*SZREG(t0)
- REG_S a1, 22*SZREG(t0)
- REG_S a1, 23*SZREG(t0)
- REG_S a1, 24*SZREG(t0)
- REG_S a1, 25*SZREG(t0)
- REG_S a1, 26*SZREG(t0)
- REG_S a1, 27*SZREG(t0)
- REG_S a1, 28*SZREG(t0)
- REG_S a1, 29*SZREG(t0)
- REG_S a1, 30*SZREG(t0)
- REG_S a1, 31*SZREG(t0)
- addi t0, t0, 32*SZREG
- bltu t0, a3, 3b
- andi a2, a2, SZREG-1 /* Update count */
+ REG_S a1, 0(t0)
+ REG_S a1, SZREG(t0)
+ REG_S a1, 2*SZREG(t0)
+ REG_S a1, 3*SZREG(t0)
+ REG_S a1, 4*SZREG(t0)
+ REG_S a1, 5*SZREG(t0)
+ REG_S a1, 6*SZREG(t0)
+ REG_S a1, 7*SZREG(t0)
+ REG_S a1, 8*SZREG(t0)
+ REG_S a1, 9*SZREG(t0)
+ REG_S a1, 10*SZREG(t0)
+ REG_S a1, 11*SZREG(t0)
+ REG_S a1, 12*SZREG(t0)
+ REG_S a1, 13*SZREG(t0)
+ REG_S a1, 14*SZREG(t0)
+ REG_S a1, 15*SZREG(t0)
+ REG_S a1, 16*SZREG(t0)
+ REG_S a1, 17*SZREG(t0)
+ REG_S a1, 18*SZREG(t0)
+ REG_S a1, 19*SZREG(t0)
+ REG_S a1, 20*SZREG(t0)
+ REG_S a1, 21*SZREG(t0)
+ REG_S a1, 22*SZREG(t0)
+ REG_S a1, 23*SZREG(t0)
+ REG_S a1, 24*SZREG(t0)
+ REG_S a1, 25*SZREG(t0)
+ REG_S a1, 26*SZREG(t0)
+ REG_S a1, 27*SZREG(t0)
+ REG_S a1, 28*SZREG(t0)
+ REG_S a1, 29*SZREG(t0)
+ REG_S a1, 30*SZREG(t0)
+ REG_S a1, 31*SZREG(t0)
+
+ addi t0, t0, 32*SZREG
+ bltu t0, a3, 3b
+ andi a2, a2, SZREG-1 /* Update count */
4:
/* Handle trailing misalignment */
- beqz a2, 6f
- add a3, t0, a2
+ beqz a2, 6f
+ add a3, t0, a2
5:
- sb a1, 0(t0)
- addi t0, t0, 1
- bltu t0, a3, 5b
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 5b
6:
ret
END(__memset)
--
2.33.0
On Wed, May 10, 2023 at 11:52:43AM +0800, zhangfei wrote:
> From: zhangfei <[email protected]>
>
> On Tue, May 09, 2023 11:16:33AM +0200, Andrew Jones wrote:
> > On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote:
> > >
> > > Hi,
> > >
> > > I filled head and tail with minimal branching. Each conditional ensures that
> > > all the subsequently used offsets are well-defined and in the dest region.
> >
> > I know. You trimmed my comment, so I'll quote myself, here
> >
> > """
> > After the check of a2 against 6 above we know that offsets 6(t0)
> > and -7(a3) are safe. Are we trying to avoid too may redundant
> > stores with these additional checks?
> > """
> >
> > So, again. Why the additional check against 8 above and, the one you
> > trimmed, checking 10?
>
> Hi,
>
> These additional checks are to avoid too many redundant stores.
>
> Adding a check for more than 8 bytes is because after the loop
> segment '3' comes out, the remaining bytes are less than 8 bytes,
> which also avoids redundant stores.
So the benchmarks showed these additional checks were necessary to avoid
making memset worse? Please add comments to the code explaining the
purpose of the checks.
Thanks,
drew
Hey Zhangfei,
On Wed, May 10, 2023 at 11:52:41AM +0800, zhangfei wrote:
> From: Andrew Jones <[email protected]>
>
> Aligning the first operand of each instructions with a tab is a
> typical style which improves readability. Apply it to memset.S.
> While there, we also make a small grammar change to a comment.
>
> No functional change intended.
>
> Signed-off-by: Andrew Jones <[email protected]>
> Reviewed-by: Conor Dooley <[email protected]>
Three things for you here..
Firstly, since you sent Andrew's work, you need to add your own
Signed-off-by to the patch.
Secondly & thirdly, please version your patchsets & do not later
versions as an reply to the v1. In my mail client I see:
May 10 zhangfei ( 58) ┌─>[PATCH 2/2] riscv: Optimize memset
May 10 Andrew Jones ( 36) │ ┌─>
May 10 zhangfei ( 32) ├─>Re: [PATCH] riscv: Optimize memset
May 10 zhangfei ( 205) ├─>[PATCH 1/2] RISC-V: lib: Improve memset assembler formatting
May 10 zhangfei ( 56) ┌─>[PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes
May 09 Andrew Jones ( 87) ┌─>Re: [PATCH] riscv: Optimize memset
May 09 zhangfei ( 67) ┌─>Re: Re: [PATCH] riscv: Optimize memset
May 09 Andrew Jones ( 211) ├─>Re: [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting
May 09 zhangfei ( 57) ├─>[PATCH 2/2] riscv: Optimize memset
May 09 zhangfei ( 204) ┌─>[PATCH 1/2] RISC-V: lib: Improve memset assembler formatting
May 05 Andrew Jones ( 91) ┌─>
May 05 张飞 ( 38) [PATCH] riscv: Optimize memset
How am I supposed to know what is what there?
Thanks,
Conor.
From: zhangfei <[email protected]>
On Wed, May 10, 2023 at 14:58:22PM +0200, Andrew Jones wrote:
> On Wed, May 10, 2023 at 11:52:43AM +0800, zhangfei wrote:
> > From: zhangfei <[email protected]>
> >
> > On Tue, May 09, 2023 11:16:33AM +0200, Andrew Jones wrote:
> > > On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote:
> > > >
> > > > Hi,
> > > >
> > > > I filled head and tail with minimal branching. Each conditional ensures that
> > > > all the subsequently used offsets are well-defined and in the dest region.
> > >
> > > I know. You trimmed my comment, so I'll quote myself, here
> > >
> > > """
> > > After the check of a2 against 6 above we know that offsets 6(t0)
> > > and -7(a3) are safe. Are we trying to avoid too may redundant
> > > stores with these additional checks?
> > > """
> > >
> > > So, again. Why the additional check against 8 above and, the one you
> > > trimmed, checking 10?
> >
> > Hi,
> >
> > These additional checks are to avoid too many redundant stores.
> >
> > Adding a check for more than 8 bytes is because after the loop
> > segment '3' comes out, the remaining bytes are less than 8 bytes,
> > which also avoids redundant stores.
>
> So the benchmarks showed these additional checks were necessary to avoid
> making memset worse? Please add comments to the code explaining the
> purpose of the checks.
Hi,
As you mentioned, the lack of these additional tests can make memset worse.
When I removed the checks for 8 and 10 above, the benchmarks showed that the
memset changed to 0.21 bytes/ns at 8B. Although this is better than storing
byte by byte, additional detections will bring a better improvement to 0.27 bytes/ns.
Due to the chaotic response in my previous email, I am sorry for this. I have
reorganized patch v2 and sent it to you. Please reply under the latest patch.
Thanks,
Fei Zhang
From: zhangfei <[email protected]>
Hi,
Due to the chaotic response in my previous email, I am sorry for this. I have
reorganized patch v2 and sent it to you. Please reply under the latest patch.
Thanks,
Fei Zhang