LinuxLists.cc - [PATCH] riscv: fix memmove and optimise memcpy when misalign

2021-02-16 23:02:27

Subject: [PATCH] riscv: fix memmove and optimise memcpy when misalign

04091d6 introduces an assembly version of memmove but
it does take misalignment into account (it checks if
length is a multiple of machine word size but pointers
need also be aligned). As a result it will generate
misaligned load/store for the majority of cases and causes
significant performance regression on hardware that traps
misaligned load/store and emulate them using firmware.

The current behaviour of memcpy is that it checks if both
src and dest pointers are co-aligned (aka congruent
modular SZ_REG). If aligned, it will copy data word-by-word
after first aligning pointers to word boundary. If src
and dst are not co-aligned, however, byte-wise copy will
be performed.

This patch fixes the memmove and optimises memcpy for
misaligned cases. It will first align destination pointer
to word-boundary regardless whether src and dest are
co-aligned or not. If they indeed are, then wordwise copy
is performed. If they are not co-aligned, then it will
load two adjacent words from src and use shifts to assemble
a full machine word. Some additional assembly level
micro-optimisation is also performed to ensure more
instructions can be compressed (e.g. prefer a0 to t6).

In my testing this speeds up memcpy 4~5x when src and dest
are not co-aligned (which is quite common in networking),
and speeds up memmove 1000+x by avoiding trapping to firmware.

Signed-off-by: Gary Guo <[email protected]>
---
arch/riscv/lib/memcpy.S | 223 ++++++++++++++++++++++++---------------
arch/riscv/lib/memmove.S | 176 ++++++++++++++++++++----------
2 files changed, 257 insertions(+), 142 deletions(-)

diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
index 51ab716253fa..00672c19ad9b 100644
--- a/arch/riscv/lib/memcpy.S
+++ b/arch/riscv/lib/memcpy.S
@@ -9,100 +9,151 @@
/* void *memcpy(void *, const void *, size_t) */
ENTRY(__memcpy)
WEAK(memcpy)
- move t6, a0 /* Preserve return value */
+ /* Save for return value */
+ mv t6, a0

- /* Defer to byte-oriented copy for small sizes */
- sltiu a3, a2, 128
- bnez a3, 4f
- /* Use word-oriented copy only if low-order bits match */
- andi a3, t6, SZREG-1
- andi a4, a1, SZREG-1
- bne a3, a4, 4f
+ /*
+ * Register allocation for code below:
+ * a0 - start of uncopied dst
+ * a1 - start of uncopied src
+ * t0 - end of uncopied dst
+ */
+ add t0, a0, a2

- beqz a3, 2f /* Skip if already aligned */
/*
- * Round to nearest double word-aligned address
- * greater than or equal to start address
+ * Use bytewise copy if too small.
+ *
+ * This threshold must be at least 2*SZREG to ensure at least one
+ * wordwise copy is performed. It is chosen to be 16 because it will
+ * save at least 7 iterations of bytewise copy, which pays off the
+ * fixed overhead.
*/
- andi a3, a1, ~(SZREG-1)
- addi a3, a3, SZREG
- /* Handle initial misalignment */
- sub a4, a3, a1
+ li a3, 16
+ bltu a2, a3, .Lbyte_copy_tail
+
+ /*
+ * Bytewise copy first to align a0 to word boundary.
+ */
+ addi a2, a0, SZREG-1
+ andi a2, a2, ~(SZREG-1)
+ beq a0, a2, 2f
1:
- lb a5, 0(a1)
- addi a1, a1, 1
- sb a5, 0(t6)
- addi t6, t6, 1
- bltu a1, a3, 1b
- sub a2, a2, a4 /* Update count */
+ lb a5, 0(a1)
+ addi a1, a1, 1
+ sb a5, 0(a0)
+ addi a0, a0, 1
+ bne a0, a2, 1b
+2:
+
+ /*
+ * Now a0 is word-aligned. If a1 is also word aligned, we could perform
+ * aligned word-wise copy. Otherwise we need to perform misaligned
+ * word-wise copy.
+ */
+ andi a3, a1, SZREG-1
+ bnez a3, .Lmisaligned_word_copy

+ /* Unrolled wordwise copy */
+ addi t0, t0, -(16*SZREG-1)
+ bgeu a0, t0, 2f
+1:
+ REG_L a2, 0(a1)
+ REG_L a3, SZREG(a1)
+ REG_L a4, 2*SZREG(a1)
+ REG_L a5, 3*SZREG(a1)
+ REG_L a6, 4*SZREG(a1)
+ REG_L a7, 5*SZREG(a1)
+ REG_L t1, 6*SZREG(a1)
+ REG_L t2, 7*SZREG(a1)
+ REG_L t3, 8*SZREG(a1)
+ REG_L t4, 9*SZREG(a1)
+ REG_L t5, 10*SZREG(a1)
+ REG_S a2, 0(a0)
+ REG_S a3, SZREG(a0)
+ REG_S a4, 2*SZREG(a0)
+ REG_S a5, 3*SZREG(a0)
+ REG_S a6, 4*SZREG(a0)
+ REG_S a7, 5*SZREG(a0)
+ REG_S t1, 6*SZREG(a0)
+ REG_S t2, 7*SZREG(a0)
+ REG_S t3, 8*SZREG(a0)
+ REG_S t4, 9*SZREG(a0)
+ REG_S t5, 10*SZREG(a0)
+ REG_L a2, 11*SZREG(a1)
+ REG_L a3, 12*SZREG(a1)
+ REG_L a4, 13*SZREG(a1)
+ REG_L a5, 14*SZREG(a1)
+ REG_L a6, 15*SZREG(a1)
+ addi a1, a1, 16*SZREG
+ REG_S a2, 11*SZREG(a0)
+ REG_S a3, 12*SZREG(a0)
+ REG_S a4, 13*SZREG(a0)
+ REG_S a5, 14*SZREG(a0)
+ REG_S a6, 15*SZREG(a0)
+ addi a0, a0, 16*SZREG
+ bltu a0, t0, 1b
2:
- andi a4, a2, ~((16*SZREG)-1)
- beqz a4, 4f
- add a3, a1, a4
-3:
- REG_L a4, 0(a1)
- REG_L a5, SZREG(a1)
- REG_L a6, 2*SZREG(a1)
- REG_L a7, 3*SZREG(a1)
- REG_L t0, 4*SZREG(a1)
- REG_L t1, 5*SZREG(a1)
- REG_L t2, 6*SZREG(a1)
- REG_L t3, 7*SZREG(a1)
- REG_L t4, 8*SZREG(a1)
- REG_L t5, 9*SZREG(a1)
- REG_S a4, 0(t6)
- REG_S a5, SZREG(t6)
- REG_S a6, 2*SZREG(t6)
- REG_S a7, 3*SZREG(t6)
- REG_S t0, 4*SZREG(t6)
- REG_S t1, 5*SZREG(t6)
- REG_S t2, 6*SZREG(t6)
- REG_S t3, 7*SZREG(t6)
- REG_S t4, 8*SZREG(t6)
- REG_S t5, 9*SZREG(t6)
- REG_L a4, 10*SZREG(a1)
- REG_L a5, 11*SZREG(a1)
- REG_L a6, 12*SZREG(a1)
- REG_L a7, 13*SZREG(a1)
- REG_L t0, 14*SZREG(a1)
- REG_L t1, 15*SZREG(a1)
- addi a1, a1, 16*SZREG
- REG_S a4, 10*SZREG(t6)
- REG_S a5, 11*SZREG(t6)
- REG_S a6, 12*SZREG(t6)
- REG_S a7, 13*SZREG(t6)
- REG_S t0, 14*SZREG(t6)
- REG_S t1, 15*SZREG(t6)
- addi t6, t6, 16*SZREG
- bltu a1, a3, 3b
- andi a2, a2, (16*SZREG)-1 /* Update count */
-
-4:
- /* Handle trailing misalignment */
- beqz a2, 6f
- add a3, a1, a2
-
- /* Use word-oriented copy if co-aligned to word boundary */
- or a5, a1, t6
- or a5, a5, a3
- andi a5, a5, 3
- bnez a5, 5f
-7:
- lw a4, 0(a1)
- addi a1, a1, 4
- sw a4, 0(t6)
- addi t6, t6, 4
- bltu a1, a3, 7b
+ /* Post-loop increment by 16*SZREG-1 and pre-loop decrement by SZREG-1 */
+ addi t0, t0, 15*SZREG

- ret
+ /* Wordwise copy */
+ bgeu a0, t0, 2f
+1:
+ REG_L a5, 0(a1)
+ addi a1, a1, SZREG
+ REG_S a5, 0(a0)
+ addi a0, a0, SZREG
+ bltu a0, t0, 1b
+2:
+ addi t0, t0, SZREG-1

-5:
- lb a4, 0(a1)
- addi a1, a1, 1
- sb a4, 0(t6)
- addi t6, t6, 1
- bltu a1, a3, 5b
-6:
+.Lbyte_copy_tail:
+ /*
+ * Bytewise copy anything left.
+ */
+ beq a0, t0, 2f
+1:
+ lb a5, 0(a1)
+ addi a1, a1, 1
+ sb a5, 0(a0)
+ addi a0, a0, 1
+ bne a0, t0, 1b
+2:
+
+ mv a0, t6
ret
+
+.Lmisaligned_word_copy:
+ /*
+ * Misaligned word-wise copy.
+ * For misaligned copy we still perform word-wise copy, but we need to
+ * use the value fetched from the previous iteration and do some shifts.
+ * This is safe because we wouldn't access more words than necessary.
+ */
+
+ /* Calculate shifts */
+ slli t3, a3, 3
+ sub t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+ /* Load the initial value and align a1 */
+ andi a1, a1, ~(SZREG-1)
+ REG_L a5, 0(a1)
+
+ addi t0, t0, -(SZREG-1)
+ /* At least one iteration will be executed here, no check */
+1:
+ srl a4, a5, t3
+ REG_L a5, SZREG(a1)
+ addi a1, a1, SZREG
+ sll a2, a5, t4
+ or a2, a2, a4
+ REG_S a2, 0(a0)
+ addi a0, a0, SZREG
+ bltu a0, t0, 1b
+
+ /* Update pointers to correct value */
+ addi t0, t0, SZREG-1
+ add a1, a1, a3
+
+ j .Lbyte_copy_tail
END(__memcpy)
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
index 07d1d2152ba5..fbe6701dbe4a 100644
--- a/arch/riscv/lib/memmove.S
+++ b/arch/riscv/lib/memmove.S
@@ -5,60 +5,124 @@

ENTRY(__memmove)
WEAK(memmove)
- move t0, a0
- move t1, a1
-
- beq a0, a1, exit_memcpy
- beqz a2, exit_memcpy
- srli t2, a2, 0x2
-
- slt t3, a0, a1
- beqz t3, do_reverse
-
- andi a2, a2, 0x3
- li t4, 1
- beqz t2, byte_copy
-
-word_copy:
- lw t3, 0(a1)
- addi t2, t2, -1
- addi a1, a1, 4
- sw t3, 0(a0)
- addi a0, a0, 4
- bnez t2, word_copy
- beqz a2, exit_memcpy
- j byte_copy
-
-do_reverse:
- add a0, a0, a2
- add a1, a1, a2
- andi a2, a2, 0x3
- li t4, -1
- beqz t2, reverse_byte_copy
-
-reverse_word_copy:
- addi a1, a1, -4
- addi t2, t2, -1
- lw t3, 0(a1)
- addi a0, a0, -4
- sw t3, 0(a0)
- bnez t2, reverse_word_copy
- beqz a2, exit_memcpy
-
-reverse_byte_copy:
- addi a0, a0, -1
- addi a1, a1, -1
-
-byte_copy:
- lb t3, 0(a1)
- addi a2, a2, -1
- sb t3, 0(a0)
- add a1, a1, t4
- add a0, a0, t4
- bnez a2, byte_copy
-
-exit_memcpy:
- move a0, t0
- move a1, t1
- ret
+ /*
+ * Here we determine if forward copy is possible. Forward copy is
+ * preferred to backward copy as it is more cache friendly.
+ *
+ * If a0 >= a1, t0 gives their distance, if t0 >= a2 then we can
+ * copy forward.
+ * If a0 < a1, we can always copy forward. This will make t0 negative,
+ * so a *unsigned* comparison will always have t0 >= a2.
+ *
+ * For forward copy we just delegate the task to memcpy.
+ */
+ sub t0, a0, a1
+ bltu t0, a2, 1f
+ tail __memcpy
+1:
+
+ /*
+ * Register allocation for code below:
+ * a0 - end of uncopied dst
+ * a1 - end of uncopied src
+ * t0 - start of uncopied dst
+ */
+ mv t0, a0
+ add a0, a0, a2
+ add a1, a1, a2
+
+ /*
+ * Use bytewise copy if too small.
+ *
+ * This threshold must be at least 2*SZREG to ensure at least one
+ * wordwise copy is performed. It is chosen to be 16 because it will
+ * save at least 7 iterations of bytewise copy, which pays off the
+ * fixed overhead.
+ */
+ li a3, 16
+ bltu a2, a3, .Lbyte_copy_tail
+
+ /*
+ * Bytewise copy first to align t0 to word boundary.
+ */
+ andi a2, a0, ~(SZREG-1)
+ beq a0, a2, 2f
+1:
+ addi a1, a1, -1
+ lb a5, 0(a1)
+ addi a0, a0, -1
+ sb a5, 0(a0)
+ bne a0, a2, 1b
+2:
+
+ /*
+ * Now a0 is word-aligned. If a1 is also word aligned, we could perform
+ * aligned word-wise copy. Otherwise we need to perform misaligned
+ * word-wise copy.
+ */
+ andi a3, a1, SZREG-1
+ bnez a3, .Lmisaligned_word_copy
+
+ /* Wordwise copy */
+ addi t0, t0, SZREG-1
+ bleu a0, t0, 2f
+1:
+ addi a1, a1, -SZREG
+ REG_L a5, 0(a1)
+ addi a0, a0, -SZREG
+ REG_S a5, 0(a0)
+ bgtu a0, t0, 1b
+2:
+ addi t0, t0, -(SZREG-1)
+
+.Lbyte_copy_tail:
+ /*
+ * Bytewise copy anything left.
+ */
+ beq a0, t0, 2f
+1:
+ addi a1, a1, -1
+ lb a5, 0(a1)
+ addi a0, a0, -1
+ sb a5, 0(a0)
+ bne a0, t0, 1b
+2:
+
+ mv a0, t0
+ ret
+
+.Lmisaligned_word_copy:
+ /*
+ * Misaligned word-wise copy.
+ * For misaligned copy we still perform word-wise copy, but we need to
+ * use the value fetched from the previous iteration and do some shifts.
+ * This is safe because we wouldn't access more words than necessary.
+ */
+
+ /* Calculate shifts */
+ slli t3, a3, 3
+ sub t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+ /* Load the initial value and align a1 */
+ andi a1, a1, ~(SZREG-1)
+ REG_L a5, 0(a1)
+
+ addi t0, t0, SZREG-1
+ /* At least one iteration will be executed here, no check */
+1:
+ sll a4, a5, t4
+ addi a1, a1, -SZREG
+ REG_L a5, 0(a1)
+ srl a2, a5, t3
+ or a2, a2, a4
+ addi a0, a0, -SZREG
+ REG_S a2, 0(a0)
+ bgtu a0, t0, 1b
+
+ /* Update pointers to correct value */
+ addi t0, t0, -(SZREG-1)
+ add a1, a1, a3
+
+ j .Lbyte_copy_tail
+
END(__memmove)
--
2.20.1

2021-05-13 08:59:39

by Bin Meng

[permalink] [raw]

Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign

On Wed, Feb 17, 2021 at 7:00 AM Gary Guo <[email protected]> wrote:
>
> 04091d6 introduces an assembly version of memmove but
> it does take misalignment into account (it checks if
> length is a multiple of machine word size but pointers
> need also be aligned). As a result it will generate
> misaligned load/store for the majority of cases and causes
> significant performance regression on hardware that traps
> misaligned load/store and emulate them using firmware.
>
> The current behaviour of memcpy is that it checks if both
> src and dest pointers are co-aligned (aka congruent
> modular SZ_REG). If aligned, it will copy data word-by-word
> after first aligning pointers to word boundary. If src
> and dst are not co-aligned, however, byte-wise copy will
> be performed.
>
> This patch fixes the memmove and optimises memcpy for
> misaligned cases. It will first align destination pointer
> to word-boundary regardless whether src and dest are
> co-aligned or not. If they indeed are, then wordwise copy
> is performed. If they are not co-aligned, then it will
> load two adjacent words from src and use shifts to assemble
> a full machine word. Some additional assembly level
> micro-optimisation is also performed to ensure more
> instructions can be compressed (e.g. prefer a0 to t6).
>
> In my testing this speeds up memcpy 4~5x when src and dest
> are not co-aligned (which is quite common in networking),
> and speeds up memmove 1000+x by avoiding trapping to firmware.
>
> Signed-off-by: Gary Guo <[email protected]>
> ---
> arch/riscv/lib/memcpy.S | 223 ++++++++++++++++++++++++---------------
> arch/riscv/lib/memmove.S | 176 ++++++++++++++++++++----------
> 2 files changed, 257 insertions(+), 142 deletions(-)
>

Looks this patch remains unapplied.

This patch fixed an booting failure of U-Boot SPL on SiFive Unleashed
board, which was built from the latest U-Boot sources that has taken
the assembly version of mem* from the Linux kernel recently.
The exact load misalignment happens in the original memmove()
implementation that it does not handle the alignment correctly. With
this patch, the U-Boot SPL boots again.

Tested-by: Bin Meng <[email protected]>

Regards,
Bin

2021-05-22 22:26:48

by Gary Guo

[permalink] [raw]

Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign

On Tue, 16 Feb 2021 22:55:51 +0000
Gary Guo <[email protected]> wrote:

> 04091d6 introduces an assembly version of memmove but
> it does take misalignment into account (it checks if
> length is a multiple of machine word size but pointers
> need also be aligned). As a result it will generate
> misaligned load/store for the majority of cases and causes
> significant performance regression on hardware that traps
> misaligned load/store and emulate them using firmware.
>
> The current behaviour of memcpy is that it checks if both
> src and dest pointers are co-aligned (aka congruent
> modular SZ_REG). If aligned, it will copy data word-by-word
> after first aligning pointers to word boundary. If src
> and dst are not co-aligned, however, byte-wise copy will
> be performed.
>
> This patch fixes the memmove and optimises memcpy for
> misaligned cases. It will first align destination pointer
> to word-boundary regardless whether src and dest are
> co-aligned or not. If they indeed are, then wordwise copy
> is performed. If they are not co-aligned, then it will
> load two adjacent words from src and use shifts to assemble
> a full machine word. Some additional assembly level
> micro-optimisation is also performed to ensure more
> instructions can be compressed (e.g. prefer a0 to t6).
>
> In my testing this speeds up memcpy 4~5x when src and dest
> are not co-aligned (which is quite common in networking),
> and speeds up memmove 1000+x by avoiding trapping to firmware.
>
> Signed-off-by: Gary Guo <[email protected]>
> ---
> arch/riscv/lib/memcpy.S | 223
> ++++++++++++++++++++++++--------------- arch/riscv/lib/memmove.S |
> 176 ++++++++++++++++++++---------- 2 files changed, 257
> insertions(+), 142 deletions(-)
>
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> index 51ab716253fa..00672c19ad9b 100644
> --- a/arch/riscv/lib/memcpy.S
> +++ b/arch/riscv/lib/memcpy.S
> @@ -9,100 +9,151 @@
> /* void *memcpy(void *, const void *, size_t) */
> ENTRY(__memcpy)
> WEAK(memcpy)
> - move t6, a0 /* Preserve return value */
> + /* Save for return value */
> + mv t6, a0
>
> - /* Defer to byte-oriented copy for small sizes */
> - sltiu a3, a2, 128
> - bnez a3, 4f
> - /* Use word-oriented copy only if low-order bits match */
> - andi a3, t6, SZREG-1
> - andi a4, a1, SZREG-1
> - bne a3, a4, 4f
> + /*
> + * Register allocation for code below:
> + * a0 - start of uncopied dst
> + * a1 - start of uncopied src
> + * t0 - end of uncopied dst
> + */
> + add t0, a0, a2
>
> - beqz a3, 2f /* Skip if already aligned */
> /*
> - * Round to nearest double word-aligned address
> - * greater than or equal to start address
> + * Use bytewise copy if too small.
> + *
> + * This threshold must be at least 2*SZREG to ensure at
> least one
> + * wordwise copy is performed. It is chosen to be 16 because
> it will
> + * save at least 7 iterations of bytewise copy, which pays
> off the
> + * fixed overhead.
> */
> - andi a3, a1, ~(SZREG-1)
> - addi a3, a3, SZREG
> - /* Handle initial misalignment */
> - sub a4, a3, a1
> + li a3, 16
> + bltu a2, a3, .Lbyte_copy_tail
> +
> + /*
> + * Bytewise copy first to align a0 to word boundary.
> + */
> + addi a2, a0, SZREG-1
> + andi a2, a2, ~(SZREG-1)
> + beq a0, a2, 2f
> 1:
> - lb a5, 0(a1)
> - addi a1, a1, 1
> - sb a5, 0(t6)
> - addi t6, t6, 1
> - bltu a1, a3, 1b
> - sub a2, a2, a4 /* Update count */
> + lb a5, 0(a1)
> + addi a1, a1, 1
> + sb a5, 0(a0)
> + addi a0, a0, 1
> + bne a0, a2, 1b
> +2:
> +
> + /*
> + * Now a0 is word-aligned. If a1 is also word aligned, we
> could perform
> + * aligned word-wise copy. Otherwise we need to perform
> misaligned
> + * word-wise copy.
> + */
> + andi a3, a1, SZREG-1
> + bnez a3, .Lmisaligned_word_copy
>
> + /* Unrolled wordwise copy */
> + addi t0, t0, -(16*SZREG-1)
> + bgeu a0, t0, 2f
> +1:
> + REG_L a2, 0(a1)
> + REG_L a3, SZREG(a1)
> + REG_L a4, 2*SZREG(a1)
> + REG_L a5, 3*SZREG(a1)
> + REG_L a6, 4*SZREG(a1)
> + REG_L a7, 5*SZREG(a1)
> + REG_L t1, 6*SZREG(a1)
> + REG_L t2, 7*SZREG(a1)
> + REG_L t3, 8*SZREG(a1)
> + REG_L t4, 9*SZREG(a1)
> + REG_L t5, 10*SZREG(a1)
> + REG_S a2, 0(a0)
> + REG_S a3, SZREG(a0)
> + REG_S a4, 2*SZREG(a0)
> + REG_S a5, 3*SZREG(a0)
> + REG_S a6, 4*SZREG(a0)
> + REG_S a7, 5*SZREG(a0)
> + REG_S t1, 6*SZREG(a0)
> + REG_S t2, 7*SZREG(a0)
> + REG_S t3, 8*SZREG(a0)
> + REG_S t4, 9*SZREG(a0)
> + REG_S t5, 10*SZREG(a0)
> + REG_L a2, 11*SZREG(a1)
> + REG_L a3, 12*SZREG(a1)
> + REG_L a4, 13*SZREG(a1)
> + REG_L a5, 14*SZREG(a1)
> + REG_L a6, 15*SZREG(a1)
> + addi a1, a1, 16*SZREG
> + REG_S a2, 11*SZREG(a0)
> + REG_S a3, 12*SZREG(a0)
> + REG_S a4, 13*SZREG(a0)
> + REG_S a5, 14*SZREG(a0)
> + REG_S a6, 15*SZREG(a0)
> + addi a0, a0, 16*SZREG
> + bltu a0, t0, 1b
> 2:
> - andi a4, a2, ~((16*SZREG)-1)
> - beqz a4, 4f
> - add a3, a1, a4
> -3:
> - REG_L a4, 0(a1)
> - REG_L a5, SZREG(a1)
> - REG_L a6, 2*SZREG(a1)
> - REG_L a7, 3*SZREG(a1)
> - REG_L t0, 4*SZREG(a1)
> - REG_L t1, 5*SZREG(a1)
> - REG_L t2, 6*SZREG(a1)
> - REG_L t3, 7*SZREG(a1)
> - REG_L t4, 8*SZREG(a1)
> - REG_L t5, 9*SZREG(a1)
> - REG_S a4, 0(t6)
> - REG_S a5, SZREG(t6)
> - REG_S a6, 2*SZREG(t6)
> - REG_S a7, 3*SZREG(t6)
> - REG_S t0, 4*SZREG(t6)
> - REG_S t1, 5*SZREG(t6)
> - REG_S t2, 6*SZREG(t6)
> - REG_S t3, 7*SZREG(t6)
> - REG_S t4, 8*SZREG(t6)
> - REG_S t5, 9*SZREG(t6)
> - REG_L a4, 10*SZREG(a1)
> - REG_L a5, 11*SZREG(a1)
> - REG_L a6, 12*SZREG(a1)
> - REG_L a7, 13*SZREG(a1)
> - REG_L t0, 14*SZREG(a1)
> - REG_L t1, 15*SZREG(a1)
> - addi a1, a1, 16*SZREG
> - REG_S a4, 10*SZREG(t6)
> - REG_S a5, 11*SZREG(t6)
> - REG_S a6, 12*SZREG(t6)
> - REG_S a7, 13*SZREG(t6)
> - REG_S t0, 14*SZREG(t6)
> - REG_S t1, 15*SZREG(t6)
> - addi t6, t6, 16*SZREG
> - bltu a1, a3, 3b
> - andi a2, a2, (16*SZREG)-1 /* Update count */
> -
> -4:
> - /* Handle trailing misalignment */
> - beqz a2, 6f
> - add a3, a1, a2
> -
> - /* Use word-oriented copy if co-aligned to word boundary */
> - or a5, a1, t6
> - or a5, a5, a3
> - andi a5, a5, 3
> - bnez a5, 5f
> -7:
> - lw a4, 0(a1)
> - addi a1, a1, 4
> - sw a4, 0(t6)
> - addi t6, t6, 4
> - bltu a1, a3, 7b
> + /* Post-loop increment by 16*SZREG-1 and pre-loop decrement
> by SZREG-1 */
> + addi t0, t0, 15*SZREG
>
> - ret
> + /* Wordwise copy */
> + bgeu a0, t0, 2f
> +1:
> + REG_L a5, 0(a1)
> + addi a1, a1, SZREG
> + REG_S a5, 0(a0)
> + addi a0, a0, SZREG
> + bltu a0, t0, 1b
> +2:
> + addi t0, t0, SZREG-1
>
> -5:
> - lb a4, 0(a1)
> - addi a1, a1, 1
> - sb a4, 0(t6)
> - addi t6, t6, 1
> - bltu a1, a3, 5b
> -6:
> +.Lbyte_copy_tail:
> + /*
> + * Bytewise copy anything left.
> + */
> + beq a0, t0, 2f
> +1:
> + lb a5, 0(a1)
> + addi a1, a1, 1
> + sb a5, 0(a0)
> + addi a0, a0, 1
> + bne a0, t0, 1b
> +2:
> +
> + mv a0, t6
> ret
> +
> +.Lmisaligned_word_copy:
> + /*
> + * Misaligned word-wise copy.
> + * For misaligned copy we still perform word-wise copy, but
> we need to
> + * use the value fetched from the previous iteration and do
> some shifts.
> + * This is safe because we wouldn't access more words than
> necessary.
> + */
> +
> + /* Calculate shifts */
> + slli t3, a3, 3
> + sub t4, x0, t3 /* negate is okay as shift will only
> look at LSBs */ +
> + /* Load the initial value and align a1 */
> + andi a1, a1, ~(SZREG-1)
> + REG_L a5, 0(a1)
> +
> + addi t0, t0, -(SZREG-1)
> + /* At least one iteration will be executed here, no check */
> +1:
> + srl a4, a5, t3
> + REG_L a5, SZREG(a1)
> + addi a1, a1, SZREG
> + sll a2, a5, t4
> + or a2, a2, a4
> + REG_S a2, 0(a0)
> + addi a0, a0, SZREG
> + bltu a0, t0, 1b
> +
> + /* Update pointers to correct value */
> + addi t0, t0, SZREG-1
> + add a1, a1, a3
> +
> + j .Lbyte_copy_tail
> END(__memcpy)
> diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
> index 07d1d2152ba5..fbe6701dbe4a 100644
> --- a/arch/riscv/lib/memmove.S
> +++ b/arch/riscv/lib/memmove.S
> @@ -5,60 +5,124 @@
>
> ENTRY(__memmove)
> WEAK(memmove)
> - move t0, a0
> - move t1, a1
> -
> - beq a0, a1, exit_memcpy
> - beqz a2, exit_memcpy
> - srli t2, a2, 0x2
> -
> - slt t3, a0, a1
> - beqz t3, do_reverse
> -
> - andi a2, a2, 0x3
> - li t4, 1
> - beqz t2, byte_copy
> -
> -word_copy:
> - lw t3, 0(a1)
> - addi t2, t2, -1
> - addi a1, a1, 4
> - sw t3, 0(a0)
> - addi a0, a0, 4
> - bnez t2, word_copy
> - beqz a2, exit_memcpy
> - j byte_copy
> -
> -do_reverse:
> - add a0, a0, a2
> - add a1, a1, a2
> - andi a2, a2, 0x3
> - li t4, -1
> - beqz t2, reverse_byte_copy
> -
> -reverse_word_copy:
> - addi a1, a1, -4
> - addi t2, t2, -1
> - lw t3, 0(a1)
> - addi a0, a0, -4
> - sw t3, 0(a0)
> - bnez t2, reverse_word_copy
> - beqz a2, exit_memcpy
> -
> -reverse_byte_copy:
> - addi a0, a0, -1
> - addi a1, a1, -1
> -
> -byte_copy:
> - lb t3, 0(a1)
> - addi a2, a2, -1
> - sb t3, 0(a0)
> - add a1, a1, t4
> - add a0, a0, t4
> - bnez a2, byte_copy
> -
> -exit_memcpy:
> - move a0, t0
> - move a1, t1
> - ret
> + /*
> + * Here we determine if forward copy is possible. Forward
> copy is
> + * preferred to backward copy as it is more cache friendly.
> + *
> + * If a0 >= a1, t0 gives their distance, if t0 >= a2 then we
> can
> + * copy forward.
> + * If a0 < a1, we can always copy forward. This will make t0
> negative,
> + * so a *unsigned* comparison will always have t0 >= a2.
> + *
> + * For forward copy we just delegate the task to memcpy.
> + */
> + sub t0, a0, a1
> + bltu t0, a2, 1f
> + tail __memcpy
> +1:
> +
> + /*
> + * Register allocation for code below:
> + * a0 - end of uncopied dst
> + * a1 - end of uncopied src
> + * t0 - start of uncopied dst
> + */
> + mv t0, a0
> + add a0, a0, a2
> + add a1, a1, a2
> +
> + /*
> + * Use bytewise copy if too small.
> + *
> + * This threshold must be at least 2*SZREG to ensure at
> least one
> + * wordwise copy is performed. It is chosen to be 16 because
> it will
> + * save at least 7 iterations of bytewise copy, which pays
> off the
> + * fixed overhead.
> + */
> + li a3, 16
> + bltu a2, a3, .Lbyte_copy_tail
> +
> + /*
> + * Bytewise copy first to align t0 to word boundary.
> + */
> + andi a2, a0, ~(SZREG-1)
> + beq a0, a2, 2f
> +1:
> + addi a1, a1, -1
> + lb a5, 0(a1)
> + addi a0, a0, -1
> + sb a5, 0(a0)
> + bne a0, a2, 1b
> +2:
> +
> + /*
> + * Now a0 is word-aligned. If a1 is also word aligned, we
> could perform
> + * aligned word-wise copy. Otherwise we need to perform
> misaligned
> + * word-wise copy.
> + */
> + andi a3, a1, SZREG-1
> + bnez a3, .Lmisaligned_word_copy
> +
> + /* Wordwise copy */
> + addi t0, t0, SZREG-1
> + bleu a0, t0, 2f
> +1:
> + addi a1, a1, -SZREG
> + REG_L a5, 0(a1)
> + addi a0, a0, -SZREG
> + REG_S a5, 0(a0)
> + bgtu a0, t0, 1b
> +2:
> + addi t0, t0, -(SZREG-1)
> +
> +.Lbyte_copy_tail:
> + /*
> + * Bytewise copy anything left.
> + */
> + beq a0, t0, 2f
> +1:
> + addi a1, a1, -1
> + lb a5, 0(a1)
> + addi a0, a0, -1
> + sb a5, 0(a0)
> + bne a0, t0, 1b
> +2:
> +
> + mv a0, t0
> + ret
> +
> +.Lmisaligned_word_copy:
> + /*
> + * Misaligned word-wise copy.
> + * For misaligned copy we still perform word-wise copy, but
> we need to
> + * use the value fetched from the previous iteration and do
> some shifts.
> + * This is safe because we wouldn't access more words than
> necessary.
> + */
> +
> + /* Calculate shifts */
> + slli t3, a3, 3
> + sub t4, x0, t3 /* negate is okay as shift will only
> look at LSBs */ +
> + /* Load the initial value and align a1 */
> + andi a1, a1, ~(SZREG-1)
> + REG_L a5, 0(a1)
> +
> + addi t0, t0, SZREG-1
> + /* At least one iteration will be executed here, no check */
> +1:
> + sll a4, a5, t4
> + addi a1, a1, -SZREG
> + REG_L a5, 0(a1)
> + srl a2, a5, t3
> + or a2, a2, a4
> + addi a0, a0, -SZREG
> + REG_S a2, 0(a0)
> + bgtu a0, t0, 1b
> +
> + /* Update pointers to correct value */
> + addi t0, t0, -(SZREG-1)
> + add a1, a1, a3
> +
> + j .Lbyte_copy_tail
> +
> END(__memmove)

ping. It's been 3 month since submission and I really would like to see
this applied or at least have some feedbacks.

Link to the original patch in archive:
https://lore.kernel.org/linux-riscv/[email protected]/

- Gary

2021-05-23 01:55:10

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign

On Sat, 22 May 2021 15:22:56 PDT (-0700), [email protected] wrote:
> On Tue, 16 Feb 2021 22:55:51 +0000
> Gary Guo <[email protected]> wrote:
>
>> 04091d6 introduces an assembly version of memmove but
>> it does take misalignment into account (it checks if
>> length is a multiple of machine word size but pointers
>> need also be aligned). As a result it will generate
>> misaligned load/store for the majority of cases and causes
>> significant performance regression on hardware that traps
>> misaligned load/store and emulate them using firmware.
>>
>> The current behaviour of memcpy is that it checks if both
>> src and dest pointers are co-aligned (aka congruent
>> modular SZ_REG). If aligned, it will copy data word-by-word
>> after first aligning pointers to word boundary. If src
>> and dst are not co-aligned, however, byte-wise copy will
>> be performed.
>>
>> This patch fixes the memmove and optimises memcpy for
>> misaligned cases. It will first align destination pointer
>> to word-boundary regardless whether src and dest are
>> co-aligned or not. If they indeed are, then wordwise copy
>> is performed. If they are not co-aligned, then it will
>> load two adjacent words from src and use shifts to assemble
>> a full machine word. Some additional assembly level
>> micro-optimisation is also performed to ensure more
>> instructions can be compressed (e.g. prefer a0 to t6).
>>
>> In my testing this speeds up memcpy 4~5x when src and dest
>> are not co-aligned (which is quite common in networking),
>> and speeds up memmove 1000+x by avoiding trapping to firmware.
>>
>> Signed-off-by: Gary Guo <[email protected]>
>> ---
>> arch/riscv/lib/memcpy.S | 223
>> ++++++++++++++++++++++++--------------- arch/riscv/lib/memmove.S |
>> 176 ++++++++++++++++++++---------- 2 files changed, 257
>> insertions(+), 142 deletions(-)
>>
>> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
>> index 51ab716253fa..00672c19ad9b 100644
>> --- a/arch/riscv/lib/memcpy.S
>> +++ b/arch/riscv/lib/memcpy.S
>> @@ -9,100 +9,151 @@
>> /* void *memcpy(void *, const void *, size_t) */
>> ENTRY(__memcpy)
>> WEAK(memcpy)
>> - move t6, a0 /* Preserve return value */
>> + /* Save for return value */
>> + mv t6, a0
>>
>> - /* Defer to byte-oriented copy for small sizes */
>> - sltiu a3, a2, 128
>> - bnez a3, 4f
>> - /* Use word-oriented copy only if low-order bits match */
>> - andi a3, t6, SZREG-1
>> - andi a4, a1, SZREG-1
>> - bne a3, a4, 4f
>> + /*
>> + * Register allocation for code below:
>> + * a0 - start of uncopied dst
>> + * a1 - start of uncopied src
>> + * t0 - end of uncopied dst
>> + */
>> + add t0, a0, a2
>>
>> - beqz a3, 2f /* Skip if already aligned */
>> /*
>> - * Round to nearest double word-aligned address
>> - * greater than or equal to start address
>> + * Use bytewise copy if too small.
>> + *
>> + * This threshold must be at least 2*SZREG to ensure at
>> least one
>> + * wordwise copy is performed. It is chosen to be 16 because
>> it will
>> + * save at least 7 iterations of bytewise copy, which pays
>> off the
>> + * fixed overhead.
>> */
>> - andi a3, a1, ~(SZREG-1)
>> - addi a3, a3, SZREG
>> - /* Handle initial misalignment */
>> - sub a4, a3, a1
>> + li a3, 16
>> + bltu a2, a3, .Lbyte_copy_tail
>> +
>> + /*
>> + * Bytewise copy first to align a0 to word boundary.
>> + */
>> + addi a2, a0, SZREG-1
>> + andi a2, a2, ~(SZREG-1)
>> + beq a0, a2, 2f
>> 1:
>> - lb a5, 0(a1)
>> - addi a1, a1, 1
>> - sb a5, 0(t6)
>> - addi t6, t6, 1
>> - bltu a1, a3, 1b
>> - sub a2, a2, a4 /* Update count */
>> + lb a5, 0(a1)
>> + addi a1, a1, 1
>> + sb a5, 0(a0)
>> + addi a0, a0, 1
>> + bne a0, a2, 1b
>> +2:
>> +
>> + /*
>> + * Now a0 is word-aligned. If a1 is also word aligned, we
>> could perform
>> + * aligned word-wise copy. Otherwise we need to perform
>> misaligned
>> + * word-wise copy.
>> + */
>> + andi a3, a1, SZREG-1
>> + bnez a3, .Lmisaligned_word_copy
>>
>> + /* Unrolled wordwise copy */
>> + addi t0, t0, -(16*SZREG-1)
>> + bgeu a0, t0, 2f
>> +1:
>> + REG_L a2, 0(a1)
>> + REG_L a3, SZREG(a1)
>> + REG_L a4, 2*SZREG(a1)
>> + REG_L a5, 3*SZREG(a1)
>> + REG_L a6, 4*SZREG(a1)
>> + REG_L a7, 5*SZREG(a1)
>> + REG_L t1, 6*SZREG(a1)
>> + REG_L t2, 7*SZREG(a1)
>> + REG_L t3, 8*SZREG(a1)
>> + REG_L t4, 9*SZREG(a1)
>> + REG_L t5, 10*SZREG(a1)
>> + REG_S a2, 0(a0)
>> + REG_S a3, SZREG(a0)
>> + REG_S a4, 2*SZREG(a0)
>> + REG_S a5, 3*SZREG(a0)
>> + REG_S a6, 4*SZREG(a0)
>> + REG_S a7, 5*SZREG(a0)
>> + REG_S t1, 6*SZREG(a0)
>> + REG_S t2, 7*SZREG(a0)
>> + REG_S t3, 8*SZREG(a0)
>> + REG_S t4, 9*SZREG(a0)
>> + REG_S t5, 10*SZREG(a0)
>> + REG_L a2, 11*SZREG(a1)
>> + REG_L a3, 12*SZREG(a1)
>> + REG_L a4, 13*SZREG(a1)
>> + REG_L a5, 14*SZREG(a1)
>> + REG_L a6, 15*SZREG(a1)
>> + addi a1, a1, 16*SZREG
>> + REG_S a2, 11*SZREG(a0)
>> + REG_S a3, 12*SZREG(a0)
>> + REG_S a4, 13*SZREG(a0)
>> + REG_S a5, 14*SZREG(a0)
>> + REG_S a6, 15*SZREG(a0)
>> + addi a0, a0, 16*SZREG
>> + bltu a0, t0, 1b
>> 2:
>> - andi a4, a2, ~((16*SZREG)-1)
>> - beqz a4, 4f
>> - add a3, a1, a4
>> -3:
>> - REG_L a4, 0(a1)
>> - REG_L a5, SZREG(a1)
>> - REG_L a6, 2*SZREG(a1)
>> - REG_L a7, 3*SZREG(a1)
>> - REG_L t0, 4*SZREG(a1)
>> - REG_L t1, 5*SZREG(a1)
>> - REG_L t2, 6*SZREG(a1)
>> - REG_L t3, 7*SZREG(a1)
>> - REG_L t4, 8*SZREG(a1)
>> - REG_L t5, 9*SZREG(a1)
>> - REG_S a4, 0(t6)
>> - REG_S a5, SZREG(t6)
>> - REG_S a6, 2*SZREG(t6)
>> - REG_S a7, 3*SZREG(t6)
>> - REG_S t0, 4*SZREG(t6)
>> - REG_S t1, 5*SZREG(t6)
>> - REG_S t2, 6*SZREG(t6)
>> - REG_S t3, 7*SZREG(t6)
>> - REG_S t4, 8*SZREG(t6)
>> - REG_S t5, 9*SZREG(t6)
>> - REG_L a4, 10*SZREG(a1)
>> - REG_L a5, 11*SZREG(a1)
>> - REG_L a6, 12*SZREG(a1)
>> - REG_L a7, 13*SZREG(a1)
>> - REG_L t0, 14*SZREG(a1)
>> - REG_L t1, 15*SZREG(a1)
>> - addi a1, a1, 16*SZREG
>> - REG_S a4, 10*SZREG(t6)
>> - REG_S a5, 11*SZREG(t6)
>> - REG_S a6, 12*SZREG(t6)
>> - REG_S a7, 13*SZREG(t6)
>> - REG_S t0, 14*SZREG(t6)
>> - REG_S t1, 15*SZREG(t6)
>> - addi t6, t6, 16*SZREG
>> - bltu a1, a3, 3b
>> - andi a2, a2, (16*SZREG)-1 /* Update count */
>> -
>> -4:
>> - /* Handle trailing misalignment */
>> - beqz a2, 6f
>> - add a3, a1, a2
>> -
>> - /* Use word-oriented copy if co-aligned to word boundary */
>> - or a5, a1, t6
>> - or a5, a5, a3
>> - andi a5, a5, 3
>> - bnez a5, 5f
>> -7:
>> - lw a4, 0(a1)
>> - addi a1, a1, 4
>> - sw a4, 0(t6)
>> - addi t6, t6, 4
>> - bltu a1, a3, 7b
>> + /* Post-loop increment by 16*SZREG-1 and pre-loop decrement
>> by SZREG-1 */
>> + addi t0, t0, 15*SZREG
>>
>> - ret
>> + /* Wordwise copy */
>> + bgeu a0, t0, 2f
>> +1:
>> + REG_L a5, 0(a1)
>> + addi a1, a1, SZREG
>> + REG_S a5, 0(a0)
>> + addi a0, a0, SZREG
>> + bltu a0, t0, 1b
>> +2:
>> + addi t0, t0, SZREG-1
>>
>> -5:
>> - lb a4, 0(a1)
>> - addi a1, a1, 1
>> - sb a4, 0(t6)
>> - addi t6, t6, 1
>> - bltu a1, a3, 5b
>> -6:
>> +.Lbyte_copy_tail:
>> + /*
>> + * Bytewise copy anything left.
>> + */
>> + beq a0, t0, 2f
>> +1:
>> + lb a5, 0(a1)
>> + addi a1, a1, 1
>> + sb a5, 0(a0)
>> + addi a0, a0, 1
>> + bne a0, t0, 1b
>> +2:
>> +
>> + mv a0, t6
>> ret
>> +
>> +.Lmisaligned_word_copy:
>> + /*
>> + * Misaligned word-wise copy.
>> + * For misaligned copy we still perform word-wise copy, but
>> we need to
>> + * use the value fetched from the previous iteration and do
>> some shifts.
>> + * This is safe because we wouldn't access more words than
>> necessary.
>> + */
>> +
>> + /* Calculate shifts */
>> + slli t3, a3, 3
>> + sub t4, x0, t3 /* negate is okay as shift will only
>> look at LSBs */ +
>> + /* Load the initial value and align a1 */
>> + andi a1, a1, ~(SZREG-1)
>> + REG_L a5, 0(a1)
>> +
>> + addi t0, t0, -(SZREG-1)
>> + /* At least one iteration will be executed here, no check */
>> +1:
>> + srl a4, a5, t3
>> + REG_L a5, SZREG(a1)
>> + addi a1, a1, SZREG
>> + sll a2, a5, t4
>> + or a2, a2, a4
>> + REG_S a2, 0(a0)
>> + addi a0, a0, SZREG
>> + bltu a0, t0, 1b
>> +
>> + /* Update pointers to correct value */
>> + addi t0, t0, SZREG-1
>> + add a1, a1, a3
>> +
>> + j .Lbyte_copy_tail
>> END(__memcpy)
>> diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
>> index 07d1d2152ba5..fbe6701dbe4a 100644
>> --- a/arch/riscv/lib/memmove.S
>> +++ b/arch/riscv/lib/memmove.S
>> @@ -5,60 +5,124 @@
>>
>> ENTRY(__memmove)
>> WEAK(memmove)
>> - move t0, a0
>> - move t1, a1
>> -
>> - beq a0, a1, exit_memcpy
>> - beqz a2, exit_memcpy
>> - srli t2, a2, 0x2
>> -
>> - slt t3, a0, a1
>> - beqz t3, do_reverse
>> -
>> - andi a2, a2, 0x3
>> - li t4, 1
>> - beqz t2, byte_copy
>> -
>> -word_copy:
>> - lw t3, 0(a1)
>> - addi t2, t2, -1
>> - addi a1, a1, 4
>> - sw t3, 0(a0)
>> - addi a0, a0, 4
>> - bnez t2, word_copy
>> - beqz a2, exit_memcpy
>> - j byte_copy
>> -
>> -do_reverse:
>> - add a0, a0, a2
>> - add a1, a1, a2
>> - andi a2, a2, 0x3
>> - li t4, -1
>> - beqz t2, reverse_byte_copy
>> -
>> -reverse_word_copy:
>> - addi a1, a1, -4
>> - addi t2, t2, -1
>> - lw t3, 0(a1)
>> - addi a0, a0, -4
>> - sw t3, 0(a0)
>> - bnez t2, reverse_word_copy
>> - beqz a2, exit_memcpy
>> -
>> -reverse_byte_copy:
>> - addi a0, a0, -1
>> - addi a1, a1, -1
>> -
>> -byte_copy:
>> - lb t3, 0(a1)
>> - addi a2, a2, -1
>> - sb t3, 0(a0)
>> - add a1, a1, t4
>> - add a0, a0, t4
>> - bnez a2, byte_copy
>> -
>> -exit_memcpy:
>> - move a0, t0
>> - move a1, t1
>> - ret
>> + /*
>> + * Here we determine if forward copy is possible. Forward
>> copy is
>> + * preferred to backward copy as it is more cache friendly.
>> + *
>> + * If a0 >= a1, t0 gives their distance, if t0 >= a2 then we
>> can
>> + * copy forward.
>> + * If a0 < a1, we can always copy forward. This will make t0
>> negative,
>> + * so a *unsigned* comparison will always have t0 >= a2.
>> + *
>> + * For forward copy we just delegate the task to memcpy.
>> + */
>> + sub t0, a0, a1
>> + bltu t0, a2, 1f
>> + tail __memcpy
>> +1:
>> +
>> + /*
>> + * Register allocation for code below:
>> + * a0 - end of uncopied dst
>> + * a1 - end of uncopied src
>> + * t0 - start of uncopied dst
>> + */
>> + mv t0, a0
>> + add a0, a0, a2
>> + add a1, a1, a2
>> +
>> + /*
>> + * Use bytewise copy if too small.
>> + *
>> + * This threshold must be at least 2*SZREG to ensure at
>> least one
>> + * wordwise copy is performed. It is chosen to be 16 because
>> it will
>> + * save at least 7 iterations of bytewise copy, which pays
>> off the
>> + * fixed overhead.
>> + */
>> + li a3, 16
>> + bltu a2, a3, .Lbyte_copy_tail
>> +
>> + /*
>> + * Bytewise copy first to align t0 to word boundary.
>> + */
>> + andi a2, a0, ~(SZREG-1)
>> + beq a0, a2, 2f
>> +1:
>> + addi a1, a1, -1
>> + lb a5, 0(a1)
>> + addi a0, a0, -1
>> + sb a5, 0(a0)
>> + bne a0, a2, 1b
>> +2:
>> +
>> + /*
>> + * Now a0 is word-aligned. If a1 is also word aligned, we
>> could perform
>> + * aligned word-wise copy. Otherwise we need to perform
>> misaligned
>> + * word-wise copy.
>> + */
>> + andi a3, a1, SZREG-1
>> + bnez a3, .Lmisaligned_word_copy
>> +
>> + /* Wordwise copy */
>> + addi t0, t0, SZREG-1
>> + bleu a0, t0, 2f
>> +1:
>> + addi a1, a1, -SZREG
>> + REG_L a5, 0(a1)
>> + addi a0, a0, -SZREG
>> + REG_S a5, 0(a0)
>> + bgtu a0, t0, 1b
>> +2:
>> + addi t0, t0, -(SZREG-1)
>> +
>> +.Lbyte_copy_tail:
>> + /*
>> + * Bytewise copy anything left.
>> + */
>> + beq a0, t0, 2f
>> +1:
>> + addi a1, a1, -1
>> + lb a5, 0(a1)
>> + addi a0, a0, -1
>> + sb a5, 0(a0)
>> + bne a0, t0, 1b
>> +2:
>> +
>> + mv a0, t0
>> + ret
>> +
>> +.Lmisaligned_word_copy:
>> + /*
>> + * Misaligned word-wise copy.
>> + * For misaligned copy we still perform word-wise copy, but
>> we need to
>> + * use the value fetched from the previous iteration and do
>> some shifts.
>> + * This is safe because we wouldn't access more words than
>> necessary.
>> + */
>> +
>> + /* Calculate shifts */
>> + slli t3, a3, 3
>> + sub t4, x0, t3 /* negate is okay as shift will only
>> look at LSBs */ +
>> + /* Load the initial value and align a1 */
>> + andi a1, a1, ~(SZREG-1)
>> + REG_L a5, 0(a1)
>> +
>> + addi t0, t0, SZREG-1
>> + /* At least one iteration will be executed here, no check */
>> +1:
>> + sll a4, a5, t4
>> + addi a1, a1, -SZREG
>> + REG_L a5, 0(a1)
>> + srl a2, a5, t3
>> + or a2, a2, a4
>> + addi a0, a0, -SZREG
>> + REG_S a2, 0(a0)
>> + bgtu a0, t0, 1b
>> +
>> + /* Update pointers to correct value */
>> + addi t0, t0, -(SZREG-1)
>> + add a1, a1, a3
>> +
>> + j .Lbyte_copy_tail
>> +
>> END(__memmove)
>
> ping. It's been 3 month since submission and I really would like to see
> this applied or at least have some feedbacks.
>
> Link to the original patch in archive:
> https://lore.kernel.org/linux-riscv/[email protected]/

Sorry, I thought I replied to this one but it must have gotten lost
somewhere.

IMO the right way to go here is to just move to C-based string routines,
at least until we get to the point where we're seriously optimizing for
specific processors. We went with the C-based string rountines in glibc
as part of the upstreaming process and found only some small performance
differences when compared to the hand-written assembly, and they're way
easier to maintain.

IIRC Linux only has trivial C string routines in lib, I think the best
way to go about that would be to higher performance versions in there.
That will allow other ports to use them.

2021-05-23 17:14:37

by David Laight

[permalink] [raw]

Subject: RE: [PATCH] riscv: fix memmove and optimise memcpy when misalign

From: Palmer Dabbelt
> Sent: 23 May 2021 02:47
...
> IMO the right way to go here is to just move to C-based string routines,
> at least until we get to the point where we're seriously optimizing for
> specific processors. We went with the C-based string rountines in glibc
> as part of the upstreaming process and found only some small performance
> differences when compared to the hand-written assembly, and they're way
> easier to maintain.
>
> IIRC Linux only has trivial C string routines in lib, I think the best
> way to go about that would be to higher performance versions in there.
> That will allow other ports to use them.

I certainly wonder how much benefit these massively unrolled
loops have on modern superscaler processors - especially those
with any form of 'out of order' execution.

It is often easy to write assembler where all the loop
control instructions happen in parallel with the memory
accesses - which cannot be avoided.
Loop unrolling is so 1970s.

Sometimes you need to unroll once.
And maybe interleave the loads and stores.
But after that you can just be trashing the i-cache.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-05-25 16:33:20

by Gary Guo

[permalink] [raw]

Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign

On Sun, 23 May 2021 17:12:23 +0000
David Laight <[email protected]> wrote:

> From: Palmer Dabbelt
> > Sent: 23 May 2021 02:47
> ...
> > IMO the right way to go here is to just move to C-based string
> > routines, at least until we get to the point where we're seriously
> > optimizing for specific processors. We went with the C-based
> > string rountines in glibc as part of the upstreaming process and
> > found only some small performance differences when compared to the
> > hand-written assembly, and they're way easier to maintain.

I prefer C versions as well, and actually before commit 04091d6 we are
indeed using the generic C version. The issue is that 04091d6
introduces an assembly version that's very broken. It does not offer
and performance improvement to the C version, and breaks all processors
without hardware misalignment support (yes, firmware is expected to
trap and handle these, but they are painfully slow).

I noticed the issue because I ran Linux on my own firmware and found
that kernel couldn't boot. I didn't implement misalignment emulation at
that time (and just send the trap to the supervisor).

Because 04091d6 is accepted, my assumption is that we need an assembly
version. So I spent some time writing, testing and optimising the
assembly.

> >
> > IIRC Linux only has trivial C string routines in lib, I think the
> > best way to go about that would be to higher performance versions
> > in there. That will allow other ports to use them.
>
> I certainly wonder how much benefit these massively unrolled
> loops have on modern superscaler processors - especially those
> with any form of 'out of order' execution.
>
> It is often easy to write assembler where all the loop
> control instructions happen in parallel with the memory
> accesses - which cannot be avoided.
> Loop unrolling is so 1970s.
>
> Sometimes you need to unroll once.
> And maybe interleave the loads and stores.
> But after that you can just be trashing the i-cache.

I didn't introduce the loop unrolling though. The loop unrolled
assembly is there before this patch, and I didn't even change the
unroll factor. I only added a path to handle misaligned case.

There are a lot of diffs because I did made some changes to the
register allocation so that the code is more optimal. I also made a few
cleanups and added a few comments. It might be easier to review if you
apply the patch locally and just look at the file.

- Gary

2021-06-15 13:43:33

by Bin Meng

[permalink] [raw]

Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign

On Tue, May 25, 2021 at 10:37 PM Gary Guo <[email protected]> wrote:
>
> On Sun, 23 May 2021 17:12:23 +0000
> David Laight <[email protected]> wrote:
>
> > From: Palmer Dabbelt
> > > Sent: 23 May 2021 02:47
> > ...
> > > IMO the right way to go here is to just move to C-based string
> > > routines, at least until we get to the point where we're seriously
> > > optimizing for specific processors. We went with the C-based
> > > string rountines in glibc as part of the upstreaming process and
> > > found only some small performance differences when compared to the
> > > hand-written assembly, and they're way easier to maintain.
>
> I prefer C versions as well, and actually before commit 04091d6 we are
> indeed using the generic C version. The issue is that 04091d6
> introduces an assembly version that's very broken. It does not offer
> and performance improvement to the C version, and breaks all processors
> without hardware misalignment support (yes, firmware is expected to
> trap and handle these, but they are painfully slow).
>
> I noticed the issue because I ran Linux on my own firmware and found
> that kernel couldn't boot. I didn't implement misalignment emulation at
> that time (and just send the trap to the supervisor).

Similarly, I noticed exactly the same issue as what you saw on
mainline U-Boot SPL that it suddenly does not boot on SiFive
Unleashed. Bisect leads to the commit that brought the kernel buggy
assembly mem* version into U-Boot.

> Because 04091d6 is accepted, my assumption is that we need an assembly
> version.

That's my understanding as well. I thought kernel folks prefer
hand-written assembly over pure C :)

> So I spent some time writing, testing and optimising the assembly.

Thank you for the fix!

>
> > >
> > > IIRC Linux only has trivial C string routines in lib, I think the
> > > best way to go about that would be to higher performance versions
> > > in there. That will allow other ports to use them.
> >
> > I certainly wonder how much benefit these massively unrolled
> > loops have on modern superscaler processors - especially those
> > with any form of 'out of order' execution.
> >
> > It is often easy to write assembler where all the loop
> > control instructions happen in parallel with the memory
> > accesses - which cannot be avoided.
> > Loop unrolling is so 1970s.
> >
> > Sometimes you need to unroll once.
> > And maybe interleave the loads and stores.
> > But after that you can just be trashing the i-cache.
>
> I didn't introduce the loop unrolling though. The loop unrolled
> assembly is there before this patch, and I didn't even change the
> unroll factor. I only added a path to handle misaligned case.
>
> There are a lot of diffs because I did made some changes to the
> register allocation so that the code is more optimal. I also made a few
> cleanups and added a few comments. It might be easier to review if you
> apply the patch locally and just look at the file.

I would vote to apply this patch, unless we wanted to do "git revert
04091d6", but that would break KASAN I guess?

Regards,
Bin

2021-06-15 14:10:35

by David Laight

[permalink] [raw]

Subject: RE: [PATCH] riscv: fix memmove and optimise memcpy when misalign

From: Bin Meng
> Sent: 15 June 2021 14:40
...
> > I prefer C versions as well, and actually before commit 04091d6 we are
> > indeed using the generic C version. The issue is that 04091d6
> > introduces an assembly version that's very broken. It does not offer
> > and performance improvement to the C version, and breaks all processors
> > without hardware misalignment support

There may need to be a few C implementations for different cpu
instruction sets.
While the compiler might manage to DTRT (or the wrong thing given
the right source) using a loop that matches the instruction set
is a good idea.

For instance, x86 can do *(reg_1 + reg_2 * (1|2|4|8) + constant)
so you can increment reg_2 and use it for both buffers while
still unrolling enough to hit memory bandwidth.

With only *(reg_1 + constant) you need to increment both the
source and destination addresses.

OTOH you can save an instruction on x86 by adding to 'reg_2'
until it becomes zero (so you don't need add, cmp and jmp).

But a mips-like instruction set (includes riscv and nios2)
has 'compare and branch' so you only ever need one instruction
at the end of the loop.

Having to handle misaligned copies is another distinct issue.
For some 32bit cpu byte copies may be as fast as any shift
and mask code.

> > (yes, firmware is expected to
> > trap and handle these, but they are painfully slow).

Yes, to the point where the system should just panic and
force you to fix the code.

When I were a lad we forced everyone to fix there code
so it would run on sparc.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)