LinuxLists.cc - [PATCH v3 0/1] riscv: improving uaccess with logs from network bench

2021-06-23 12:38:46

Subject: [PATCH v3 0/1] riscv: improving uaccess with logs from network bench

Optimizing copy_to_user and copy_from_user.

I rewrote the functions but heavily influenced by Garry's memcpy
function [1]. It must be written in assembler to handle page faults
manually inside the function unlike other memcpy functions.

This patch will reduce cpu usage dramatically in kernel space especially
for applications which use sys-call with large buffer size, such as network
applications. The main reason behind this is that every unaligned memory
access will raise exceptions and switch between s-mode and m-mode causing
large overhead.

The motivation to create the patch was to improve network performance on
beaglev beta board. By observing with perf, the memcpy and
__asm_copy_to_user had heavy cpu usage and the network speed was limited
at around 680Mbps on 1Gbps lan. Matteo is creating the patches to improve
memcpy in C and this patch is meant to be used with them.

Typical network applications use system calls with a large buffer on
send/recv() and sendto/recvfrom() for the optimization.

The bench result, when patching only copy_user. The memcpy is without
Matteo's patches but listing the both since they are the top two largest
overhead.

All results are from the same base kernel, same rootfs and same BeagleV
beta board.

Results of iperf3 have speedup on UDP with the copy_user patch alone.

--- UDP send ---
306 Mbits/sec 362 Mbits/sec
305 Mbits/sec 362 Mbits/sec

--- UDP recv ---
772 Mbits/sec 787 Mbits/sec
773 Mbits/sec 784 Mbits/sec

Comparison by "perf top -Ue task-clock" while running iperf3.

--- TCP recv ---
* Before
40.40% [kernel] [k] memcpy
33.09% [kernel] [k] __asm_copy_to_user
* With patch
50.35% [kernel] [k] memcpy
13.76% [kernel] [k] __asm_copy_to_user

--- TCP send ---
* Before
19.96% [kernel] [k] memcpy
9.84% [kernel] [k] __asm_copy_to_user
* With patch
14.27% [kernel] [k] memcpy
7.37% [kernel] [k] __asm_copy_to_user

--- UDP recv ---
* Before
44.45% [kernel] [k] memcpy
31.04% [kernel] [k] __asm_copy_to_user
* With patch
55.62% [kernel] [k] memcpy
11.22% [kernel] [k] __asm_copy_to_user

--- UDP send ---
* Before
25.18% [kernel] [k] memcpy
22.50% [kernel] [k] __asm_copy_to_user
* With patch
28.90% [kernel] [k] memcpy
9.49% [kernel] [k] __asm_copy_to_user

---
v2 -> v3:
- Merged all patches

v1 -> v2:
- Added shift copy
- Separated patches for readability of changes in assembler
- Using perf results

[1] https://lkml.org/lkml/2021/2/16/778

Akira Tsukamoto (1):
riscv: __asm_copy_to-from_user: Optimize unaligned memory access and
pipeline stall

arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
1 file changed, 146 insertions(+), 35 deletions(-)

--
2.17.1

2021-06-23 12:42:37

by Akira Tsukamoto

[permalink] [raw]

Subject: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

This patch will reduce cpu usage dramatically in kernel space especially
for application which use sys-call with large buffer size, such as network
applications. The main reason behind this is that every unaligned memory
access will raise exceptions and switch between s-mode and m-mode causing
large overhead.

First copy in bytes until reaches the first word aligned boundary in
destination memory address. This is the preparation before the bulk
aligned word copy.

The destination address is aligned now, but oftentimes the source address
is not in an aligned boundary. To reduce the unaligned memory access, it
reads the data from source in aligned boundaries, which will cause the
data to have an offset, and then combines the data in the next iteration
by fixing offset with shifting before writing to destination. The majority
of the improving copy speed comes from this shift copy.

In the lucky situation that the both source and destination address are on
the aligned boundary, perform load and store with register size to copy the
data. Without the unrolling, it will reduce the speed since the next store
instruction for the same register using from the load will stall the
pipeline.

At last, copying the remainder in one byte at a time.

Signed-off-by: Akira Tsukamoto <[email protected]>
---
arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
1 file changed, 146 insertions(+), 35 deletions(-)

diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index fceaeb18cc64..bceb0629e440 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user)
li t6, SR_SUM
csrs CSR_STATUS, t6

- add a3, a1, a2
- /* Use word-oriented copy only if low-order bits match */
- andi t0, a0, SZREG-1
- andi t1, a1, SZREG-1
- bne t0, t1, 2f
+ /* Save for return value */
+ mv t5, a2

- addi t0, a1, SZREG-1
- andi t1, a3, ~(SZREG-1)
- andi t0, t0, ~(SZREG-1)
/*
- * a3: terminal address of source region
- * t0: lowest XLEN-aligned address in source
- * t1: highest XLEN-aligned address in source
+ * Register allocation for code below:
+ * a0 - start of uncopied dst
+ * a1 - start of uncopied src
+ * a2 - size
+ * t0 - end of uncopied dst
*/
- bgeu t0, t1, 2f
- bltu a1, t0, 4f
+ add t0, a0, a2
+ bgtu a0, t0, 5f
+
+ /*
+ * Use byte copy only if too small.
+ */
+ li a3, 8*SZREG /* size must be larger than size in word_copy */
+ bltu a2, a3, .Lbyte_copy_tail
+
+ /*
+ * Copy first bytes until dst is align to word boundary.
+ * a0 - start of dst
+ * t1 - start of aligned dst
+ */
+ addi t1, a0, SZREG-1
+ andi t1, t1, ~(SZREG-1)
+ /* dst is already aligned, skip */
+ beq a0, t1, .Lskip_first_bytes
1:
- fixup REG_L, t2, (a1), 10f
- fixup REG_S, t2, (a0), 10f
- addi a1, a1, SZREG
- addi a0, a0, SZREG
- bltu a1, t1, 1b
+ /* a5 - one byte for copying data */
+ fixup lb a5, 0(a1), 10f
+ addi a1, a1, 1 /* src */
+ fixup sb a5, 0(a0), 10f
+ addi a0, a0, 1 /* dst */
+ bltu a0, t1, 1b /* t1 - start of aligned dst */
+
+.Lskip_first_bytes:
+ /*
+ * Now dst is aligned.
+ * Use shift-copy if src is misaligned.
+ * Use word-copy if both src and dst are aligned because
+ * can not use shift-copy which do not require shifting
+ */
+ /* a1 - start of src */
+ andi a3, a1, SZREG-1
+ bnez a3, .Lshift_copy
+
+.Lword_copy:
+ /*
+ * Both src and dst are aligned, unrolled word copy
+ *
+ * a0 - start of aligned dst
+ * a1 - start of aligned src
+ * a3 - a1 & mask:(SZREG-1)
+ * t0 - end of aligned dst
+ */
+ addi t0, t0, -(8*SZREG-1) /* not to over run */
2:
- bltu a1, a3, 5f
+ fixup REG_L a4, 0(a1), 10f
+ fixup REG_L a5, SZREG(a1), 10f
+ fixup REG_L a6, 2*SZREG(a1), 10f
+ fixup REG_L a7, 3*SZREG(a1), 10f
+ fixup REG_L t1, 4*SZREG(a1), 10f
+ fixup REG_L t2, 5*SZREG(a1), 10f
+ fixup REG_L t3, 6*SZREG(a1), 10f
+ fixup REG_L t4, 7*SZREG(a1), 10f
+ fixup REG_S a4, 0(a0), 10f
+ fixup REG_S a5, SZREG(a0), 10f
+ fixup REG_S a6, 2*SZREG(a0), 10f
+ fixup REG_S a7, 3*SZREG(a0), 10f
+ fixup REG_S t1, 4*SZREG(a0), 10f
+ fixup REG_S t2, 5*SZREG(a0), 10f
+ fixup REG_S t3, 6*SZREG(a0), 10f
+ fixup REG_S t4, 7*SZREG(a0), 10f
+ addi a0, a0, 8*SZREG
+ addi a1, a1, 8*SZREG
+ bltu a0, t0, 2b
+
+ addi t0, t0, 8*SZREG-1 /* revert to original value */
+ j .Lbyte_copy_tail
+
+.Lshift_copy:
+
+ /*
+ * Word copy with shifting.
+ * For misaligned copy we still perform aligned word copy, but
+ * we need to use the value fetched from the previous iteration and
+ * do some shifts.
+ * This is safe because reading less than a word size.
+ *
+ * a0 - start of aligned dst
+ * a1 - start of src
+ * a3 - a1 & mask:(SZREG-1)
+ * t0 - end of uncopied dst
+ * t1 - end of aligned dst
+ */
+ /* calculating aligned word boundary for dst */
+ andi t1, t0, ~(SZREG-1)
+ /* Converting unaligned src to aligned arc */
+ andi a1, a1, ~(SZREG-1)
+
+ /*
+ * Calculate shifts
+ * t3 - prev shift
+ * t4 - current shift
+ */
+ slli t3, a3, LGREG
+ li a5, SZREG*8
+ sub t4, a5, t3
+
+ /* Load the first word to combine with seceond word */
+ fixup REG_L a5, 0(a1), 10f

3:
+ /* Main shifting copy
+ *
+ * a0 - start of aligned dst
+ * a1 - start of aligned src
+ * t1 - end of aligned dst
+ */
+
+ /* At least one iteration will be executed */
+ srl a4, a5, t3
+ fixup REG_L a5, SZREG(a1), 10f
+ addi a1, a1, SZREG
+ sll a2, a5, t4
+ or a2, a2, a4
+ fixup REG_S a2, 0(a0), 10f
+ addi a0, a0, SZREG
+ bltu a0, t1, 3b
+
+ /* Revert src to original unaligned value */
+ add a1, a1, a3
+
+.Lbyte_copy_tail:
+ /*
+ * Byte copy anything left.
+ *
+ * a0 - start of remaining dst
+ * a1 - start of remaining src
+ * t0 - end of remaining dst
+ */
+ bgeu a0, t0, 5f
+4:
+ fixup lb a5, 0(a1), 10f
+ addi a1, a1, 1 /* src */
+ fixup sb a5, 0(a0), 10f
+ addi a0, a0, 1 /* dst */
+ bltu a0, t0, 4b /* t0 - end of dst */
+
+5:
/* Disable access to user memory */
csrc CSR_STATUS, t6
- li a0, 0
+ li a0, 0
ret
-4: /* Edge case: unalignment */
- fixup lbu, t2, (a1), 10f
- fixup sb, t2, (a0), 10f
- addi a1, a1, 1
- addi a0, a0, 1
- bltu a1, t0, 4b
- j 1b
-5: /* Edge case: remainder */
- fixup lbu, t2, (a1), 10f
- fixup sb, t2, (a0), 10f
- addi a1, a1, 1
- addi a0, a0, 1
- bltu a1, a3, 5b
- j 3b
ENDPROC(__asm_copy_to_user)
ENDPROC(__asm_copy_from_user)
EXPORT_SYMBOL(__asm_copy_to_user)
@@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user)
10:
/* Disable access to user memory */
csrs CSR_STATUS, t6
- mv a0, a2
+ mv a0, t5
ret
11:
csrs CSR_STATUS, t6
--
2.17.1

2021-07-06 23:17:39

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

On Wed, 23 Jun 2021 05:40:39 PDT (-0700), [email protected] wrote:
>
> This patch will reduce cpu usage dramatically in kernel space especially
> for application which use sys-call with large buffer size, such as network
> applications. The main reason behind this is that every unaligned memory
> access will raise exceptions and switch between s-mode and m-mode causing
> large overhead.
>
> First copy in bytes until reaches the first word aligned boundary in
> destination memory address. This is the preparation before the bulk
> aligned word copy.
>
> The destination address is aligned now, but oftentimes the source address
> is not in an aligned boundary. To reduce the unaligned memory access, it
> reads the data from source in aligned boundaries, which will cause the
> data to have an offset, and then combines the data in the next iteration
> by fixing offset with shifting before writing to destination. The majority
> of the improving copy speed comes from this shift copy.
>
> In the lucky situation that the both source and destination address are on
> the aligned boundary, perform load and store with register size to copy the
> data. Without the unrolling, it will reduce the speed since the next store
> instruction for the same register using from the load will stall the
> pipeline.
>
> At last, copying the remainder in one byte at a time.
>
> Signed-off-by: Akira Tsukamoto <[email protected]>
> ---
> arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
> 1 file changed, 146 insertions(+), 35 deletions(-)
>
> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
> index fceaeb18cc64..bceb0629e440 100644
> --- a/arch/riscv/lib/uaccess.S
> +++ b/arch/riscv/lib/uaccess.S
> @@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user)
> li t6, SR_SUM
> csrs CSR_STATUS, t6
>
> - add a3, a1, a2
> - /* Use word-oriented copy only if low-order bits match */
> - andi t0, a0, SZREG-1
> - andi t1, a1, SZREG-1
> - bne t0, t1, 2f
> + /* Save for return value */
> + mv t5, a2
>
> - addi t0, a1, SZREG-1
> - andi t1, a3, ~(SZREG-1)
> - andi t0, t0, ~(SZREG-1)
> /*
> - * a3: terminal address of source region
> - * t0: lowest XLEN-aligned address in source
> - * t1: highest XLEN-aligned address in source
> + * Register allocation for code below:
> + * a0 - start of uncopied dst
> + * a1 - start of uncopied src
> + * a2 - size
> + * t0 - end of uncopied dst
> */
> - bgeu t0, t1, 2f
> - bltu a1, t0, 4f
> + add t0, a0, a2
> + bgtu a0, t0, 5f
> +
> + /*
> + * Use byte copy only if too small.
> + */
> + li a3, 8*SZREG /* size must be larger than size in word_copy */
> + bltu a2, a3, .Lbyte_copy_tail
> +
> + /*
> + * Copy first bytes until dst is align to word boundary.
> + * a0 - start of dst
> + * t1 - start of aligned dst
> + */
> + addi t1, a0, SZREG-1
> + andi t1, t1, ~(SZREG-1)
> + /* dst is already aligned, skip */
> + beq a0, t1, .Lskip_first_bytes
> 1:
> - fixup REG_L, t2, (a1), 10f
> - fixup REG_S, t2, (a0), 10f
> - addi a1, a1, SZREG
> - addi a0, a0, SZREG
> - bltu a1, t1, 1b
> + /* a5 - one byte for copying data */
> + fixup lb a5, 0(a1), 10f
> + addi a1, a1, 1 /* src */
> + fixup sb a5, 0(a0), 10f
> + addi a0, a0, 1 /* dst */
> + bltu a0, t1, 1b /* t1 - start of aligned dst */
> +
> +.Lskip_first_bytes:
> + /*
> + * Now dst is aligned.
> + * Use shift-copy if src is misaligned.
> + * Use word-copy if both src and dst are aligned because
> + * can not use shift-copy which do not require shifting
> + */
> + /* a1 - start of src */
> + andi a3, a1, SZREG-1
> + bnez a3, .Lshift_copy
> +
> +.Lword_copy:
> + /*
> + * Both src and dst are aligned, unrolled word copy
> + *
> + * a0 - start of aligned dst
> + * a1 - start of aligned src
> + * a3 - a1 & mask:(SZREG-1)
> + * t0 - end of aligned dst
> + */
> + addi t0, t0, -(8*SZREG-1) /* not to over run */
> 2:
> - bltu a1, a3, 5f
> + fixup REG_L a4, 0(a1), 10f
> + fixup REG_L a5, SZREG(a1), 10f
> + fixup REG_L a6, 2*SZREG(a1), 10f
> + fixup REG_L a7, 3*SZREG(a1), 10f
> + fixup REG_L t1, 4*SZREG(a1), 10f
> + fixup REG_L t2, 5*SZREG(a1), 10f
> + fixup REG_L t3, 6*SZREG(a1), 10f
> + fixup REG_L t4, 7*SZREG(a1), 10f
> + fixup REG_S a4, 0(a0), 10f
> + fixup REG_S a5, SZREG(a0), 10f
> + fixup REG_S a6, 2*SZREG(a0), 10f
> + fixup REG_S a7, 3*SZREG(a0), 10f
> + fixup REG_S t1, 4*SZREG(a0), 10f
> + fixup REG_S t2, 5*SZREG(a0), 10f
> + fixup REG_S t3, 6*SZREG(a0), 10f
> + fixup REG_S t4, 7*SZREG(a0), 10f

This seems like a suspiciously large unrolling factor, at least without
a fallback. My guess is that some workloads will want some smaller
unrolling factors, but given that we run on these single-issue in-order
processors it's probably best to have some big unrolling factors as well
since they're pretty limited WRT integer bandwidth.

> + addi a0, a0, 8*SZREG
> + addi a1, a1, 8*SZREG
> + bltu a0, t0, 2b
> +
> + addi t0, t0, 8*SZREG-1 /* revert to original value */
> + j .Lbyte_copy_tail
> +
> +.Lshift_copy:
> +
> + /*
> + * Word copy with shifting.
> + * For misaligned copy we still perform aligned word copy, but
> + * we need to use the value fetched from the previous iteration and
> + * do some shifts.
> + * This is safe because reading less than a word size.
> + *
> + * a0 - start of aligned dst
> + * a1 - start of src
> + * a3 - a1 & mask:(SZREG-1)
> + * t0 - end of uncopied dst
> + * t1 - end of aligned dst
> + */
> + /* calculating aligned word boundary for dst */
> + andi t1, t0, ~(SZREG-1)
> + /* Converting unaligned src to aligned arc */
> + andi a1, a1, ~(SZREG-1)
> +
> + /*
> + * Calculate shifts
> + * t3 - prev shift
> + * t4 - current shift
> + */
> + slli t3, a3, LGREG
> + li a5, SZREG*8
> + sub t4, a5, t3
> +
> + /* Load the first word to combine with seceond word */
> + fixup REG_L a5, 0(a1), 10f
>
> 3:
> + /* Main shifting copy
> + *
> + * a0 - start of aligned dst
> + * a1 - start of aligned src
> + * t1 - end of aligned dst
> + */
> +
> + /* At least one iteration will be executed */
> + srl a4, a5, t3
> + fixup REG_L a5, SZREG(a1), 10f
> + addi a1, a1, SZREG
> + sll a2, a5, t4
> + or a2, a2, a4
> + fixup REG_S a2, 0(a0), 10f
> + addi a0, a0, SZREG
> + bltu a0, t1, 3b
> +
> + /* Revert src to original unaligned value */
> + add a1, a1, a3
> +
> +.Lbyte_copy_tail:
> + /*
> + * Byte copy anything left.
> + *
> + * a0 - start of remaining dst
> + * a1 - start of remaining src
> + * t0 - end of remaining dst
> + */
> + bgeu a0, t0, 5f
> +4:
> + fixup lb a5, 0(a1), 10f
> + addi a1, a1, 1 /* src */
> + fixup sb a5, 0(a0), 10f
> + addi a0, a0, 1 /* dst */
> + bltu a0, t0, 4b /* t0 - end of dst */
> +
> +5:
> /* Disable access to user memory */
> csrc CSR_STATUS, t6
> - li a0, 0
> + li a0, 0
> ret
> -4: /* Edge case: unalignment */
> - fixup lbu, t2, (a1), 10f
> - fixup sb, t2, (a0), 10f
> - addi a1, a1, 1
> - addi a0, a0, 1
> - bltu a1, t0, 4b
> - j 1b
> -5: /* Edge case: remainder */
> - fixup lbu, t2, (a1), 10f
> - fixup sb, t2, (a0), 10f
> - addi a1, a1, 1
> - addi a0, a0, 1
> - bltu a1, a3, 5b
> - j 3b
> ENDPROC(__asm_copy_to_user)
> ENDPROC(__asm_copy_from_user)
> EXPORT_SYMBOL(__asm_copy_to_user)
> @@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user)
> 10:
> /* Disable access to user memory */
> csrs CSR_STATUS, t6
> - mv a0, a2
> + mv a0, t5
> ret
> 11:
> csrs CSR_STATUS, t6

That said, this is good enough for me. If someone comes up with a case
where the extra unrolling is an issue I'm happy to take something to fix
it, but until then I'm fine with this as-is. Like the string fuctions
it's probably best to eventually put this in C, but IIRC last time I
tried it was kind of a headache.

This is on for-next.

Thanks!

2021-07-07 10:09:04

by David Laight

[permalink] [raw]

Subject: RE: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

...
> > + fixup REG_L a4, 0(a1), 10f
> > + fixup REG_L a5, SZREG(a1), 10f
> > + fixup REG_L a6, 2*SZREG(a1), 10f
> > + fixup REG_L a7, 3*SZREG(a1), 10f
> > + fixup REG_L t1, 4*SZREG(a1), 10f
> > + fixup REG_L t2, 5*SZREG(a1), 10f
> > + fixup REG_L t3, 6*SZREG(a1), 10f
> > + fixup REG_L t4, 7*SZREG(a1), 10f
> > + fixup REG_S a4, 0(a0), 10f
> > + fixup REG_S a5, SZREG(a0), 10f
> > + fixup REG_S a6, 2*SZREG(a0), 10f
> > + fixup REG_S a7, 3*SZREG(a0), 10f
> > + fixup REG_S t1, 4*SZREG(a0), 10f
> > + fixup REG_S t2, 5*SZREG(a0), 10f
> > + fixup REG_S t3, 6*SZREG(a0), 10f
> > + fixup REG_S t4, 7*SZREG(a0), 10f
>
> This seems like a suspiciously large unrolling factor, at least without
> a fallback. My guess is that some workloads will want some smaller
> unrolling factors, but given that we run on these single-issue in-order
> processors it's probably best to have some big unrolling factors as well
> since they're pretty limited WRT integer bandwidth.

But a single-issue cpu is unlikely to have an 8 clock data delay.
OTOH a cpu than can do concurrent memory read and write might
not have enough 'out of order' capability to do so with the above loop.

You may want to interleave the reads and writes - starting with
two or three reads (possibly with the extra ones outside the loop).

I don't know the microarchitectures well enough (well at all)
to know the exact pitfalls.

The very simple cpu might have the same 'issue' the Nios2 has
(another MIPS clone fpga soft cpu) where there can be a pipeline
stall between a write and read.
I doubt the non-trivial riscv have that issue though.

> > + addi a0, a0, 8*SZREG
> > + addi a1, a1, 8*SZREG
> > + bltu a0, t0, 2b

For a dual-issue cpu you want to move the two 'addi' higher
up the loop so that they are 'free'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-07-10 01:51:26

by Guenter Roeck

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

Hi,

On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote:
> This patch will reduce cpu usage dramatically in kernel space especially
> for application which use sys-call with large buffer size, such as network
> applications. The main reason behind this is that every unaligned memory
> access will raise exceptions and switch between s-mode and m-mode causing
> large overhead.
>
> First copy in bytes until reaches the first word aligned boundary in
> destination memory address. This is the preparation before the bulk
> aligned word copy.
>
> The destination address is aligned now, but oftentimes the source address
> is not in an aligned boundary. To reduce the unaligned memory access, it
> reads the data from source in aligned boundaries, which will cause the
> data to have an offset, and then combines the data in the next iteration
> by fixing offset with shifting before writing to destination. The majority
> of the improving copy speed comes from this shift copy.
>
> In the lucky situation that the both source and destination address are on
> the aligned boundary, perform load and store with register size to copy the
> data. Without the unrolling, it will reduce the speed since the next store
> instruction for the same register using from the load will stall the
> pipeline.
>
> At last, copying the remainder in one byte at a time.
>
> Signed-off-by: Akira Tsukamoto <[email protected]>

This patch causes all riscv32 qemu emulations to stall during boot.
The log suggests that something in kernel/user communication may be wrong.

Bad case:

Starting syslogd: OK
Starting klogd: OK
/etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution
/etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string
Starting network: /bin/sh: syntax error: unterminated quoted string
sed: unmatched '/'
/bin/sh: syntax error: unterminated quoted string
FAIL
/etc/init.d/S55runtest: line 48: syntax error: EOF in backquote substitution

Good case (this patch reverted):

Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Saving random seed: [ 12.277714] random: dd: uninitialized urandom read (512 bytes read)
OK
Starting network: [ 12.949529] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 12.951170] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
udhcpc: started, v1.33.0
udhcpc: sending discover
udhcpc: sending select for 10.0.2.15
udhcpc: lease of 10.0.2.15 obtained, lease time 86400
deleting routers
adding dns 10.0.2.3
OK
Found console ttyS0

Reverting this patch fixes the problem. Bisect log attached.

Guenter

---
# bad: [50be9417e23af5a8ac860d998e1e3f06b8fd79d7] Merge tag 'io_uring-5.14-2021-07-09' of git://git.kernel.dk/linux-block
# good: [f55966571d5eb2876a11e48e798b4592fa1ffbb7] Merge tag 'drm-next-2021-07-08-1' of git://anongit.freedesktop.org/drm/drm
git bisect start 'HEAD' 'f55966571d5e'
# good: [7a400bf28334fc7734639db3566394e1fc80670c] Merge tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
git bisect good 7a400bf28334fc7734639db3566394e1fc80670c
# bad: [d8dc121eeab9abfbc510097f8db83e87560f753b] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect bad d8dc121eeab9abfbc510097f8db83e87560f753b
# bad: [7761e36bc7222d1221242c5f195ee0fd40caea40] riscv: Fix PTDUMP output now BPF region moved back to module region
git bisect bad 7761e36bc7222d1221242c5f195ee0fd40caea40
# good: [5def4429aefe65b494816d9ba8ae7f971d522251] riscv: mm: Use better bitmap_zalloc()
git bisect good 5def4429aefe65b494816d9ba8ae7f971d522251
# good: [47513f243b452a5e21180dcf3d6ac1c57e1781a6] riscv: Enable KFENCE for riscv64
git bisect good 47513f243b452a5e21180dcf3d6ac1c57e1781a6
# good: [01112e5e20f5298a81639806cd0a3c587aade467] Merge branch 'riscv-wx-mappings' into for-next
git bisect good 01112e5e20f5298a81639806cd0a3c587aade467
# good: [70eee556b678d1e4cd4ea6742a577b596963fa25] riscv: ptrace: add argn syntax
git bisect good 70eee556b678d1e4cd4ea6742a577b596963fa25
# bad: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
git bisect bad ca6eaaa210deec0e41cbfc380bf89cf079203569
# good: [31da94c25aea835ceac00575a9fd206c5a833fed] riscv: add VMAP_STACK overflow detection
git bisect good 31da94c25aea835ceac00575a9fd206c5a833fed
# first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

2021-07-13 18:12:05

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

Hi Günter, Tsukamoto-san,

On Sat, Jul 10, 2021 at 3:50 AM Guenter Roeck <[email protected]> wrote:
> On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote:
> > This patch will reduce cpu usage dramatically in kernel space especially
> > for application which use sys-call with large buffer size, such as network
> > applications. The main reason behind this is that every unaligned memory
> > access will raise exceptions and switch between s-mode and m-mode causing
> > large overhead.
> >
> > First copy in bytes until reaches the first word aligned boundary in
> > destination memory address. This is the preparation before the bulk
> > aligned word copy.
> >
> > The destination address is aligned now, but oftentimes the source address
> > is not in an aligned boundary. To reduce the unaligned memory access, it
> > reads the data from source in aligned boundaries, which will cause the
> > data to have an offset, and then combines the data in the next iteration
> > by fixing offset with shifting before writing to destination. The majority
> > of the improving copy speed comes from this shift copy.
> >
> > In the lucky situation that the both source and destination address are on
> > the aligned boundary, perform load and store with register size to copy the
> > data. Without the unrolling, it will reduce the speed since the next store
> > instruction for the same register using from the load will stall the
> > pipeline.
> >
> > At last, copying the remainder in one byte at a time.
> >
> > Signed-off-by: Akira Tsukamoto <[email protected]>
>
> This patch causes all riscv32 qemu emulations to stall during boot.
> The log suggests that something in kernel/user communication may be wrong.
>
> Bad case:
>
> Starting syslogd: OK
> Starting klogd: OK
> /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution
> /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string
> Starting network: /bin/sh: syntax error: unterminated quoted string

> # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

Same here on vexriscv. Bisected to the same commit.

The actual scripts look fine when using "cat", but contain some garbage
when executing them using "sh -v".

Tsukamoto-san: glancing at the patch:

+ addi a0, a0, 8*SZREG
+ addi a1, a1, 8*SZREG

I think you forgot about rv32, where registers cover only 4
bytes each?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-07-15 08:01:52

by Akira Tsukamoto

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall

On 7/14/2021 3:10 AM, Geert Uytterhoeven wrote:
> Hi Günter, Tsukamoto-san,
>
> On Sat, Jul 10, 2021 at 3:50 AM Guenter Roeck <[email protected]> wrote:
>> On Wed, Jun 23, 2021 at 09:40:39PM +0900, Akira Tsukamoto wrote:
>>> This patch will reduce cpu usage dramatically in kernel space especially
>>> for application which use sys-call with large buffer size, such as network
>>> applications. The main reason behind this is that every unaligned memory
>>> access will raise exceptions and switch between s-mode and m-mode causing
>>> large overhead.
>>>
>>> First copy in bytes until reaches the first word aligned boundary in
>>> destination memory address. This is the preparation before the bulk
>>> aligned word copy.
>>>
>>> The destination address is aligned now, but oftentimes the source address
>>> is not in an aligned boundary. To reduce the unaligned memory access, it
>>> reads the data from source in aligned boundaries, which will cause the
>>> data to have an offset, and then combines the data in the next iteration
>>> by fixing offset with shifting before writing to destination. The majority
>>> of the improving copy speed comes from this shift copy.
>>>
>>> In the lucky situation that the both source and destination address are on
>>> the aligned boundary, perform load and store with register size to copy the
>>> data. Without the unrolling, it will reduce the speed since the next store
>>> instruction for the same register using from the load will stall the
>>> pipeline.
>>>
>>> At last, copying the remainder in one byte at a time.
>>>
>>> Signed-off-by: Akira Tsukamoto <[email protected]>
>>
>> This patch causes all riscv32 qemu emulations to stall during boot.
>> The log suggests that something in kernel/user communication may be wrong.
>>
>> Bad case:
>>
>> Starting syslogd: OK
>> Starting klogd: OK
>> /etc/init.d/S02sysctl: line 68: syntax error: EOF in backquote substitution
>> /etc/init.d/S20urandom: line 1: syntax error: unterminated quoted string
>> Starting network: /bin/sh: syntax error: unterminated quoted string
>
>> # first bad commit: [ca6eaaa210deec0e41cbfc380bf89cf079203569] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall
>
> Same here on vexriscv. Bisected to the same commit.
>
> The actual scripts look fine when using "cat", but contain some garbage
> when executing them using "sh -v".
>
> Tsukamoto-san: glancing at the patch:
>
> + addi a0, a0, 8*SZREG
> + addi a1, a1, 8*SZREG
>
> I think you forgot about rv32, where registers cover only 4
> bytes each?

Thanks Günter and Geert for the pointing out the errors.
I will send the fixes, probably this weekend.

Akira