2021-06-15 02:43:41

by Matteo Croce

[permalink] [raw]
Subject: [PATCH 1/3] riscv: optimized memcpy

From: Matteo Croce <[email protected]>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes
[ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes
[ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes
[ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes
[ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes
[ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes
[ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes
[ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes
[ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes
[ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender
[ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes
[ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes
[ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes
[ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes
[ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes
[ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes
[ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes
[ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes
[ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes
[ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender
[ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead Shared O Symbol
42.22% [kernel] [k] memcpy
35.00% [kernel] [k] __asm_copy_to_user
3.50% [kernel] [k] sifive_l2_flush64_range
2.30% [kernel] [k] stmmac_napi_poll_rx
1.11% [kernel] [k] memset

after:

Overhead Shared O Symbol
45.69% [kernel] [k] __asm_copy_to_user
29.06% [kernel] [k] memcpy
4.09% [kernel] [k] sifive_l2_flush64_range
2.77% [kernel] [k] stmmac_napi_poll_rx
1.24% [kernel] [k] memset

Signed-off-by: Matteo Croce <[email protected]>
---
arch/riscv/include/asm/string.h | 8 ++-
arch/riscv/kernel/riscv_ksyms.c | 2 -
arch/riscv/lib/Makefile | 2 +-
arch/riscv/lib/memcpy.S | 108 --------------------------------
arch/riscv/lib/string.c | 94 +++++++++++++++++++++++++++
5 files changed, 101 insertions(+), 113 deletions(-)
delete mode 100644 arch/riscv/lib/memcpy.S
create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
#define __HAVE_ARCH_MEMSET
extern asmlinkage void *memset(void *, int, size_t);
extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
#define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
#define __HAVE_ARCH_MEMMOVE
extern asmlinkage void *memmove(void *, const void *, size_t);
extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
* Assembly functions that may be used (directly or indirectly) by modules
*/
EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
EXPORT_SYMBOL(memmove);
EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
# SPDX-License-Identifier: GPL-2.0-only
lib-y += delay.o
-lib-y += memcpy.o
lib-y += memset.o
lib-y += memmove.o
lib-$(CONFIG_MMU) += uaccess.o
lib-$(CONFIG_64BIT) += tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o

obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
- move t6, a0 /* Preserve return value */
-
- /* Defer to byte-oriented copy for small sizes */
- sltiu a3, a2, 128
- bnez a3, 4f
- /* Use word-oriented copy only if low-order bits match */
- andi a3, t6, SZREG-1
- andi a4, a1, SZREG-1
- bne a3, a4, 4f
-
- beqz a3, 2f /* Skip if already aligned */
- /*
- * Round to nearest double word-aligned address
- * greater than or equal to start address
- */
- andi a3, a1, ~(SZREG-1)
- addi a3, a3, SZREG
- /* Handle initial misalignment */
- sub a4, a3, a1
-1:
- lb a5, 0(a1)
- addi a1, a1, 1
- sb a5, 0(t6)
- addi t6, t6, 1
- bltu a1, a3, 1b
- sub a2, a2, a4 /* Update count */
-
-2:
- andi a4, a2, ~((16*SZREG)-1)
- beqz a4, 4f
- add a3, a1, a4
-3:
- REG_L a4, 0(a1)
- REG_L a5, SZREG(a1)
- REG_L a6, 2*SZREG(a1)
- REG_L a7, 3*SZREG(a1)
- REG_L t0, 4*SZREG(a1)
- REG_L t1, 5*SZREG(a1)
- REG_L t2, 6*SZREG(a1)
- REG_L t3, 7*SZREG(a1)
- REG_L t4, 8*SZREG(a1)
- REG_L t5, 9*SZREG(a1)
- REG_S a4, 0(t6)
- REG_S a5, SZREG(t6)
- REG_S a6, 2*SZREG(t6)
- REG_S a7, 3*SZREG(t6)
- REG_S t0, 4*SZREG(t6)
- REG_S t1, 5*SZREG(t6)
- REG_S t2, 6*SZREG(t6)
- REG_S t3, 7*SZREG(t6)
- REG_S t4, 8*SZREG(t6)
- REG_S t5, 9*SZREG(t6)
- REG_L a4, 10*SZREG(a1)
- REG_L a5, 11*SZREG(a1)
- REG_L a6, 12*SZREG(a1)
- REG_L a7, 13*SZREG(a1)
- REG_L t0, 14*SZREG(a1)
- REG_L t1, 15*SZREG(a1)
- addi a1, a1, 16*SZREG
- REG_S a4, 10*SZREG(t6)
- REG_S a5, 11*SZREG(t6)
- REG_S a6, 12*SZREG(t6)
- REG_S a7, 13*SZREG(t6)
- REG_S t0, 14*SZREG(t6)
- REG_S t1, 15*SZREG(t6)
- addi t6, t6, 16*SZREG
- bltu a1, a3, 3b
- andi a2, a2, (16*SZREG)-1 /* Update count */
-
-4:
- /* Handle trailing misalignment */
- beqz a2, 6f
- add a3, a1, a2
-
- /* Use word-oriented copy if co-aligned to word boundary */
- or a5, a1, t6
- or a5, a5, a3
- andi a5, a5, 3
- bnez a5, 5f
-7:
- lw a4, 0(a1)
- addi a1, a1, 4
- sw a4, 0(t6)
- addi t6, t6, 4
- bltu a1, a3, 7b
-
- ret
-
-5:
- lb a4, 0(a1)
- addi a1, a1, 1
- sb a4, 0(t6)
- addi t6, t6, 1
- bltu a1, a3, 5b
-6:
- ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..525f9ee25a74
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* size below a classic byte at time copy is done */
+#define MIN_THRESHOLD 64
+
+/* convenience types to avoid cast between different pointer types */
+union types {
+ u8 *u8;
+ unsigned long *ulong;
+ uintptr_t uptr;
+};
+
+union const_types {
+ const u8 *u8;
+ unsigned long *ulong;
+};
+
+void *memcpy(void *dest, const void *src, size_t count)
+{
+ const int bytes_long = BITS_PER_LONG / 8;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ const int mask = bytes_long - 1;
+ const int distance = (src - dest) & mask;
+#endif
+ union const_types s = { .u8 = src };
+ union types d = { .u8 = dest };
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ if (count <= MIN_THRESHOLD)
+ goto copy_remainder;
+
+ /* copy a byte at time until destination is aligned */
+ for (; count && d.uptr & mask; count--)
+ *d.u8++ = *s.u8++;
+
+ if (distance) {
+ unsigned long last, next;
+
+ /* move s backward to the previous alignment boundary */
+ s.u8 -= distance;
+
+ /* 32/64 bit wide copy from s to d.
+ * d is aligned now but s is not, so read s alignment wise,
+ * and do proper shift to get the right value.
+ * Works only on Little Endian machines.
+ */
+ for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
+ last = next;
+ next = s.ulong[1];
+
+ d.ulong[0] = last >> (distance * 8) |
+ next << ((bytes_long - distance) * 8);
+
+ d.ulong++;
+ s.ulong++;
+ }
+
+ /* restore s with the original offset */
+ s.u8 += distance;
+ } else
+#endif
+ {
+ /* if the source and dest lower bits are the same, do a simple
+ * 32/64 bit wide copy.
+ */
+ for (; count >= bytes_long; count -= bytes_long)
+ *d.ulong++ = *s.ulong++;
+ }
+
+ /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
+ goto copy_remainder;
+
+copy_remainder:
+ while (count--)
+ *d.u8++ = *s.u8++;
+
+ return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+ return memcpy(dest, src, count);
+}
+EXPORT_SYMBOL(__memcpy);
--
2.31.1


2021-06-15 08:59:32

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/3] riscv: optimized memcpy

From: Matteo Croce
> Sent: 15 June 2021 03:38
>
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.

I'm surprised that the C loop:

> + for (; count >= bytes_long; count -= bytes_long)
> + *d.ulong++ = *s.ulong++;

ends up being faster than the ASM 'read lots' - 'write lots' loop.

Especially since there was an earlier patch to convert
copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
instead of a tight single register copy loop.

I'd also guess that the performance needs to be measured on
different classes of riscv cpu.

A simple cpu will behave differently to one that can execute
multiple instructions per clock.
Any form of 'out of order' execution also changes things.
The other big change is whether the cpu can to a memory
read and write in the same clock.

I'd guess that riscv exist with some/all of those features.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-06-15 13:09:55

by Bin Meng

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Tue, Jun 15, 2021 at 4:57 PM David Laight <[email protected]> wrote:
>
> From: Matteo Croce
> > Sent: 15 June 2021 03:38
> >
> > Write a C version of memcpy() which uses the biggest data size allowed,
> > without generating unaligned accesses.
>
> I'm surprised that the C loop:
>
> > + for (; count >= bytes_long; count -= bytes_long)
> > + *d.ulong++ = *s.ulong++;
>
> ends up being faster than the ASM 'read lots' - 'write lots' loop.

I believe that's because the assembly version has some unaligned
access cases, which end up being trap-n-emulated in the OpenSBI
firmware, and that is a big overhead.

>
> Especially since there was an earlier patch to convert
> copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop
> instead of a tight single register copy loop.
>
> I'd also guess that the performance needs to be measured on
> different classes of riscv cpu.
>
> A simple cpu will behave differently to one that can execute
> multiple instructions per clock.
> Any form of 'out of order' execution also changes things.
> The other big change is whether the cpu can to a memory
> read and write in the same clock.
>
> I'd guess that riscv exist with some/all of those features.

Regards,
Bin

2021-06-15 13:19:58

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/3] riscv: optimized memcpy

From: Bin Meng
> Sent: 15 June 2021 14:09
>
> On Tue, Jun 15, 2021 at 4:57 PM David Laight <[email protected]> wrote:
> >
...
> > I'm surprised that the C loop:
> >
> > > + for (; count >= bytes_long; count -= bytes_long)
> > > + *d.ulong++ = *s.ulong++;
> >
> > ends up being faster than the ASM 'read lots' - 'write lots' loop.
>
> I believe that's because the assembly version has some unaligned
> access cases, which end up being trap-n-emulated in the OpenSBI
> firmware, and that is a big overhead.

Ah, that would make sense since the asm user copy code
was broken for misaligned copies.
I suspect memcpy() was broken the same way.

I'm surprised IP_NET_ALIGN isn't set to 2 to try to
avoid all these misaligned copies in the network stack.
Although avoiding 8n+4 aligned data is rather harder.

Misaligned copies are just best avoided - really even on x86.
The 'real fun' is when the access crosses TLB boundaries.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-06-15 13:31:50

by Bin Meng

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Tue, Jun 15, 2021 at 9:18 PM David Laight <[email protected]> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <[email protected]> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > + for (; count >= bytes_long; count -= bytes_long)
> > > > + *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>

Yes, Gary Guo sent one patch long time ago against the broken assembly
version, but that patch was still not applied as of today.
https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/

I suggest Matteo re-test using Gary's version.

> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>
> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.

Regards,
Bin

2021-06-15 13:46:57

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Tue, Jun 15, 2021 at 3:18 PM David Laight <[email protected]> wrote:
>
> From: Bin Meng
> > Sent: 15 June 2021 14:09
> >
> > On Tue, Jun 15, 2021 at 4:57 PM David Laight <[email protected]> wrote:
> > >
> ...
> > > I'm surprised that the C loop:
> > >
> > > > + for (; count >= bytes_long; count -= bytes_long)
> > > > + *d.ulong++ = *s.ulong++;
> > >
> > > ends up being faster than the ASM 'read lots' - 'write lots' loop.
> >
> > I believe that's because the assembly version has some unaligned
> > access cases, which end up being trap-n-emulated in the OpenSBI
> > firmware, and that is a big overhead.
>
> Ah, that would make sense since the asm user copy code
> was broken for misaligned copies.
> I suspect memcpy() was broken the same way.
>
> I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> avoid all these misaligned copies in the network stack.
> Although avoiding 8n+4 aligned data is rather harder.
>

That's up to the network driver, indeed I have a patch already for the
BeagleV one:

https://lore.kernel.org/netdev/[email protected]/T/

> Misaligned copies are just best avoided - really even on x86.
> The 'real fun' is when the access crosses TLB boundaries.
>

--
per aspera ad upstream

2021-06-15 16:14:21

by Emil Renner Berthing

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Tue, 15 Jun 2021 at 15:29, Bin Meng <[email protected]> wrote:
> ...
> Yes, Gary Guo sent one patch long time ago against the broken assembly
> version, but that patch was still not applied as of today.
> https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
>
> I suggest Matteo re-test using Gary's version.

That's a good idea, but if you read the replies to Gary's original patch
https://lore.kernel.org/linux-riscv/[email protected]/
.. both Gary, Palmer and David would rather like a C-based version.
This is one attempt at providing that.

> > I'm surprised IP_NET_ALIGN isn't set to 2 to try to
> > avoid all these misaligned copies in the network stack.
> > Although avoiding 8n+4 aligned data is rather harder.
> >
> > Misaligned copies are just best avoided - really even on x86.
> > The 'real fun' is when the access crosses TLB boundaries.
>
> Regards,
> Bin

2021-06-16 00:36:07

by Bin Meng

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing <[email protected]> wrote:
>
> On Tue, 15 Jun 2021 at 15:29, Bin Meng <[email protected]> wrote:
> > ...
> > Yes, Gary Guo sent one patch long time ago against the broken assembly
> > version, but that patch was still not applied as of today.
> > https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> >
> > I suggest Matteo re-test using Gary's version.
>
> That's a good idea, but if you read the replies to Gary's original patch
> https://lore.kernel.org/linux-riscv/[email protected]/
> .. both Gary, Palmer and David would rather like a C-based version.
> This is one attempt at providing that.

Yep, I prefer C as well :)

But if you check commit 04091d6, the assembly version was introduced
for KASAN. So if we are to change it back to C, please make sure KASAN
is not broken.

Regards,
Bin

2021-06-16 02:02:27

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Wed, 16 Jun 2021 08:33:21 +0800
Bin Meng <[email protected]> wrote:

> On Wed, Jun 16, 2021 at 12:12 AM Emil Renner Berthing
> <[email protected]> wrote:
> >
> > On Tue, 15 Jun 2021 at 15:29, Bin Meng <[email protected]> wrote:
> > > ...
> > > Yes, Gary Guo sent one patch long time ago against the broken
> > > assembly version, but that patch was still not applied as of
> > > today.
> > > https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> > >
> > > I suggest Matteo re-test using Gary's version.
> >
> > That's a good idea, but if you read the replies to Gary's original
> > patch
> > https://lore.kernel.org/linux-riscv/[email protected]/
> > .. both Gary, Palmer and David would rather like a C-based version.
> > This is one attempt at providing that.
>
> Yep, I prefer C as well :)
>
> But if you check commit 04091d6, the assembly version was introduced
> for KASAN. So if we are to change it back to C, please make sure KASAN
> is not broken.
>
> Regards,
> Bin
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

I added a small benchmark for memcpy() and memset() in lib/test_string.c:

memcpy_align_selftest():

#define PG_SIZE (1 << (MAX_ORDER - 1 + PAGE_SHIFT))

page1 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
page2 = alloc_pages(GFP_KERNEL, MAX_ORDER-1);

for (i = 0; i < sizeof(void*); i++) {
t0 = ktime_get();
memset(dst + i, 0, PG_SIZE - i);
t1 = ktime_get();
printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i,
PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
}

memset_align_selftest():
page = alloc_pages(GFP_KERNEL, MAX_ORDER-1);
for (i = 0; i < sizeof(void*); i++) {
for (j = 0; j < sizeof(void*); j++) {
t0 = ktime_get();
memcpy(dst + j, src + i, PG_SIZE - max(i, j));
t1 = ktime_get();
printk("Strings selftest: memcpy(src+%d, dst+%d): %llu Mb/s\n",
i, j, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
}
}

And I ran it agains the three implementations, current, Gary's assembler
and mine in C.

Current:
[ 38.980687] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[ 39.021612] Strings selftest: memcpy(src+0, dst+1): 113 Mb/s
[ 39.062191] Strings selftest: memcpy(src+0, dst+2): 114 Mb/s
[ 39.102669] Strings selftest: memcpy(src+0, dst+3): 114 Mb/s
[ 39.127423] Strings selftest: memcpy(src+0, dst+4): 209 Mb/s
[ 39.167836] Strings selftest: memcpy(src+0, dst+5): 115 Mb/s
[ 39.208305] Strings selftest: memcpy(src+0, dst+6): 114 Mb/s
[ 39.248712] Strings selftest: memcpy(src+0, dst+7): 115 Mb/s
[ 39.288144] Strings selftest: memcpy(src+1, dst+0): 118 Mb/s
[ 39.309190] Strings selftest: memcpy(src+1, dst+1): 260 Mb/s
[ 39.349721] Strings selftest: memcpy(src+1, dst+2): 114 Mb/s
[...]
[ 41.289423] Strings selftest: memcpy(src+7, dst+5): 114 Mb/s
[ 41.328801] Strings selftest: memcpy(src+7, dst+6): 118 Mb/s
[ 41.349907] Strings selftest: memcpy(src+7, dst+7): 259 Mb/s

[ 41.377735] Strings selftest: memset(dst+0): 241 Mb/s
[ 41.397882] Strings selftest: memset(dst+1): 265 Mb/s
[ 41.417666] Strings selftest: memset(dst+2): 272 Mb/s
[ 41.437169] Strings selftest: memset(dst+3): 277 Mb/s
[ 41.456656] Strings selftest: memset(dst+4): 277 Mb/s
[ 41.476125] Strings selftest: memset(dst+5): 278 Mb/s
[ 41.495555] Strings selftest: memset(dst+6): 278 Mb/s
[ 41.515002] Strings selftest: memset(dst+7): 278 Mb/s

Gary's
[ 27.438112] Strings selftest: memcpy(src+0, dst+0): 232 Mb/s
[ 27.461586] Strings selftest: memcpy(src+0, dst+1): 224 Mb/s
[ 27.484691] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[ 27.507693] Strings selftest: memcpy(src+0, dst+3): 230 Mb/s
[ 27.530758] Strings selftest: memcpy(src+0, dst+4): 229 Mb/s
[ 27.553840] Strings selftest: memcpy(src+0, dst+5): 229 Mb/s
[ 27.576793] Strings selftest: memcpy(src+0, dst+6): 231 Mb/s
[ 27.599862] Strings selftest: memcpy(src+0, dst+7): 230 Mb/s
[ 27.622888] Strings selftest: memcpy(src+1, dst+0): 230 Mb/s
[ 27.643964] Strings selftest: memcpy(src+1, dst+1): 259 Mb/s
[ 27.666926] Strings selftest: memcpy(src+1, dst+2): 231 Mb/s
[...]
[ 28.831726] Strings selftest: memcpy(src+7, dst+5): 230 Mb/s
[ 28.854790] Strings selftest: memcpy(src+7, dst+6): 229 Mb/s
[ 28.875844] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[ 28.903666] Strings selftest: memset(dst+0): 240 Mb/s
[ 28.923533] Strings selftest: memset(dst+1): 269 Mb/s
[ 28.943100] Strings selftest: memset(dst+2): 275 Mb/s
[ 28.962554] Strings selftest: memset(dst+3): 277 Mb/s
[ 28.982009] Strings selftest: memset(dst+4): 277 Mb/s
[ 29.001412] Strings selftest: memset(dst+5): 278 Mb/s
[ 29.020894] Strings selftest: memset(dst+6): 277 Mb/s
[ 29.040383] Strings selftest: memset(dst+7): 276 Mb/s

Mine:
[ 33.916144] Strings selftest: memcpy(src+0, dst+0): 222 Mb/s
[ 33.939520] Strings selftest: memcpy(src+0, dst+1): 226 Mb/s
[ 33.962666] Strings selftest: memcpy(src+0, dst+2): 229 Mb/s
[ 33.985749] Strings selftest: memcpy(src+0, dst+3): 229 Mb/s
[ 34.008748] Strings selftest: memcpy(src+0, dst+4): 231 Mb/s
[ 34.031970] Strings selftest: memcpy(src+0, dst+5): 228 Mb/s
[ 34.055065] Strings selftest: memcpy(src+0, dst+6): 229 Mb/s
[ 34.078068] Strings selftest: memcpy(src+0, dst+7): 231 Mb/s
[ 34.101177] Strings selftest: memcpy(src+1, dst+0): 229 Mb/s
[ 34.122995] Strings selftest: memcpy(src+1, dst+1): 247 Mb/s
[ 34.146072] Strings selftest: memcpy(src+1, dst+2): 229 Mb/s
[...]
[ 35.315594] Strings selftest: memcpy(src+7, dst+5): 229 Mb/s
[ 35.338617] Strings selftest: memcpy(src+7, dst+6): 230 Mb/s
[ 35.360464] Strings selftest: memcpy(src+7, dst+7): 247 Mb/s

[ 35.388929] Strings selftest: memset(dst+0): 232 Mb/s
[ 35.409351] Strings selftest: memset(dst+1): 260 Mb/s
[ 35.429434] Strings selftest: memset(dst+2): 266 Mb/s
[ 35.449460] Strings selftest: memset(dst+3): 267 Mb/s
[ 35.469479] Strings selftest: memset(dst+4): 267 Mb/s
[ 35.489481] Strings selftest: memset(dst+5): 268 Mb/s
[ 35.509443] Strings selftest: memset(dst+6): 269 Mb/s
[ 35.529449] Strings selftest: memset(dst+7): 268 Mb/s

Leaving out the first memcpy/set of every test which is always slower, (maybe
because of a cache miss?), the current implementation copies 260 Mb/s when
the low order bits match, and 114 otherwise.
Memset is stable at 278 Mb/s.

Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
and 230 Mb/s otherwise. Memset is the same as the current one.

Mine has the same speed of Gary's one when the low order bits mismatch, but
it's slower when equally aligned, it stops at 247 Mb/s.
Memset is slighty slower ad 269 Mb/s.


I'm not familiar with RISC-V assembly, but looking at Gary's assembler and I
think that he manually unrolled the loop by copying 16 uint64_t at time
using 16 registers.
I managed to do the same with a small change in the C code and a pragma directive:

This for memcpy():

if (distance) {
unsigned long last, next;
int i;

s.u8 -= distance;

for (; count >= bytes_long * 8 + mask; count -= bytes_long * 8) {
next = s.ulong[0];
for (i = 0; i < 8; i++) {
last = next;
next = s.ulong[i + 1];

d.ulong[i] = last >> (distance * 8) |
next << ((bytes_long - distance) * 8);
}

d.ulong += 8;
s.ulong += 8;
}

s.u8 += distance;
} else {
/* 8 byte wide copy */
int i;
for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
for (i = 0; i < 8; i++)
d.ulong[i] = s.ulong[i];
d.ulong += 8;
s.ulong += 8;
}
}

And this for memset:

for (; count >= bytes_long * 8; count -= bytes_long * 8) {
#pragma GCC unroll 8
for (i = 0; i < 8; i++)
dest.ulong[i] = cu;

dest.ulong += 8;
}

And the generated machine code is very, very similar to Gary's one!
And these are the result:

[ 35.898366] Strings selftest: memcpy(src+0, dst+0): 231 Mb/s
[ 35.920942] Strings selftest: memcpy(src+0, dst+1): 236 Mb/s
[ 35.943171] Strings selftest: memcpy(src+0, dst+2): 241 Mb/s
[ 35.965291] Strings selftest: memcpy(src+0, dst+3): 242 Mb/s
[ 35.987374] Strings selftest: memcpy(src+0, dst+4): 244 Mb/s
[ 36.009554] Strings selftest: memcpy(src+0, dst+5): 242 Mb/s
[ 36.031721] Strings selftest: memcpy(src+0, dst+6): 242 Mb/s
[ 36.053881] Strings selftest: memcpy(src+0, dst+7): 242 Mb/s
[ 36.075949] Strings selftest: memcpy(src+1, dst+0): 243 Mb/s
[ 36.097084] Strings selftest: memcpy(src+1, dst+1): 258 Mb/s
[ 36.119269] Strings selftest: memcpy(src+1, dst+2): 242 Mb/s
[...]
[ 37.242433] Strings selftest: memcpy(src+7, dst+5): 242 Mb/s
[ 37.264571] Strings selftest: memcpy(src+7, dst+6): 242 Mb/s
[ 37.285609] Strings selftest: memcpy(src+7, dst+7): 260 Mb/s

[ 37.313633] Strings selftest: memset(dst+0): 237 Mb/s
[ 37.333682] Strings selftest: memset(dst+1): 266 Mb/s
[ 37.353375] Strings selftest: memset(dst+2): 273 Mb/s
[ 37.373000] Strings selftest: memset(dst+3): 274 Mb/s
[ 37.392608] Strings selftest: memset(dst+4): 274 Mb/s
[ 37.412220] Strings selftest: memset(dst+5): 274 Mb/s
[ 37.431848] Strings selftest: memset(dst+6): 274 Mb/s
[ 37.451467] Strings selftest: memset(dst+7): 274 Mb/s

This version is even faster than the assembly one, but it won't work for
copies/set smaller that at least 64 bytes, or even 128. With small buffers
it will copy bytes one at time, so I don't know if it's worth it.

What is preferred in your opinion, an implementation which is always fast
with all sizes, or one which is a bit faster but slow with small copies?

--
per aspera ad upstream

2021-06-16 08:28:15

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/3] riscv: optimized memcpy

From: Matteo Croce
> Sent: 16 June 2021 03:02
...
> > > That's a good idea, but if you read the replies to Gary's original
> > > patch
> > > https://lore.kernel.org/linux-riscv/[email protected]/
> > > .. both Gary, Palmer and David would rather like a C-based version.
> > > This is one attempt at providing that.
> >
> > Yep, I prefer C as well :)
> >
> > But if you check commit 04091d6, the assembly version was introduced
> > for KASAN. So if we are to change it back to C, please make sure KASAN
> > is not broken.
> >
...
> Leaving out the first memcpy/set of every test which is always slower, (maybe
> because of a cache miss?), the current implementation copies 260 Mb/s when
> the low order bits match, and 114 otherwise.
> Memset is stable at 278 Mb/s.
>
> Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> and 230 Mb/s otherwise. Memset is the same as the current one.

Any idea what the attainable performance is for the cpu you are using?
Since both memset and memcpy are running at much the same speed
I suspect it is all limited by the writes.

272MB/s is only 34M writes/sec.
This seems horribly slow for a modern cpu.
So is this actually really limited by the cache writes to physical memory?

You might want to do some tests (userspace is fine) where you
check much smaller lengths that definitely sit within the data cache.

It is also worth checking how much overhead there is for
short copies - they are almost certainly more common than
you might expect.
This is one problem with excessive loop unrolling - the 'special
cases' for the ends of the buffer start having a big effect
on small copies.

For cpu that support misaligned memory accesses, one 'trick'
for transfers longer than a 'word' is to do a (probably) misaligned
transfer of the last word of the buffer first followed by the
transfer of the rest of the buffer (overlapping a few bytes at the end).
This saves on conditionals and temporary values.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-06-16 10:52:26

by Akira Tsukamoto

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Wed, Jun 16, 2021 at 5:24 PM David Laight <[email protected]> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/[email protected]/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>
> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>
> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

> Overhead Shared O Symbol
> 42.22% [kernel] [k] memcpy
> 35.00% [kernel] [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.

which are in the code:

Gary's:
+ /* Calculate shifts */
+ slli t3, a3, 3
+ sub t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+ /* Load the initial value and align a1 */
+ andi a1, a1, ~(SZREG-1)
+ REG_L a5, 0(a1)
+
+ addi t0, t0, -(SZREG-1)
+ /* At least one iteration will be executed here, no check */
+1:
+ srl a4, a5, t3
+ REG_L a5, SZREG(a1)
+ addi a1, a1, SZREG
+ sll a2, a5, t4
+ or a2, a2, a4
+ REG_S a2, 0(a0)
+ addi a0, a0, SZREG
+ bltu a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+ for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+ last = next;
+ next = s.ulong[1];
+
+ d.ulong[0] = last >> (distance * 8) |
+ next << ((bytes_long - distance) * 8);
+
+ d.ulong++;
+ s.ulong++;
+ }

I believe this is reasonable and enough to be in the upstream.

Akira


>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

2021-06-16 12:27:03

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

Hi Matteo,

Have you tried Glibc generic implementation code?
ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t

If Glibc codes have the same performance in your hardware, then you
could give a generic implementation first.

The current Linux generic implementation is so simple in lib/string.c:
#ifndef __HAVE_ARCH_MEMCPY
/**
* memcpy - Copy one area of memory to another
* @dest: Where to copy to
* @src: Where to copy from
* @count: The size of the area.
*
* You should not use this function to access IO space, use memcpy_toio()
* or memcpy_fromio() instead.
*/
void *memcpy(void *dest, const void *src, size_t count)
{
char *tmp = dest;
const char *s = src;

while (count--)
*tmp++ = *s++;
return dest;
}
EXPORT_SYMBOL(memcpy);
#endif

On Tue, Jun 15, 2021 at 10:42 AM Matteo Croce
<[email protected]> wrote:
>
> From: Matteo Croce <[email protected]>
>
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
>
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
>
> On a BeagleV, the TCP RX throughput increased by 45%:
>
> before:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes
> [ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes
> [ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes
> [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes
> [ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes
> [ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes
> [ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes
> [ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes
> [ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes
> [ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender
> [ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver
>
> after:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes
> [ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes
> [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes
> [ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes
> [ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes
> [ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes
> [ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes
> [ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes
> [ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes
> [ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender
> [ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver
>
> And the decreased CPU time of the memcpy() is observable with perf top.
> This is the `perf top -Ue task-clock` output when doing the test:
>
> before:
>
> Overhead Shared O Symbol
> 42.22% [kernel] [k] memcpy
> 35.00% [kernel] [k] __asm_copy_to_user
> 3.50% [kernel] [k] sifive_l2_flush64_range
> 2.30% [kernel] [k] stmmac_napi_poll_rx
> 1.11% [kernel] [k] memset
>
> after:
>
> Overhead Shared O Symbol
> 45.69% [kernel] [k] __asm_copy_to_user
> 29.06% [kernel] [k] memcpy
> 4.09% [kernel] [k] sifive_l2_flush64_range
> 2.77% [kernel] [k] stmmac_napi_poll_rx
> 1.24% [kernel] [k] memset
>
> Signed-off-by: Matteo Croce <[email protected]>
> ---
> arch/riscv/include/asm/string.h | 8 ++-
> arch/riscv/kernel/riscv_ksyms.c | 2 -
> arch/riscv/lib/Makefile | 2 +-
> arch/riscv/lib/memcpy.S | 108 --------------------------------
> arch/riscv/lib/string.c | 94 +++++++++++++++++++++++++++
> 5 files changed, 101 insertions(+), 113 deletions(-)
> delete mode 100644 arch/riscv/lib/memcpy.S
> create mode 100644 arch/riscv/lib/string.c
>
> diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
> index 909049366555..6b5d6fc3eab4 100644
> --- a/arch/riscv/include/asm/string.h
> +++ b/arch/riscv/include/asm/string.h
> @@ -12,9 +12,13 @@
> #define __HAVE_ARCH_MEMSET
> extern asmlinkage void *memset(void *, int, size_t);
> extern asmlinkage void *__memset(void *, int, size_t);
> +
> +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
> #define __HAVE_ARCH_MEMCPY
> -extern asmlinkage void *memcpy(void *, const void *, size_t);
> -extern asmlinkage void *__memcpy(void *, const void *, size_t);
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);
> +#endif
> +
> #define __HAVE_ARCH_MEMMOVE
> extern asmlinkage void *memmove(void *, const void *, size_t);
> extern asmlinkage void *__memmove(void *, const void *, size_t);
> diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
> index 5ab1c7e1a6ed..3f6d512a5b97 100644
> --- a/arch/riscv/kernel/riscv_ksyms.c
> +++ b/arch/riscv/kernel/riscv_ksyms.c
> @@ -10,8 +10,6 @@
> * Assembly functions that may be used (directly or indirectly) by modules
> */
> EXPORT_SYMBOL(memset);
> -EXPORT_SYMBOL(memcpy);
> EXPORT_SYMBOL(memmove);
> EXPORT_SYMBOL(__memset);
> -EXPORT_SYMBOL(__memcpy);
> EXPORT_SYMBOL(__memmove);
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 25d5c9664e57..2ffe85d4baee 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -1,9 +1,9 @@
> # SPDX-License-Identifier: GPL-2.0-only
> lib-y += delay.o
> -lib-y += memcpy.o
> lib-y += memset.o
> lib-y += memmove.o
> lib-$(CONFIG_MMU) += uaccess.o
> lib-$(CONFIG_64BIT) += tishift.o
> +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
>
> obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> deleted file mode 100644
> index 51ab716253fa..000000000000
> --- a/arch/riscv/lib/memcpy.S
> +++ /dev/null
> @@ -1,108 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2013 Regents of the University of California
> - */
> -
> -#include <linux/linkage.h>
> -#include <asm/asm.h>
> -
> -/* void *memcpy(void *, const void *, size_t) */
> -ENTRY(__memcpy)
> -WEAK(memcpy)
> - move t6, a0 /* Preserve return value */
> -
> - /* Defer to byte-oriented copy for small sizes */
> - sltiu a3, a2, 128
> - bnez a3, 4f
> - /* Use word-oriented copy only if low-order bits match */
> - andi a3, t6, SZREG-1
> - andi a4, a1, SZREG-1
> - bne a3, a4, 4f
> -
> - beqz a3, 2f /* Skip if already aligned */
> - /*
> - * Round to nearest double word-aligned address
> - * greater than or equal to start address
> - */
> - andi a3, a1, ~(SZREG-1)
> - addi a3, a3, SZREG
> - /* Handle initial misalignment */
> - sub a4, a3, a1
> -1:
> - lb a5, 0(a1)
> - addi a1, a1, 1
> - sb a5, 0(t6)
> - addi t6, t6, 1
> - bltu a1, a3, 1b
> - sub a2, a2, a4 /* Update count */
> -
> -2:
> - andi a4, a2, ~((16*SZREG)-1)
> - beqz a4, 4f
> - add a3, a1, a4
> -3:
> - REG_L a4, 0(a1)
> - REG_L a5, SZREG(a1)
> - REG_L a6, 2*SZREG(a1)
> - REG_L a7, 3*SZREG(a1)
> - REG_L t0, 4*SZREG(a1)
> - REG_L t1, 5*SZREG(a1)
> - REG_L t2, 6*SZREG(a1)
> - REG_L t3, 7*SZREG(a1)
> - REG_L t4, 8*SZREG(a1)
> - REG_L t5, 9*SZREG(a1)
> - REG_S a4, 0(t6)
> - REG_S a5, SZREG(t6)
> - REG_S a6, 2*SZREG(t6)
> - REG_S a7, 3*SZREG(t6)
> - REG_S t0, 4*SZREG(t6)
> - REG_S t1, 5*SZREG(t6)
> - REG_S t2, 6*SZREG(t6)
> - REG_S t3, 7*SZREG(t6)
> - REG_S t4, 8*SZREG(t6)
> - REG_S t5, 9*SZREG(t6)
> - REG_L a4, 10*SZREG(a1)
> - REG_L a5, 11*SZREG(a1)
> - REG_L a6, 12*SZREG(a1)
> - REG_L a7, 13*SZREG(a1)
> - REG_L t0, 14*SZREG(a1)
> - REG_L t1, 15*SZREG(a1)
> - addi a1, a1, 16*SZREG
> - REG_S a4, 10*SZREG(t6)
> - REG_S a5, 11*SZREG(t6)
> - REG_S a6, 12*SZREG(t6)
> - REG_S a7, 13*SZREG(t6)
> - REG_S t0, 14*SZREG(t6)
> - REG_S t1, 15*SZREG(t6)
> - addi t6, t6, 16*SZREG
> - bltu a1, a3, 3b
> - andi a2, a2, (16*SZREG)-1 /* Update count */
> -
> -4:
> - /* Handle trailing misalignment */
> - beqz a2, 6f
> - add a3, a1, a2
> -
> - /* Use word-oriented copy if co-aligned to word boundary */
> - or a5, a1, t6
> - or a5, a5, a3
> - andi a5, a5, 3
> - bnez a5, 5f
> -7:
> - lw a4, 0(a1)
> - addi a1, a1, 4
> - sw a4, 0(t6)
> - addi t6, t6, 4
> - bltu a1, a3, 7b
> -
> - ret
> -
> -5:
> - lb a4, 0(a1)
> - addi a1, a1, 1
> - sb a4, 0(t6)
> - addi t6, t6, 1
> - bltu a1, a3, 5b
> -6:
> - ret
> -END(__memcpy)
> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> new file mode 100644
> index 000000000000..525f9ee25a74
> --- /dev/null
> +++ b/arch/riscv/lib/string.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* size below a classic byte at time copy is done */
> +#define MIN_THRESHOLD 64
> +
> +/* convenience types to avoid cast between different pointer types */
> +union types {
> + u8 *u8;
> + unsigned long *ulong;
> + uintptr_t uptr;
> +};
> +
> +union const_types {
> + const u8 *u8;
> + unsigned long *ulong;
> +};
> +
> +void *memcpy(void *dest, const void *src, size_t count)
> +{
> + const int bytes_long = BITS_PER_LONG / 8;
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> + const int mask = bytes_long - 1;
> + const int distance = (src - dest) & mask;
> +#endif
> + union const_types s = { .u8 = src };
> + union types d = { .u8 = dest };
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> + if (count <= MIN_THRESHOLD)
> + goto copy_remainder;
> +
> + /* copy a byte at time until destination is aligned */
> + for (; count && d.uptr & mask; count--)
> + *d.u8++ = *s.u8++;
> +
> + if (distance) {
> + unsigned long last, next;
> +
> + /* move s backward to the previous alignment boundary */
> + s.u8 -= distance;
> +
> + /* 32/64 bit wide copy from s to d.
> + * d is aligned now but s is not, so read s alignment wise,
> + * and do proper shift to get the right value.
> + * Works only on Little Endian machines.
> + */
> + for (next = s.ulong[0]; count >= bytes_long + mask; count -= bytes_long) {
> + last = next;
> + next = s.ulong[1];
> +
> + d.ulong[0] = last >> (distance * 8) |
> + next << ((bytes_long - distance) * 8);
> +
> + d.ulong++;
> + s.ulong++;
> + }
> +
> + /* restore s with the original offset */
> + s.u8 += distance;
> + } else
> +#endif
> + {
> + /* if the source and dest lower bits are the same, do a simple
> + * 32/64 bit wide copy.
> + */
> + for (; count >= bytes_long; count -= bytes_long)
> + *d.ulong++ = *s.ulong++;
> + }
> +
> + /* suppress warning when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y */
> + goto copy_remainder;
> +
> +copy_remainder:
> + while (count--)
> + *d.u8++ = *s.u8++;
> +
> + return dest;
> +}
> +EXPORT_SYMBOL(memcpy);
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> + return memcpy(dest, src, count);
> +}
> +EXPORT_SYMBOL(__memcpy);
> --
> 2.31.1
>


--
Best Regards
Guo Ren

ML: https://lore.kernel.org/linux-csky/

2021-06-17 03:14:06

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Wed, Jun 16, 2021 at 10:24 AM David Laight <[email protected]> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/[email protected]/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>

I get similar results in userspace, this tool write to RAM with
variable data width:

root@beaglev:~/src# ./unalign_check 1 0 1
size: 1 Mb
write size: 8 bit
unalignment: 0 byte
elapsed time: 0.01 sec
throughput: 124.36 Mb/s

# ./unalign_check 1 0 8
size: 1 Mb
write size: 64 bit
unalignment: 0 byte
elapsed time: 0.00 sec
throughput: 252.12 Mb/s

> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>

I too believe that they are much more common than long ones.
Indeed, I wish to reduce the MIN_THRESHOLD value from 64 to 32 or even 16.
Or having it dependend on the word size, e.g. sizeof(long) * 2.

Suggestions?

> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Regards,
--
per aspera ad upstream

2021-06-17 03:14:45

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <[email protected]> wrote:
>
> Hi Matteo,
>
> Have you tried Glibc generic implementation code?
> ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
>
> If Glibc codes have the same performance in your hardware, then you
> could give a generic implementation first.
>

Hi,

I had a look, it seems that it's a C unrolled version with the
'register' keyword.
The same one was already merged in nios2:
https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
result of the other versions:

[ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

Regards,
--
per aspera ad upstream

2021-06-17 21:42:24

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/3] riscv: optimized memcpy

From: Matteo Croce
> Sent: 16 June 2021 19:52
> To: Guo Ren <[email protected]>
>
> On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <[email protected]> wrote:
> >
> > Hi Matteo,
> >
> > Have you tried Glibc generic implementation code?
> > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> >
> > If Glibc codes have the same performance in your hardware, then you
> > could give a generic implementation first.

Isn't that a byte copy loop - the performance of that ought to be terrible.
...

> I had a look, it seems that it's a C unrolled version with the
> 'register' keyword.
> The same one was already merged in nios2:
> https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68

I know a lot about the nios2 instruction timings.
(I've looked at code execution in the fpga's intel 'logic analiser.)
It is a very simple 4-clock pipeline cpu with a 2-clock delay
before a value read from 'tightly coupled memory' (aka cache)
can be used in another instruction.
There is also a subtle pipeline stall if a read follows a write
to the same memory block because the write is executed one
clock later - and would collide with the read.
Since it only ever executes one instruction per clock loop
unrolling does help - since you never get the loop control 'for free'.
OTOH you don't need to use that many registers.
But an unrolled loop should approach 2 bytes/clock (32bit cpu).

> I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> result of the other versions:
>
> [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s

What clock speed is that running at?
It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).

While the small riscv cpu might be similar to the nios2 (and mips
for that matter), there are also bigger/faster cpu.
I'm sure these can execute multiple instructions/clock
and possible even read and write at the same time.
Unless they also support significant instruction re-ordering
the trivial copy loops are going to be slow on such cpu.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-06-17 21:51:12

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Thu, Jun 17, 2021 at 11:30 PM David Laight <[email protected]> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 19:52
> > To: Guo Ren <[email protected]>
> >
> > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <[email protected]> wrote:
> > >
> > > Hi Matteo,
> > >
> > > Have you tried Glibc generic implementation code?
> > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > >
> > > If Glibc codes have the same performance in your hardware, then you
> > > could give a generic implementation first.
>
> Isn't that a byte copy loop - the performance of that ought to be terrible.
> ...
>
> > I had a look, it seems that it's a C unrolled version with the
> > 'register' keyword.
> > The same one was already merged in nios2:
> > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
>
> I know a lot about the nios2 instruction timings.
> (I've looked at code execution in the fpga's intel 'logic analiser.)
> It is a very simple 4-clock pipeline cpu with a 2-clock delay
> before a value read from 'tightly coupled memory' (aka cache)
> can be used in another instruction.
> There is also a subtle pipeline stall if a read follows a write
> to the same memory block because the write is executed one
> clock later - and would collide with the read.
> Since it only ever executes one instruction per clock loop
> unrolling does help - since you never get the loop control 'for free'.
> OTOH you don't need to use that many registers.
> But an unrolled loop should approach 2 bytes/clock (32bit cpu).
>
> > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > result of the other versions:
> >
> > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
>
> What clock speed is that running at?
> It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
>
> While the small riscv cpu might be similar to the nios2 (and mips
> for that matter), there are also bigger/faster cpu.
> I'm sure these can execute multiple instructions/clock
> and possible even read and write at the same time.
> Unless they also support significant instruction re-ordering
> the trivial copy loops are going to be slow on such cpu.
>

It's running at 1 GHz.

I get 257 Mb/s with a memcpy, a bit more with a memset,
but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.

--
per aspera ad upstream

2021-06-18 03:09:49

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
<[email protected]> wrote:
>
> On Thu, Jun 17, 2021 at 11:30 PM David Laight <[email protected]> wrote:
> >
> > From: Matteo Croce
> > > Sent: 16 June 2021 19:52
> > > To: Guo Ren <[email protected]>
> > >
> > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <[email protected]> wrote:
> > > >
> > > > Hi Matteo,
> > > >
> > > > Have you tried Glibc generic implementation code?
> > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > >
> > > > If Glibc codes have the same performance in your hardware, then you
> > > > could give a generic implementation first.
> >
> > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > ...
> >
> > > I had a look, it seems that it's a C unrolled version with the
> > > 'register' keyword.
> > > The same one was already merged in nios2:
> > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> >
> > I know a lot about the nios2 instruction timings.
> > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > before a value read from 'tightly coupled memory' (aka cache)
> > can be used in another instruction.
> > There is also a subtle pipeline stall if a read follows a write
> > to the same memory block because the write is executed one
> > clock later - and would collide with the read.
> > Since it only ever executes one instruction per clock loop
> > unrolling does help - since you never get the loop control 'for free'.
> > OTOH you don't need to use that many registers.
> > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> >
> > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > result of the other versions:
> > >
> > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> >
> > What clock speed is that running at?
> > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> >
> > While the small riscv cpu might be similar to the nios2 (and mips
> > for that matter), there are also bigger/faster cpu.
> > I'm sure these can execute multiple instructions/clock
> > and possible even read and write at the same time.
> > Unless they also support significant instruction re-ordering
> > the trivial copy loops are going to be slow on such cpu.
> >
>
> It's running at 1 GHz.
>
> I get 257 Mb/s with a memcpy, a bit more with a memset,
> but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
>

Err, I forget a mlock() before accessing the memory in userspace.

The real speed here is:

8 bit read: 155.42 Mb/s
64 bit read: 277.29 Mb/s
8 bit write: 138.57 Mb/s
64 bit write: 239.21 Mb/s

--
per aspera ad upstream

2021-06-18 03:21:37

by Matteo Croce

[permalink] [raw]
Subject: Re: [PATCH 1/3] riscv: optimized memcpy

On Fri, Jun 18, 2021 at 2:32 AM Matteo Croce <[email protected]> wrote:
>
> On Thu, Jun 17, 2021 at 11:48 PM Matteo Croce
> <[email protected]> wrote:
> >
> > On Thu, Jun 17, 2021 at 11:30 PM David Laight <[email protected]> wrote:
> > >
> > > From: Matteo Croce
> > > > Sent: 16 June 2021 19:52
> > > > To: Guo Ren <[email protected]>
> > > >
> > > > On Wed, Jun 16, 2021 at 1:46 PM Guo Ren <[email protected]> wrote:
> > > > >
> > > > > Hi Matteo,
> > > > >
> > > > > Have you tried Glibc generic implementation code?
> > > > > ref: https://lore.kernel.org/linux-arch/20190629053641.3iBfk9-
> > > > I_D29cDp9yJnIdIg7oMtHNZlDmhLQPTumhEc@z/#t
> > > > >
> > > > > If Glibc codes have the same performance in your hardware, then you
> > > > > could give a generic implementation first.
> > >
> > > Isn't that a byte copy loop - the performance of that ought to be terrible.
> > > ...
> > >
> > > > I had a look, it seems that it's a C unrolled version with the
> > > > 'register' keyword.
> > > > The same one was already merged in nios2:
> > > > https://elixir.bootlin.com/linux/latest/source/arch/nios2/lib/memcpy.c#L68
> > >
> > > I know a lot about the nios2 instruction timings.
> > > (I've looked at code execution in the fpga's intel 'logic analiser.)
> > > It is a very simple 4-clock pipeline cpu with a 2-clock delay
> > > before a value read from 'tightly coupled memory' (aka cache)
> > > can be used in another instruction.
> > > There is also a subtle pipeline stall if a read follows a write
> > > to the same memory block because the write is executed one
> > > clock later - and would collide with the read.
> > > Since it only ever executes one instruction per clock loop
> > > unrolling does help - since you never get the loop control 'for free'.
> > > OTOH you don't need to use that many registers.
> > > But an unrolled loop should approach 2 bytes/clock (32bit cpu).
> > >
> > > > I copied _wordcopy_fwd_aligned() from Glibc, and I have a very similar
> > > > result of the other versions:
> > > >
> > > > [ 563.359126] Strings selftest: memcpy(src+7, dst+7): 257 Mb/s
> > >
> > > What clock speed is that running at?
> > > It seems very slow for a 64bit cpu (that isn't an fpga soft-cpu).
> > >
> > > While the small riscv cpu might be similar to the nios2 (and mips
> > > for that matter), there are also bigger/faster cpu.
> > > I'm sure these can execute multiple instructions/clock
> > > and possible even read and write at the same time.
> > > Unless they also support significant instruction re-ordering
> > > the trivial copy loops are going to be slow on such cpu.
> > >
> >
> > It's running at 1 GHz.
> >
> > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> >
>
> Err, I forget a mlock() before accessing the memory in userspace.
>
> The real speed here is:
>
> 8 bit read: 155.42 Mb/s
> 64 bit read: 277.29 Mb/s
> 8 bit write: 138.57 Mb/s
> 64 bit write: 239.21 Mb/s
>

Anyway, thanks for the info on nio2 timings.
If you think that an unrolled loop would help, we can achieve the same in C.
I think we could code something similar to a Duff device (or with jump
labels) to unroll the loop but at the same time doing efficient small copies.

Regards,

--
per aspera ad upstream

2021-06-18 09:26:05

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/3] riscv: optimized memcpy

From: Matteo Croce
> Sent: 18 June 2021 02:05
...
> > > It's running at 1 GHz.
> > >
> > > I get 257 Mb/s with a memcpy, a bit more with a memset,
> > > but I get 1200 Mb/s with a cyle which just reads memory with 64 bit addressing.
> > >
> >
> > Err, I forget a mlock() before accessing the memory in userspace.

What is the mlock() for?
The data for a quick loop won't get paged out.
You want to test cache to cache copies, so the first loop
will always be slow.
After that each iteration should be much the same.
I use code like:
for (;;) {
start = read_tsc();
do_test();
histogram[(read_tsc() - start) >> n]++
}
(You need to exclude outliers)
to get a distribution for the execution times.
Tends to be pretty stable - even though different program
runs can give different values!

> > The real speed here is:
> >
> > 8 bit read: 155.42 Mb/s
> > 64 bit read: 277.29 Mb/s
> > 8 bit write: 138.57 Mb/s
> > 64 bit write: 239.21 Mb/s
> >
>
> Anyway, thanks for the info on nio2 timings.
> If you think that an unrolled loop would help, we can achieve the same in C.
> I think we could code something similar to a Duff device (or with jump
> labels) to unroll the loop but at the same time doing efficient small copies.

Unrolling has to be done with care.
It tends to improve benchmarks, but the extra code displaces
other code from the i-cache and slows down overall performance.
So you need 'just enough' unrolling to avoid cpu stalls.

On your system it looks like the memory/cache subsystem
is the bottleneck for the tests you are doing.
I'd really expect a 1GHz cpu to be able to read/write from
its data cache every clock.
So I'd expect transfer rates nearer 8000 MB/s, not 250 MB/s.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)