LinuxLists.cc - [PATCH v2 0/3] lib/string: optimized mem* functions

2021-07-02 12:33:39

Subject: [PATCH v2 0/3] lib/string: optimized mem* functions

From: Matteo Croce <[email protected]>

Rewrite the generic mem{cpy,move,set} so that memory is accessed with
the widest size possible, but without doing unaligned accesses.

This was originally posted as C string functions for RISC-V[1], but as
there was no specific RISC-V code, it was proposed for the generic
lib/string.c implementation.

Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
and HAVE_EFFICIENT_UNALIGNED_ACCESS.

These are the performances of memcpy() and memset() of a RISC-V machine
on a 32 mbyte buffer:

memcpy:
original aligned: 75 Mb/s
original unaligned: 75 Mb/s
new aligned: 114 Mb/s
new unaligned: 107 Mb/s

memset:
original aligned: 140 Mb/s
original unaligned: 140 Mb/s
new aligned: 241 Mb/s
new unaligned: 241 Mb/s

The size increase is negligible:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
Function old new delta
memcpy 29 351 +322
memset 29 117 +88
strlcat 68 78 +10
strlcpy 50 57 +7
memmove 56 50 -6
Total: Before=8556964, After=8557385, chg +0.00%

These functions will be used for RISC-V initially.

[1] https://lore.kernel.org/linux-riscv/[email protected]/

Matteo Croce (3):
lib/string: optimized memcpy
lib/string: optimized memmove
lib/string: optimized memset

lib/string.c | 130 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 113 insertions(+), 17 deletions(-)

--
2.31.1

2021-07-02 12:33:49

by Matteo Croce

[permalink] [raw]

Subject: [PATCH v2 1/3] lib/string: optimized memcpy

From: Matteo Croce <[email protected]>

Rewrite the generic memcpy() to copy a word at time, without generating
unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next long
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

This is the improvement on RISC-V:

original aligned: 75 Mb/s
original unaligned: 75 Mb/s
new aligned: 114 Mb/s
new unaligned: 107 Mb/s

and this the binary size increase according to bloat-o-meter:

Function old new delta
memcpy 36 324 +288

Signed-off-by: Matteo Croce <[email protected]>
---
lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 546d59711a12..caeef4264c43 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -33,6 +33,23 @@
#include <asm/word-at-a-time.h>
#include <asm/page.h>

+#define BYTES_LONG sizeof(long)
+#define WORD_MASK (BYTES_LONG - 1)
+#define MIN_THRESHOLD (BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+ u8 *as_u8;
+ unsigned long *as_ulong;
+ uintptr_t as_uptr;
+};
+
+union const_types {
+ const u8 *as_u8;
+ const unsigned long *as_ulong;
+ uintptr_t as_uptr;
+};
+
#ifndef __HAVE_ARCH_STRNCASECMP
/**
* strncasecmp - Case insensitive, length-limited string comparison
@@ -869,6 +886,13 @@ EXPORT_SYMBOL(memset64);
#endif

#ifndef __HAVE_ARCH_MEMCPY
+
+#ifdef __BIG_ENDIAN
+#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((BYTES_LONG - (d)) * 8))
+#else
+#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((BYTES_LONG - (d)) * 8))
+#endif
+
/**
* memcpy - Copy one area of memory to another
* @dest: Where to copy to
@@ -880,14 +904,64 @@ EXPORT_SYMBOL(memset64);
*/
void *memcpy(void *dest, const void *src, size_t count)
{
- char *tmp = dest;
- const char *s = src;
+ union const_types s = { .as_u8 = src };
+ union types d = { .as_u8 = dest };
+ int distance = 0;
+
+ if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+ if (count < MIN_THRESHOLD)
+ goto copy_remainder;
+
+ /* Copy a byte at time until destination is aligned. */
+ for (; d.as_uptr & WORD_MASK; count--)
+ *d.as_u8++ = *s.as_u8++;
+
+ distance = s.as_uptr & WORD_MASK;
+ }
+
+ if (distance) {
+ unsigned long last, next;

+ /*
+ * s is distance bytes ahead of d, and d just reached
+ * the alignment boundary. Move s backward to word align it
+ * and shift data to compensate for distance, in order to do
+ * word-by-word copy.
+ */
+ s.as_u8 -= distance;
+
+ next = s.as_ulong[0];
+ for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+ last = next;
+ next = s.as_ulong[1];
+
+ d.as_ulong[0] = MERGE_UL(last, next, distance);
+
+ d.as_ulong++;
+ s.as_ulong++;
+ }
+
+ /* Restore s with the original offset. */
+ s.as_u8 += distance;
+ } else {
+ /*
+ * If the source and dest lower bits are the same, do a simple
+ * 32/64 bit wide copy.
+ */
+ for (; count >= BYTES_LONG; count -= BYTES_LONG)
+ *d.as_ulong++ = *s.as_ulong++;
+ }
+
+copy_remainder:
while (count--)
- *tmp++ = *s++;
+ *d.as_u8++ = *s.as_u8++;
+
return dest;
}
EXPORT_SYMBOL(memcpy);
+
+#undef MERGE_UL
+
#endif

#ifndef __HAVE_ARCH_MEMMOVE
--
2.31.1

2021-07-02 12:34:00

by Matteo Croce

[permalink] [raw]

Subject: [PATCH v2 2/3] lib/string: optimized memmove

From: Matteo Croce <[email protected]>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

This "optimization" only covers a common case. In future, proper code
which does the same thing as memcpy() does but backwards can be done.

Signed-off-by: Matteo Croce <[email protected]>
---
lib/string.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index caeef4264c43..108b83c34cec 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -975,19 +975,13 @@ EXPORT_SYMBOL(memcpy);
*/
void *memmove(void *dest, const void *src, size_t count)
{
- char *tmp;
- const char *s;
+ if (dest < src || src + count <= dest)
+ return memcpy(dest, src, count);
+
+ if (dest > src) {
+ const char *s = src + count;
+ char *tmp = dest + count;

- if (dest <= src) {
- tmp = dest;
- s = src;
- while (count--)
- *tmp++ = *s++;
- } else {
- tmp = dest;
- tmp += count;
- s = src;
- s += count;
while (count--)
*--tmp = *--s;
}
--
2.31.1

2021-07-02 12:34:09

by Matteo Croce

[permalink] [raw]

Subject: [PATCH v2 3/3] lib/string: optimized memset

From: Matteo Croce <[email protected]>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s,
and this the binary size increase according to bloat-o-meter:

Function old new delta
memset 32 148 +116

Signed-off-by: Matteo Croce <[email protected]>
---
lib/string.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 108b83c34cec..264821f0e795 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -810,10 +810,38 @@ EXPORT_SYMBOL(__sysfs_match_string);
*/
void *memset(void *s, int c, size_t count)
{
- char *xs = s;
+ union types dest = { .as_u8 = s };

+ if (count >= MIN_THRESHOLD) {
+ unsigned long cu = (unsigned long)c;
+
+ /* Compose an ulong with 'c' repeated 4/8 times */
+#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+ cu *= 0x0101010101010101UL;
+#else
+ cu |= cu << 8;
+ cu |= cu << 16;
+ /* Suppress warning on 32 bit machines */
+ cu |= (cu << 16) << 16;
+#endif
+ if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+ /*
+ * Fill the buffer one byte at time until
+ * the destination is word aligned.
+ */
+ for (; count && dest.as_uptr & WORD_MASK; count--)
+ *dest.as_u8++ = c;
+ }
+
+ /* Copy using the largest size allowed */
+ for (; count >= BYTES_LONG; count -= BYTES_LONG)
+ *dest.as_ulong++ = cu;
+ }
+
+ /* copy the remainder */
while (count--)
- *xs++ = c;
+ *dest.as_u8++ = c;
+
return s;
}
EXPORT_SYMBOL(memset);
--
2.31.1

2021-07-02 15:17:15

by Ben Dooks

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] lib/string: optimized memcpy

On 02/07/2021 13:31, Matteo Croce wrote:
> From: Matteo Croce <[email protected]>
>
> Rewrite the generic memcpy() to copy a word at time, without generating
> unaligned accesses.
>
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next long
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
>
> This is the improvement on RISC-V:
>
> original aligned: 75 Mb/s
> original unaligned: 75 Mb/s
> new aligned: 114 Mb/s
> new unaligned: 107 Mb/s
>
> and this the binary size increase according to bloat-o-meter:
>
> Function old new delta
> memcpy 36 324 +288
>
>
> Signed-off-by: Matteo Croce <[email protected]>
> ---
> lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 77 insertions(+), 3 deletions(-)

Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
optimised version? I would have thought the lib/string.c version
was not being used?

--
Ben Dooks http://www.codethink.co.uk/
Senior Engineer Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

2021-07-02 15:30:08

by Matteo Croce

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] lib/string: optimized memcpy

On Fri, Jul 2, 2021 at 4:37 PM Ben Dooks <[email protected]> wrote:
>
> On 02/07/2021 13:31, Matteo Croce wrote:
> > From: Matteo Croce <[email protected]>
> >
> > Rewrite the generic memcpy() to copy a word at time, without generating
> > unaligned accesses.
> >
> > The procedure is made of three steps:
> > First copy data one byte at time until the destination buffer is aligned
> > to a long boundary.
> > Then copy the data one long at time shifting the current and the next long
> > to compose a long at every cycle.
> > Finally, copy the remainder one byte at time.
> >
> > This is the improvement on RISC-V:
> >
> > original aligned: 75 Mb/s
> > original unaligned: 75 Mb/s
> > new aligned: 114 Mb/s
> > new unaligned: 107 Mb/s
> >
> > and this the binary size increase according to bloat-o-meter:
> >
> > Function old new delta
> > memcpy 36 324 +288
> >
> >
> > Signed-off-by: Matteo Croce <[email protected]>
> > ---
> > lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 77 insertions(+), 3 deletions(-)
>
> Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
> optimised version? I would have thought the lib/string.c version
> was not being used?
>
>

Yes, but this series started as C replacement for the assembly one,
which generates unaligned accesses.
Unfortunately the existing RISC-V processors can't handle unaligned
accesses, so they are emulated with a terrible slowdown.
Then, since there wasn't any riscv specific code, it was proposed as
generic code:

Discussion: https://lore.kernel.org/linux-riscv/[email protected]/

--
per aspera ad upstream

2021-07-10 21:33:55

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2 0/3] lib/string: optimized mem* functions

On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <[email protected]> wrote:

> From: Matteo Croce <[email protected]>
>
> Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> the widest size possible, but without doing unaligned accesses.
>
> This was originally posted as C string functions for RISC-V[1], but as
> there was no specific RISC-V code, it was proposed for the generic
> lib/string.c implementation.
>
> Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> and HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> These are the performances of memcpy() and memset() of a RISC-V machine
> on a 32 mbyte buffer:
>
> memcpy:
> original aligned: 75 Mb/s
> original unaligned: 75 Mb/s
> new aligned: 114 Mb/s
> new unaligned: 107 Mb/s
>
> memset:
> original aligned: 140 Mb/s
> original unaligned: 140 Mb/s
> new aligned: 241 Mb/s
> new unaligned: 241 Mb/s

Did you record the x86_64 performance?

Which other architectures are affected by this change?

2021-07-10 23:09:49

by Matteo Croce

[permalink] [raw]

Subject: Re: [PATCH v2 0/3] lib/string: optimized mem* functions

On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
<[email protected]> wrote:
>
> On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <[email protected]> wrote:
>
> > From: Matteo Croce <[email protected]>
> >
> > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > the widest size possible, but without doing unaligned accesses.
> >
> > This was originally posted as C string functions for RISC-V[1], but as
> > there was no specific RISC-V code, it was proposed for the generic
> > lib/string.c implementation.
> >
> > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> >
> > These are the performances of memcpy() and memset() of a RISC-V machine
> > on a 32 mbyte buffer:
> >
> > memcpy:
> > original aligned: 75 Mb/s
> > original unaligned: 75 Mb/s
> > new aligned: 114 Mb/s
> > new unaligned: 107 Mb/s
> >
> > memset:
> > original aligned: 140 Mb/s
> > original unaligned: 140 Mb/s
> > new aligned: 241 Mb/s
> > new unaligned: 241 Mb/s
>
> Did you record the x86_64 performance?
>
>
> Which other architectures are affected by this change?

x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
and has optimized implementations in arch/x86/lib.
Anyway, I was curious and I tested them on x86_64 too, there was zero
gain over the generic ones.

The only architecture which will use all the three function will be
riscv, while memmove() will be used by arc, h8300, hexagon, ia64,
openrisc and parisc.
Keep in mind that memmove() isn't anything special, it just calls
memcpy() when possible (e.g. buffers not overlapping), and fallbacks
to the byte by byte copy otherwise.

In future we can write two functions, one which copies forward and
another one which copies backward, and call the right one depending on
the buffers position.
Then, we could alias memcpy() and memmove(), as proposed by Linus:

https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132

Regards,
--
per aspera ad upstream

2021-07-12 09:06:19

by David Laight

[permalink] [raw]

Subject: RE: [PATCH v2 0/3] lib/string: optimized mem* functions

From: Matteo Croce
> Sent: 11 July 2021 00:08
>
> On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
> <[email protected]> wrote:
> >
> > On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <[email protected]> wrote:
> >
> > > From: Matteo Croce <[email protected]>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > > the widest size possible, but without doing unaligned accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1], but as
> > > there was no specific RISC-V code, it was proposed for the generic
> > > lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > These are the performances of memcpy() and memset() of a RISC-V machine
> > > on a 32 mbyte buffer:
> > >
> > > memcpy:
> > > original aligned: 75 Mb/s
> > > original unaligned: 75 Mb/s
> > > new aligned: 114 Mb/s
> > > new unaligned: 107 Mb/s
> > >
> > > memset:
> > > original aligned: 140 Mb/s
> > > original unaligned: 140 Mb/s
> > > new aligned: 241 Mb/s
> > > new unaligned: 241 Mb/s
> >
> > Did you record the x86_64 performance?
> >
> >
> > Which other architectures are affected by this change?
>
> x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
> and has optimized implementations in arch/x86/lib.
> Anyway, I was curious and I tested them on x86_64 too, there was zero
> gain over the generic ones.

x86 performance (and attainable performance) does depend on the cpu
micro-archiecture.

Any recent 'desktop' intel cpu will almost certainly manage to
re-order the execution of almost any copy loop and attain 1 write per clock.
(Even the trivial 'while (count--) *dest++ = *src++;' loop.)

The same isn't true of the Atom based cpu that may be on small servers.
Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited
out-of-order execution and so are much more sensitive to instruction
ordering.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)