2010-09-02 05:46:18

by Miao Xie

[permalink] [raw]
Subject: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

Changes from V1 to V2:
- change the version of GPL from version 2.1 to version 2

the kernel's memcpy and memmove is very inefficient. But the glibc version is
quite fast, in some cases it is 10 times faster than the kernel version. So I
introduce some memory copy macros and functions of the glibc to improve the
kernel version's performance.

The strategy of the memory functions is:
1. Copy bytes until the destination pointer is aligned.
2. Copy words in unrolled loops. If the source and destination are not aligned
in the same way, use word memory operations, but shift and merge two read
words before writing.
3. Copy the few remaining bytes.

Signed-off-by: Miao Xie <[email protected]>
---
include/linux/memcopy.h | 226 ++++++++++++++++++++++++++
lib/Makefile | 3 +-
lib/memcopy.c | 403 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 631 insertions(+), 1 deletions(-)
create mode 100644 include/linux/memcopy.h
create mode 100644 lib/memcopy.c

diff --git a/include/linux/memcopy.h b/include/linux/memcopy.h
new file mode 100644
index 0000000..9c65ac8
--- /dev/null
+++ b/include/linux/memcopy.h
@@ -0,0 +1,226 @@
+/*
+ * memcopy.h -- definitions for memory copy functions. Generic C version.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
+ * Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+#ifndef _LINUX_MEMCOPY_H_
+#define _LINUX_MEMCOPY_H_
+
+/*
+ * The strategy of the memory functions is:
+ *
+ * 1. Copy bytes until the destination pointer is aligned.
+ *
+ * 2. Copy words in unrolled loops. If the source and destination
+ * are not aligned in the same way, use word memory operations,
+ * but shift and merge two read words before writing.
+ *
+ * 3. Copy the few remaining bytes.
+ *
+ * This is fast on processors that have at least 10 registers for
+ * allocation by GCC, and that can access memory at reg+const in one
+ * instruction.
+ */
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <asm/byteorder.h>
+
+/*
+ * The macros defined in this file are:
+ *
+ * BYTE_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ *
+ * BYTE_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ *
+ * WORD_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_remaining, nbytes_to_copy)
+ *
+ * WORD_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_remaining, nbytes_to_copy)
+ *
+ * MERGE(old_word, sh_1, new_word, sh_2)
+ *
+ * MEM_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ *
+ * MEM_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ */
+
+#define OP_T_THRESHOLD 16
+
+/*
+ * Type to use for aligned memory operations.
+ * This should normally be the biggest type supported by a single load
+ * and store.
+ */
+#define op_t unsigned long int
+#define OPSIZ (sizeof(op_t))
+
+/* Type to use for unaligned operations. */
+typedef unsigned char byte;
+
+#ifndef MERGE
+# ifdef __LITTLE_ENDIAN
+# define MERGE(w0, sh_1, w1, sh_2) (((w0) >> (sh_1)) | ((w1) << (sh_2)))
+# elif defined(__BIG_ENDIAN)
+# define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2)))
+# else
+# error "Macro MERGE() hasn't defined!"
+# endif
+#endif
+
+/*
+ * Copy exactly NBYTES bytes from SRC_BP to DST_BP,
+ * without any assumptions about alignment of the pointers.
+ */
+#ifndef BYTE_COPY_FWD
+#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes) \
+do { \
+ size_t __nbytes = (nbytes); \
+ while (__nbytes > 0) { \
+ byte __x = ((byte *) src_bp)[0]; \
+ src_bp += 1; \
+ __nbytes -= 1; \
+ ((byte *) dst_bp)[0] = __x; \
+ dst_bp += 1; \
+ } \
+} while (0)
+#endif
+
+/*
+ * Copy exactly NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the bytes right before the pointers and continuing towards
+ * smaller addresses. Don't assume anything about alignment of the
+ * pointers.
+ */
+#ifndef BYTE_COPY_BWD
+#define BYTE_COPY_BWD(dst_ep, src_ep, nbytes) \
+do { \
+ size_t __nbytes = (nbytes); \
+ while (__nbytes > 0) { \
+ byte __x; \
+ src_ep -= 1; \
+ __x = ((byte *) src_ep)[0]; \
+ dst_ep -= 1; \
+ __nbytes -= 1; \
+ ((byte *) dst_ep)[0] = __x; \
+ } \
+} while (0)
+#endif
+/*
+ * Copy *up to* NBYTES bytes from SRC_BP to DST_BP, with
+ * the assumption that DST_BP is aligned on an OPSIZ multiple. If
+ * not all bytes could be easily copied, store remaining number of bytes
+ * in NBYTES_LEFT, otherwise store 0.
+ */
+extern void _wordcopy_fwd_aligned(long int, long int, size_t);
+extern void _wordcopy_fwd_dest_aligned(long int, long int, size_t);
+#ifndef WORD_COPY_FWD
+#define WORD_COPY_FWD(dst_bp, src_bp, nbytes_left, nbytes) \
+do { \
+ if (src_bp % OPSIZ == 0) \
+ _wordcopy_fwd_aligned (dst_bp, src_bp, (nbytes) / OPSIZ); \
+ else \
+ _wordcopy_fwd_dest_aligned (dst_bp, src_bp, (nbytes) / OPSIZ);\
+ \
+ src_bp += (nbytes) & -OPSIZ; \
+ dst_bp += (nbytes) & -OPSIZ; \
+ (nbytes_left) = (nbytes) % OPSIZ; \
+} while (0)
+#endif
+
+/*
+ * Copy *up to* NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the words (of type op_t) right before the pointers and
+ * continuing towards smaller addresses. May take advantage of that
+ * DST_END_PTR is aligned on an OPSIZ multiple. If not all bytes could be
+ * easily copied, store remaining number of bytes in NBYTES_REMAINING,
+ * otherwise store 0.
+ */
+extern void _wordcopy_bwd_aligned(long int, long int, size_t);
+extern void _wordcopy_bwd_dest_aligned(long int, long int, size_t);
+#ifndef WORD_COPY_BWD
+#define WORD_COPY_BWD(dst_ep, src_ep, nbytes_left, nbytes) \
+do { \
+ if (src_ep % OPSIZ == 0) \
+ _wordcopy_bwd_aligned (dst_ep, src_ep, (nbytes) / OPSIZ); \
+ else \
+ _wordcopy_bwd_dest_aligned (dst_ep, src_ep, (nbytes) / OPSIZ);\
+ \
+ src_ep -= (nbytes) & -OPSIZ; \
+ dst_ep -= (nbytes) & -OPSIZ; \
+ (nbytes_left) = (nbytes) % OPSIZ; \
+} while (0)
+#endif
+
+/* Copy memory from the beginning to the end */
+#ifndef MEM_COPY_FWD
+static __always_inline void mem_copy_fwd(unsigned long dstp,
+ unsigned long srcp,
+ size_t count)
+{
+ /* If there not too few bytes to copy, use word copy. */
+ if (count >= OP_T_THRESHOLD) {
+ /* Copy just a few bytes to make dstp aligned. */
+ count -= (-dstp) % OPSIZ;
+ BYTE_COPY_FWD(dstp, srcp, (-dstp) % OPSIZ);
+
+ /*
+ * Copy from srcp to dstp taking advantage of the known
+ * alignment of dstp. Number if bytes remaining is put in
+ * the third argument.
+ */
+ WORD_COPY_FWD(dstp, srcp, count, count);
+
+ /* Fall out and copy the tail. */
+ }
+
+ /* There are just a few bytes to copy. Use byte memory operations. */
+ BYTE_COPY_FWD(dstp, srcp, count);
+}
+#endif
+
+/* Copy memory from the end to the beginning. */
+#ifndef MEM_COPY_BWD
+static __always_inline void mem_copy_bwd(unsigned long dstp,
+ unsigned long srcp,
+ size_t count)
+{
+ srcp += count;
+ dstp += count;
+
+ /* If there not too few bytes to copy, use word copy. */
+ if (count >= OP_T_THRESHOLD) {
+ /* Copy just a few bytes to make dstp aligned. */
+ count -= dstp % OPSIZ;
+ BYTE_COPY_BWD(dstp, srcp, dstp % OPSIZ);
+
+ /*
+ * Copy from srcp to dstp taking advantage of the known
+ * alignment of dstp. Number if bytes remaining is put in
+ * the third argument.
+ */
+ WORD_COPY_BWD(dstp, srcp, count, count);
+
+ /* Fall out and copy the tail. */
+ }
+
+ /* There are just a few bytes to copy. Use byte memory operations. */
+ BYTE_COPY_BWD (dstp, srcp, count);
+}
+#endif
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index e6a3763..10a319e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o flex_array.o
+ is_single_threaded.o plist.o decompress.o flex_array.o \
+ memcopy.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/memcopy.c b/lib/memcopy.c
new file mode 100644
index 0000000..70fb6b2
--- /dev/null
+++ b/lib/memcopy.c
@@ -0,0 +1,403 @@
+/*
+ * memcopy.c -- subroutines for memory copy functions.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
+ * Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+
+/* BE VERY CAREFUL IF YOU CHANGE THIS CODE...! */
+
+#include <linux/memcopy.h>
+
+/*
+ * _wordcopy_fwd_aligned -- Copy block beginning at SRCP to block beginning
+ * at DSTP with LEN `op_t' words (not LEN bytes!).
+ * Both SRCP and DSTP should be aligned for memory operations on `op_t's.
+ */
+void _wordcopy_fwd_aligned (long int dstp, long int srcp, size_t len)
+{
+ op_t a0, a1;
+
+ switch (len % 8) {
+ case 2:
+ a0 = ((op_t *) srcp)[0];
+ srcp -= 6 * OPSIZ;
+ dstp -= 7 * OPSIZ;
+ len += 6;
+ goto do1;
+ case 3:
+ a1 = ((op_t *) srcp)[0];
+ srcp -= 5 * OPSIZ;
+ dstp -= 6 * OPSIZ;
+ len += 5;
+ goto do2;
+ case 4:
+ a0 = ((op_t *) srcp)[0];
+ srcp -= 4 * OPSIZ;
+ dstp -= 5 * OPSIZ;
+ len += 4;
+ goto do3;
+ case 5:
+ a1 = ((op_t *) srcp)[0];
+ srcp -= 3 * OPSIZ;
+ dstp -= 4 * OPSIZ;
+ len += 3;
+ goto do4;
+ case 6:
+ a0 = ((op_t *) srcp)[0];
+ srcp -= 2 * OPSIZ;
+ dstp -= 3 * OPSIZ;
+ len += 2;
+ goto do5;
+ case 7:
+ a1 = ((op_t *) srcp)[0];
+ srcp -= 1 * OPSIZ;
+ dstp -= 2 * OPSIZ;
+ len += 1;
+ goto do6;
+ case 0:
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ return;
+ a0 = ((op_t *) srcp)[0];
+ srcp -= 0 * OPSIZ;
+ dstp -= 1 * OPSIZ;
+ goto do7;
+ case 1:
+ a1 = ((op_t *) srcp)[0];
+ srcp -=-1 * OPSIZ;
+ dstp -= 0 * OPSIZ;
+ len -= 1;
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ goto do0;
+ goto do8; /* No-op. */
+ }
+
+ do {
+do8:
+ a0 = ((op_t *) srcp)[0];
+ ((op_t *) dstp)[0] = a1;
+do7:
+ a1 = ((op_t *) srcp)[1];
+ ((op_t *) dstp)[1] = a0;
+do6:
+ a0 = ((op_t *) srcp)[2];
+ ((op_t *) dstp)[2] = a1;
+do5:
+ a1 = ((op_t *) srcp)[3];
+ ((op_t *) dstp)[3] = a0;
+do4:
+ a0 = ((op_t *) srcp)[4];
+ ((op_t *) dstp)[4] = a1;
+do3:
+ a1 = ((op_t *) srcp)[5];
+ ((op_t *) dstp)[5] = a0;
+do2:
+ a0 = ((op_t *) srcp)[6];
+ ((op_t *) dstp)[6] = a1;
+do1:
+ a1 = ((op_t *) srcp)[7];
+ ((op_t *) dstp)[7] = a0;
+
+ srcp += 8 * OPSIZ;
+ dstp += 8 * OPSIZ;
+ len -= 8;
+ } while (len != 0);
+
+ /*
+ * This is the right position for do0. Please don't move it into
+ * the loop.
+ */
+do0:
+ ((op_t *) dstp)[0] = a1;
+}
+
+/*
+ * _wordcopy_fwd_dest_aligned -- Copy block beginning at SRCP to block
+ * beginning at DSTP with LEN `op_t' words (not LEN bytes!). DSTP should
+ * be aligned for memory operations on `op_t's, but SRCP must *not* be aligned.
+ */
+
+void _wordcopy_fwd_dest_aligned (long int dstp, long int srcp, size_t len)
+{
+ op_t a0, a1, a2, a3;
+ int sh_1, sh_2;
+
+ /*
+ * Calculate how to shift a word read at the memory operation aligned
+ * srcp to make it aligned for copy.
+ */
+ sh_1 = 8 * (srcp % OPSIZ);
+ sh_2 = 8 * OPSIZ - sh_1;
+
+ /*
+ * Make SRCP aligned by rounding it down to the beginning of the `op_t'
+ * it points in the middle of.
+ */
+ srcp &= -OPSIZ;
+
+ switch (len % 4) {
+ case 2:
+ a1 = ((op_t *) srcp)[0];
+ a2 = ((op_t *) srcp)[1];
+ srcp -= 1 * OPSIZ;
+ dstp -= 3 * OPSIZ;
+ len += 2;
+ goto do1;
+ case 3:
+ a0 = ((op_t *) srcp)[0];
+ a1 = ((op_t *) srcp)[1];
+ srcp -= 0 * OPSIZ;
+ dstp -= 2 * OPSIZ;
+ len += 1;
+ goto do2;
+ case 0:
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ return;
+ a3 = ((op_t *) srcp)[0];
+ a0 = ((op_t *) srcp)[1];
+ srcp -=-1 * OPSIZ;
+ dstp -= 1 * OPSIZ;
+ len += 0;
+ goto do3;
+ case 1:
+ a2 = ((op_t *) srcp)[0];
+ a3 = ((op_t *) srcp)[1];
+ srcp -=-2 * OPSIZ;
+ dstp -= 0 * OPSIZ;
+ len -= 1;
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ goto do0;
+ goto do4; /* No-op. */
+ }
+
+ do {
+do4:
+ a0 = ((op_t *) srcp)[0];
+ ((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2);
+do3:
+ a1 = ((op_t *) srcp)[1];
+ ((op_t *) dstp)[1] = MERGE (a3, sh_1, a0, sh_2);
+do2:
+ a2 = ((op_t *) srcp)[2];
+ ((op_t *) dstp)[2] = MERGE (a0, sh_1, a1, sh_2);
+do1:
+ a3 = ((op_t *) srcp)[3];
+ ((op_t *) dstp)[3] = MERGE (a1, sh_1, a2, sh_2);
+
+ srcp += 4 * OPSIZ;
+ dstp += 4 * OPSIZ;
+ len -= 4;
+ } while (len != 0);
+
+ /*
+ * This is the right position for do0. Please don't move it into
+ * the loop.
+ */
+do0:
+ ((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2);
+}
+
+/*
+ * _wordcopy_bwd_aligned -- Copy block finishing right before
+ * SRCP to block finishing right before DSTP with LEN `op_t' words (not LEN
+ * bytes!). Both SRCP and DSTP should be aligned for memory operations
+ * on `op_t's.
+ */
+void _wordcopy_bwd_aligned (long int dstp, long int srcp, size_t len)
+{
+ op_t a0, a1;
+
+ switch (len % 8) {
+ case 2:
+ srcp -= 2 * OPSIZ;
+ dstp -= 1 * OPSIZ;
+ a0 = ((op_t *) srcp)[1];
+ len += 6;
+ goto do1;
+ case 3:
+ srcp -= 3 * OPSIZ;
+ dstp -= 2 * OPSIZ;
+ a1 = ((op_t *) srcp)[2];
+ len += 5;
+ goto do2;
+ case 4:
+ srcp -= 4 * OPSIZ;
+ dstp -= 3 * OPSIZ;
+ a0 = ((op_t *) srcp)[3];
+ len += 4;
+ goto do3;
+ case 5:
+ srcp -= 5 * OPSIZ;
+ dstp -= 4 * OPSIZ;
+ a1 = ((op_t *) srcp)[4];
+ len += 3;
+ goto do4;
+ case 6:
+ srcp -= 6 * OPSIZ;
+ dstp -= 5 * OPSIZ;
+ a0 = ((op_t *) srcp)[5];
+ len += 2;
+ goto do5;
+ case 7:
+ srcp -= 7 * OPSIZ;
+ dstp -= 6 * OPSIZ;
+ a1 = ((op_t *) srcp)[6];
+ len += 1;
+ goto do6;
+ case 0:
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ return;
+ srcp -= 8 * OPSIZ;
+ dstp -= 7 * OPSIZ;
+ a0 = ((op_t *) srcp)[7];
+ goto do7;
+ case 1:
+ srcp -= 9 * OPSIZ;
+ dstp -= 8 * OPSIZ;
+ a1 = ((op_t *) srcp)[8];
+ len -= 1;
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ goto do0;
+ goto do8; /* No-op. */
+ }
+
+ do {
+do8:
+ a0 = ((op_t *) srcp)[7];
+ ((op_t *) dstp)[7] = a1;
+do7:
+ a1 = ((op_t *) srcp)[6];
+ ((op_t *) dstp)[6] = a0;
+do6:
+ a0 = ((op_t *) srcp)[5];
+ ((op_t *) dstp)[5] = a1;
+do5:
+ a1 = ((op_t *) srcp)[4];
+ ((op_t *) dstp)[4] = a0;
+do4:
+ a0 = ((op_t *) srcp)[3];
+ ((op_t *) dstp)[3] = a1;
+do3:
+ a1 = ((op_t *) srcp)[2];
+ ((op_t *) dstp)[2] = a0;
+do2:
+ a0 = ((op_t *) srcp)[1];
+ ((op_t *) dstp)[1] = a1;
+do1:
+ a1 = ((op_t *) srcp)[0];
+ ((op_t *) dstp)[0] = a0;
+
+ srcp -= 8 * OPSIZ;
+ dstp -= 8 * OPSIZ;
+ len -= 8;
+ } while (len != 0);
+
+ /*
+ * This is the right position for do0. Please don't move it into
+ * the loop.
+ */
+do0:
+ ((op_t *) dstp)[7] = a1;
+}
+
+/*
+ * _wordcopy_bwd_dest_aligned -- Copy block finishing right before SRCP to
+ * block finishing right before DSTP with LEN `op_t' words (not LEN bytes!).
+ * DSTP should be aligned for memory operations on `op_t', but SRCP must *not*
+ * be aligned.
+ */
+void _wordcopy_bwd_dest_aligned (long int dstp, long int srcp, size_t len)
+{
+ op_t a0, a1, a2, a3;
+ int sh_1, sh_2;
+
+ /*
+ * Calculate how to shift a word read at the memory operation aligned
+ * srcp to make it aligned for copy.
+ */
+
+ sh_1 = 8 * (srcp % OPSIZ);
+ sh_2 = 8 * OPSIZ - sh_1;
+
+ /*
+ * Make srcp aligned by rounding it down to the beginning of the op_t
+ * it points in the middle of.
+ */
+ srcp &= -OPSIZ;
+ srcp += OPSIZ;
+
+ switch (len % 4) {
+ case 2:
+ srcp -= 3 * OPSIZ;
+ dstp -= 1 * OPSIZ;
+ a2 = ((op_t *) srcp)[2];
+ a1 = ((op_t *) srcp)[1];
+ len += 2;
+ goto do1;
+ case 3:
+ srcp -= 4 * OPSIZ;
+ dstp -= 2 * OPSIZ;
+ a3 = ((op_t *) srcp)[3];
+ a2 = ((op_t *) srcp)[2];
+ len += 1;
+ goto do2;
+ case 0:
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ return;
+ srcp -= 5 * OPSIZ;
+ dstp -= 3 * OPSIZ;
+ a0 = ((op_t *) srcp)[4];
+ a3 = ((op_t *) srcp)[3];
+ goto do3;
+ case 1:
+ srcp -= 6 * OPSIZ;
+ dstp -= 4 * OPSIZ;
+ a1 = ((op_t *) srcp)[5];
+ a0 = ((op_t *) srcp)[4];
+ len -= 1;
+ if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0)
+ goto do0;
+ goto do4; /* No-op. */
+ }
+
+ do {
+do4:
+ a3 = ((op_t *) srcp)[3];
+ ((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2);
+do3:
+ a2 = ((op_t *) srcp)[2];
+ ((op_t *) dstp)[2] = MERGE (a3, sh_1, a0, sh_2);
+do2:
+ a1 = ((op_t *) srcp)[1];
+ ((op_t *) dstp)[1] = MERGE (a2, sh_1, a3, sh_2);
+do1:
+ a0 = ((op_t *) srcp)[0];
+ ((op_t *) dstp)[0] = MERGE (a1, sh_1, a2, sh_2);
+
+ srcp -= 4 * OPSIZ;
+ dstp -= 4 * OPSIZ;
+ len -= 4;
+ } while (len != 0);
+
+ /*
+ * This is the right position for do0. Please don't move it into
+ * the loop.
+ */
+do0:
+ ((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2);
+}
+
--
1.7.0.1


2010-09-02 08:56:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

Miao Xie <[email protected]> writes:

> Changes from V1 to V2:
> - change the version of GPL from version 2.1 to version 2
>
> the kernel's memcpy and memmove is very inefficient. But the glibc version is
> quite fast, in some cases it is 10 times faster than the kernel version. So I


Can you elaborate on which CPUs and with what workloads you measured that?

The kernel memcpy is optimized for copies smaller than a page size
for example (kernel very rarely does anything on larger than 4k),
the glibc isn't. etc. There are various other differences.

memcpy and memmove are very different. AFAIK noone has tried
to optimize memmove() before because traditionally it wasn't
used for anything performance critical in the kernel. Has that
that changed? memcpy on the other hand while not perfect
is actually quite optimized for typical workloads.

One big difference between the kernel and glibc is that kernel
is often cache cold, so you e.g. the cost of a very large code footprint
memcpy/memset is harder to amortize.

Microbenchmarks often leave out that crucial variable.

I have some systemtap scripts to measure size/alignment distributions of
copies on a kernel, if you have a particular workload you're interested
in those could be tried.

Just copying the glibc bloat uncritical is very likely
the wrong move at least.

-Andi
--
[email protected] -- Speaking for myself only.

2010-09-02 10:11:31

by Miao Xie

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

On Thu, 02 Sep 2010 10:55:58 +0200, Andi Kleen wrote:
> Miao Xie<[email protected]> writes:
>
>> Changes from V1 to V2:
>> - change the version of GPL from version 2.1 to version 2
>>
>> the kernel's memcpy and memmove is very inefficient. But the glibc version is
>> quite fast, in some cases it is 10 times faster than the kernel version. So I
>
>
> Can you elaborate on which CPUs and with what workloads you measured that?

I did this test on x86_64 box with 4 cores, and the workload is quite low,
and I just do 500 bytes copy for 5,000,000 times.

the attached file is my test program.

> The kernel memcpy is optimized for copies smaller than a page size
> for example (kernel very rarely does anything on larger than 4k),
> the glibc isn't. etc. There are various other differences.
>
> memcpy and memmove are very different. AFAIK noone has tried
> to optimize memmove() before because traditionally it wasn't
> used for anything performance critical in the kernel. Has that
> that changed? memcpy on the other hand while not perfect
> is actually quite optimized for typical workloads.

Yes,the performance of memcpy on the most architecture is well,

But some of memmoves are implemented by byte copy, it is quite inefficient.
Unfortunately those memmove are used to modify the metadata of some filesystems,
such as: btrfs. That is memmove is importent for the performance of those filesystems.

So I improve the generic version of memcpy and memmove, and x86_64's memmove
which are implemented by byte copy.

> One big difference between the kernel and glibc is that kernel
> is often cache cold, so you e.g. the cost of a very large code footprint
> memcpy/memset is harder to amortize.
>
> Microbenchmarks often leave out that crucial variable.
>
> I have some systemtap scripts to measure size/alignment distributions of
> copies on a kernel, if you have a particular workload you're interested
> in those could be tried.

Good! Could you give me these script?

> Just copying the glibc bloat uncritical is very likely
> the wrong move at least.

Agree!

Thanks!
Miao


Attachments:
perf_memcopy.c (1.22 kB)

2010-09-02 10:40:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

> So I improve the generic version of memcpy and memmove, and x86_64's memmove
> which are implemented by byte copy.

One should also add that most memmove()s and memcpy()s are actually
generated by gcc as inlines (especially if you don't use the
"make my code slow" option aka -Os) and don't use the fallback.
The fallback depends on the gcc version and if gcc thinks the
data is aligned or not.

Sometimes one can get better code in the caller by making sure
gcc knows the correct alignment (e.g. with suitable
types) and size. This might be worth looking at for btrfs
if it's really that memmove heavy.

> >
> >I have some systemtap scripts to measure size/alignment distributions of
> >copies on a kernel, if you have a particular workload you're interested
> >in those could be tried.
>
> Good! Could you give me these script?

ftp://firstfloor.org/pub/ak/probes/csum.m4

You need to run them through .m4 first.
They don't measure memmove, but that should be easy to add.

-Andi

2010-09-08 11:06:12

by Miao Xie

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

Hi, Andi

On Thu, 2 Sep 2010 12:40:08 +0200, Andi Kleen wrote:
>> So I improve the generic version of memcpy and memmove, and x86_64's memmove
>> which are implemented by byte copy.
>
> One should also add that most memmove()s and memcpy()s are actually
> generated by gcc as inlines (especially if you don't use the
> "make my code slow" option aka -Os) and don't use the fallback.
> The fallback depends on the gcc version and if gcc thinks the
> data is aligned or not.
>
> Sometimes one can get better code in the caller by making sure
> gcc knows the correct alignment (e.g. with suitable
> types) and size. This might be worth looking at for btrfs
> if it's really that memmove heavy.

Right! But the src address and dest address is not fixed, so it is hard to
tell gcc that the address is alignment or not.

The problem is memmove is very inefficient in fact, and it is used at some fast path,
such as: it is used to do metadata copy for filesystem, so we must improve it.

>>> I have some systemtap scripts to measure size/alignment distributions of
>>> copies on a kernel, if you have a particular workload you're interested
>>> in those could be tried.
>>
>> Good! Could you give me these script?
>
> ftp://firstfloor.org/pub/ak/probes/csum.m4
>
> You need to run them through .m4 first.
> They don't measure memmove, but that should be easy to add.

I used your script to measure size/alignment distributions of copies on a kernel when
I ran the btrfs test, and got some data:

memmove
total 325903
length
value |-------------------------------------------------- count
1 | 0
2 | 0
4 | 3
8 | 0
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 57062
32 |@@ 5903
64 |@@@ 7296
128 |@@@@@@@@@ 18868
256 |@@@@@@@@@@@@@@@@@ 33790
512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 64886
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 98450
2048 |@@@@@@@@@@@@@@@@@@@@ 39645
4096 | 0
8192 | 0

length upto 50
value |-------------------------------------------------- count
2 | 0
3 | 0
4 | 3
5 | 0
6 | 0
~
21 | 0
22 | 0
23 | 24
24 | 0
25 |@@@@@@@@@@ 57037
26 | 0
27 | 0
~
29 | 0
30 | 0
31 | 1
32 | 3
33 | 78
34 | 215
35 | 1865
36 | 432
37 | 0
38 | 0
39 | 0
40 | 0
41 | 130
42 | 0
43 | 80
44 | 0
45 | 0
46 | 0
47 | 0
48 | 80
49 | 0
50 | 1077
>50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 264878

src unalignments
value |-------------------------------------------------- count
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23173
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17623
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23760
3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17372
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19185
5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 26264
6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20288
7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18474
8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20160
9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17754
10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21450
11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18127
12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23075
13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18582
14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21879
15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18737
16 | 0

dst unalignments
value |-------------------------------------------------- count
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28566
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17449
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20980
3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17239
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17171
5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 15691
6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20558
7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18590
8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20644
9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21459
10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20384
11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19460
12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23087
13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19656
14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 26330
15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18639
16 | 0

same unalignments
value |-------------------------------------------------- count
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5695
1 |@@@@@@@@@@@@@@@ 1815
2 |@@@@@@@@@@@@@@@@@@@@@@@@@ 2850
3 |@@@@@@@@@@@@@@@ 1819
4 |@@@@@@@@@@@@@@@@@@@@@@@@ 2791
5 |@@@@@ 573
6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3358
7 |@@@@@@@@@@@@@@@@@@@@@ 2411
8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3340
9 |@@@@@@@@@@@@@@@@@@@@@ 2404
10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3475
11 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 3019
12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4153
13 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 3052
14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4212
15 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 3011
16 | 0

different unalignments
value |-------------------------------------------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41652
2 |@@@@@@@@@@@@@@@@@@ 25447
3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 69946
4 |@ 1800
5 |@ 1701
6 |@ 1573
7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 62478
8 | 810
9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 49180
10 | 684
11 | 148
12 | 1148
13 |@@@@@@@@@@ 15298
14 |@@ 3459
15 |@ 2601
16 | 0

According to the data, the length of the most copies is >=128.

Besides that I did some tests for various condition on my x86_64 box.
The test is doing 500 bytes memory copy for 50000 times.

The test result is following:

dest < src || dest >= src + len:
Length Src Unalign Dest Unalign Without Patch Patch applied
------ ----------- ------------ ------------- -------------
8 0 0 0s 223086us 0s 230612us
8 0 3 0s 133857us 0s 138364us
8 0 6 0s 133852us 0s 138364us
8 3 0 0s 133845us 0s 138365us
8 3 3 0s 133845us 0s 138364us
8 3 6 0s 133841us 0s 138363us
8 6 0 0s 133848us 0s 138364us
8 6 3 0s 133842us 0s 138363us
8 6 6 0s 133840us 0s 138360us
16 0 0 0s 133847us 0s 138364us
16 0 3 0s 133851us 0s 138362us
16 0 6 0s 133842us 0s 138368us
16 3 0 0s 133849us 0s 138360us
16 3 3 0s 133844us 0s 138362us
16 3 6 0s 133839us 0s 138365us
16 6 0 0s 133845us 0s 138359us
16 6 3 0s 133841us 0s 138368us
16 6 6 0s 133841us 0s 138363us
32 0 0 0s 160914us 0s 165435us
32 0 3 0s 160925us 0s 165427us
32 0 6 0s 160898us 0s 165443us
32 3 0 0s 160930us 0s 165432us
32 3 3 0s 160898us 0s 165434us
32 3 6 0s 160919us 0s 165433us
32 6 0 0s 160909us 0s 165436us
32 6 3 0s 160914us 0s 165425us
32 6 6 0s 160910us 0s 165439us
256 0 0 0s 294756us 0s 299280us
256 0 3 0s 500777us 0s 505367us
256 0 6 0s 655671us 0s 660232us
256 3 0 0s 497769us 0s 503386us
256 3 3 0s 497790us 0s 502358us
256 3 6 0s 500793us 0s 505253us
256 6 0 0s 497751us 0s 503097us
256 6 3 0s 497773us 0s 502242us
256 6 6 0s 497769us 0s 502192us
1024 0 0 0s 457170us 0s 461843us
1024 0 3 1s 655705us 1s 660707us
1024 0 6 2s 388031us 2s 391429us
1024 3 0 1s 652717us 1s 660362us
1024 3 3 2s 755214us 1s 657005us
1024 3 6 1s 655735us 1s 660939us
1024 6 0 1s 669425us 1s 662643us
1024 6 3 1s 653472us 1s 659986us
1024 6 6 1s 653559us 1s 662163us

dest > src && dest < src + len:
Length Src Unalign Dest Unalign Without Patch Patch applied
------ ----------- ------------ ------------- -------------
8 0 3 0s 45029us 0s 45775us
8 0 6 0s 43634us 0s 48014us
8 3 0 0s 43625us 0s 43993us
8 3 6 0s 44324us 0s 45081us
8 6 0 0s 43613us 0s 44014us
8 6 3 0s 43620us 0s 44011us
16 0 0 0s 67680us 0s 49631us
16 0 3 0s 67677us 0s 77417us
16 0 6 0s 67671us 0s 81879us
16 3 0 0s 67676us 0s 52492us
16 3 3 0s 67675us 0s 75199us
16 3 6 0s 67677us 0s 81215us
16 6 0 0s 67673us 0s 52490us
16 6 3 0s 67676us 0s 77304us
16 6 6 0s 67676us 0s 79712us
32 0 0 0s 115800us 0s 49632us
32 0 3 0s 115812us 0s 81214us
32 0 6 0s 115792us 0s 85728us
32 3 0 0s 116488us 0s 60198us
32 3 3 0s 115803us 0s 79709us
32 3 6 0s 115798us 0s 85726us
32 6 0 0s 115844us 0s 60202us
32 6 3 0s 115805us 0s 81212us
32 6 6 0s 115802us 0s 84228us
256 0 0 0s 815079us 0s 91737us
256 0 3 0s 815075us 0s 194652us
256 0 6 0s 815079us 0s 198521us
256 3 0 0s 815074us 0s 171470us
256 3 3 0s 817181us 0s 114299us
256 3 6 0s 816478us 0s 198524us
256 6 0 0s 815822us 0s 173566us
256 6 3 0s 814936us 0s 196515us
256 6 6 0s 815152us 0s 118811us
1024 0 0 3s 125598us 0s 244644us
1024 0 3 3s 132574us 0s 590261us
1024 0 6 3s 125613us 0s 595143us
1024 3 0 3s 128452us 0s 568743us
1024 3 3 3s 124862us 0s 263189us
1024 3 6 3s 127440us 0s 595515us
1024 6 0 3s 122370us 0s 569479us
1024 6 3 3s 127429us 0s 590554us
1024 6 6 3s 126732us 0s 269415us

Though there is little regression when the length is <=16, theperformance
is quite well when the length is >=32.

Thanks!
Miao

2010-09-08 12:19:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions


> According to the data, the length of the most copies is >=128.

Thanks for the data. Large is easier to optimize than small, that's good.

Could you also measure how many memsets need the backwards copy?
(should be easy to add)

If the number is small that needs backwards then the easiest fix
would be to simply call the normal memcpy in the forward case.

That is for backward could also use a string instruction copy
of course, just have to set the direction flag.

That would be a very small code change.

-Andi



2010-09-08 12:57:07

by Miao Xie

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
>
>> According to the data, the length of the most copies is>=128.
>
> Thanks for the data. Large is easier to optimize than small, that's good.
>
> Could you also measure how many memsets need the backwards copy?
> (should be easy to add)

I think memset doesn't need the backwards copy.
The reason why memmove needs the backward copy is because the forward copy may
cover the data which will be used later if the dest string intersect with the
src string. But memset needn't worry about this problem.

Thanks!
Miao

>
> If the number is small that needs backwards then the easiest fix
> would be to simply call the normal memcpy in the forward case.
>
> That is for backward could also use a string instruction copy
> of course, just have to set the direction flag.
>
> That would be a very small code change.
>
> -Andi
>
>
>
>

2010-09-08 13:05:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

On Wed, 08 Sep 2010 20:57:07 +0800
Miao Xie <[email protected]> wrote:

> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
> >
> >> According to the data, the length of the most copies is>=128.
> >
> > Thanks for the data. Large is easier to optimize than small, that's
> > good.
> >
> > Could you also measure how many memsets need the backwards copy?
> > (should be easy to add)
>
> I think memset doesn't need the backwards copy.

I meant for memmove of course. Obviously memset doesn't need a backwards
copy. That was just the only thing the script didn't measure because
the original version didn't have memmove support.

Your whole thread was about making memmove faster, right?

-Andi
--
[email protected] -- Speaking for myself only.

2010-09-08 13:32:42

by Miao Xie

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

On Wed, 8 Sep 2010 15:05:19 +0200, Andi Kleen wrote:
> On Wed, 08 Sep 2010 20:57:07 +0800
> Miao Xie<[email protected]> wrote:
>
>> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
>>>
>>>> According to the data, the length of the most copies is>=128.
>>>
>>> Thanks for the data. Large is easier to optimize than small, that's
>>> good.
>>>
>>> Could you also measure how many memsets need the backwards copy?
^^^^^^^ :-)
>>> (should be easy to add)
>>
>> I think memset doesn't need the backwards copy.
>
> I meant for memmove of course. Obviously memset doesn't need a backwards
> copy. That was just the only thing the script didn't measure because
> the original version didn't have memmove support.

memmove
total 5224252
[snip]

forward copy
value |-------------------------------------------------- count
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 980253


backward copy
value |-------------------------------------------------- count
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4243999

the backward copy is much more than the forward copy.

Thanks
Miao