LinuxLists.cc - [PATCH v6 0/4] riscv: Add fine-tuned checksum functions

2023-09-15 17:06:06

Subject: [PATCH v6 0/4] riscv: Add fine-tuned checksum functions

Each architecture generally implements fine-tuned checksum functions to
leverage the instruction set. This patch adds the main checksum
functions that are used in networking.

Vector support is included in this patch to start a discussion on that,
it can probably be optimized more. The vector patches still need some
work as they rely on GCC vector intrinsics types which cannot work in
the kernel since it requires C vector support rather than just assembler
support. I have tested the vector patches as standalone algorithms in QEMU.

This patch takes heavy use of the Zbb extension using alternatives
patching.

To test this patch, enable the configs for KUNIT, then CHECKSUM_KUNIT
and RISCV_CHECKSUM_KUNIT.

I have attempted to make these functions as optimal as possible, but I
have not ran anything on actual riscv hardware. My performance testing
has been limited to inspecting the assembly, running the algorithms on
x86 hardware, and running in QEMU.

ip_fast_csum is a relatively small function so even though it is
possible to read 64 bits at a time on compatible hardware, the
bottleneck becomes the clean up and setup code so loading 32 bits at a
time is actually faster.

---

The algorithm proposed to replace the default csum_fold can be seen to
compute the same result by running all 2^32 possible inputs.

static inline unsigned int ror32(unsigned int word, unsigned int shift)
{
return (word >> (shift & 31)) | (word << ((-shift) & 31));
}

unsigned short csum_fold(unsigned int csum)
{
unsigned int sum = csum;
sum = (sum & 0xffff) + (sum >> 16);
sum = (sum & 0xffff) + (sum >> 16);
return ~sum;
}

unsigned short csum_fold_arc(unsigned int csum)
{
return ((~csum - ror32(csum, 16)) >> 16);
}

int main()
{
unsigned int start = 0x0;
do {
if (csum_fold(start) != csum_fold_arc(start)) {
printf("Not the same %u\n", start);
return -1;
}
start += 1;
} while(start != 0x0);
printf("The same\n");
return 0;
}

Cc: Paul Walmsley <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Arnd Bergmann <[email protected]>
To: Charlie Jenkins <[email protected]>
To: Palmer Dabbelt <[email protected]>
To: Conor Dooley <[email protected]>
To: Samuel Holland <[email protected]>
To: David Laight <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
Signed-off-by: Charlie Jenkins <[email protected]>

---
Changes in v6:
- Fix accuracy of commit message for csum_fold
- Fix indentation
- Link to v5: https://lore.kernel.org/r/[email protected]

Changes in v5:
- Drop vector patches
- Check ZBB enabled before doing any ZBB code (Conor)
- Check endianness in IS_ENABLED
- Revert to the simpler non-tree based version of ipv6_csum_magic since
David pointed out that the tree based version is not better.
- Link to v4: https://lore.kernel.org/r/[email protected]

Changes in v4:
- Suggestion by David Laight to use an improved checksum used in
arch/arc.
- Eliminates zero-extension on rv32, but not on rv64.
- Reduces data dependency which should improve execution speed on
rv32 and rv64
- Still passes CHECKSUM_KUNIT and RISCV_CHECKSUM_KUNIT on rv32 and
rv64 with and without zbb.
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Use riscv_has_extension_likely and has_vector where possible (Conor)
- Reduce ifdefs by using IS_ENABLED where possible (Conor)
- Use kernel_vector_begin in the vector code (Samuel)
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- After more benchmarking, rework functions to improve performance.
- Remove tests that overlapped with the already existing checksum
tests and make tests more extensive.
- Use alternatives to activate code with Zbb and vector extensions
- Link to v1: https://lore.kernel.org/r/[email protected]

---
Charlie Jenkins (4):
asm-generic: Improve csum_fold
riscv: Checksum header
riscv: Add checksum library
riscv: Test checksum functions

arch/riscv/Kconfig.debug | 1 +
arch/riscv/include/asm/checksum.h | 91 ++++++++++
arch/riscv/lib/Kconfig.debug | 31 ++++
arch/riscv/lib/Makefile | 3 +
arch/riscv/lib/csum.c | 198 ++++++++++++++++++++
arch/riscv/lib/riscv_checksum_kunit.c | 330 ++++++++++++++++++++++++++++++++++
include/asm-generic/checksum.h | 4 +-
7 files changed, 655 insertions(+), 3 deletions(-)
---
base-commit: af3c30d33476bc2694b0d699173544b07f7ae7de
change-id: 20230804-optimize_checksum-db145288ac21
--
- Charlie

2023-09-15 19:09:57

by Charlie Jenkins

[permalink] [raw]

Subject: [PATCH v6 1/4] asm-generic: Improve csum_fold

This csum_fold implementation introduced into arch/arc by Vineet Gupta
is better than the default implementation on at least arc, x86, and
riscv. Using GCC trunk and compiling non-inlined version, this
implementation has 41.6667%, 25% fewer instructions on riscv64, x86-64
respectively with -O3 optimization. Most implmentations override this
default in asm, but this should be more performant than all of those
other implementations except for arm which has barrel shifting and
sparc32 which has a carry flag.

Signed-off-by: Charlie Jenkins <[email protected]>
Reviewed-by: David Laight <[email protected]>
---
include/asm-generic/checksum.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/asm-generic/checksum.h b/include/asm-generic/checksum.h
index 43e18db89c14..37f5ec70ac93 100644
--- a/include/asm-generic/checksum.h
+++ b/include/asm-generic/checksum.h
@@ -31,9 +31,7 @@ extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
static inline __sum16 csum_fold(__wsum csum)
{
u32 sum = (__force u32)csum;
- sum = (sum & 0xffff) + (sum >> 16);
- sum = (sum & 0xffff) + (sum >> 16);
- return (__force __sum16)~sum;
+ return (__force __sum16)((~sum - ror32(sum, 16)) >> 16);
}
#endif

--
2.42.0

2023-09-15 19:29:09

by Charlie Jenkins

[permalink] [raw]

Subject: [PATCH v6 3/4] riscv: Add checksum library

Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
will load from the buffer in groups of 32 bits, and when compiled for
64-bit will load in groups of 64 bits.

Signed-off-by: Charlie Jenkins <[email protected]>
---
arch/riscv/include/asm/checksum.h | 12 +++
arch/riscv/lib/Makefile | 1 +
arch/riscv/lib/csum.c | 198 ++++++++++++++++++++++++++++++++++++++
3 files changed, 211 insertions(+)

diff --git a/arch/riscv/include/asm/checksum.h b/arch/riscv/include/asm/checksum.h
index dc0dd89f2a13..7fcd07edb8b3 100644
--- a/arch/riscv/include/asm/checksum.h
+++ b/arch/riscv/include/asm/checksum.h
@@ -12,6 +12,18 @@

#define ip_fast_csum ip_fast_csum

+extern unsigned int do_csum(const unsigned char *buff, int len);
+#define do_csum do_csum
+
+/* Default version is sufficient for 32 bit */
+#ifdef CONFIG_64BIT
+#define _HAVE_ARCH_IPV6_CSUM
+__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto, __wsum sum);
+#endif
+
+// Define riscv versions of functions before importing asm-generic/checksum.h
#include <asm-generic/checksum.h>

/*
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 26cb2502ecf8..2aa1a4ad361f 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -6,6 +6,7 @@ lib-y += memmove.o
lib-y += strcmp.o
lib-y += strlen.o
lib-y += strncmp.o
+lib-y += csum.o
lib-$(CONFIG_MMU) += uaccess.o
lib-$(CONFIG_64BIT) += tishift.o
lib-$(CONFIG_RISCV_ISA_ZICBOZ) += clear_page.o
diff --git a/arch/riscv/lib/csum.c b/arch/riscv/lib/csum.c
new file mode 100644
index 000000000000..1fda96d2bd8d
--- /dev/null
+++ b/arch/riscv/lib/csum.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * IP checksum library
+ *
+ * Influenced by arch/arm64/lib/csum.c
+ * Copyright (C) 2023 Rivos Inc.
+ */
+#include <linux/bitops.h>
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kernel.h>
+
+#include <net/checksum.h>
+
+/* Default version is sufficient for 32 bit */
+#ifndef CONFIG_32BIT
+__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto, __wsum csum)
+{
+ unsigned int ulen, uproto;
+ unsigned long sum = csum;
+
+ sum += saddr->s6_addr32[0];
+ sum += saddr->s6_addr32[1];
+ sum += saddr->s6_addr32[2];
+ sum += saddr->s6_addr32[3];
+
+ sum += daddr->s6_addr32[0];
+ sum += daddr->s6_addr32[1];
+ sum += daddr->s6_addr32[2];
+ sum += daddr->s6_addr32[3];
+
+ ulen = htonl((unsigned int)len);
+ sum += ulen;
+
+ uproto = htonl(proto);
+ sum += uproto;
+
+ if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
+ IS_ENABLED(CONFIG_RISCV_ALTERNATIVE)) {
+ unsigned long fold_temp;
+
+ /*
+ * Zbb is likely available when the kernel is compiled with Zbb
+ * support, so nop when Zbb is available and jump when Zbb is
+ * not available.
+ */
+ asm_volatile_goto(ALTERNATIVE("j %l[no_zbb]", "nop", 0,
+ RISCV_ISA_EXT_ZBB, 1)
+ :
+ :
+ :
+ : no_zbb);
+ asm(".option push \n\
+ .option arch,+zbb \n\
+ rori %[fold_temp], %[sum], 32 \n\
+ add %[sum], %[fold_temp], %[sum] \n\
+ srli %[sum], %[sum], 32 \n\
+ not %[fold_temp], %[sum] \n\
+ roriw %[sum], %[sum], 16 \n\
+ subw %[sum], %[fold_temp], %[sum] \n\
+ .option pop"
+ : [sum] "+r" (sum), [fold_temp] "=&r" (fold_temp));
+ return (__force __sum16)(sum >> 16);
+ }
+no_zbb:
+ sum += (sum >> 32) | (sum << 32);
+ sum >>= 32;
+ return csum_fold((__force __wsum)sum);
+}
+EXPORT_SYMBOL(csum_ipv6_magic);
+#endif // !CONFIG_32BIT
+
+#ifdef CONFIG_32BIT
+#define OFFSET_MASK 3
+#elif CONFIG_64BIT
+#define OFFSET_MASK 7
+#endif
+
+/*
+ * Perform a checksum on an arbitrary memory address.
+ * Algorithm accounts for buff being misaligned.
+ * If buff is not aligned, will over-read bytes but not use the bytes that it
+ * shouldn't. The same thing will occur on the tail-end of the read.
+ */
+unsigned int __no_sanitize_address do_csum(const unsigned char *buff, int len)
+{
+ unsigned int offset, shift;
+ unsigned long csum = 0, data;
+ const unsigned long *ptr;
+
+ if (unlikely(len <= 0))
+ return 0;
+ /*
+ * To align the address, grab the whole first byte in buff.
+ * Since it is inside of a same byte, it will never cross pages or cache
+ * lines.
+ * Directly call KASAN with the alignment we will be using.
+ */
+ offset = (unsigned long)buff & OFFSET_MASK;
+ kasan_check_read(buff, len);
+ ptr = (const unsigned long *)(buff - offset);
+ len = len + offset - sizeof(unsigned long);
+
+ /*
+ * Clear the most signifant bits that were over-read if buff was not
+ * aligned.
+ */
+ shift = offset * 8;
+ data = *ptr;
+ if (IS_ENABLED(__LITTLE_ENDIAN))
+ data = (data >> shift) << shift;
+ else
+ data = (data << shift) >> shift;
+
+ /*
+ * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
+ * faster than doing 32-bit reads on architectures that support larger
+ * reads.
+ */
+ while (len > 0) {
+ csum += data;
+ csum += csum < data;
+ len -= sizeof(unsigned long);
+ ptr += 1;
+ data = *ptr;
+ }
+
+ /*
+ * Perform alignment (and over-read) bytes on the tail if any bytes
+ * leftover.
+ */
+ shift = len * -8;
+ if (IS_ENABLED(__LITTLE_ENDIAN))
+ data = (data << shift) >> shift;
+ else
+ data = (data >> shift) << shift;
+
+ csum += data;
+ csum += csum < data;
+
+ if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
+ riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
+ unsigned int fold_temp;
+
+ if (IS_ENABLED(CONFIG_32BIT)) {
+ asm_volatile_goto(".option push \n\
+ .option arch,+zbb \n\
+ rori %[fold_temp], %[csum], 16 \n\
+ andi %[offset], %[offset], 1 \n\
+ add %[csum], %[fold_temp], %[csum] \n\
+ beq %[offset], zero, %l[end] \n\
+ rev8 %[csum], %[csum] \n\
+ zext.h %[csum], %[csum] \n\
+ .option pop"
+ : [csum] "+r" (csum),
+ [fold_temp] "=&r" (fold_temp)
+ : [offset] "r" (offset)
+ :
+ : end);
+
+ return csum;
+ } else {
+ asm_volatile_goto(".option push \n\
+ .option arch,+zbb \n\
+ rori %[fold_temp], %[csum], 32 \n\
+ add %[csum], %[fold_temp], %[csum] \n\
+ srli %[csum], %[csum], 32 \n\
+ roriw %[fold_temp], %[csum], 16 \n\
+ addw %[csum], %[fold_temp], %[csum] \n\
+ andi %[offset], %[offset], 1 \n\
+ beq %[offset], zero, %l[end] \n\
+ rev8 %[csum], %[csum] \n\
+ srli %[csum], %[csum], 32 \n\
+ zext.h %[csum], %[csum] \n\
+ .option pop"
+ : [csum] "+r" (csum),
+ [fold_temp] "=&r" (fold_temp)
+ : [offset] "r" (offset)
+ :
+ : end);
+
+ return csum;
+ }
+end:
+ return csum >> 16;
+ }
+
+#ifndef CONFIG_32BIT
+ csum += (csum >> 32) | (csum << 32);
+ csum >>= 32;
+#endif
+ csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
+ if (offset & 1)
+ return (unsigned short)swab32(csum);
+ return csum >> 16;
+}

--
2.42.0

2023-09-15 21:41:45

by Charlie Jenkins

[permalink] [raw]

Subject: [PATCH v6 4/4] riscv: Test checksum functions

Add Kconfig support for riscv specific testing modules. This was created
to supplement lib/checksum_kunit.c, and add tests for ip_fast_csum and
csum_ipv6_magic.

Signed-off-by: Charlie Jenkins <[email protected]>
---
arch/riscv/Kconfig.debug | 1 +
arch/riscv/lib/Kconfig.debug | 31 ++++
arch/riscv/lib/Makefile | 2 +
arch/riscv/lib/riscv_checksum_kunit.c | 330 ++++++++++++++++++++++++++++++++++
4 files changed, 364 insertions(+)

diff --git a/arch/riscv/Kconfig.debug b/arch/riscv/Kconfig.debug
index e69de29bb2d1..53a84ec4f91f 100644
--- a/arch/riscv/Kconfig.debug
+++ b/arch/riscv/Kconfig.debug
@@ -0,0 +1 @@
+source "arch/riscv/lib/Kconfig.debug"
diff --git a/arch/riscv/lib/Kconfig.debug b/arch/riscv/lib/Kconfig.debug
new file mode 100644
index 000000000000..fc7da3b107ad
--- /dev/null
+++ b/arch/riscv/lib/Kconfig.debug
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: GPL-2.0-only
+menu "riscv lib Testing and Coverage"
+
+menuconfig RUNTIME_LIB_TESTING_MENU
+ bool "Runtime Lib Testing"
+ def_bool y
+ help
+ Enable riscv runtime lib testing.
+
+if RUNTIME_LIB_TESTING_MENU
+
+config RISCV_CHECKSUM_KUNIT
+ tristate "KUnit test riscv checksum functions at runtime" if !KUNIT_ALL_TESTS
+ depends on KUNIT
+ default KUNIT_ALL_TESTS
+ help
+ Enable this option to test the checksum functions at boot.
+
+ KUnit tests run during boot and output the results to the debug log
+ in TAP format (http://testanything.org/). Only useful for kernel devs
+ running the KUnit test harness, and not intended for inclusion into a
+ production build.
+
+ For more information on KUnit and unit tests in general please refer
+ to the KUnit documentation in Documentation/dev-tools/kunit/.
+
+ If unsure, say N.
+
+endif # RUNTIME_LIB_TESTING_MENU
+
+endmenu # "riscv lib Testing and Coverage"
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2aa1a4ad361f..1535a8c81430 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -12,3 +12,5 @@ lib-$(CONFIG_64BIT) += tishift.o
lib-$(CONFIG_RISCV_ISA_ZICBOZ) += clear_page.o

obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+
+obj-$(CONFIG_RISCV_CHECKSUM_KUNIT) += riscv_checksum_kunit.o
diff --git a/arch/riscv/lib/riscv_checksum_kunit.c b/arch/riscv/lib/riscv_checksum_kunit.c
new file mode 100644
index 000000000000..27f0e465447f
--- /dev/null
+++ b/arch/riscv/lib/riscv_checksum_kunit.c
@@ -0,0 +1,330 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test cases for checksum
+ */
+
+#include <linux/in6.h>
+
+#include <kunit/test.h>
+#include <net/checksum.h>
+#include <net/ip6_checksum.h>
+
+#define CHECK_EQ(lhs, rhs) KUNIT_ASSERT_EQ(test, lhs, rhs)
+
+static const u8 random_buf[] = {
+ 0x3d, 0xf9, 0x6f, 0x81, 0x84, 0x11, 0xb8, 0x03, 0x8f, 0x00, 0x1e, 0xfd,
+ 0xc6, 0x77, 0xf7, 0x72, 0xde, 0x16, 0xe2, 0xf7, 0xf8, 0x81, 0x4b, 0x3e,
+ 0x36, 0x57, 0x9c, 0x10, 0x4e, 0x53, 0x44, 0x94, 0x5e, 0x6c, 0x5b, 0xde,
+ 0x98, 0x8a, 0xc5, 0x0a, 0x5d, 0x24, 0x38, 0x4c, 0x50, 0xef, 0x20, 0xe8,
+ 0x14, 0x4e, 0x8d, 0x3e, 0x80, 0x9a, 0xd9, 0xf1, 0xb5, 0x2d, 0x27, 0x6d,
+ 0xb4, 0x99, 0x9b, 0x10, 0xf7, 0x12, 0x14, 0xff, 0xe8, 0xe1, 0xd5, 0x1a,
+ 0x96, 0x86, 0x6a, 0xb3, 0xde, 0x10, 0xf3, 0xa5, 0x08, 0xbd, 0x74, 0x27,
+ 0x5a, 0x72, 0x4f, 0x5a, 0xd3, 0x4b, 0xbb, 0x73, 0xe3, 0x71, 0xd1, 0x1d,
+ 0x8c, 0xb3, 0x69, 0xd9, 0x3c, 0xda, 0x58, 0x73, 0x86, 0x19, 0xd1, 0xf9,
+ 0x58, 0xee, 0x4a, 0x39, 0xf9, 0x43, 0x38, 0x22, 0x8a, 0x6f, 0xee, 0xb5,
+ 0x7a, 0x31, 0x52, 0x32, 0x80, 0xf1, 0x70, 0x60, 0x7c, 0x0a, 0xa6, 0x54,
+ 0x08, 0x11, 0x99, 0xa1, 0x4b, 0x58, 0xc1, 0xbe, 0x6d, 0x5e, 0xd1, 0x32,
+ 0x79, 0xcf, 0xaf, 0x7c, 0x52, 0x6f, 0x26, 0xc4, 0xa8, 0x1d, 0x67, 0x04,
+ 0x2f, 0xb8, 0x10, 0x9d, 0x97, 0x2f, 0xe3, 0xa1, 0xf7, 0x88, 0xa4, 0xab,
+ 0xd9, 0x22, 0xaa, 0x8d, 0x11, 0x3b, 0x27, 0x34, 0x31, 0xd6, 0x44, 0xeb,
+ 0x9f, 0x4c, 0x22, 0x29, 0xea, 0x83, 0xa4, 0x6b, 0x48, 0x7a, 0xe7, 0x4c,
+ 0x84, 0x5b, 0x24, 0xbe, 0x1e, 0x1f, 0xf6, 0xc7, 0x9e, 0xd4, 0xc1, 0x52,
+ 0x23, 0x18, 0xaa, 0xfe, 0x72, 0x63, 0x7f, 0x2f, 0xcd, 0xda, 0x0e, 0x39,
+ 0x09, 0xbb, 0x84, 0x24, 0xa4, 0xa9, 0x2f, 0x01, 0x55, 0xfa, 0xb4, 0xa7,
+ 0x0c, 0x9c, 0xb0, 0x22, 0x71, 0x85, 0x91, 0x62, 0x97, 0xdc, 0x8d, 0xaf
+};
+
+static const __sum16 expected_csum_ipv6_magic[] = {
+ 0xf45f, 0x1b52, 0xa787, 0x5002, 0x562d, 0x2aed, 0x54b4, 0xc018, 0xfdc9,
+ 0xae5a, 0xa9d1, 0x79e1, 0x12c6, 0xa262, 0x1290, 0x7632, 0x85d, 0xfa1c,
+ 0xbe47, 0x304b, 0x506e, 0x4dd0, 0x1ce7, 0x49f5, 0x4c39, 0xa900, 0x16d6,
+ 0x4c3d, 0xf8b7, 0x71ab, 0x9109, 0x992b, 0x19a9, 0x8b0f, 0xff9c, 0x3113,
+ 0x152f, 0xcffc, 0xb3af, 0xfb87, 0x7015, 0x2005, 0x2fa5, 0x4c99, 0xd8fe,
+ 0xffb5, 0x4610, 0xe437, 0xa888, 0x49b0, 0x8705, 0xabfa, 0x2ed2, 0x8788,
+ 0xdff8, 0x662f, 0x3ac0, 0xf00c, 0x863a, 0xce3f, 0xfe40, 0x38e0, 0xb0a9,
+ 0x181, 0xee1d, 0x707a, 0x922, 0xd470, 0x3fad, 0x6b7b, 0x3945, 0x8991,
+ 0x3ffb, 0xc8c5, 0xfae1, 0x59cb, 0xfc51, 0x6954, 0x8955, 0x49d3, 0xc582,
+ 0x61bd, 0xe5a4, 0xaf1d, 0xa2d0, 0xb02b, 0xbf1e, 0x20ac, 0xd5d4, 0x2450,
+ 0xc454, 0x6a16, 0x4f9c, 0xeecf, 0xb7de, 0x9f27, 0x99fe, 0xb715, 0xfdc0,
+ 0xc6a2, 0xbb1a, 0xf0c2, 0xbb01, 0x8f53, 0xad2f, 0x9bf7, 0x9f3, 0x87ca,
+ 0xb445, 0xc220, 0x8b20, 0xd65a, 0xba07, 0x6b33, 0x4139, 0xbeef, 0x673a,
+ 0xbab8, 0xa929, 0x54cf, 0x2a18, 0xbbd1, 0x2d8, 0x2269, 0xa025, 0xeece,
+ 0x64a6, 0x5b74, 0x5ef7, 0xbaf5, 0x26e9, 0x2009, 0xabc0, 0x97a1, 0x41f,
+ 0xe0a7, 0x6d8b, 0x2845, 0x374a, 0x76e0, 0x7303, 0x1384, 0x854e, 0xcfac,
+ 0xc102, 0xc7f1, 0x479d, 0x9d8b, 0xd587, 0xc173, 0xb00c, 0xc4d1, 0xe8ed,
+ 0x51d2, 0x48d4, 0xd9eb, 0x6744, 0xcaf, 0xf785, 0xe8dc, 0x9034, 0x7413,
+ 0x26ce, 0x3b4b, 0xbf9, 0xba2a, 0xe9d8, 0x89de, 0x5150, 0x28ef, 0xbefb,
+ 0xb67f, 0xee07, 0x1c10, 0x2534, 0x78ce, 0xfc75, 0x7a6d, 0x5cdd, 0x7edb,
+ 0xf3ad, 0xd7bf, 0x3b1, 0xc411, 0xacfc, 0xe3b5, 0xca9d, 0x174e, 0x893b,
+ 0x442c, 0x4dec, 0x827d, 0x5783, 0x2dac, 0x7d26, 0x3530, 0xb0db, 0x11bc,
+ 0xb2ac, 0x4462
+};
+
+static const __sum16 expected_fast_csum[] = {
+ 0x78e9, 0x2e78, 0xf02e, 0x52f0, 0x3353, 0xa133, 0xeda0, 0xbced, 0xe0bc,
+ 0xfde0, 0x8dfd, 0xd78d, 0xf6d7, 0x3ff7, 0x240, 0xdc02, 0x96db, 0x2197,
+ 0x2321, 0x3e23, 0x103f, 0xda10, 0x6dda, 0xed6d, 0x5fed, 0xd35f, 0xc7d2,
+ 0x4ec8, 0xf04d, 0x87f0, 0xf587, 0x3ef5, 0x4b3f, 0x1d4b, 0x1d1d, 0x9f1c,
+ 0x99f, 0x8209, 0x6682, 0x2067, 0x420, 0xc903, 0x8ec8, 0x658e, 0xca65,
+ 0xbec9, 0xa6bf, 0xcba6, 0x8fcb, 0xf78e, 0xfbf6, 0xaefb, 0x1faf, 0x991f,
+ 0x3399, 0x834, 0x7208, 0xdf71, 0x8edf, 0x138e, 0x5613, 0xbf56, 0x32bf,
+ 0xe632, 0x1be6, 0x831c, 0xbc82, 0x47bc, 0x6148, 0x5d61, 0xf75d, 0x77f7,
+ 0x9e77, 0x2a9e, 0xb32a, 0xc3b2, 0x48c4, 0x1649, 0xa615, 0x9fa6, 0x729f,
+ 0x6b72, 0x556b, 0x8755, 0x987, 0x5b09, 0x625b, 0xd961, 0xc2d8, 0x53c3,
+ 0x2053, 0xc420, 0x5ac4, 0xae5a, 0x88ae, 0x4789, 0x8447, 0x4984, 0xc849,
+ 0xc4c7, 0xebc4, 0x86eb, 0x9487, 0x8d94, 0xd58d, 0x93d5, 0xfd92, 0xf3fd,
+ 0x96f4, 0xd096, 0x7ad1, 0x757a, 0x5f75, 0x6660, 0x9266, 0x592, 0x1305,
+ 0x4413, 0x2a44, 0x712a, 0x2171, 0x7e21, 0xf47d, 0xfef3, 0xf3fe, 0x5f4,
+ 0x1606, 0xc715, 0xf9c6, 0xf0f9, 0x94f0, 0x7095, 0x2570, 0xd024, 0x18d0,
+ 0x219, 0xb602, 0x1eb6, 0x561e, 0xcf56, 0x77cf, 0xa577, 0xa6a5, 0x93a6,
+ 0x3793, 0x1537, 0x7e15, 0x207e, 0x4f20, 0x994e, 0x9b99, 0x159b, 0xd215,
+ 0xacd2, 0xb4ac, 0xecb4, 0x84ec, 0xea84, 0x66ea, 0xb666, 0x18b6, 0xae18,
+ 0xfbad, 0x6efc, 0x746f, 0x7c74, 0x797c, 0x7c79, 0xb97c, 0xdba, 0x620d,
+ 0xd061, 0xa2d0, 0x5da2, 0x825d, 0x6082, 0xf85f, 0x72f8, 0xaf73, 0xc1ae,
+ 0xd2c1, 0xb8a5, 0xacb8, 0x5aad, 0x805a, 0xcb80, 0xb6cb, 0x89b6, 0x2a8a,
+ 0xf929, 0x5af9, 0x8d5a, 0x1d8d, 0xac1d, 0x4bac, 0x994b, 0x7d99, 0x17e,
+ 0xff01, 0xf3fe, 0xa8f4, 0x9fa9, 0x51a0, 0x3251, 0x7c32, 0x887b, 0x9d88,
+ 0x919d, 0xac91, 0x63ac, 0x7a63, 0x1c7a, 0xe51b, 0xbee4, 0x8dbe, 0xfd8d,
+ 0xc1fd, 0x6ec2, 0xa66e, 0x5fa6, 0xd05f, 0x59d0, 0x3659, 0x6b36, 0x5a6b,
+ 0xb859, 0xc1b7, 0xc5c1, 0xcc5, 0x930d, 0x8b92, 0x5a8b, 0xae5a, 0xe5ad,
+ 0x4fe5, 0x6f50, 0x366f, 0xbb36, 0xe3bb, 0x2be3, 0x962b, 0x7196, 0xf071,
+ 0x98f0, 0x3c99, 0x4f3c, 0x604f, 0x1660, 0xb915, 0xa1b9, 0xbea1, 0x11bf,
+ 0xc311, 0xec3, 0xcd0e, 0xe1cc, 0xcde1, 0xbbcd, 0x6fbc, 0xf26e, 0x9f3,
+ 0x250a, 0x8c24, 0xc88c, 0x2fc8, 0xf62e, 0x30f6, 0x7a30, 0x357a, 0x9b35,
+ 0xf9b, 0xa30f, 0x92a3, 0xf492, 0xebf4, 0xf6eb, 0xcef6, 0x5ece, 0xe05e,
+ 0xe0e0, 0xf7e0, 0x87f8, 0xb487, 0x70b4, 0x9c70, 0x839c, 0xa683, 0x92a6,
+ 0xd192, 0x37d2, 0x2238, 0x1523, 0xd414, 0xacd3, 0x81ad, 0x9881, 0xf897,
+ 0xfbf7, 0x14fc, 0xd15, 0x320d, 0x9032, 0x3390, 0xf232, 0xd5f1, 0xa7d5,
+ 0x3a8, 0x2a04, 0x4e2a, 0xc64d, 0x21c6, 0xb321, 0x60b3, 0x361, 0x3a03,
+ 0x5c39, 0xc25c, 0x60c2, 0x7660, 0x8976, 0x5489, 0xa654, 0xcaa5, 0x7bca,
+ 0xf77b, 0x2f7, 0x9702, 0xaf97, 0x9caf, 0x9e9c, 0xdd9e, 0xd2dd, 0xdcd2,
+ 0x62dd, 0x5463, 0xaa53, 0x76aa, 0xc375, 0x5c3, 0x2f06, 0xf42e, 0xa2f4,
+ 0xa1a2, 0x4ea1, 0xe04e, 0x84e0, 0x8f85, 0x938f, 0x4c93, 0xf24c, 0xa1f2,
+ 0xb9a1, 0x27ba, 0x8927, 0x1a89, 0xa51a, 0x4ba4, 0x114b, 0xde10, 0x12de,
+ 0x6112, 0xab61, 0x50d3, 0xc250, 0xf6c2, 0xedf6, 0xe3ed, 0x13e4, 0x8913,
+ 0x7089, 0xae6f, 0x66ae, 0x2466, 0xbf23, 0x16c0, 0x2917, 0x6a29, 0xe86a,
+ 0x90e8, 0x7691, 0xb875, 0x37b9, 0xc837, 0x1bc9, 0xfc1b, 0xd9fb, 0xfbd9,
+ 0x8ffb, 0xb88f, 0x52b8, 0xd751, 0xead6, 0xfcea, 0x7fd, 0x2408, 0xb223,
+ 0xf6b1, 0x71f6, 0xc472, 0x13c4, 0x3c14, 0xc53c, 0x47c4, 0x3947, 0x8a38,
+ 0x9b89, 0xbb9b, 0x55bb, 0x2456, 0xc24, 0x590c, 0x4258, 0x9642, 0xdc95,
+ 0x2edc, 0x542f, 0xc54, 0xb90c, 0xd6b9, 0x14d7, 0x9214, 0xec91, 0xa4ec,
+ 0xcda4, 0xf2cd, 0xadf2, 0x8fad, 0xc18f, 0x30c1, 0x430, 0x1205, 0x6112,
+ 0x4061, 0xcd40, 0x81cc, 0x2682, 0x2e26, 0x382e, 0x6e38, 0x906e, 0x6590,
+ 0xb265, 0x11b2, 0x6211, 0xe061, 0x8be0, 0xce8b, 0xeccd, 0xfcec, 0x3fd,
+ 0x3504, 0x4d35, 0x114d, 0x1a11, 0xcf19, 0x82cf, 0xf83, 0x210, 0xfb01,
+ 0xdfb, 0xbd0d, 0x6bd, 0x3607, 0xc735, 0x5c8, 0x7a05, 0x247a, 0xf824,
+ 0x2cf8, 0x302d, 0x8530, 0x3d85, 0x1b3e, 0xc71a, 0x95c6, 0x5296, 0x7b52,
+ 0xb97a, 0x6ab9, 0xca6a, 0xaca, 0x90b, 0x4409, 0x3144, 0x631, 0x5d06,
+ 0x745c, 0x3474, 0x4835, 0x3e48, 0xa43e, 0x8ba4, 0xf68a, 0x20f7, 0xae20,
+ 0x91ad, 0x8f91, 0x478f, 0x8f47, 0x9b8e, 0x5e9b, 0xb85e, 0x71b8, 0x4c71,
+ 0xad4c, 0x73ad, 0x5273, 0xdb52, 0xe6db, 0x63e7, 0x2f64, 0x852f, 0xc884,
+ 0x66c8, 0xa166, 0x6fa1, 0x726f, 0xb472, 0x4db4, 0xf94c, 0x81f9, 0x6581,
+ 0xb365, 0xb4b3, 0x68b4, 0xb068, 0xbdb0, 0x23be, 0xeb23, 0xa3eb, 0xd8a3,
+ 0x5ed9, 0xdc5e, 0x12dc, 0xa212, 0x85a1, 0x885, 0xeb07, 0xe9ea, 0xf8e9,
+ 0xa7f9, 0x93a7, 0x9493, 0x6940, 0x1f69, 0xf61f, 0x33f6, 0x9933, 0x1f99,
+ 0x201f, 0x1220, 0x1912, 0x4419, 0xf543, 0x29f5, 0xa62a, 0xa0a6, 0x2ea0,
+ 0x772f, 0xb976, 0x40ba, 0x8240, 0x9582, 0x3b96, 0xe3c, 0x230e, 0x8022,
+ 0x6f7f, 0x6f, 0x9900, 0x7599, 0x3c75, 0xf3c, 0xf60e, 0xb7f5, 0x79b8,
+ 0x1f79, 0xd31f, 0x66d3, 0xb266, 0x16b2, 0x5b16, 0x65b, 0x4b06, 0xcd4a,
+ 0xe8cc, 0x9ae8, 0x819a, 0xc81, 0x600d, 0x3a5f, 0xa23a, 0x46a2, 0x3346,
+ 0x5f33, 0x4a5f, 0x854a, 0x7285, 0xf73, 0xa10, 0xf209, 0xebf1, 0x5deb,
+ 0xe55d, 0x2ee5, 0xd2f, 0xf90c, 0xfff8, 0x6400, 0x5f63, 0xe5f, 0x850e,
+ 0xba85, 0x8cba, 0x378d, 0x3437, 0x4734, 0xa147, 0xe0a0, 0x5ae0, 0x665b,
+ 0x7d65, 0xe7e, 0xea0e, 0x1de9, 0x631e, 0x5a63, 0x685a, 0x2a68, 0x6b2a,
+ 0x8b6a, 0xf8b, 0xe40f, 0x29e4, 0x4d2a, 0x6b4d, 0xb06b, 0xebaf, 0x10ec,
+ 0xa910, 0x20a9, 0x5221, 0xe451, 0xd6e4, 0x18d7, 0xa019, 0xd89f, 0x71d8,
+ 0x1372, 0x3313, 0x2333, 0x6e23, 0xe6e, 0xfe0e, 0x87fd, 0x488, 0x805,
+ 0x7907, 0x9078, 0x1e90, 0xc81e, 0x1ec8, 0x901f, 0x1090, 0x6210, 0x2462,
+ 0x4d24, 0x524d, 0x9e52, 0x8b9e, 0xfe8b, 0x4efe, 0xe34e, 0x29e3, 0xa629,
+ 0xdca5, 0xb6db, 0x64b6, 0xab64, 0x5aab, 0x1d5a, 0x901d, 0x3490, 0xc134,
+ 0x90c1, 0xe490, 0x3ae5, 0xe33a, 0x82e3, 0xdc82, 0xeddc, 0x6ded, 0xa06d,
+ 0x90a0, 0xa490, 0x2ba5, 0x632b, 0xc562, 0x25c5, 0x5e25, 0xc5e, 0x9c0c,
+ 0x359b, 0xec35, 0x48ec, 0xc048, 0x7c1, 0xa407, 0xe0a4, 0xde1, 0x8f0d,
+ 0xf18e, 0xc9f1, 0x3fc9, 0xb23f, 0x7ab2, 0xa07a, 0x9da0, 0x1d9d, 0xd31c,
+ 0xdbd2, 0x45dc, 0xa145, 0x1a2, 0x1e86, 0x2b1e, 0x8d2b, 0xd58c, 0x3d6,
+ 0xfd03, 0xf0fc, 0x7cf1, 0xa87c, 0xbba8, 0xb9ba, 0xb8b9, 0xceb8, 0x6acf,
+ 0xf86a, 0xd4f8, 0x2cd5, 0x332d, 0xa932, 0x3ba9, 0xaf3b, 0x7eaf, 0x37f,
+ 0xa303, 0xd4a2, 0x24d4, 0x9224, 0x2592, 0x9225, 0x7c91, 0xd27c, 0xacd2,
+ 0x67ac, 0x2267, 0xf221, 0xa7f1, 0xb5a8, 0xaab5, 0xb9aa, 0x5ba, 0x1105,
+ 0x8410, 0x2484, 0xc923, 0xcac8, 0x10cb, 0xfd10, 0xbcfc, 0xbdbd, 0x77bd,
+ 0x9977, 0xb599, 0x7db5, 0x627d, 0xcc62, 0x80cc, 0x4a81, 0x534a, 0x653,
+ 0xa905, 0x55a9, 0xd155, 0x3bd1, 0x33c, 0x7302, 0xbd73, 0xabbc, 0x78ab,
+ 0x3779, 0xdb37, 0xffdb, 0xdfff, 0x20df, 0x1d21, 0xb91c, 0x3cb9, 0x333d,
+ 0x2233, 0x22, 0xdd00, 0x83dd, 0x5b83, 0xd15b, 0xe1d0, 0x42e1, 0xc142,
+ 0x83c1, 0xbe83, 0xabbe, 0x11ac, 0x611, 0x5c06, 0x195c, 0xc319, 0x80c3,
+ 0xee80, 0x49ee, 0x724a, 0xec72, 0x42ec, 0x2443, 0x3424, 0xa634, 0xcba5,
+ 0x5acb, 0xe45a, 0x15e4, 0xe415, 0xdce4, 0xc3dc, 0xfbc3, 0x5efb, 0xb85e,
+ 0x5b9, 0x8d05, 0x178d, 0xeb16, 0xf8ea, 0x3cf9, 0x803d, 0xee80, 0xcbee,
+ 0x67cb, 0xd68, 0xfd0c, 0xf5fc, 0xbef6, 0x83be, 0x7d83, 0x87d, 0xff07,
+ 0x9ff, 0xa809, 0x38a7, 0x9638, 0x2796, 0xaa27, 0x61aa, 0xc761, 0xfbc7,
+ 0x51fc, 0x3852, 0xda37, 0xc4da, 0x21c4, 0x9e21, 0xa49e, 0x2ba5, 0xf82b,
+ 0x93f7, 0xe393, 0x15e3, 0x3c16, 0x763c, 0xdf75, 0xf5de, 0x96f5, 0xa096,
+ 0xf3a0, 0x8cf3, 0xd28c, 0x5d3, 0xe305, 0xf2e2, 0xbcf2, 0x4bbd, 0x714b,
+ 0x2e71, 0xca2e, 0xe4ca, 0xd4e4, 0xe4d4, 0x63e4, 0x8363, 0x3b83, 0x2b3b,
+ 0x402b, 0x8f3f, 0x3b8f, 0xc53b, 0xedc5, 0x8928, 0x889, 0x5e09, 0x405e,
+ 0x9340, 0x7493, 0xb573, 0xbb6, 0xd10a, 0x85d1, 0x8385, 0x1683, 0x4217,
+ 0x5d42, 0x1f5d, 0x7b1f, 0xa07a, 0xa3a0, 0x89a3, 0x5e8a, 0x145f, 0xa314,
+ 0xfca2, 0x52fc, 0x2a53, 0x9229, 0x6e92, 0x1a6f, 0x8019, 0x7f7f, 0xf17e,
+ 0xedf0, 0x6aee, 0xb66a, 0x50b6, 0xa750, 0x7ba7, 0x617b, 0xf561, 0x33f5,
+ 0x5a33, 0x885a, 0xc187, 0x4bc1, 0xe64b, 0x41e6, 0x6342, 0x1363, 0xf113,
+ 0x54f0, 0xf354, 0x26f3, 0xbe26, 0xc3bd, 0xe6c3, 0xcbe6, 0xbacc, 0xf5ba,
+ 0x34f5, 0xb334, 0xc8b2, 0x2ac9, 0x882a, 0x6d88, 0x256d, 0xde25, 0x1ede,
+ 0x211e, 0x2421, 0xb124, 0x17b1, 0x3c18, 0xf93b, 0xd8f8, 0x3bd9, 0xb3c,
+ 0xcd0b, 0x5fcd, 0x6e5f, 0x646e, 0x5e64, 0xf25d, 0xe9f2, 0x14ea, 0xdf14,
+ 0xeede, 0x5fee, 0xcd5f, 0x59cd, 0x245a, 0x9b24, 0x399b, 0xba39, 0x14bb,
+ 0x1b15, 0x4d1b, 0x974c, 0x8d97, 0xf28d, 0x35f2, 0xd36, 0x50d, 0x8905,
+ 0x8c88, 0xc98c, 0x99c9, 0x1399, 0xbb13, 0x90bb, 0xc190, 0xfc2, 0xe60f,
+ 0x84e5, 0x3685, 0xab36, 0x7ab, 0xc907, 0x62c9, 0x8062, 0x4081, 0x9940,
+ 0x2399, 0x9b23, 0x929a, 0x2b92, 0x1b2b, 0x941b, 0xe793, 0x48e7, 0x8a48,
+ 0x308a, 0x8630, 0xf785, 0x7cf7, 0xcd7c, 0xeecd, 0x3aef, 0x93b, 0xbd08,
+ 0x85bd, 0x9085, 0x5390, 0xa253, 0x2a3, 0xac02, 0x91ab, 0xf791, 0x9cf7,
+ 0x89d, 0xa708, 0xfda6, 0xe5fc, 0x74e6, 0xa75, 0x370a, 0x4d37, 0x7d4c,
+ 0x5d7d, 0x165e, 0x7815, 0xeb77, 0x70eb, 0x4670, 0x9246, 0x9592, 0x6696,
+ 0x667, 0x6106, 0xb360, 0xc7b3, 0x72c7, 0xf272, 0xd0f2, 0x36d0, 0x3136,
+ 0x4f31, 0x2c4f, 0x772c, 0x4777, 0x3747, 0xe38, 0x1893, 0x8018, 0x2280,
+ 0xcf22, 0xbbce, 0x3ebc, 0x7f3e, 0x697f, 0x4469, 0x7844, 0xaa77, 0xbca9,
+ 0xb5bc, 0xcdb5, 0xffcd, 0x9e00, 0x59e, 0xc805, 0x82c7, 0xe83, 0x6a0f,
+ 0x106a, 0xd910, 0x47d9, 0x1847, 0x9517, 0x8d94, 0x5b8d, 0x835b, 0x1383,
+ 0x5013, 0xed4f, 0x30ed, 0x6d30, 0x8c6d, 0xd58b, 0xc4d5, 0x65c5, 0x9265,
+ 0xb692, 0x75b6, 0xb975, 0x27b9, 0xa227, 0x19a2, 0x1f19, 0xbd1f, 0x84bc,
+ 0x3185, 0xb630, 0xdb6, 0x720d, 0x2e72, 0x662e, 0x1566, 0xd615, 0x2dd6,
+ 0x4f2e, 0x814e, 0x1d81, 0x7b1d, 0x4b7b, 0xfb4b, 0x15fb, 0x1215, 0xb412,
+ 0x36b3, 0x7d36, 0xfc7d, 0x6cfc, 0x9a6d, 0xa9b, 0x930a, 0x1693, 0xaa16,
+ 0x92a9, 0xa792, 0xf6a7, 0x86f6, 0x9787, 0xfa97, 0x1ffa, 0xc61f, 0x23c6,
+ 0x8d23, 0x18d, 0xf501, 0xaaf4, 0xfaaa, 0x75fb, 0x3576, 0x9835, 0x798,
+ 0x3008, 0x2130, 0x4021, 0x803f, 0x5e80, 0xd55e, 0xf6d4, 0x7bf7, 0xba7b,
+ 0x86ba, 0x6386, 0x7d63, 0x977d, 0x2797, 0x4228, 0x5d42, 0xf25c, 0x2df3,
+ 0xd62d, 0x62d6, 0xa063, 0xee9f, 0xc7ee, 0x73c7, 0xba73, 0xb3ba, 0xc5b3,
+ 0xc7c5, 0x48c7, 0x7048, 0xf66f, 0xf6f5, 0x9cf6, 0xc59d, 0x63c5, 0x9863,
+ 0xce98, 0x67ce, 0x4d68, 0x884d, 0x2488, 0xc323, 0x78c3, 0x7978, 0x2479,
+ 0x8524, 0xc385, 0x1ac4, 0x471a, 0xf546, 0x73f5, 0xbc73, 0xa4bc, 0x11a5,
+ 0x6d11, 0x416d, 0x3b41, 0x553b, 0x3d55, 0x5b3d, 0xc75b, 0x59c7, 0x3859,
+ 0x9637, 0xc895, 0x79c8, 0x1779, 0xc417, 0x8bc4, 0xdb8b, 0xc4db, 0x7ec4,
+ 0x497f, 0xa449, 0x6ea4, 0x206f, 0x7b20, 0x687a, 0x1669, 0xbd16, 0x1ebd,
+ 0x3d1e, 0xc13c, 0x4cc1, 0x4e4c, 0x794e, 0x6379, 0x6364, 0x4121, 0x4a41,
+ 0xec4a, 0x2cec, 0x2f2d, 0x312f, 0xa630, 0xfa6, 0xb80e, 0xe8b7, 0x8ae8,
+ 0xdf8a, 0x1ae0, 0xf21a, 0xf8f1, 0x4df9, 0x5b4e, 0x355b, 0x5f35, 0x360,
+ 0x5803, 0x1358, 0xf812, 0x88f7, 0x1b89, 0x291b, 0xec28, 0x5aec, 0x495a,
+ 0xca48, 0x8bca, 0x1b8b, 0x7a1b, 0x717a, 0x2971, 0x5829, 0xe058, 0x96e0,
+ 0xf896, 0xcf9, 0xa90c, 0x96a8, 0x8196, 0x1381, 0x5a13, 0x8059, 0xd780,
+ 0xcfd6, 0xa1d0, 0x58a1, 0x3c58, 0x7c3c, 0xa17b, 0xbfa1, 0x61bf, 0x4062,
+ 0xe040, 0x6fe0, 0xf46f, 0xc5f3, 0x67c5, 0x2168, 0x1321, 0x7213, 0xea71,
+ 0x6fea, 0xb96f, 0x4bb9, 0x964c, 0xaa96, 0x8ab, 0x9208, 0x6d91, 0xad6d,
+ 0xc2ad, 0xc5c2, 0x43c6, 0x2444, 0x6323, 0xa663, 0xa8a6, 0x32a8, 0x5b33,
+ 0x15b, 0x2e01, 0x532e, 0x8f53, 0x98f, 0x4809, 0x9148, 0x3b91, 0x8b3b,
+ 0xf08a, 0xf1, 0x401, 0x104, 0xef00, 0x13ef, 0xd313, 0xcdd2, 0x2fce,
+ 0xb82f, 0x9ab8, 0xea9a, 0x49ea, 0xc849, 0x45c8, 0x3246, 0x3b33, 0x5c3b,
+ 0x715c, 0x9671, 0xd96, 0xf80d, 0x21f8, 0x4d21, 0xa24c, 0xdfa1, 0x88df,
+ 0x2989, 0x9329, 0xca92, 0xa1ca, 0x72a1, 0x4672, 0xe146, 0xfce1, 0x2afd,
+ 0x292b, 0x7629, 0x5d75, 0xd75d, 0xc6d6, 0x3fc6, 0x8b3f, 0xb68b, 0x3b7,
+ 0x1803, 0xd817, 0x34d8, 0x2b35, 0x5a2b, 0xf5a, 0x440f, 0xf543, 0x38f5,
+ 0x6939, 0xc469, 0x27c4, 0xf827, 0x77f8, 0x2877, 0x7428, 0x3274, 0xbd31,
+ 0xd7bc, 0x6ed7, 0xe36e, 0xee4, 0x4a0e, 0xad49, 0x6ead, 0x796e, 0xd279,
+ 0xebd2, 0xfceb, 0x99fc, 0x929a, 0xc93, 0x630d, 0x7462, 0x8874, 0xdd88,
+ 0xf5dc, 0x6ef5, 0xed6e, 0xa1ed, 0xc9a1, 0x7dc9, 0x597d, 0xc159, 0xb47f,
+ 0x3cb4, 0x133d, 0xd312, 0xa2d2, 0xa1a2, 0x86a1, 0x3287, 0x1d32, 0xd1d,
+ 0x840c, 0x8f83, 0x7090, 0x5f70, 0xd55f, 0x42d6, 0x4942, 0x3849, 0x7e37,
+ 0x447e, 0x5b45, 0xa75b, 0x56a7, 0x8856, 0xe187, 0xdfe0, 0x27e0, 0x8927,
+ 0x9288, 0xce92, 0x28ce, 0x9e28, 0x959e, 0xa295, 0x8fa2, 0xae8f, 0x13af,
+ 0x7413, 0x5274, 0x7e52, 0xe97d, 0xf7e8, 0x9bf7, 0x5e9b, 0xca5e, 0x22ca,
+ 0x623, 0xda05, 0x14da, 0xb214, 0x88b1, 0xe688, 0x53e6, 0xe053, 0xd4e0,
+ 0xe8d4, 0xcce8, 0x45cd, 0xc45, 0x220c, 0x4022, 0xdd3f, 0x95dd, 0x4096,
+ 0x8440, 0xad84, 0x27ad, 0xd326, 0x70d3, 0x4171, 0x2142, 0xc521, 0x9c5,
+ 0xdb09, 0x9eda, 0xd49e, 0xf1d4, 0x36f2, 0xf836, 0x83f8, 0x4984, 0x8449,
+ 0xf584, 0x5ff5, 0x7b5f, 0x6e7b, 0x956e, 0xfc94, 0x30fc, 0x6231, 0x1e62,
+ 0x4c1e, 0x5f4c, 0xb65f, 0x1b6, 0xd801, 0xa2d7, 0x11a3, 0xe711, 0x54e7,
+ 0xfc54, 0xe8fb, 0xb8e9, 0xdab8, 0x27db, 0x3228, 0x8931, 0xf289, 0xe5f2,
+ 0xb3e5, 0xa4b4, 0x1ba4, 0x3c1b, 0x1d3c, 0xf71c, 0xb0f6, 0x6db0, 0x616d,
+ 0xba61, 0xa5ba, 0xe2a5, 0xee3, 0xd90e, 0x39d9, 0xd739, 0x88d7, 0xf288,
+ 0xb4f2, 0x67b4, 0x9167, 0x2591, 0x1526, 0x5115, 0x3350, 0xde32, 0x27de,
+ 0x1428, 0x2b14, 0xf22a, 0x4f2, 0x6405, 0xee63, 0x66ee, 0x9b67, 0xdb9a,
+ 0xf5db, 0x8bf6, 0xaf8b, 0x40af, 0x6340, 0xdb62, 0xc7da, 0x4cc8, 0x4d4d,
+ 0x524d, 0xa52, 0x5809, 0xc657, 0xacc6, 0x57ac, 0x1a58, 0x221a, 0x6f21,
+ 0xf66f, 0xd7f6, 0xe4d8, 0xa5e4, 0x4a6, 0x2d05, 0x3a2d, 0xa639, 0xb4a6,
+ 0x32b5, 0x7132, 0x7370, 0xe372, 0xffe2, 0x800, 0x3a08, 0x9c39, 0x29d,
+ 0x2825, 0xad27, 0xf3ad, 0xf5f3, 0x7f6, 0xc607, 0x7fc5, 0xe27f, 0x72e2,
+ 0x7a72, 0x607a, 0x8460, 0x5e84, 0x625e, 0xf461, 0x83f4, 0x4c84, 0xcc4c,
+ 0xdccb, 0x43dd, 0x2144, 0x5e21, 0x925e, 0xb691, 0x2ab6, 0xe42a, 0xc4e3,
+ 0xbc5, 0xae0b, 0xffad, 0x8eff, 0xf48e, 0xc8f4, 0x7fc8, 0xe97f, 0x1fe9,
+ 0x5420, 0xd553, 0x6cd5, 0xc96c, 0x59c9, 0x9a59, 0xca99, 0x68ca, 0x3d68,
+ 0x7c3d, 0x527c, 0x4452, 0xc744, 0xd2c6, 0xfbd2, 0x8efb, 0x408e, 0xb640,
+ 0xecb5, 0x44ed, 0xa545, 0x1a5, 0x8f01, 0xf08e, 0xd9f0, 0x1ada, 0x41b,
+ 0xc803, 0x5ec7, 0x445f, 0x4044, 0x640, 0xd07, 0x6f0d, 0xfd6e, 0xd3fd,
+ 0xb7d3, 0xedb7, 0x33ee, 0xb233, 0x92b2, 0x8893, 0x9288, 0xe292, 0x96e2,
+ 0x9f96, 0xfb9f, 0x52fb, 0x6452, 0x3f64, 0x783f, 0xbd77, 0x9fbd, 0x17a0,
+ 0x1c17, 0x231c, 0x1323, 0xb413, 0x15b4, 0x5f16, 0x6f5e, 0x426f, 0x543,
+ 0x4505, 0xda45, 0x52da, 0xfc52, 0x9afc, 0xd29a, 0x89d2, 0xbc89, 0x77bc,
+ 0x1478, 0xd913, 0x79d9, 0x7f79, 0x77f, 0x9f07, 0x289f, 0x2d28, 0xbd2c,
+ 0xa5bd, 0xf1a5, 0x6cf2, 0x736d, 0xb673, 0xceb5, 0xc3ce, 0x15c3, 0xa415,
+ 0xbaa4, 0xf2ba, 0xf1f2, 0x84f1, 0x7884, 0x8678, 0x6186, 0x4661, 0xf845,
+ 0xf7f7, 0x4cf8, 0xbf4c, 0x49bf, 0x5c4a, 0x4a5c, 0xab4a, 0x89ab, 0x8689,
+ 0xf485, 0x60f4, 0xef60, 0x4eef, 0x194f, 0x7e19, 0x707e, 0xfa6f, 0x35fa,
+ 0x3036, 0xf02f, 0x17f0, 0xc517, 0x79c4, 0xa279, 0x7ba2, 0x67c, 0xa07,
+ 0x7b09, 0x687b, 0xf868, 0xbbf8, 0xd7bb, 0x30d8, 0x8231, 0xb582, 0xaab4,
+ 0xaaaa, 0x90aa, 0xaf90, 0x2faf, 0x262f, 0x4126, 0xe640, 0x91e6, 0x9991,
+ 0x1a9a
+};
+
+#define IPv4_MIN_WORDS 5
+#define IPv4_MAX_WORDS 15
+#define NUM_IPv6_TESTS 200
+#define NUM_IP_FAST_CSUM_TESTS 181
+
+static void test_ip_fast_csum(struct kunit *test)
+{
+ __sum16 csum_result, expected;
+
+ for (int len = IPv4_MIN_WORDS; len < IPv4_MAX_WORDS; len++) {
+ for (int index = 0; index < NUM_IP_FAST_CSUM_TESTS; index++) {
+ csum_result = ip_fast_csum(random_buf + index, len);
+ expected =
+ expected_fast_csum[(len - IPv4_MIN_WORDS) *
+ NUM_IP_FAST_CSUM_TESTS +
+ index];
+ CHECK_EQ(expected, csum_result);
+ }
+ }
+}
+
+static void test_csum_ipv6_magic(struct kunit *test)
+{
+ const struct in6_addr *saddr;
+ const struct in6_addr *daddr;
+ unsigned int len;
+ unsigned char proto;
+ unsigned int csum;
+
+ const int daddr_offset = sizeof(struct in6_addr);
+ const int len_offset = sizeof(struct in6_addr) + sizeof(struct in6_addr);
+ const int proto_offset = sizeof(struct in6_addr) + sizeof(struct in6_addr) +
+ sizeof(int);
+ const int csum_offset = sizeof(struct in6_addr) + sizeof(struct in6_addr) +
+ sizeof(int) + sizeof(char);
+
+ for (int i = 0; i < NUM_IPv6_TESTS; i++) {
+ saddr = (const struct in6_addr *)(random_buf + i);
+ daddr = (const struct in6_addr *)(random_buf + i +
+ daddr_offset);
+ len = *(unsigned int *)(random_buf + i + len_offset);
+ proto = *(random_buf + i + proto_offset);
+ csum = *(unsigned int *)(random_buf + i + csum_offset);
+ CHECK_EQ(expected_csum_ipv6_magic[i],
+ csum_ipv6_magic(saddr, daddr, len, proto, csum));
+ }
+}
+
+static struct kunit_case __refdata riscv_checksum_test_cases[] = {
+ KUNIT_CASE(test_ip_fast_csum),
+ KUNIT_CASE(test_csum_ipv6_magic),
+ {}
+};
+
+static struct kunit_suite riscv_checksum_test_suite = {
+ .name = "riscv_checksum",
+ .test_cases = riscv_checksum_test_cases,
+};
+
+kunit_test_suites(&riscv_checksum_test_suite);
+
+MODULE_AUTHOR("Charlie Jenkins <[email protected]>");
+MODULE_LICENSE("GPL");

--
2.42.0

2023-09-15 21:48:01

by Charlie Jenkins

[permalink] [raw]

Subject: [PATCH v6 2/4] riscv: Checksum header

Provide checksum algorithms that have been designed to leverage riscv
instructions such as rotate. In 64-bit, can take advantage of the larger
register to avoid some overflow checking.

Signed-off-by: Charlie Jenkins <[email protected]>
---
arch/riscv/include/asm/checksum.h | 79 +++++++++++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)

diff --git a/arch/riscv/include/asm/checksum.h b/arch/riscv/include/asm/checksum.h
new file mode 100644
index 000000000000..dc0dd89f2a13
--- /dev/null
+++ b/arch/riscv/include/asm/checksum.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * IP checksum routines
+ *
+ * Copyright (C) 2023 Rivos Inc.
+ */
+#ifndef __ASM_RISCV_CHECKSUM_H
+#define __ASM_RISCV_CHECKSUM_H
+
+#include <linux/in6.h>
+#include <linux/uaccess.h>
+
+#define ip_fast_csum ip_fast_csum
+
+#include <asm-generic/checksum.h>
+
+/*
+ * Quickly compute an IP checksum with the assumption that IPv4 headers will
+ * always be in multiples of 32-bits, and have an ihl of at least 5.
+ * @ihl is the number of 32 bit segments and must be greater than or equal to 5.
+ * @iph is assumed to be word aligned.
+ */
+static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
+{
+ unsigned long csum = 0;
+ int pos = 0;
+
+ do {
+ csum += ((const unsigned int *)iph)[pos];
+ if (IS_ENABLED(CONFIG_32BIT))
+ csum += csum < ((const unsigned int *)iph)[pos];
+ } while (++pos < ihl);
+
+ /*
+ * ZBB only saves three instructions on 32-bit and five on 64-bit so not
+ * worth checking if supported without Alternatives.
+ */
+ if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
+ IS_ENABLED(CONFIG_RISCV_ALTERNATIVE)) {
+ unsigned long fold_temp;
+
+ asm_volatile_goto(ALTERNATIVE("j %l[no_zbb]", "nop", 0,
+ RISCV_ISA_EXT_ZBB, 1)
+ :
+ :
+ :
+ : no_zbb);
+
+ if (IS_ENABLED(CONFIG_32BIT)) {
+ asm(".option push \n\
+ .option arch,+zbb \n\
+ not %[fold_temp], %[csum] \n\
+ rori %[csum], %[csum], 16 \n\
+ sub %[csum], %[fold_temp], %[csum] \n\
+ .option pop"
+ : [csum] "+r" (csum), [fold_temp] "=&r" (fold_temp));
+ } else {
+ asm(".option push \n\
+ .option arch,+zbb \n\
+ rori %[fold_temp], %[csum], 32 \n\
+ add %[csum], %[fold_temp], %[csum] \n\
+ srli %[csum], %[csum], 32 \n\
+ not %[fold_temp], %[csum] \n\
+ roriw %[csum], %[csum], 16 \n\
+ subw %[csum], %[fold_temp], %[csum] \n\
+ .option pop"
+ : [csum] "+r" (csum), [fold_temp] "=&r" (fold_temp));
+ }
+ return csum >> 16;
+ }
+no_zbb:
+#ifndef CONFIG_32BIT
+ csum += (csum >> 32) | (csum << 32);
+ csum >>= 32;
+#endif
+ return csum_fold((__force __wsum)csum);
+}
+
+#endif // __ASM_RISCV_CHECKSUM_H

--
2.42.0

2023-09-16 08:52:04

by Conor Dooley

[permalink] [raw]

Subject: Re: [PATCH v6 1/4] asm-generic: Improve csum_fold

On Fri, Sep 15, 2023 at 10:01:17AM -0700, Charlie Jenkins wrote:
> This csum_fold implementation introduced into arch/arc by Vineet Gupta
> is better than the default implementation on at least arc, x86, and
> riscv. Using GCC trunk and compiling non-inlined version, this
> implementation has 41.6667%, 25% fewer instructions on riscv64, x86-64
> respectively with -O3 optimization. Most implmentations override this
> default in asm, but this should be more performant than all of those
> other implementations except for arm which has barrel shifting and
> sparc32 which has a carry flag.
>
> Signed-off-by: Charlie Jenkins <[email protected]>
> Reviewed-by: David Laight <[email protected]>
> ---
> include/asm-generic/checksum.h | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/include/asm-generic/checksum.h b/include/asm-generic/checksum.h
> index 43e18db89c14..37f5ec70ac93 100644
> --- a/include/asm-generic/checksum.h
> +++ b/include/asm-generic/checksum.h
> @@ -31,9 +31,7 @@ extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
> static inline __sum16 csum_fold(__wsum csum)
> {
> u32 sum = (__force u32)csum;
> - sum = (sum & 0xffff) + (sum >> 16);
> - sum = (sum & 0xffff) + (sum >> 16);
> - return (__force __sum16)~sum;
> + return (__force __sum16)((~sum - ror32(sum, 16)) >> 16);

Breaks the build on RISC-V in a way that is repaired by later patches in
the series, so you likely did not notice:

./include/asm-generic/checksum.h:34:35: error: call to undeclared function 'ror32'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
../include/linux/bitops.h:134:21: error: static declaration of 'ror32' follows non-static declaration

Cheers,
Conor.

Attachments:

(No filename) (1.76 kB)
signature.asc (235.00 B)
Download all attachments

2023-09-16 10:08:11

by David Laight

[permalink] [raw]

Subject: RE: [PATCH v6 3/4] riscv: Add checksum library

From: Charlie Jenkins
> Sent: 15 September 2023 18:01
>
> Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> will load from the buffer in groups of 32 bits, and when compiled for
> 64-bit will load in groups of 64 bits.
>
...
> + /*
> + * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> + * faster than doing 32-bit reads on architectures that support larger
> + * reads.
> + */
> + while (len > 0) {
> + csum += data;
> + csum += csum < data;
> + len -= sizeof(unsigned long);
> + ptr += 1;
> + data = *ptr;
> + }

I think you'd be better adding the 'carry' bits in a separate
variable.
It reduces the register dependency chain length in the loop.
(Helps if the cpu can execute two instructions in one clock.)

The masked misaligned data values are max 24 bits
(if

You'll also almost certainly remove at least one instruction
from the loop by comparing against the end address rather than
changing 'len'.

So ending up with (something like):
end = buff + length;
...
while (++ptr < end) {
csum += data;
carry += csum < data;
data = ptr[-1];
}
(Although a do-while loop tends to generate better code
and gcc will pretty much always make that transformation.)

I think that is 4 instructions per word (load, add, cmp+set, add).
In principle they could be completely pipelined and all
execute (for different loop iterations) in the same clock.
(But that is pretty unlikely to happen - even x86 isn't that good.)
But taking two clocks is quite plausible.
Plus 2 instructions per loop (inc, cmp+jmp).
They might execute in parallel, but unrolling once
may be required.

...
> + if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> + riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
...
> + }
> +end:
> + return csum >> 16;
> + }

Is it really worth doing all that to save (I think) 4 instructions?
(shift, shift, or with rotate twice).
There is much more to be gained by careful inspection
of the loop (even leaving it in C).

> +
> +#ifndef CONFIG_32BIT
> + csum += (csum >> 32) | (csum << 32);
> + csum >>= 32;
> +#endif
> + csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));

Use ror64() and ror32().

David

> + if (offset & 1)
> + return (unsigned short)swab32(csum);
> + return csum >> 16;
> +}
>
> --
> 2.42.0

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-19 02:58:36

by Charlie Jenkins

[permalink] [raw]

Subject: Re: [PATCH v6 3/4] riscv: Add checksum library

On Sat, Sep 16, 2023 at 09:32:40AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 15 September 2023 18:01
> >
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits.
> >
> ...
> > + /*
> > + * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> > + * faster than doing 32-bit reads on architectures that support larger
> > + * reads.
> > + */
> > + while (len > 0) {
> > + csum += data;
> > + csum += csum < data;
> > + len -= sizeof(unsigned long);
> > + ptr += 1;
> > + data = *ptr;
> > + }
>
> I think you'd be better adding the 'carry' bits in a separate
> variable.
> It reduces the register dependency chain length in the loop.
> (Helps if the cpu can execute two instructions in one clock.)
>
> The masked misaligned data values are max 24 bits
> (if
>
> You'll also almost certainly remove at least one instruction
> from the loop by comparing against the end address rather than
> changing 'len'.
>
> So ending up with (something like):
> end = buff + length;
> ...
> while (++ptr < end) {
> csum += data;
> carry += csum < data;
> data = ptr[-1];
> }
> (Although a do-while loop tends to generate better code
> and gcc will pretty much always make that transformation.)
>
> I think that is 4 instructions per word (load, add, cmp+set, add).
> In principle they could be completely pipelined and all
> execute (for different loop iterations) in the same clock.
> (But that is pretty unlikely to happen - even x86 isn't that good.)
> But taking two clocks is quite plausible.
> Plus 2 instructions per loop (inc, cmp+jmp).
> They might execute in parallel, but unrolling once
> may be required.
>
It looks like GCC actually ends up generating 7 total instructions:
ffffffff808d2acc: 97b6 add a5,a5,a3
ffffffff808d2ace: 00d7b533 sltu a0,a5,a3
ffffffff808d2ad2: 0721 add a4,a4,8
ffffffff808d2ad4: 86be mv a3,a5
ffffffff808d2ad6: 962a add a2,a2,a0
ffffffff808d2ad8: ff873783 ld a5,-8(a4)
ffffffff808d2adc: feb768e3 bltu a4,a1,ffffffff808d2acc <do_csum+0x34>

This mv instruction could be avoided if the registers were shuffled
around, but perhaps this way reduces some dependency chains.
> ...
> > + if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> > + riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
> ...
> > + }
> > +end:
> > + return csum >> 16;
> > + }
>
> Is it really worth doing all that to save (I think) 4 instructions?
> (shift, shift, or with rotate twice).
> There is much more to be gained by careful inspection
> of the loop (even leaving it in C).
>

The main benefit was from using rev8 to replace swab32. However, now
that I am looking at the assembly in the kernel it is not outputting the
asm that matches what I have from an out of kernel test case, so rev8
might not be beneficial. I am going to have to look at this more to
figure out what is happening.

> > +
> > +#ifndef CONFIG_32BIT
> > + csum += (csum >> 32) | (csum << 32);
> > + csum >>= 32;
> > +#endif
> > + csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
>
> Use ror64() and ror32().
>
> David
>

Good idea.

- Charlie

> > + if (offset & 1)
> > + return (unsigned short)swab32(csum);
> > + return csum >> 16;
> > +}
> >
> > --
> > 2.42.0
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2023-09-19 08:05:28

by David Laight

[permalink] [raw]

Subject: RE: [PATCH v6 3/4] riscv: Add checksum library

...
> > So ending up with (something like):
> > end = buff + length;
> > ...
> > while (++ptr < end) {
> > csum += data;
> > carry += csum < data;
> > data = ptr[-1];
> > }
> > (Although a do-while loop tends to generate better code
> > and gcc will pretty much always make that transformation.)
> >
> > I think that is 4 instructions per word (load, add, cmp+set, add).
> > In principle they could be completely pipelined and all
> > execute (for different loop iterations) in the same clock.
> > (But that is pretty unlikely to happen - even x86 isn't that good.)
> > But taking two clocks is quite plausible.
> > Plus 2 instructions per loop (inc, cmp+jmp).
> > They might execute in parallel, but unrolling once
> > may be required.
> >
> It looks like GCC actually ends up generating 7 total instructions:
> ffffffff808d2acc: 97b6 add a5,a5,a3
> ffffffff808d2ace: 00d7b533 sltu a0,a5,a3
> ffffffff808d2ad2: 0721 add a4,a4,8
> ffffffff808d2ad4: 86be mv a3,a5
> ffffffff808d2ad6: 962a add a2,a2,a0
> ffffffff808d2ad8: ff873783 ld a5,-8(a4)
> ffffffff808d2adc: feb768e3 bltu a4,a1,ffffffff808d2acc <do_csum+0x34>
>
> This mv instruction could be avoided if the registers were shuffled
> around, but perhaps this way reduces some dependency chains.

gcc managed to do 'data += csum' so had add 'csum = data'.
If you unroll once that might go away.
It might then be 10 instructions for 16 bytes.
Although you then need slightly larger alignment code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-20 01:22:06

by Charlie Jenkins

[permalink] [raw]

Subject: Re: [PATCH v6 3/4] riscv: Add checksum library

On Tue, Sep 19, 2023 at 08:00:12AM +0000, David Laight wrote:
> ...
> > > So ending up with (something like):
> > > end = buff + length;
> > > ...
> > > while (++ptr < end) {
> > > csum += data;
> > > carry += csum < data;
> > > data = ptr[-1];
> > > }
> > > (Although a do-while loop tends to generate better code
> > > and gcc will pretty much always make that transformation.)
> > >
> > > I think that is 4 instructions per word (load, add, cmp+set, add).
> > > In principle they could be completely pipelined and all
> > > execute (for different loop iterations) in the same clock.
> > > (But that is pretty unlikely to happen - even x86 isn't that good.)
> > > But taking two clocks is quite plausible.
> > > Plus 2 instructions per loop (inc, cmp+jmp).
> > > They might execute in parallel, but unrolling once
> > > may be required.
> > >
> > It looks like GCC actually ends up generating 7 total instructions:
> > ffffffff808d2acc: 97b6 add a5,a5,a3
> > ffffffff808d2ace: 00d7b533 sltu a0,a5,a3
> > ffffffff808d2ad2: 0721 add a4,a4,8
> > ffffffff808d2ad4: 86be mv a3,a5
> > ffffffff808d2ad6: 962a add a2,a2,a0
> > ffffffff808d2ad8: ff873783 ld a5,-8(a4)
> > ffffffff808d2adc: feb768e3 bltu a4,a1,ffffffff808d2acc <do_csum+0x34>
> >
> > This mv instruction could be avoided if the registers were shuffled
> > around, but perhaps this way reduces some dependency chains.
>
> gcc managed to do 'data += csum' so had add 'csum = data'.
> If you unroll once that might go away.
> It might then be 10 instructions for 16 bytes.
> Although you then need slightly larger alignment code.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
I messed with it a bit and couldn't get the mv to go away. I would expect
mv to be very cheap so it should be fine, and I would like to avoid adding
too much to the alignment code since it is already large, and I assume
that buff will be aligned more often than not.

Interestingly, the mv does not appear pre gcc 12, and does not appear on clang.

- Charlie