2024-01-05 08:08:46

by Wang, Xiao W

[permalink] [raw]
Subject: [PATCH] riscv: Optimize crc32 with Zbc extension

As suggested by the B-ext spec, the Zbc (carry-less multiplication)
instructions can be used to accelerate CRC calculations. Currently, the
crc32 is the most widely used crc function inside kernel, so this patch
focuses on the optimization of just the crc32 APIs.

Compared with the current table-lookup based optimization, Zbc based
optimization can also achieve large stride during CRC calculation loop,
meantime, it avoids the memory access latency of the table-lookup based
implementation and it reduces memory footprint.

If Zbc feature is not supported in a runtime environment, then the
table-lookup based implementation would serve as fallback via alternative
mechanism. To avoid the performance concern of unalignment access, we also
use the fallback implementation to handle the head and tail bytes if any,
the main body is accelerated by Zbc extension.

This patch is tested on QEMU VM with the kernel CRC32 selftest.

Signed-off-by: Xiao Wang <[email protected]>
---
arch/riscv/Kconfig | 23 +++++
arch/riscv/lib/Makefile | 1 +
arch/riscv/lib/crc32.c | 216 ++++++++++++++++++++++++++++++++++++++++
include/linux/crc32.h | 3 +
4 files changed, 243 insertions(+)
create mode 100644 arch/riscv/lib/crc32.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 95a2a06acc6a..d8b4fb697df2 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -549,6 +549,29 @@ config RISCV_ISA_ZBB

If you don't know what to do here, say Y.

+config TOOLCHAIN_HAS_ZBC
+ bool
+ default y
+ depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zbc)
+ depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zbc)
+ depends on LLD_VERSION >= 150000 || LD_VERSION >= 23900
+ depends on AS_HAS_OPTION_ARCH
+
+config RISCV_ISA_ZBC
+ bool "Zbc extension support for carry-less multiplication instructions"
+ depends on TOOLCHAIN_HAS_ZBC
+ depends on MMU
+ depends on RISCV_ALTERNATIVE
+ default y
+ help
+ Adds support to dynamically detect the presence of the Zbc
+ extension (carry-less multiplication) and enable its usage.
+
+ The Zbc extension could accelerate CRC (cyclic redundancy check)
+ calculations.
+
+ If you don't know what to do here, say Y.
+
config RISCV_ISA_ZICBOM
bool "Zicbom extension support for non-coherent DMA operation"
depends on MMU
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 26cb2502ecf8..183bf2097d57 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -9,5 +9,6 @@ lib-y += strncmp.o
lib-$(CONFIG_MMU) += uaccess.o
lib-$(CONFIG_64BIT) += tishift.o
lib-$(CONFIG_RISCV_ISA_ZICBOZ) += clear_page.o
+lib-$(CONFIG_RISCV_ISA_ZBC) += crc32.o

obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/crc32.c b/arch/riscv/lib/crc32.c
new file mode 100644
index 000000000000..1a24994df826
--- /dev/null
+++ b/arch/riscv/lib/crc32.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Accelerated CRC32 implementation with Zbc extension.
+ *
+ * Copyright (C) 2024 Intel Corporation
+ *
+ * Authors:
+ * Xiao Wang <[email protected]>
+ */
+
+#include <asm/hwcap.h>
+#include <asm/alternative-macros.h>
+#include <asm/byteorder.h>
+
+#include <linux/types.h>
+#include <linux/crc32poly.h>
+#include <linux/crc32.h>
+#include <linux/byteorder/generic.h>
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#if (BITS_PER_LONG == 64)
+/* Slide by XLEN bits per iteration */
+# define STEP_ORDER 3
+
+/* Each below polynomial quotient has an implicit bit for 2^XLEN */
+
+/* Polynomial quotient of (2^(XLEN+32))/CRC32_POLY, in LE format */
+# define CRC32_POLY_QT_LE 0x5a72d812fb808b20
+
+/* Polynomial quotient of (2^(XLEN+32))/CRC32C_POLY, in LE format */
+# define CRC32C_POLY_QT_LE 0xa434f61c6f5389f8
+
+/* Polynomial quotient of (2^(XLEN+32))/CRC32_POLY, in BE format, it should be
+ * the same as the bit-reversed version of CRC32_POLY_QT_LE
+ */
+# define CRC32_POLY_QT_BE 0x04d101df481b4e5a
+#elif (BITS_PER_LONG == 32)
+# define STEP_ORDER 2
+/* Each quotient should match the upper half of its analog in RV64 */
+# define CRC32_POLY_QT_LE 0xfb808b20
+# define CRC32C_POLY_QT_LE 0x6f5389f8
+# define CRC32_POLY_QT_BE 0x04d101df
+#else
+# error "Unexpected BITS_PER_LONG"
+#endif
+
+#define STEP (1 << STEP_ORDER)
+#define OFFSET_MASK (STEP - 1)
+
+/*
+ * Refer to https://www.corsix.org/content/barrett-reduction-polynomials for
+ * better understanding of how this math works.
+ *
+ * let "+" denotes polynomial add (XOR)
+ * let "-" denotes polynomial sub (XOR)
+ * let "*" denotes polynomial multiplication
+ * let "/" denotes polynomial floor division
+ * let "S" denotes source data, XLEN bit wide
+ * let "P" denotes CRC32 polynomial
+ * let "T" denotes 2^(XLEN+32)
+ * let "QT" denotes quotient of T/P, with the bit for 2^XLEN being implicit
+ *
+ * crc32(S, P)
+ * => S * (2^32) - S * (2^32) / P * P
+ * => lowest 32 bits of: S * (2^32) / P * P
+ * => lowest 32 bits of: S * (2^32) * (T / P) / T * P
+ * => lowest 32 bits of: S * (2^32) * quotient / T * P
+ * => lowest 32 bits of: S * quotient / 2^XLEN * P
+ * => lowest 32 bits of: (clmul_high_part(S, QT) + S) * P
+ * => clmul_low_part(clmul_high_part(S, QT) + S, P)
+ *
+ * In terms of below implementations, the BE case is more intuitive, since the
+ * higher order bit sits at more significant position.
+ */
+
+typedef u32 (*fallback)(u32 crc, unsigned char const *p, size_t len);
+
+static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
+#if (BITS_PER_LONG == 64)
+ size_t len, u32 poly, u64 poly_qt,
+#else
+ size_t len, u32 poly, u32 poly_qt,
+#endif
+ fallback crc_fb)
+{
+ size_t offset, head_len, tail_len;
+ const unsigned long *p_ul;
+ unsigned long s;
+
+ asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
+ RISCV_ISA_EXT_ZBC, 1)
+ : : : : legacy);
+
+ /* Handle the unalignment head. */
+ offset = (unsigned long)p & OFFSET_MASK;
+ if (offset) {
+ head_len = MIN(STEP - offset, len);
+ crc = crc_fb(crc, p, head_len);
+ len -= head_len;
+ p += head_len;
+ }
+
+ tail_len = len & OFFSET_MASK;
+ len = len >> STEP_ORDER;
+ p_ul = (unsigned long *)p;
+
+ for (int i = 0; i < len; i++) {
+#if (BITS_PER_LONG == 64)
+ s = (unsigned long)crc ^ __cpu_to_le64(*p_ul++);
+ /* We don't have a "clmulrh" insn, so use clmul + slli instead.
+ */
+ asm volatile (".option push\n"
+ ".option arch,+zbc\n"
+ "clmul %0, %1, %2\n"
+ "slli %0, %0, 1\n"
+ "xor %0, %0, %1\n"
+ "clmulr %0, %0, %3\n"
+ "srli %0, %0, 32\n"
+ ".option pop\n"
+ : "=&r" (crc)
+ : "r" (s),
+ "r" (poly_qt),
+ "r" ((u64)poly << 32)
+ :);
+#else
+ s = crc ^ __cpu_to_le32(*p_ul++);
+ /* We don't have a "clmulrh" insn, so use clmul + slli instead.
+ */
+ asm volatile (".option push\n"
+ ".option arch,+zbc\n"
+ "clmul %0, %1, %2\n"
+ "slli %0, %0, 1\n"
+ "xor %0, %0, %1\n"
+ "clmulr %0, %0, %3\n"
+ ".option pop\n"
+ : "=&r" (crc)
+ : "r" (s),
+ "r" (poly_qt),
+ "r" (poly)
+ :);
+#endif
+ }
+
+ /* Handle the tail bytes. */
+ if (tail_len)
+ crc = crc_fb(crc, (unsigned char const *)p_ul, tail_len);
+ return crc;
+
+legacy:
+ return crc_fb(crc, p, len);
+}
+
+u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
+{
+ return crc32_le_generic(crc, p, len, CRC32_POLY_LE, CRC32_POLY_QT_LE,
+ crc32_le_base);
+}
+
+u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
+{
+ return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
+ CRC32C_POLY_QT_LE, __crc32c_le_base);
+}
+
+u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
+{
+ size_t offset, head_len, tail_len;
+ const unsigned long *p_ul;
+ unsigned long s;
+
+ asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
+ RISCV_ISA_EXT_ZBC, 1)
+ : : : : legacy);
+
+ /* Handle the unalignment head. */
+ offset = (unsigned long)p & OFFSET_MASK;
+ if (offset) {
+ head_len = MIN(STEP - offset, len);
+ crc = crc32_be_base(crc, p, head_len);
+ len -= head_len;
+ p += head_len;
+ }
+
+ tail_len = len & OFFSET_MASK;
+ len = len >> STEP_ORDER;
+ p_ul = (unsigned long *)p;
+
+ for (int i = 0; i < len; i++) {
+#if (BITS_PER_LONG == 64)
+ s = (unsigned long)crc << 32;
+ s ^= __cpu_to_be64(*p_ul++);
+#else
+ s = crc ^ __cpu_to_be32(*p_ul++);
+#endif
+ asm volatile (".option push\n"
+ ".option arch,+zbc\n"
+ "clmulh %0, %1, %2\n"
+ "xor %0, %0, %1\n"
+ "clmul %0, %0, %3\n"
+ ".option pop\n"
+ : "=&r" (crc)
+ : "r" (s),
+ "r" (CRC32_POLY_QT_BE),
+ "r" (CRC32_POLY_BE)
+ :);
+ }
+
+ /* Handle the tail bytes. */
+ if (tail_len)
+ crc = crc32_be_base(crc, (unsigned char const *)p_ul, tail_len);
+ return crc;
+
+legacy:
+ return crc32_be_base(crc, p, len);
+}
diff --git a/include/linux/crc32.h b/include/linux/crc32.h
index 9e8a032c1788..87f788c0d607 100644
--- a/include/linux/crc32.h
+++ b/include/linux/crc32.h
@@ -9,7 +9,9 @@
#include <linux/bitrev.h>

u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len);
+u32 __pure crc32_le_base(u32 crc, unsigned char const *p, size_t len);
u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len);
+u32 __pure crc32_be_base(u32 crc, unsigned char const *p, size_t len);

/**
* crc32_le_combine - Combine two crc32 check values into one. For two
@@ -37,6 +39,7 @@ static inline u32 crc32_le_combine(u32 crc1, u32 crc2, size_t len2)
}

u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len);
+u32 __pure __crc32c_le_base(u32 crc, unsigned char const *p, size_t len);

/**
* __crc32c_le_combine - Combine two crc32c check values into one. For two
--
2.25.1



2024-01-08 14:26:40

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] riscv: Optimize crc32 with Zbc extension

Hi Xiao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.7 next-20240108]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Xiao-Wang/riscv-Optimize-crc32-with-Zbc-extension/20240105-161053
base: linus/master
patch link: https://lore.kernel.org/r/20240105080830.3738117-1-xiao.w.wang%40intel.com
patch subject: [PATCH] riscv: Optimize crc32 with Zbc extension
config: riscv-randconfig-r121-20240106 (https://download.01.org/0day-ci/archive/20240108/[email protected]/config)
compiler: riscv64-linux-gcc (GCC) 13.2.0
reproduce: (https://download.01.org/0day-ci/archive/20240108/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> arch/riscv/lib/crc32.c:192:19: sparse: sparse: invalid assignment: ^=
arch/riscv/lib/crc32.c:192:19: sparse: left side has type unsigned long
arch/riscv/lib/crc32.c:192:19: sparse: right side has type restricted __be64
>> arch/riscv/lib/crc32.c:110:42: sparse: sparse: restricted __le64 degrades to integer
>> arch/riscv/lib/crc32.c:110:42: sparse: sparse: restricted __le64 degrades to integer

vim +192 arch/riscv/lib/crc32.c

78
79 static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
80 #if (BITS_PER_LONG == 64)
81 size_t len, u32 poly, u64 poly_qt,
82 #else
83 size_t len, u32 poly, u32 poly_qt,
84 #endif
85 fallback crc_fb)
86 {
87 size_t offset, head_len, tail_len;
88 const unsigned long *p_ul;
89 unsigned long s;
90
91 asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
92 RISCV_ISA_EXT_ZBC, 1)
93 : : : : legacy);
94
95 /* Handle the unalignment head. */
96 offset = (unsigned long)p & OFFSET_MASK;
97 if (offset) {
98 head_len = MIN(STEP - offset, len);
99 crc = crc_fb(crc, p, head_len);
100 len -= head_len;
101 p += head_len;
102 }
103
104 tail_len = len & OFFSET_MASK;
105 len = len >> STEP_ORDER;
106 p_ul = (unsigned long *)p;
107
108 for (int i = 0; i < len; i++) {
109 #if (BITS_PER_LONG == 64)
> 110 s = (unsigned long)crc ^ __cpu_to_le64(*p_ul++);
111 /* We don't have a "clmulrh" insn, so use clmul + slli instead.
112 */
113 asm volatile (".option push\n"
114 ".option arch,+zbc\n"
115 "clmul %0, %1, %2\n"
116 "slli %0, %0, 1\n"
117 "xor %0, %0, %1\n"
118 "clmulr %0, %0, %3\n"
119 "srli %0, %0, 32\n"
120 ".option pop\n"
121 : "=&r" (crc)
122 : "r" (s),
123 "r" (poly_qt),
124 "r" ((u64)poly << 32)
125 :);
126 #else
127 s = crc ^ __cpu_to_le32(*p_ul++);
128 /* We don't have a "clmulrh" insn, so use clmul + slli instead.
129 */
130 asm volatile (".option push\n"
131 ".option arch,+zbc\n"
132 "clmul %0, %1, %2\n"
133 "slli %0, %0, 1\n"
134 "xor %0, %0, %1\n"
135 "clmulr %0, %0, %3\n"
136 ".option pop\n"
137 : "=&r" (crc)
138 : "r" (s),
139 "r" (poly_qt),
140 "r" (poly)
141 :);
142 #endif
143 }
144
145 /* Handle the tail bytes. */
146 if (tail_len)
147 crc = crc_fb(crc, (unsigned char const *)p_ul, tail_len);
148 return crc;
149
150 legacy:
151 return crc_fb(crc, p, len);
152 }
153
154 u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
155 {
156 return crc32_le_generic(crc, p, len, CRC32_POLY_LE, CRC32_POLY_QT_LE,
157 crc32_le_base);
158 }
159
160 u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
161 {
162 return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
163 CRC32C_POLY_QT_LE, __crc32c_le_base);
164 }
165
166 u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
167 {
168 size_t offset, head_len, tail_len;
169 const unsigned long *p_ul;
170 unsigned long s;
171
172 asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
173 RISCV_ISA_EXT_ZBC, 1)
174 : : : : legacy);
175
176 /* Handle the unalignment head. */
177 offset = (unsigned long)p & OFFSET_MASK;
178 if (offset) {
179 head_len = MIN(STEP - offset, len);
180 crc = crc32_be_base(crc, p, head_len);
181 len -= head_len;
182 p += head_len;
183 }
184
185 tail_len = len & OFFSET_MASK;
186 len = len >> STEP_ORDER;
187 p_ul = (unsigned long *)p;
188
189 for (int i = 0; i < len; i++) {
190 #if (BITS_PER_LONG == 64)
191 s = (unsigned long)crc << 32;
> 192 s ^= __cpu_to_be64(*p_ul++);

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-16 15:04:47

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH] riscv: Optimize crc32 with Zbc extension

On Fri, Jan 05, 2024 at 04:08:30PM +0800, Xiao Wang wrote:
> As suggested by the B-ext spec, the Zbc (carry-less multiplication)
> instructions can be used to accelerate CRC calculations. Currently, the
> crc32 is the most widely used crc function inside kernel, so this patch
> focuses on the optimization of just the crc32 APIs.
>
> Compared with the current table-lookup based optimization, Zbc based
> optimization can also achieve large stride during CRC calculation loop,
> meantime, it avoids the memory access latency of the table-lookup based
> implementation and it reduces memory footprint.
>
> If Zbc feature is not supported in a runtime environment, then the
> table-lookup based implementation would serve as fallback via alternative
> mechanism. To avoid the performance concern of unalignment access, we also
> use the fallback implementation to handle the head and tail bytes if any,
> the main body is accelerated by Zbc extension.
>
> This patch is tested on QEMU VM with the kernel CRC32 selftest.

Ideally we'd also have the results of some benchmarks, or at least a
dynamic instruction count or something, but I don't suspect that the
table-lookup would "win" over Zbc. The bigger question is whether we
want three implementations, because I assume we want Zvbc, particularly
since [1] says

"""
Zvbc is listed as a development option for use in other algorithms,
and will become mandatory. Scalar Zbc is now listed as an expansion
option, i.e., it will probably not become mandatory.
"""

[1] https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc


A few nits below.

>
> Signed-off-by: Xiao Wang <[email protected]>
> ---
> arch/riscv/Kconfig | 23 +++++
> arch/riscv/lib/Makefile | 1 +
> arch/riscv/lib/crc32.c | 216 ++++++++++++++++++++++++++++++++++++++++
> include/linux/crc32.h | 3 +
> 4 files changed, 243 insertions(+)
> create mode 100644 arch/riscv/lib/crc32.c
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 95a2a06acc6a..d8b4fb697df2 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -549,6 +549,29 @@ config RISCV_ISA_ZBB
>
> If you don't know what to do here, say Y.
>
> +config TOOLCHAIN_HAS_ZBC
> + bool
> + default y
> + depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zbc)
> + depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zbc)
> + depends on LLD_VERSION >= 150000 || LD_VERSION >= 23900
> + depends on AS_HAS_OPTION_ARCH
> +
> +config RISCV_ISA_ZBC
> + bool "Zbc extension support for carry-less multiplication instructions"
> + depends on TOOLCHAIN_HAS_ZBC
> + depends on MMU
> + depends on RISCV_ALTERNATIVE
> + default y
> + help
> + Adds support to dynamically detect the presence of the Zbc
> + extension (carry-less multiplication) and enable its usage.
> +
> + The Zbc extension could accelerate CRC (cyclic redundancy check)
> + calculations.
> +
> + If you don't know what to do here, say Y.
> +
> config RISCV_ISA_ZICBOM
> bool "Zicbom extension support for non-coherent DMA operation"
> depends on MMU
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 26cb2502ecf8..183bf2097d57 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -9,5 +9,6 @@ lib-y += strncmp.o
> lib-$(CONFIG_MMU) += uaccess.o
> lib-$(CONFIG_64BIT) += tishift.o
> lib-$(CONFIG_RISCV_ISA_ZICBOZ) += clear_page.o
> +lib-$(CONFIG_RISCV_ISA_ZBC) += crc32.o
>
> obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/riscv/lib/crc32.c b/arch/riscv/lib/crc32.c
> new file mode 100644
> index 000000000000..1a24994df826
> --- /dev/null
> +++ b/arch/riscv/lib/crc32.c
> @@ -0,0 +1,216 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Accelerated CRC32 implementation with Zbc extension.
> + *
> + * Copyright (C) 2024 Intel Corporation
> + *
> + * Authors:
> + * Xiao Wang <[email protected]>
> + */
> +
> +#include <asm/hwcap.h>
> +#include <asm/alternative-macros.h>
> +#include <asm/byteorder.h>
> +
> +#include <linux/types.h>
> +#include <linux/crc32poly.h>
> +#include <linux/crc32.h>
> +#include <linux/byteorder/generic.h>
> +
> +#define MIN(a, b) ((a) < (b) ? (a) : (b))

min() from linux/minmax.h?

> +
> +#if (BITS_PER_LONG == 64)

I think riscv prefers

#if __riscv_xlen == 64

In any case, the () are unnecessary.

> +/* Slide by XLEN bits per iteration */
> +# define STEP_ORDER 3
> +
> +/* Each below polynomial quotient has an implicit bit for 2^XLEN */
> +
> +/* Polynomial quotient of (2^(XLEN+32))/CRC32_POLY, in LE format */
> +# define CRC32_POLY_QT_LE 0x5a72d812fb808b20
> +
> +/* Polynomial quotient of (2^(XLEN+32))/CRC32C_POLY, in LE format */
> +# define CRC32C_POLY_QT_LE 0xa434f61c6f5389f8
> +
> +/* Polynomial quotient of (2^(XLEN+32))/CRC32_POLY, in BE format, it should be
> + * the same as the bit-reversed version of CRC32_POLY_QT_LE
> + */
> +# define CRC32_POLY_QT_BE 0x04d101df481b4e5a
> +#elif (BITS_PER_LONG == 32)
> +# define STEP_ORDER 2
> +/* Each quotient should match the upper half of its analog in RV64 */
> +# define CRC32_POLY_QT_LE 0xfb808b20
> +# define CRC32C_POLY_QT_LE 0x6f5389f8
> +# define CRC32_POLY_QT_BE 0x04d101df
> +#else
> +# error "Unexpected BITS_PER_LONG"
> +#endif
> +
> +#define STEP (1 << STEP_ORDER)
> +#define OFFSET_MASK (STEP - 1)

Should tab out the values of these defines so they line up.

> +
> +/*
> + * Refer to https://www.corsix.org/content/barrett-reduction-polynomials for
> + * better understanding of how this math works.

Thanks for this pointer. I read it, but I didn't digest it enough to
understand it, which means my review isn't confirming the math is
correct.

> + *
> + * let "+" denotes polynomial add (XOR)
> + * let "-" denotes polynomial sub (XOR)
> + * let "*" denotes polynomial multiplication
> + * let "/" denotes polynomial floor division
> + * let "S" denotes source data, XLEN bit wide
> + * let "P" denotes CRC32 polynomial
> + * let "T" denotes 2^(XLEN+32)
> + * let "QT" denotes quotient of T/P, with the bit for 2^XLEN being implicit
> + *
> + * crc32(S, P)
> + * => S * (2^32) - S * (2^32) / P * P
> + * => lowest 32 bits of: S * (2^32) / P * P
> + * => lowest 32 bits of: S * (2^32) * (T / P) / T * P
> + * => lowest 32 bits of: S * (2^32) * quotient / T * P
> + * => lowest 32 bits of: S * quotient / 2^XLEN * P
> + * => lowest 32 bits of: (clmul_high_part(S, QT) + S) * P
> + * => clmul_low_part(clmul_high_part(S, QT) + S, P)
> + *
> + * In terms of below implementations, the BE case is more intuitive, since the
> + * higher order bit sits at more significant position.
> + */
> +
> +typedef u32 (*fallback)(u32 crc, unsigned char const *p, size_t len);
> +
> +static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
> +#if (BITS_PER_LONG == 64)
> + size_t len, u32 poly, u64 poly_qt,
> +#else
> + size_t len, u32 poly, u32 poly_qt,
> +#endif

How about creating a new type for poly_qt, defined as u64 for xlen=64
and u32 for xlen=32 to avoid the #ifdef?

> + fallback crc_fb)
> +{
> + size_t offset, head_len, tail_len;
> + const unsigned long *p_ul;
> + unsigned long s;
> +
> + asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> + RISCV_ISA_EXT_ZBC, 1)
> + : : : : legacy);
> +
> + /* Handle the unalignment head. */
> + offset = (unsigned long)p & OFFSET_MASK;
> + if (offset) {
> + head_len = MIN(STEP - offset, len);
> + crc = crc_fb(crc, p, head_len);
> + len -= head_len;
> + p += head_len;
> + }
> +
> + tail_len = len & OFFSET_MASK;
> + len = len >> STEP_ORDER;
> + p_ul = (unsigned long *)p;
> +
> + for (int i = 0; i < len; i++) {
> +#if (BITS_PER_LONG == 64)
> + s = (unsigned long)crc ^ __cpu_to_le64(*p_ul++);
> + /* We don't have a "clmulrh" insn, so use clmul + slli instead.
> + */

need opening /* comment wing

> + asm volatile (".option push\n"
> + ".option arch,+zbc\n"
> + "clmul %0, %1, %2\n"
> + "slli %0, %0, 1\n"
> + "xor %0, %0, %1\n"
> + "clmulr %0, %0, %3\n"
> + "srli %0, %0, 32\n"
> + ".option pop\n"
> + : "=&r" (crc)
> + : "r" (s),
> + "r" (poly_qt),
> + "r" ((u64)poly << 32)
> + :);
> +#else
> + s = crc ^ __cpu_to_le32(*p_ul++);
> + /* We don't have a "clmulrh" insn, so use clmul + slli instead.

also here

> + */
> + asm volatile (".option push\n"
> + ".option arch,+zbc\n"
> + "clmul %0, %1, %2\n"
> + "slli %0, %0, 1\n"
> + "xor %0, %0, %1\n"
> + "clmulr %0, %0, %3\n"
> + ".option pop\n"
> + : "=&r" (crc)
> + : "r" (s),
> + "r" (poly_qt),
> + "r" (poly)
> + :);
> +#endif

Instead of the #ifdef here, we could provide function wrappers for the asm
which would be defined above in the first #ifdef ladder.

> + }
> +
> + /* Handle the tail bytes. */
> + if (tail_len)
> + crc = crc_fb(crc, (unsigned char const *)p_ul, tail_len);
> + return crc;
> +
> +legacy:
> + return crc_fb(crc, p, len);
> +}
> +
> +u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
> +{
> + return crc32_le_generic(crc, p, len, CRC32_POLY_LE, CRC32_POLY_QT_LE,
> + crc32_le_base);
> +}
> +
> +u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
> +{
> + return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
> + CRC32C_POLY_QT_LE, __crc32c_le_base);
> +}
> +
> +u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
> +{
> + size_t offset, head_len, tail_len;
> + const unsigned long *p_ul;
> + unsigned long s;
> +
> + asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> + RISCV_ISA_EXT_ZBC, 1)
> + : : : : legacy);
> +
> + /* Handle the unalignment head. */
> + offset = (unsigned long)p & OFFSET_MASK;
> + if (offset) {
> + head_len = MIN(STEP - offset, len);
> + crc = crc32_be_base(crc, p, head_len);
> + len -= head_len;
> + p += head_len;
> + }
> +
> + tail_len = len & OFFSET_MASK;
> + len = len >> STEP_ORDER;
> + p_ul = (unsigned long *)p;
> +
> + for (int i = 0; i < len; i++) {
> +#if (BITS_PER_LONG == 64)
> + s = (unsigned long)crc << 32;
> + s ^= __cpu_to_be64(*p_ul++);
> +#else
> + s = crc ^ __cpu_to_be32(*p_ul++);
> +#endif

Could write the above without #ifdef with

if (IS_ENABLED(CONFIG_64BIT)) {
...
} else {
...
}

> + asm volatile (".option push\n"
> + ".option arch,+zbc\n"
> + "clmulh %0, %1, %2\n"
> + "xor %0, %0, %1\n"
> + "clmul %0, %0, %3\n"
> + ".option pop\n"
> + : "=&r" (crc)
> + : "r" (s),
> + "r" (CRC32_POLY_QT_BE),
> + "r" (CRC32_POLY_BE)
> + :);
> + }
> +
> + /* Handle the tail bytes. */
> + if (tail_len)
> + crc = crc32_be_base(crc, (unsigned char const *)p_ul, tail_len);
> + return crc;
> +
> +legacy:
> + return crc32_be_base(crc, p, len);
> +}
> diff --git a/include/linux/crc32.h b/include/linux/crc32.h
> index 9e8a032c1788..87f788c0d607 100644
> --- a/include/linux/crc32.h
> +++ b/include/linux/crc32.h
> @@ -9,7 +9,9 @@
> #include <linux/bitrev.h>
>
> u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len);
> +u32 __pure crc32_le_base(u32 crc, unsigned char const *p, size_t len);
> u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len);
> +u32 __pure crc32_be_base(u32 crc, unsigned char const *p, size_t len);
>
> /**
> * crc32_le_combine - Combine two crc32 check values into one. For two
> @@ -37,6 +39,7 @@ static inline u32 crc32_le_combine(u32 crc1, u32 crc2, size_t len2)
> }
>
> u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len);
> +u32 __pure __crc32c_le_base(u32 crc, unsigned char const *p, size_t len);
>
> /**
> * __crc32c_le_combine - Combine two crc32c check values into one. For two
> --
> 2.25.1
>

Thanks,
drew

2024-01-16 17:17:56

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: Optimize crc32 with Zbc extension

..
> > +static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
> > +#if (BITS_PER_LONG == 64)
> > + size_t len, u32 poly, u64 poly_qt,
> > +#else
> > + size_t len, u32 poly, u32 poly_qt,
> > +#endif
>
> How about creating a new type for poly_qt, defined as u64 for xlen=64
> and u32 for xlen=32 to avoid the #ifdef?

unsigned long ?

..
> > + for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > + s = (unsigned long)crc << 32;
> > + s ^= __cpu_to_be64(*p_ul++);
> > +#else
> > + s = crc ^ __cpu_to_be32(*p_ul++);
> > +#endif
>
> Could write the above without #ifdef with

Haven't I seen a bpf patch that rather implies that byteswap
is likely to be truly horrid?

I've not tried to parse the crc code (although I do understand
how it should work). But I'm surprised you need a byteswap.

After all, the crc is basically a long division of the buffer
by the crc constant.

The CRC I've done recently is the hdlc crc-16.
My nios version (also mips-like) has:

static __inline__ uint32_t
crc_step(uint32_t crc, uint32_t byte_val)
{
#if defined(crc_step_ci)
return crc_step_ci(byte_val, crc);
#else
uint32_t t = crc ^ (byte_val & 0xff);
t = (t ^ t << 4) & 0xff;
return crc >> 8 ^ t << 8 ^ t << 3 ^ t >> 4;
#endif
}

I normally use a custom instruction for the logic - one clock.
But the C code is only a couple of clocks slower that the best
table lookup version.
On anything pipelined and multi-issue the C code is likely to
be faster than a lookup table!
I don't know if any of the 32bit crc can be reduced the same way.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2024-01-24 05:45:15

by Wang, Xiao W

[permalink] [raw]
Subject: RE: [PATCH] riscv: Optimize crc32 with Zbc extension

Hi Andrew,

> -----Original Message-----
> From: Andrew Jones <[email protected]>
> Sent: Tuesday, January 16, 2024 11:04 PM
> To: Wang, Xiao W <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Li,
> Haicheng <[email protected]>; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH] riscv: Optimize crc32 with Zbc extension
>
> On Fri, Jan 05, 2024 at 04:08:30PM +0800, Xiao Wang wrote:
> > As suggested by the B-ext spec, the Zbc (carry-less multiplication)
> > instructions can be used to accelerate CRC calculations. Currently, the
> > crc32 is the most widely used crc function inside kernel, so this patch
> > focuses on the optimization of just the crc32 APIs.
> >
> > Compared with the current table-lookup based optimization, Zbc based
> > optimization can also achieve large stride during CRC calculation loop,
> > meantime, it avoids the memory access latency of the table-lookup based
> > implementation and it reduces memory footprint.
> >
> > If Zbc feature is not supported in a runtime environment, then the
> > table-lookup based implementation would serve as fallback via alternative
> > mechanism. To avoid the performance concern of unalignment access, we
> also
> > use the fallback implementation to handle the head and tail bytes if any,
> > the main body is accelerated by Zbc extension.
> >
> > This patch is tested on QEMU VM with the kernel CRC32 selftest.
>
> Ideally we'd also have the results of some benchmarks, or at least a
> dynamic instruction count or something, but I don't suspect that the
> table-lookup would "win" over Zbc. The bigger question is whether we
> want three implementations, because I assume we want Zvbc, particularly
> since [1] says
>
> """
> Zvbc is listed as a development option for use in other algorithms,
> and will become mandatory. Scalar Zbc is now listed as an expansion
> option, i.e., it will probably not become mandatory.
> """
>
> [1] https://github.com/riscv/riscv-profiles/blob/main/rva23-profile.adoc

I would collect instructions count change and include the info in the commit log.
Regarding to the bigger question, I have some thoughts:

- It looks this question also stands for some other cases like lib/str*, which currently
has been accelerated by Zbb-ext, but since the in-kernel RVV support has been enabled,
we may also use RVV to accelerate the lib/str*.
ALTERNATIVE_2 mechanism allows a second code patching on the same location. So
two alternative versions for the same function can be supported.

- At this moment, I can't figure out how Zvbc can be used to better accelerate these
crc32 APIs which take just one data stream as input. I have rough idea for a batched
version API which takes multi data stream as input (a parallel implementation of below
Zbc acceleration), but not for these 3 APIs here.
BTW, the vector "clmulr" insn is not listed in the Zvbc spec, but it's useful in below implementation.
https://github.com/riscv/riscv-crypto/blob/main/doc/vector/riscv-crypto-vector-zvbc.adoc
Anyway, if we come up with a Zvbc based optimization in future, then ALTERNATIVE_2
would help.

>
>
> A few nits below.

Thanks for all the above & below comments. Will adopt most of the suggestions, except two,
please check below.

>
> >
> > Signed-off-by: Xiao Wang <[email protected]>
> > ---
> > arch/riscv/Kconfig | 23 +++++
> > arch/riscv/lib/Makefile | 1 +
> > arch/riscv/lib/crc32.c | 216
> ++++++++++++++++++++++++++++++++++++++++
> > include/linux/crc32.h | 3 +
> > 4 files changed, 243 insertions(+)
> > create mode 100644 arch/riscv/lib/crc32.c
> >
[...]

> > +#if (BITS_PER_LONG == 64)
> > + size_t len, u32 poly, u64 poly_qt,
> > +#else
> > + size_t len, u32 poly, u32 poly_qt,
> > +#endif
>
> How about creating a new type for poly_qt, defined as u64 for xlen=64
> and u32 for xlen=32 to avoid the #ifdef?

Would make it as "unsigned long", as David's comment.

>
> > + fallback crc_fb)
> > +{
> > + size_t offset, head_len, tail_len;
> > + const unsigned long *p_ul;
> > + unsigned long s;
> > +
> > + asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > + RISCV_ISA_EXT_ZBC, 1)
> > + : : : : legacy);
> > +
> > + /* Handle the unalignment head. */
> > + offset = (unsigned long)p & OFFSET_MASK;
> > + if (offset) {
> > + head_len = MIN(STEP - offset, len);
> > + crc = crc_fb(crc, p, head_len);
> > + len -= head_len;
> > + p += head_len;
> > + }
> > +
> > + tail_len = len & OFFSET_MASK;
> > + len = len >> STEP_ORDER;
> > + p_ul = (unsigned long *)p;
> > +
> > + for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > + s = (unsigned long)crc ^ __cpu_to_le64(*p_ul++);
> > + /* We don't have a "clmulrh" insn, so use clmul + slli instead.
> > + */
>
> need opening /* comment wing
>
> > + asm volatile (".option push\n"
> > + ".option arch,+zbc\n"
> > + "clmul %0, %1, %2\n"
> > + "slli %0, %0, 1\n"
> > + "xor %0, %0, %1\n"
> > + "clmulr %0, %0, %3\n"
> > + "srli %0, %0, 32\n"
> > + ".option pop\n"
> > + : "=&r" (crc)
> > + : "r" (s),
> > + "r" (poly_qt),
> > + "r" ((u64)poly << 32)
> > + :);
> > +#else
> > + s = crc ^ __cpu_to_le32(*p_ul++);
> > + /* We don't have a "clmulrh" insn, so use clmul + slli instead.
>
> also here
>
> > + */
> > + asm volatile (".option push\n"
> > + ".option arch,+zbc\n"
> > + "clmul %0, %1, %2\n"
> > + "slli %0, %0, 1\n"
> > + "xor %0, %0, %1\n"
> > + "clmulr %0, %0, %3\n"
> > + ".option pop\n"
> > + : "=&r" (crc)
> > + : "r" (s),
> > + "r" (poly_qt),
> > + "r" (poly)
> > + :);
> > +#endif
>
> Instead of the #ifdef here, we could provide function wrappers for the asm
> which would be defined above in the first #ifdef ladder.
>
> > + }
> > +
> > + /* Handle the tail bytes. */
> > + if (tail_len)
> > + crc = crc_fb(crc, (unsigned char const *)p_ul, tail_len);
> > + return crc;
> > +
> > +legacy:
> > + return crc_fb(crc, p, len);
> > +}
> > +
> > +u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
> > +{
> > + return crc32_le_generic(crc, p, len, CRC32_POLY_LE,
> CRC32_POLY_QT_LE,
> > + crc32_le_base);
> > +}
> > +
> > +u32 __pure __crc32c_le(u32 crc, unsigned char const *p, size_t len)
> > +{
> > + return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
> > + CRC32C_POLY_QT_LE, __crc32c_le_base);
> > +}
> > +
> > +u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
> > +{
> > + size_t offset, head_len, tail_len;
> > + const unsigned long *p_ul;
> > + unsigned long s;
> > +
> > + asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
> > + RISCV_ISA_EXT_ZBC, 1)
> > + : : : : legacy);
> > +
> > + /* Handle the unalignment head. */
> > + offset = (unsigned long)p & OFFSET_MASK;
> > + if (offset) {
> > + head_len = MIN(STEP - offset, len);
> > + crc = crc32_be_base(crc, p, head_len);
> > + len -= head_len;
> > + p += head_len;
> > + }
> > +
> > + tail_len = len & OFFSET_MASK;
> > + len = len >> STEP_ORDER;
> > + p_ul = (unsigned long *)p;
> > +
> > + for (int i = 0; i < len; i++) {
> > +#if (BITS_PER_LONG == 64)
> > + s = (unsigned long)crc << 32;
> > + s ^= __cpu_to_be64(*p_ul++);
> > +#else
> > + s = crc ^ __cpu_to_be32(*p_ul++);
> > +#endif
>
> Could write the above without #ifdef with
>
> if (IS_ENABLED(CONFIG_64BIT)) {
> ...
> } else {
> ...
> }

I got a warning for riscv32 build with this change, gcc v12.2.0
./arch/riscv/lib/crc32.c:191:40: warning: left shift count >= width of type [-Wshift-count-overflow]
It looks compiler takes the always-impossible code branch into consideration.

BRs,
Xiao


2024-01-24 07:22:12

by Wang, Xiao W

[permalink] [raw]
Subject: RE: [PATCH] riscv: Optimize crc32 with Zbc extension

Hi David,

> -----Original Message-----
> From: David Laight <[email protected]>
> Sent: Wednesday, January 17, 2024 1:16 AM
> To: 'Andrew Jones' <[email protected]>; Wang, Xiao W
> <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Li,
> Haicheng <[email protected]>; [email protected]; linux-
> [email protected]
> Subject: RE: [PATCH] riscv: Optimize crc32 with Zbc extension
>
> ...
> > > +static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p,
> > > +#if (BITS_PER_LONG == 64)
> > > + size_t len, u32 poly, u64 poly_qt,
> > > +#else
> > > + size_t len, u32 poly, u32 poly_qt,
> > > +#endif
> >
> > How about creating a new type for poly_qt, defined as u64 for xlen=64
> > and u32 for xlen=32 to avoid the #ifdef?
>
> unsigned long ?

Yes, it's simpler. thanks.

>
> ...
> > > + for (int i = 0; i < len; i++) {
> > > +#if (BITS_PER_LONG == 64)
> > > + s = (unsigned long)crc << 32;
> > > + s ^= __cpu_to_be64(*p_ul++);
> > > +#else
> > > + s = crc ^ __cpu_to_be32(*p_ul++);
> > > +#endif
> >
> > Could write the above without #ifdef with
>
> Haven't I seen a bpf patch that rather implies that byteswap
> is likely to be truly horrid?
>
> I've not tried to parse the crc code (although I do understand
> how it should work). But I'm surprised you need a byteswap.
>
> After all, the crc is basically a long division of the buffer
> by the crc constant.

For the *_be mode, the API expects that within each byte from the data
stream,the higher order poly bit sits at more significant bit position, and within
the parameter "u32 crc", the higher order poly bit sits at more significant bit
position, regardless of CPU endianness.

After the unsigned long data loading from memory into a register, the
"higher order poly bit sits at more significant bit position" is true for each byte,
but not for the whole register on a little endian CPU. In order to do the XOR in
s ^= __cpu_to_be64(*p_ul++), and in order to run the CLMUL insn at XLEN
scale, we need to rearrange the whole register bits to get them rightly ordered.

>
> The CRC I've done recently is the hdlc crc-16.
> My nios version (also mips-like) has:
>
> static __inline__ uint32_t
> crc_step(uint32_t crc, uint32_t byte_val)
> {
> #if defined(crc_step_ci)
> return crc_step_ci(byte_val, crc);
> #else
> uint32_t t = crc ^ (byte_val & 0xff);
> t = (t ^ t << 4) & 0xff;
> return crc >> 8 ^ t << 8 ^ t << 3 ^ t >> 4;
> #endif
> }
>
> I normally use a custom instruction for the logic - one clock.
> But the C code is only a couple of clocks slower that the best
> table lookup version.
> On anything pipelined and multi-issue the C code is likely to
> be faster than a lookup table!

I don't understand your code piece, but it looks only the least significant byte
of "uint32_t byte_val" is involved in this crc16 calculation, I don't understand
how it can work.

BRs,
Xiao

> I don't know if any of the 32bit crc can be reduced the same way.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1
> 1PT, UK
> Registration No: 1397386 (Wales)