LinuxLists.cc - [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

2020-11-03 12:16:11

[permalink] [raw]

Subject: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

From: liqiang <[email protected]>

Dear all,

Thank you for taking the precious time to read this email!

Let me introduce the implementation ideas of my code here.

In the process of using the compression library libz, I found that the adler32
checksum always occupies a higher hot spot, so I paid attention to this algorithm.
After getting in touch with the SVE instruction set of armv8, I realized that
SVE can effectively accelerate adler32, so I made some attempts and got correct
and better performance results. I very much hope that this modification can be
applied to the kernel.

Below is my analysis process:

Adler32 algorithm
=================

Reference: https://en.wikipedia.org/wiki/Adler-32

Assume that the buf of the Adler32 checksum to be calculated is D and the length is n:

A = 1 + D1 + D2 + ... + Dn (mod 65521)

B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
= n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)

Adler-32(D) = B × 65536 + A

In C, an inefficient but straightforward implementation is:

const uint32_t MOD_ADLER = 65521;

uint32_t adler32(unsigned char *data, size_t len)
{
uint32_t a = 1, b = 0;
size_t index;

// Process each byte of the data in order
for (index = 0; index < len; ++index)
{
a = (a + data[index]) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}

return (b << 16) | a;
}

SVE vector method
=================

Step 1. Determine the block size:
Use addvl instruction to get SVE bit width.
Assuming the SVE bit width is x here.

Step 2. Start to calculate the first block:
The calculation formula is:
A1 = 1 + D1 + D2 + ... + Dx (mod 65521)
B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521)

Step 3. Calculate the follow block:
The calculation formula of A2 is very simple, just add up:
A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521)

The calculation formula of B2 is more complicated, because
the result is related to the length of buf. When calculating
the B1 block, it is actually assumed that the length is the
block length x. Now when calculating B2, the length is expanded
to 2x, so B2 becomes:
B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x
= x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x]
^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++
Through the above polynomial transformation:
Symbol "^" represents the <x * A1>;
Symbol "~" represents the <B1>;
Symbol "+" represents the next block.

So we can get the method of calculating the next block from
the previous block(Assume that the first byte number of the
new block starts from 1):
An+1 = An + D1 + D2 + ... + Dx (mod 65521)
Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521)

Step 4. Implement and test with SVE instruction set:
Implement the above ideas with the SVE instruction set (Please refer
to the patch for details), and conduct correctness and performance tests.

The following are three sets of test data, the test objects are random
data of 1M, 10M, 100M:
[root@xxx adler32]# ./benchmark 1000000
Libz alg: Time used: 615 us, 1626.0 Mb/s.
SVE alg: Time used: 163 us, 6135.0 Mb/s.
Libz result: 0x7a6a200
Sve result: 0x7a6a200
Equal

[root@xxx adler32]# ./benchmark 10000000
Libz alg: Time used: 6486 us, 1541.8 Mb/s.
SVE alg: Time used: 2077 us, 4814.6 Mb/s.
Libz result: 0xf92a8b92
Sve result: 0xf92a8b92
Equal

[root@xxx adler32]# ./benchmark 100000000
Libz alg: Time used: 64352 us, 1554.0 Mb/s.
SVE alg: Time used: 20697 us, 4831.6 Mb/s.
Libz result: 0x295bf401
Sve result: 0x295bf401
Equal
Test environment: Taishan 1951.

The test results show that on Taishan 1951, the speed of SVE is generally
about 3 to 4 times that of the C algorithm.

liqiang (1):
Accelerate Adler32 using arm64 SVE instructions.

arch/arm64/crypto/Kconfig | 5 ++
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
crypto/testmgr.c | 8 +-
crypto/testmgr.h | 13 +++
6 files changed, 248 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
create mode 100644 arch/arm64/crypto/adler32-sve.S

--
2.19.1

2020-11-03 12:17:25

[permalink] [raw]

Subject: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

From: liqiang <[email protected]>

In the libz library, the checksum algorithm adler32 usually occupies
a relatively high hot spot, and the SVE instruction set can easily
accelerate it, so that the performance of libz library will be
significantly improved.

We can divides buf into blocks according to the bit width of SVE,
and then uses vector registers to perform operations in units of blocks
to achieve the purpose of acceleration.

On machines that support ARM64 sve instructions, this algorithm is
about 3~4 times faster than the algorithm implemented in C language
in libz. The wider the SVE instruction, the better the acceleration effect.

Measured on a Taishan 1951 machine that supports 256bit width SVE,
below are the results of my measured random data of 1M and 10M:

[root@xxx adler32]# ./benchmark 1000000
Libz alg: Time used: 608 us, 1644.7 Mb/s.
SVE alg: Time used: 166 us, 6024.1 Mb/s.

[root@xxx adler32]# ./benchmark 10000000
Libz alg: Time used: 6484 us, 1542.3 Mb/s.
SVE alg: Time used: 2034 us, 4916.4 Mb/s.

The blocks can be of any size, so the algorithm can automatically adapt
to SVE hardware with different bit widths without modifying the code.

Signed-off-by: liqiang <[email protected]>
---
arch/arm64/crypto/Kconfig | 5 ++
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
crypto/testmgr.c | 8 +-
crypto/testmgr.h | 13 +++
6 files changed, 248 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
create mode 100644 arch/arm64/crypto/adler32-sve.S

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index b8eb045..cfe58b9 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
select CRYPTO_LIB_AES
select CRYPTO_SIMD

+config SVE_ADLER32
+ tristate "Accelerate Adler32 using arm64 SVE instructions."
+ depends on ARM64_SVE
+ select CRYPTO_HASH
+
endif
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index d0901e6..45fe649 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o

+obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
+sve-adler32-y := adler32-sve.o adler32-sve-glue.o
+
CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS

$(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
new file mode 100644
index 0000000..cb74514
--- /dev/null
+++ b/arch/arm64/crypto/adler32-sve-glue.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Accelerate Adler32 using arm64 SVE instructions.
+ * Automatically support all bit width of SVE
+ * vector(128~2048).
+ *
+ * Copyright (C) 2020 Huawei Technologies Co., Ltd.
+ *
+ * Author: Li Qiang <[email protected]>
+ */
+#include <linux/cpufeature.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/zutil.h>
+
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+
+/* Scalable vector extension min size 128bit */
+#define SVE_ADLER32_MIN_SIZE 16U
+#define SVE_ADLER32_DIGEST_SIZE 4
+#define SVE_ADLER32_BLOCKSIZE 1
+
+asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
+
+static int adler32_sve_init(struct shash_desc *desc)
+{
+ u32 *adler32 = shash_desc_ctx(desc);
+
+ *adler32 = 1;
+ return 0;
+}
+
+static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
+ unsigned int length)
+{
+ u32 *adler32 = shash_desc_ctx(desc);
+
+ if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
+ kernel_neon_begin();
+ *adler32 = adler32_sve(*adler32, data, length);
+ kernel_neon_end();
+ } else {
+ *adler32 = zlib_adler32(*adler32, data, length);
+ }
+ return 0;
+}
+
+static int adler32_sve_final(struct shash_desc *desc, u8 *out)
+{
+ u32 *adler32 = shash_desc_ctx(desc);
+
+ *(u32 *)out = *adler32;
+ return 0;
+}
+
+static struct shash_alg adler32_sve_alg[] = {{
+ .digestsize = SVE_ADLER32_DIGEST_SIZE,
+ .descsize = SVE_ADLER32_DIGEST_SIZE,
+ .init = adler32_sve_init,
+ .update = adler32_sve_update,
+ .final = adler32_sve_final,
+
+ .base.cra_name = "adler32",
+ .base.cra_driver_name = "adler32-arm64-sve",
+ .base.cra_priority = 200,
+ .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
+ .base.cra_module = THIS_MODULE,
+}};
+
+static int __init adler32_sve_mod_init(void)
+{
+ if (!cpu_have_named_feature(SVE))
+ return 0;
+
+ return crypto_register_shash(adler32_sve_alg);
+}
+
+static void __exit adler32_sve_mod_exit(void)
+{
+ crypto_unregister_shash(adler32_sve_alg);
+}
+
+module_init(adler32_sve_mod_init);
+module_exit(adler32_sve_mod_exit);
+
+MODULE_AUTHOR("Li Qiang <[email protected]>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("adler32");
+MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
new file mode 100644
index 0000000..34ee4bb
--- /dev/null
+++ b/arch/arm64/crypto/adler32-sve.S
@@ -0,0 +1,127 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
+ * width of SVE vector(128~2048).
+ *
+ * Copyright (C) 2020 Huawei Technologies Co., Ltd.
+ *
+ * Author: Li Qiang <[email protected]>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch armv8-a+sve
+.file "adler32_sve.S"
+.text
+.align 6
+
+//The supported sve vector length range is 128~2048 by this Adler_sequence
+.Adler_sequence:
+ .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
+ .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
+ .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
+ .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
+ .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
+ .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
+ .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
+ .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
+ .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
+ .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
+ .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
+ .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
+ .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
+ .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
+ .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
+ .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
+
+SYM_FUNC_START(adler32_sve)
+ and w10, w0, #0xffff
+ lsr w11, w0, #16
+
+ // Get the length of the sve vector to x6.
+ mov x6, #0
+ mov x9, #256
+ addvl x6, x6, #1
+ adr x12, .Adler_sequence
+ ptrue p1.h
+
+ // Get the starting position of the required sequence.
+ sub x9, x9, x6
+ ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
+ inch x9
+ ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
+ mov x9, #0
+ // A little of byte, jumper to normal proc
+ mov x14, #3
+ mul x15, x14, x6
+ cmp x2, x15
+ b.le Lnormal_proc
+
+ ptrue p0.b
+.align 6
+LBig_loop:
+ // x is SVE vector length (byte).
+ // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
+ // An = An-1 + D1 + D2 + D3 + ... + Dx
+
+ .macro ADLER_BLOCK_X
+ ld1b z0.b, p0/z, [x1, x9]
+ incb x9
+ uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
+ mov x12, v20.2d[0]
+ madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
+
+ uunpklo z26.h, z0.b
+ uunpkhi z27.h, z0.b
+ mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
+ mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
+
+ uaddv d21, p1, z26.h
+ uaddv d22, p1, z27.h
+ mov x13, v21.2d[0]
+ mov x14, v22.2d[0]
+
+ add x11, x13, x11
+ add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
+ add x10, x12, x10 // An += D1 + D2 + ... + Dx
+ .endm
+ ADLER_BLOCK_X
+ ADLER_BLOCK_X
+ ADLER_BLOCK_X
+ // calc = reg0 % 65521
+ .macro mod65521, reg0, reg1, reg2
+ mov w\reg1, #0x8071
+ mov w\reg2, #0xfff1
+ movk w\reg1, #0x8007, lsl #16
+ umull x\reg1, w\reg0, w\reg1
+ lsr x\reg1, x\reg1, #47
+ msub w\reg0, w\reg1, w\reg2, w\reg0
+ .endm
+
+ mod65521 10, 14, 12
+ mod65521 11, 14, 12
+
+ sub x2, x2, x15
+ cmp x2, x15
+ b.ge LBig_loop
+
+.align 6
+Lnormal_proc:
+ cmp x2, #0
+ b.eq Lret
+
+ ldrb w12, [x1, x9]
+ add x9, x9, #1
+ add x10, x12, x10
+ add x11, x10, x11
+ sub x2, x2, #1
+ b Lnormal_proc
+
+Lret:
+ mod65521 10, 14, 12
+ mod65521 11, 14, 12
+ lsl x11, x11, #16
+ orr x0, x10, x11
+ ret
+SYM_FUNC_END(adler32_sve)
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index a64a639..58b8020 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4174,6 +4174,13 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.cipher = __VECS(adiantum_xchacha20_aes_tv_template)
},
+ }, {
+ .alg = "adler32",
+ .test = alg_test_hash,
+ .fips_allowed = 1,
+ .suite = {
+ .hash = __VECS(adler32_tv_template)
+ }
}, {
.alg = "aegis128",
.test = alg_test_aead,
@@ -5640,7 +5647,6 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
}

DO_ONCE(testmgr_onetime_init);
-
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
char nalg[CRYPTO_MAX_ALG_NAME];

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 8c83811..5233960 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -3676,6 +3676,19 @@ static const struct hash_testvec crct10dif_tv_template[] = {
}
};

+static const struct hash_testvec adler32_tv_template[] = {
+ {
+ .plaintext = "abcde",
+ .psize = 5,
+ .digest = "\xf0\x01\xc8\x05",
+ },
+ {
+ .plaintext = "0123456789101112131415",
+ .psize = 22,
+ .digest = "\x63\x04\xa8\x32",
+ },
+};
+
/*
* Streebog test vectors from RFC 6986 and GOST R 34.11-2012
*/
--
2.19.1

2020-11-03 14:36:58

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

(+ Dave)

Hello liqiang,

First of all, I don't think it is safe at the moment to use SVE in the
kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
though, so perhaps Dave can elaborate?

I'll give my feedback on the code itself below, but please don't
consider this an ack on the intent of using SVE in the kernel.

Could you explain why SVE is so much better than NEON here?

On Tue, 3 Nov 2020 at 13:16, l00374334 <[email protected]> wrote:
>
> From: liqiang <[email protected]>
>
> In the libz library, the checksum algorithm adler32 usually occupies
> a relatively high hot spot, and the SVE instruction set can easily
> accelerate it, so that the performance of libz library will be
> significantly improved.
>
> We can divides buf into blocks according to the bit width of SVE,
> and then uses vector registers to perform operations in units of blocks
> to achieve the purpose of acceleration.
>
> On machines that support ARM64 sve instructions, this algorithm is
> about 3~4 times faster than the algorithm implemented in C language
> in libz. The wider the SVE instruction, the better the acceleration effect.
>
> Measured on a Taishan 1951 machine that supports 256bit width SVE,
> below are the results of my measured random data of 1M and 10M:
>
> [root@xxx adler32]# ./benchmark 1000000
> Libz alg: Time used: 608 us, 1644.7 Mb/s.
> SVE alg: Time used: 166 us, 6024.1 Mb/s.
>
> [root@xxx adler32]# ./benchmark 10000000
> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
>
> The blocks can be of any size, so the algorithm can automatically adapt
> to SVE hardware with different bit widths without modifying the code.
>

Please drop this indentation from the commit log.

>
> Signed-off-by: liqiang <[email protected]>
> ---
> arch/arm64/crypto/Kconfig | 5 ++
> arch/arm64/crypto/Makefile | 3 +
> arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
> arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
> crypto/testmgr.c | 8 +-
> crypto/testmgr.h | 13 +++

Please split into two patches. Also, who is going to use this "adler32" shash?

> 6 files changed, 248 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
> create mode 100644 arch/arm64/crypto/adler32-sve.S
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index b8eb045..cfe58b9 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
> select CRYPTO_LIB_AES
> select CRYPTO_SIMD
>
> +config SVE_ADLER32
> + tristate "Accelerate Adler32 using arm64 SVE instructions."
> + depends on ARM64_SVE
> + select CRYPTO_HASH
> +
> endif
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index d0901e6..45fe649 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
> obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
> aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
>
> +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
> +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
> +
> CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
>
> $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
> diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
> new file mode 100644
> index 0000000..cb74514
> --- /dev/null
> +++ b/arch/arm64/crypto/adler32-sve-glue.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Accelerate Adler32 using arm64 SVE instructions.
> + * Automatically support all bit width of SVE
> + * vector(128~2048).
> + *
> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> + *
> + * Author: Li Qiang <[email protected]>
> + */
> +#include <linux/cpufeature.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/zutil.h>
> +
> +#include <crypto/internal/hash.h>
> +#include <crypto/internal/simd.h>
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +
> +/* Scalable vector extension min size 128bit */
> +#define SVE_ADLER32_MIN_SIZE 16U
> +#define SVE_ADLER32_DIGEST_SIZE 4
> +#define SVE_ADLER32_BLOCKSIZE 1
> +
> +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
> +
> +static int adler32_sve_init(struct shash_desc *desc)
> +{
> + u32 *adler32 = shash_desc_ctx(desc);
> +
> + *adler32 = 1;
> + return 0;
> +}
> +
> +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
> + unsigned int length)

Please indent function parameters

> +{
> + u32 *adler32 = shash_desc_ctx(desc);
> +
> + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
> + kernel_neon_begin();
> + *adler32 = adler32_sve(*adler32, data, length);
> + kernel_neon_end();
> + } else {
> + *adler32 = zlib_adler32(*adler32, data, length);
> + }
> + return 0;
> +}
> +
> +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
> +{
> + u32 *adler32 = shash_desc_ctx(desc);
> +
> + *(u32 *)out = *adler32;

Please use put_unaligned here

> + return 0;
> +}
> +
> +static struct shash_alg adler32_sve_alg[] = {{
> + .digestsize = SVE_ADLER32_DIGEST_SIZE,
> + .descsize = SVE_ADLER32_DIGEST_SIZE,
> + .init = adler32_sve_init,
> + .update = adler32_sve_update,
> + .final = adler32_sve_final,
> +
> + .base.cra_name = "adler32",
> + .base.cra_driver_name = "adler32-arm64-sve",
> + .base.cra_priority = 200,
> + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
> + .base.cra_module = THIS_MODULE,

Please make sure the indentation is correct here.

> +}};
> +
> +static int __init adler32_sve_mod_init(void)
> +{
> + if (!cpu_have_named_feature(SVE))
> + return 0;
> +
> + return crypto_register_shash(adler32_sve_alg);
> +}
> +
> +static void __exit adler32_sve_mod_exit(void)
> +{
> + crypto_unregister_shash(adler32_sve_alg);
> +}
> +
> +module_init(adler32_sve_mod_init);
> +module_exit(adler32_sve_mod_exit);
> +
> +MODULE_AUTHOR("Li Qiang <[email protected]>");
> +MODULE_LICENSE("GPL v2");
> +MODULE_ALIAS_CRYPTO("adler32");
> +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
> diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
> new file mode 100644
> index 0000000..34ee4bb
> --- /dev/null
> +++ b/arch/arm64/crypto/adler32-sve.S
> @@ -0,0 +1,127 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
> + * width of SVE vector(128~2048).
> + *
> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> + *
> + * Author: Li Qiang <[email protected]>
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +
> +.arch armv8-a+sve
> +.file "adler32_sve.S"

Drop the .file

Please indent the rest 1 tab

> +.text
> +.align 6
> +
> +//The supported sve vector length range is 128~2048 by this Adler_sequence
> +.Adler_sequence:

This should be in .rodata. Also, if you use . or L prefixes, use .L
because that is what you need to make these local symbols.

> + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
> + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
> + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
> + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
> + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
> + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
> + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
> + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
> + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
> + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
> + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
> + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
> + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
> + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
> + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
> + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
> +
> +SYM_FUNC_START(adler32_sve)
> + and w10, w0, #0xffff
> + lsr w11, w0, #16
> +

Please put the instruction mnemonics in different columns using tabs

> + // Get the length of the sve vector to x6.
> + mov x6, #0
> + mov x9, #256
> + addvl x6, x6, #1
> + adr x12, .Adler_sequence
> + ptrue p1.h
> +
> + // Get the starting position of the required sequence.
> + sub x9, x9, x6
> + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
> + inch x9
> + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
> + mov x9, #0
> + // A little of byte, jumper to normal proc

What does this comment mean?

> + mov x14, #3
> + mul x15, x14, x6
> + cmp x2, x15
> + b.le Lnormal_proc
> +
> + ptrue p0.b
> +.align 6

Ident.

> +LBig_loop:

Use .L prefix

> + // x is SVE vector length (byte).
> + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
> + // An = An-1 + D1 + D2 + D3 + ... + Dx
> +
> + .macro ADLER_BLOCK_X

Please use lower case for asm macros, to distinguish them from CPP macros
Also, indent the name, and move the macro out of the function for legibility.

> + ld1b z0.b, p0/z, [x1, x9]
> + incb x9
> + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
> + mov x12, v20.2d[0]
> + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
> +
> + uunpklo z26.h, z0.b
> + uunpkhi z27.h, z0.b
> + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
> + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
> +
> + uaddv d21, p1, z26.h
> + uaddv d22, p1, z27.h
> + mov x13, v21.2d[0]
> + mov x14, v22.2d[0]
> +
> + add x11, x13, x11
> + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
> + add x10, x12, x10 // An += D1 + D2 + ... + Dx
> + .endm
> + ADLER_BLOCK_X
> + ADLER_BLOCK_X
> + ADLER_BLOCK_X
> + // calc = reg0 % 65521
> + .macro mod65521, reg0, reg1, reg2
> + mov w\reg1, #0x8071
> + mov w\reg2, #0xfff1
> + movk w\reg1, #0x8007, lsl #16
> + umull x\reg1, w\reg0, w\reg1
> + lsr x\reg1, x\reg1, #47
> + msub w\reg0, w\reg1, w\reg2, w\reg0
> + .endm
> +

Same as above

> + mod65521 10, 14, 12
> + mod65521 11, 14, 12
> +
> + sub x2, x2, x15
> + cmp x2, x15
> + b.ge LBig_loop
> +
> +.align 6

Indent

> +Lnormal_proc:

.L

> + cmp x2, #0
> + b.eq Lret
> +
> + ldrb w12, [x1, x9]
> + add x9, x9, #1
> + add x10, x12, x10
> + add x11, x10, x11
> + sub x2, x2, #1
> + b Lnormal_proc
> +
> +Lret:
> + mod65521 10, 14, 12
> + mod65521 11, 14, 12
> + lsl x11, x11, #16
> + orr x0, x10, x11
> + ret
> +SYM_FUNC_END(adler32_sve)
> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
> index a64a639..58b8020 100644
> --- a/crypto/testmgr.c
> +++ b/crypto/testmgr.c
> @@ -4174,6 +4174,13 @@ static const struct alg_test_desc alg_test_descs[] = {
> .suite = {
> .cipher = __VECS(adiantum_xchacha20_aes_tv_template)
> },
> + }, {
> + .alg = "adler32",
> + .test = alg_test_hash,
> + .fips_allowed = 1,
> + .suite = {
> + .hash = __VECS(adler32_tv_template)
> + }
> }, {
> .alg = "aegis128",
> .test = alg_test_aead,
> @@ -5640,7 +5647,6 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
> }
>
> DO_ONCE(testmgr_onetime_init);
> -
> if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
> char nalg[CRYPTO_MAX_ALG_NAME];
>
> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
> index 8c83811..5233960 100644
> --- a/crypto/testmgr.h
> +++ b/crypto/testmgr.h
> @@ -3676,6 +3676,19 @@ static const struct hash_testvec crct10dif_tv_template[] = {
> }
> };
>
> +static const struct hash_testvec adler32_tv_template[] = {
> + {
> + .plaintext = "abcde",
> + .psize = 5,
> + .digest = "\xf0\x01\xc8\x05",
> + },
> + {
> + .plaintext = "0123456789101112131415",
> + .psize = 22,
> + .digest = "\x63\x04\xa8\x32",
> + },
> +};
> +
> /*
> * Streebog test vectors from RFC 6986 and GOST R 34.11-2012
> */
> --
> 2.19.1
>

2020-11-03 18:02:04

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Tue, Nov 03, 2020 at 03:34:27PM +0100, Ard Biesheuvel wrote:
> (+ Dave)
>
> Hello liqiang,
>
> First of all, I don't think it is safe at the moment to use SVE in the
> kernel, as we don't preserve all state IIRC. My memory is a bit hazy,

I'm not convinced that it's safe right now. SVE in the kernel is
unsupported, partly due to cost and partly due to the lack of a
compelling use case.

I think it would be preferable to see this algo accelerated for NEON
first, since all AArch64 hardware can benefit from that.

It's good to see someone experimenting with SVE though -- so feel
free to experiment with this in userspace and see what sort of benchmark
you can achieve. But there's no guarantee this can be merged in the
kernel any time soon. If there's a big enough advantage over NEON, then
it becomes more interesting.

> though, so perhaps Dave can elaborate?

Historically there was another reason: for every nested
kernel_neon_begin(), we would have potentially needed to save the entire
SVE state, which is a problem -- we'd likely run out of kernel stack,
plus it could get costly, especially for latency. Since we refactored
the kernel-mode NEON support to remove nesting support, this is not so
much of a problem. kernel_neon_begin() may incur a save of the full SVE
state anyway, so in some ways it would be a good thing if we could
actually make use of all those registers.

SVE hardware remains rare, so as a general policy I don't think we
should accept SVE implementations of any algorithm that does not
already have a NEON implementation -- unless the contributor can
explain why nobody with non-SVE hardware is going to care about the
performance of that algo.

>
> I'll give my feedback on the code itself below, but please don't
> consider this an ack on the intent of using SVE in the kernel.
>
> Could you explain why SVE is so much better than NEON here?

From my end, I think that SVE probably doesn't offer special advantages
here beyond less messy code and more number-crunching per loop
iteration.

That doesn't mean that SVE can't beat NEON, but NEON should still offer
a big advantage over scalar code.

Since the calculations are quite simple, even the NEON version may
tend to saturate load/store bandwidth on some hardware, in which case
the additional performance from SVE may be limited. Experimentation
would be needed in order to know for sure.

I've made some random comments on the code below -- but my knowledge of
the SVE instruction set is a little rusty, and I haven't tried to
understand precisely what this algo is trying to do!

>
> On Tue, 3 Nov 2020 at 13:16, l00374334 <[email protected]> wrote:
> >
> > From: liqiang <[email protected]>
> >
> > In the libz library, the checksum algorithm adler32 usually occupies
> > a relatively high hot spot, and the SVE instruction set can easily
> > accelerate it, so that the performance of libz library will be
> > significantly improved.
> >
> > We can divides buf into blocks according to the bit width of SVE,
> > and then uses vector registers to perform operations in units of blocks
> > to achieve the purpose of acceleration.
> >
> > On machines that support ARM64 sve instructions, this algorithm is
> > about 3~4 times faster than the algorithm implemented in C language
> > in libz. The wider the SVE instruction, the better the acceleration effect.
> >
> > Measured on a Taishan 1951 machine that supports 256bit width SVE,
> > below are the results of my measured random data of 1M and 10M:
> >
> > [root@xxx adler32]# ./benchmark 1000000
> > Libz alg: Time used: 608 us, 1644.7 Mb/s.
> > SVE alg: Time used: 166 us, 6024.1 Mb/s.
> >
> > [root@xxx adler32]# ./benchmark 10000000
> > Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> > SVE alg: Time used: 2034 us, 4916.4 Mb/s.
> >
> > The blocks can be of any size, so the algorithm can automatically adapt
> > to SVE hardware with different bit widths without modifying the code.
> >
>
> Please drop this indentation from the commit log.
>
> >
> > Signed-off-by: liqiang <[email protected]>
> > ---
> > arch/arm64/crypto/Kconfig | 5 ++
> > arch/arm64/crypto/Makefile | 3 +
> > arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
> > arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
> > crypto/testmgr.c | 8 +-
> > crypto/testmgr.h | 13 +++
>
> Please split into two patches. Also, who is going to use this "adler32" shash?
>
> > 6 files changed, 248 insertions(+), 1 deletion(-)
> > create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
> > create mode 100644 arch/arm64/crypto/adler32-sve.S
> >
> > diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> > index b8eb045..cfe58b9 100644
> > --- a/arch/arm64/crypto/Kconfig
> > +++ b/arch/arm64/crypto/Kconfig
> > @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
> > select CRYPTO_LIB_AES
> > select CRYPTO_SIMD
> >
> > +config SVE_ADLER32
> > + tristate "Accelerate Adler32 using arm64 SVE instructions."
> > + depends on ARM64_SVE
> > + select CRYPTO_HASH
> > +
> > endif
> > diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> > index d0901e6..45fe649 100644
> > --- a/arch/arm64/crypto/Makefile
> > +++ b/arch/arm64/crypto/Makefile
> > @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
> > obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
> > aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
> >
> > +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
> > +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
> > +
> > CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
> >
> > $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
> > diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
> > new file mode 100644
> > index 0000000..cb74514
> > --- /dev/null
> > +++ b/arch/arm64/crypto/adler32-sve-glue.c
> > @@ -0,0 +1,93 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Accelerate Adler32 using arm64 SVE instructions.
> > + * Automatically support all bit width of SVE
> > + * vector(128~2048).
> > + *
> > + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> > + *
> > + * Author: Li Qiang <[email protected]>
> > + */
> > +#include <linux/cpufeature.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/zutil.h>
> > +
> > +#include <crypto/internal/hash.h>
> > +#include <crypto/internal/simd.h>
> > +
> > +#include <asm/neon.h>
> > +#include <asm/simd.h>
> > +
> > +/* Scalable vector extension min size 128bit */
> > +#define SVE_ADLER32_MIN_SIZE 16U
> > +#define SVE_ADLER32_DIGEST_SIZE 4
> > +#define SVE_ADLER32_BLOCKSIZE 1
> > +
> > +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
> > +
> > +static int adler32_sve_init(struct shash_desc *desc)
> > +{
> > + u32 *adler32 = shash_desc_ctx(desc);
> > +
> > + *adler32 = 1;
> > + return 0;
> > +}
> > +
> > +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
> > + unsigned int length)
>
> Please indent function parameters
>
> > +{
> > + u32 *adler32 = shash_desc_ctx(desc);
> > +
> > + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
> > + kernel_neon_begin();
> > + *adler32 = adler32_sve(*adler32, data, length);
> > + kernel_neon_end();
> > + } else {
> > + *adler32 = zlib_adler32(*adler32, data, length);
> > + }
> > + return 0;
> > +}
> > +
> > +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
> > +{
> > + u32 *adler32 = shash_desc_ctx(desc);
> > +
> > + *(u32 *)out = *adler32;
>
> Please use put_unaligned here
>
> > + return 0;
> > +}
> > +
> > +static struct shash_alg adler32_sve_alg[] = {{
> > + .digestsize = SVE_ADLER32_DIGEST_SIZE,
> > + .descsize = SVE_ADLER32_DIGEST_SIZE,
> > + .init = adler32_sve_init,
> > + .update = adler32_sve_update,
> > + .final = adler32_sve_final,
> > +
> > + .base.cra_name = "adler32",
> > + .base.cra_driver_name = "adler32-arm64-sve",
> > + .base.cra_priority = 200,
> > + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
> > + .base.cra_module = THIS_MODULE,
>
> Please make sure the indentation is correct here.
>
> > +}};
> > +
> > +static int __init adler32_sve_mod_init(void)
> > +{
> > + if (!cpu_have_named_feature(SVE))
> > + return 0;
> > +
> > + return crypto_register_shash(adler32_sve_alg);
> > +}
> > +
> > +static void __exit adler32_sve_mod_exit(void)
> > +{
> > + crypto_unregister_shash(adler32_sve_alg);
> > +}
> > +
> > +module_init(adler32_sve_mod_init);
> > +module_exit(adler32_sve_mod_exit);
> > +
> > +MODULE_AUTHOR("Li Qiang <[email protected]>");
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_ALIAS_CRYPTO("adler32");
> > +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
> > diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
> > new file mode 100644
> > index 0000000..34ee4bb
> > --- /dev/null
> > +++ b/arch/arm64/crypto/adler32-sve.S
> > @@ -0,0 +1,127 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
> > + * width of SVE vector(128~2048).
> > + *
> > + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> > + *
> > + * Author: Li Qiang <[email protected]>
> > + */
> > +
> > +#include <linux/linkage.h>
> > +#include <asm/assembler.h>
> > +
> > +.arch armv8-a+sve

Use .arch_extension sve instead.

The compiler frontend already sets .arch to the the appropriate base
architecture already; that shouldn't be overridden unless there is a
good reason.

> > +.file "adler32_sve.S"
>
> Drop the .file
>
> Please indent the rest 1 tab
>
> > +.text
> > +.align 6
> > +
> > +//The supported sve vector length range is 128~2048 by this Adler_sequence
> > +.Adler_sequence:
>
> This should be in .rodata. Also, if you use . or L prefixes, use .L
> because that is what you need to make these local symbols.
>
>
> > + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
> > + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
> > + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
> > + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
> > + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
> > + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
> > + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
> > + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
> > + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
> > + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
> > + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
> > + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
> > + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
> > + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
> > + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
> > + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
> > +
> > +SYM_FUNC_START(adler32_sve)
> > + and w10, w0, #0xffff
> > + lsr w11, w0, #16
> > +
>
> Please put the instruction mnemonics in different columns using tabs
>
> > + // Get the length of the sve vector to x6.
> > + mov x6, #0
> > + mov x9, #256
> > + addvl x6, x6, #1
> > + adr x12, .Adler_sequence
> > + ptrue p1.h
> > +
> > + // Get the starting position of the required sequence.
> > + sub x9, x9, x6
> > + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
> > + inch x9
> > + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h

This seems cumbersome, and will explode if SVE is ever extended to
support vectors larger than 256 bytes (though that's probably not very
likely any time soon).

Can you do something like the following (untested)?

ptrue p0.h
index z0.h, #0, #1
mov z1.d, z0.d
dech z0.h
dech z1.h, all, mul #2
negh z0.h, p0/m, z0.h
negh z1.h, p0/m, z1.h

> > + mov x9, #0
> > + // A little of byte, jumper to normal proc
>
> What does this comment mean?
>
> > + mov x14, #3
> > + mul x15, x14, x6
> > + cmp x2, x15
> > + b.le Lnormal_proc
> > +
> > + ptrue p0.b
> > +.align 6
>
> Ident.
>
> > +LBig_loop:
>
> Use .L prefix
>
> > + // x is SVE vector length (byte).
> > + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
> > + // An = An-1 + D1 + D2 + D3 + ... + Dx
> > +
> > + .macro ADLER_BLOCK_X
>
> Please use lower case for asm macros, to distinguish them from CPP macros
> Also, indent the name, and move the macro out of the function for legibility.
>
> > + ld1b z0.b, p0/z, [x1, x9]
> > + incb x9
> > + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
> > + mov x12, v20.2d[0]
> > + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
> > +
> > + uunpklo z26.h, z0.b
> > + uunpkhi z27.h, z0.b

Instead of loading and then unpacking elements, it's best to do it in
one go. If you increment the base address as you go, this becomes
something like (untested):

ld1b z26.h, p0/z, [x1]
ld1b z27.h, p0/z, [x1, #1, mul vl]
inch x1, all, mul #2

(you could keep a pre-incremented version of x1 or x9 in some other
register to eliminate one of these inch instructions).

> > + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
> > + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
> > +
> > + uaddv d21, p1, z26.h
> > + uaddv d22, p1, z27.h
> > + mov x13, v21.2d[0]
> > + mov x14, v22.2d[0]
> > +
> > + add x11, x13, x11
> > + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
> > + add x10, x12, x10 // An += D1 + D2 + ... + Dx

If you want best performance, you should do accumulations outside the
core loop. Vertical reduction instructions such as UADDV need to
collect the results of multiple element-by-element calculations which
can otherwise proceed independenly, so doing this too often will heavily
constrain the ways in which the hardware can schedule the calculations.

Instead, you can accumulate column-wise partial sums with vector MAD
instructions (maybe with MOXPRFX, since MAD is overwrites a source
operand).

To avoid frequent overflows, it may make sense to operate on quite wide
elements within the loop, but you will still need to limit the number of
loop interations to ensure that an overflow cannot happen.

> > + .endm
> > + ADLER_BLOCK_X
> > + ADLER_BLOCK_X
> > + ADLER_BLOCK_X
> > + // calc = reg0 % 65521
> > + .macro mod65521, reg0, reg1, reg2
> > + mov w\reg1, #0x8071
> > + mov w\reg2, #0xfff1
> > + movk w\reg1, #0x8007, lsl #16
> > + umull x\reg1, w\reg0, w\reg1
> > + lsr x\reg1, x\reg1, #47
> > + msub w\reg0, w\reg1, w\reg2, w\reg0
> > + .endm
> > +
>
> Same as above
>
> > + mod65521 10, 14, 12
> > + mod65521 11, 14, 12
> > +
> > + sub x2, x2, x15
> > + cmp x2, x15
> > + b.ge LBig_loop
> > +
> > +.align 6
>
> Indent
>
> > +Lnormal_proc:
>
> .L
>
>
> > + cmp x2, #0
> > + b.eq Lret
> > +
> > + ldrb w12, [x1, x9]
> > + add x9, x9, #1
> > + add x10, x12, x10
> > + add x11, x10, x11
> > + sub x2, x2, #1

I haven't tried to understand this algorithm in detail, but there should
probably be no need for this special case to handle the trailing bytes.

You should search for examples of speculative vectorization using
WHILELO etc., to get a better feel for how to do this.

[...]

Cheers
---Dave

2020-11-04 08:02:39

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Hi Ard,

Thank you very much for your reply and comments on the code :)

在 2020/11/3 22:34, Ard Biesheuvel 写道:
> (+ Dave)
>
> Hello liqiang,
>
> First of all, I don't think it is safe at the moment to use SVE in the
> kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
> though, so perhaps Dave can elaborate?

OK, I understand this problem now.

>
> I'll give my feedback on the code itself below, but please don't
> consider this an ack on the intent of using SVE in the kernel.
>
> Could you explain why SVE is so much better than NEON here?

In the previous months of work, I spent some time researching ARM's SVE
instruction set, and tried to use SVE instructions to optimize adler32
and some other general algorithms, and achieved good optimization results
on Adler32.

According to my current research and debugging (see cover-letter), I think
the vectorization method of Adler32 algorithm is very suitable for SVE
implementation, because it can divide data blocks of any length, such as
16byte, 32byte, 128byte, 256byte, etc. so our code can adapt to any
different SVE hardware from 128bit to 2048bit. Supporting SVE instructions
with different bit widths does not require special changes and processing
procedures. It only needs to determine the starting position of "Adler_sequence"
according to the SVE bit width. And different hardware can give full play to
its performance.

I am also trying to implement the algorithm with NEON instructions. I will
reply to you in time if there are results.

>
> On Tue, 3 Nov 2020 at 13:16, l00374334 <[email protected]> wrote:
>>
>> From: liqiang <[email protected]>
>>
>> In the libz library, the checksum algorithm adler32 usually occupies
>> a relatively high hot spot, and the SVE instruction set can easily
>> accelerate it, so that the performance of libz library will be
>> significantly improved.
>>
>> We can divides buf into blocks according to the bit width of SVE,
>> and then uses vector registers to perform operations in units of blocks
>> to achieve the purpose of acceleration.
>>
>> On machines that support ARM64 sve instructions, this algorithm is
>> about 3~4 times faster than the algorithm implemented in C language
>> in libz. The wider the SVE instruction, the better the acceleration effect.
>>
>> Measured on a Taishan 1951 machine that supports 256bit width SVE,
>> below are the results of my measured random data of 1M and 10M:
>>
>> [root@xxx adler32]# ./benchmark 1000000
>> Libz alg: Time used: 608 us, 1644.7 Mb/s.
>> SVE alg: Time used: 166 us, 6024.1 Mb/s.
>>
>> [root@xxx adler32]# ./benchmark 10000000
>> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
>> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
>>
>> The blocks can be of any size, so the algorithm can automatically adapt
>> to SVE hardware with different bit widths without modifying the code.
>>
>
> Please drop this indentation from the commit log.
>
>>
>> Signed-off-by: liqiang <[email protected]>
>> ---
>> arch/arm64/crypto/Kconfig | 5 ++
>> arch/arm64/crypto/Makefile | 3 +
>> arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
>> arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
>> crypto/testmgr.c | 8 +-
>> crypto/testmgr.h | 13 +++
>
> Please split into two patches. Also, who is going to use this "adler32" shash?

In the kernel, adler32 is used by the zlib_deflate algorithm as a checksum algorithm,
and the same is used in the libz library.

>
>> 6 files changed, 248 insertions(+), 1 deletion(-)
>> create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
>> create mode 100644 arch/arm64/crypto/adler32-sve.S
>>
>> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
>> index b8eb045..cfe58b9 100644
>> --- a/arch/arm64/crypto/Kconfig
>> +++ b/arch/arm64/crypto/Kconfig
>> @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
>> select CRYPTO_LIB_AES
>> select CRYPTO_SIMD
>>
>> +config SVE_ADLER32
>> + tristate "Accelerate Adler32 using arm64 SVE instructions."
>> + depends on ARM64_SVE
>> + select CRYPTO_HASH
>> +
>> endif
>> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
>> index d0901e6..45fe649 100644
>> --- a/arch/arm64/crypto/Makefile
>> +++ b/arch/arm64/crypto/Makefile
>> @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
>> obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
>> aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
>>
>> +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
>> +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
>> +
>> CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
>>
>> $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
>> diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
>> new file mode 100644
>> index 0000000..cb74514
>> --- /dev/null
>> +++ b/arch/arm64/crypto/adler32-sve-glue.c
>> @@ -0,0 +1,93 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Accelerate Adler32 using arm64 SVE instructions.
>> + * Automatically support all bit width of SVE
>> + * vector(128~2048).
>> + *
>> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
>> + *
>> + * Author: Li Qiang <[email protected]>
>> + */
>> +#include <linux/cpufeature.h>
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/zutil.h>
>> +
>> +#include <crypto/internal/hash.h>
>> +#include <crypto/internal/simd.h>
>> +
>> +#include <asm/neon.h>
>> +#include <asm/simd.h>
>> +
>> +/* Scalable vector extension min size 128bit */
>> +#define SVE_ADLER32_MIN_SIZE 16U
>> +#define SVE_ADLER32_DIGEST_SIZE 4
>> +#define SVE_ADLER32_BLOCKSIZE 1
>> +
>> +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
>> +
>> +static int adler32_sve_init(struct shash_desc *desc)
>> +{
>> + u32 *adler32 = shash_desc_ctx(desc);
>> +
>> + *adler32 = 1;
>> + return 0;
>> +}
>> +
>> +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
>> + unsigned int length)
>
> Please indent function parameters
>
>> +{
>> + u32 *adler32 = shash_desc_ctx(desc);
>> +
>> + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
>> + kernel_neon_begin();
>> + *adler32 = adler32_sve(*adler32, data, length);
>> + kernel_neon_end();
>> + } else {
>> + *adler32 = zlib_adler32(*adler32, data, length);
>> + }
>> + return 0;
>> +}
>> +
>> +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
>> +{
>> + u32 *adler32 = shash_desc_ctx(desc);
>> +
>> + *(u32 *)out = *adler32;
>
> Please use put_unaligned here
>
>> + return 0;
>> +}
>> +
>> +static struct shash_alg adler32_sve_alg[] = {{
>> + .digestsize = SVE_ADLER32_DIGEST_SIZE,
>> + .descsize = SVE_ADLER32_DIGEST_SIZE,
>> + .init = adler32_sve_init,
>> + .update = adler32_sve_update,
>> + .final = adler32_sve_final,
>> +
>> + .base.cra_name = "adler32",
>> + .base.cra_driver_name = "adler32-arm64-sve",
>> + .base.cra_priority = 200,
>> + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
>> + .base.cra_module = THIS_MODULE,
>
> Please make sure the indentation is correct here.
>
>> +}};
>> +
>> +static int __init adler32_sve_mod_init(void)
>> +{
>> + if (!cpu_have_named_feature(SVE))
>> + return 0;
>> +
>> + return crypto_register_shash(adler32_sve_alg);
>> +}
>> +
>> +static void __exit adler32_sve_mod_exit(void)
>> +{
>> + crypto_unregister_shash(adler32_sve_alg);
>> +}
>> +
>> +module_init(adler32_sve_mod_init);
>> +module_exit(adler32_sve_mod_exit);
>> +
>> +MODULE_AUTHOR("Li Qiang <[email protected]>");
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_ALIAS_CRYPTO("adler32");
>> +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
>> diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
>> new file mode 100644
>> index 0000000..34ee4bb
>> --- /dev/null
>> +++ b/arch/arm64/crypto/adler32-sve.S
>> @@ -0,0 +1,127 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
>> + * width of SVE vector(128~2048).
>> + *
>> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
>> + *
>> + * Author: Li Qiang <[email protected]>
>> + */
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/assembler.h>
>> +
>> +.arch armv8-a+sve
>> +.file "adler32_sve.S"
>
> Drop the .file
>
> Please indent the rest 1 tab
>
>> +.text
>> +.align 6
>> +
>> +//The supported sve vector length range is 128~2048 by this Adler_sequence
>> +.Adler_sequence:
>
> This should be in .rodata. Also, if you use . or L prefixes, use .L
> because that is what you need to make these local symbols.
>
>
>> + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
>> + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
>> + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
>> + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
>> + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
>> + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
>> + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
>> + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
>> + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
>> + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
>> + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
>> + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
>> + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
>> + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
>> + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
>> + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
>> +
>> +SYM_FUNC_START(adler32_sve)
>> + and w10, w0, #0xffff
>> + lsr w11, w0, #16
>> +
>
> Please put the instruction mnemonics in different columns using tabs
>
>> + // Get the length of the sve vector to x6.
>> + mov x6, #0
>> + mov x9, #256
>> + addvl x6, x6, #1
>> + adr x12, .Adler_sequence
>> + ptrue p1.h
>> +
>> + // Get the starting position of the required sequence.
>> + sub x9, x9, x6
>> + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
>> + inch x9
>> + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
>> + mov x9, #0
>> + // A little of byte, jumper to normal proc
>
> What does this comment mean?
>
>> + mov x14, #3
>> + mul x15, x14, x6
>> + cmp x2, x15
>> + b.le Lnormal_proc
>> +
>> + ptrue p0.b
>> +.align 6
>
> Ident.
>
>> +LBig_loop:
>
> Use .L prefix
>
>> + // x is SVE vector length (byte).
>> + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
>> + // An = An-1 + D1 + D2 + D3 + ... + Dx
>> +
>> + .macro ADLER_BLOCK_X
>
> Please use lower case for asm macros, to distinguish them from CPP macros
> Also, indent the name, and move the macro out of the function for legibility.
>
>> + ld1b z0.b, p0/z, [x1, x9]
>> + incb x9
>> + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
>> + mov x12, v20.2d[0]
>> + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
>> +
>> + uunpklo z26.h, z0.b
>> + uunpkhi z27.h, z0.b
>> + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
>> + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
>> +
>> + uaddv d21, p1, z26.h
>> + uaddv d22, p1, z27.h
>> + mov x13, v21.2d[0]
>> + mov x14, v22.2d[0]
>> +
>> + add x11, x13, x11
>> + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
>> + add x10, x12, x10 // An += D1 + D2 + ... + Dx
>> + .endm
>> + ADLER_BLOCK_X
>> + ADLER_BLOCK_X
>> + ADLER_BLOCK_X
>> + // calc = reg0 % 65521
>> + .macro mod65521, reg0, reg1, reg2
>> + mov w\reg1, #0x8071
>> + mov w\reg2, #0xfff1
>> + movk w\reg1, #0x8007, lsl #16
>> + umull x\reg1, w\reg0, w\reg1
>> + lsr x\reg1, x\reg1, #47
>> + msub w\reg0, w\reg1, w\reg2, w\reg0
>> + .endm
>> +
>
> Same as above
>
>> + mod65521 10, 14, 12
>> + mod65521 11, 14, 12
>> +
>> + sub x2, x2, x15
>> + cmp x2, x15
>> + b.ge LBig_loop
>> +
>> +.align 6
>
> Indent
>
>> +Lnormal_proc:
>
> .L
>
>
>> + cmp x2, #0
>> + b.eq Lret
>> +
>> + ldrb w12, [x1, x9]
>> + add x9, x9, #1
>> + add x10, x12, x10
>> + add x11, x10, x11
>> + sub x2, x2, #1
>> + b Lnormal_proc
>> +
>> +Lret:
>> + mod65521 10, 14, 12
>> + mod65521 11, 14, 12
>> + lsl x11, x11, #16
>> + orr x0, x10, x11
>> + ret
>> +SYM_FUNC_END(adler32_sve)
>> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
>> index a64a639..58b8020 100644
>> --- a/crypto/testmgr.c
>> +++ b/crypto/testmgr.c
>> @@ -4174,6 +4174,13 @@ static const struct alg_test_desc alg_test_descs[] = {
>> .suite = {
>> .cipher = __VECS(adiantum_xchacha20_aes_tv_template)
>> },
>> + }, {
>> + .alg = "adler32",
>> + .test = alg_test_hash,
>> + .fips_allowed = 1,
>> + .suite = {
>> + .hash = __VECS(adler32_tv_template)
>> + }
>> }, {
>> .alg = "aegis128",
>> .test = alg_test_aead,
>> @@ -5640,7 +5647,6 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
>> }
>>
>> DO_ONCE(testmgr_onetime_init);
>> -
>> if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
>> char nalg[CRYPTO_MAX_ALG_NAME];
>>
>> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
>> index 8c83811..5233960 100644
>> --- a/crypto/testmgr.h
>> +++ b/crypto/testmgr.h
>> @@ -3676,6 +3676,19 @@ static const struct hash_testvec crct10dif_tv_template[] = {
>> }
>> };
>>
>> +static const struct hash_testvec adler32_tv_template[] = {
>> + {
>> + .plaintext = "abcde",
>> + .psize = 5,
>> + .digest = "\xf0\x01\xc8\x05",
>> + },
>> + {
>> + .plaintext = "0123456789101112131415",
>> + .psize = 22,
>> + .digest = "\x63\x04\xa8\x32",
>> + },
>> +};
>> +
>> /*
>> * Streebog test vectors from RFC 6986 and GOST R 34.11-2012
>> */
>> --
>> 2.19.1
>>
> .
>

Thank you for your suggestions, I will modify them one by one in my code.
:-)

--
Best regards,
Li Qiang

2020-11-04 08:06:29

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Wed, 4 Nov 2020 at 09:02, Li Qiang <[email protected]> wrote:
>
> Hi Ard,
>
> Thank you very much for your reply and comments on the code :)
>
> 在 2020/11/3 22:34, Ard Biesheuvel 写道:
> > (+ Dave)
> >
> > Hello liqiang,
> >
> > First of all, I don't think it is safe at the moment to use SVE in the
> > kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
> > though, so perhaps Dave can elaborate?
>
> OK, I understand this problem now.
>
> >
> > I'll give my feedback on the code itself below, but please don't
> > consider this an ack on the intent of using SVE in the kernel.
> >
> > Could you explain why SVE is so much better than NEON here?
>
> In the previous months of work, I spent some time researching ARM's SVE
> instruction set, and tried to use SVE instructions to optimize adler32
> and some other general algorithms, and achieved good optimization results
> on Adler32.
>
> According to my current research and debugging (see cover-letter), I think
> the vectorization method of Adler32 algorithm is very suitable for SVE
> implementation, because it can divide data blocks of any length, such as
> 16byte, 32byte, 128byte, 256byte, etc. so our code can adapt to any
> different SVE hardware from 128bit to 2048bit. Supporting SVE instructions
> with different bit widths does not require special changes and processing
> procedures. It only needs to determine the starting position of "Adler_sequence"
> according to the SVE bit width. And different hardware can give full play to
> its performance.
>
> I am also trying to implement the algorithm with NEON instructions. I will
> reply to you in time if there are results.
>
> >
> > On Tue, 3 Nov 2020 at 13:16, l00374334 <[email protected]> wrote:
> >>
> >> From: liqiang <[email protected]>
> >>
> >> In the libz library, the checksum algorithm adler32 usually occupies
> >> a relatively high hot spot, and the SVE instruction set can easily
> >> accelerate it, so that the performance of libz library will be
> >> significantly improved.
> >>
> >> We can divides buf into blocks according to the bit width of SVE,
> >> and then uses vector registers to perform operations in units of blocks
> >> to achieve the purpose of acceleration.
> >>
> >> On machines that support ARM64 sve instructions, this algorithm is
> >> about 3~4 times faster than the algorithm implemented in C language
> >> in libz. The wider the SVE instruction, the better the acceleration effect.
> >>
> >> Measured on a Taishan 1951 machine that supports 256bit width SVE,
> >> below are the results of my measured random data of 1M and 10M:
> >>
> >> [root@xxx adler32]# ./benchmark 1000000
> >> Libz alg: Time used: 608 us, 1644.7 Mb/s.
> >> SVE alg: Time used: 166 us, 6024.1 Mb/s.
> >>
> >> [root@xxx adler32]# ./benchmark 10000000
> >> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> >> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
> >>
> >> The blocks can be of any size, so the algorithm can automatically adapt
> >> to SVE hardware with different bit widths without modifying the code.
> >>
> >
> > Please drop this indentation from the commit log.
> >
> >>
> >> Signed-off-by: liqiang <[email protected]>
> >> ---
> >> arch/arm64/crypto/Kconfig | 5 ++
> >> arch/arm64/crypto/Makefile | 3 +
> >> arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
> >> arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
> >> crypto/testmgr.c | 8 +-
> >> crypto/testmgr.h | 13 +++
> >
> > Please split into two patches. Also, who is going to use this "adler32" shash?
>
> In the kernel, adler32 is used by the zlib_deflate algorithm as a checksum algorithm,
> and the same is used in the libz library.
>

I understand that zlib_deflate uses adler32 internally. But where does
it invoke the crypto API to use the shash abstraction to perform this
transformation?

> >
> >> 6 files changed, 248 insertions(+), 1 deletion(-)
> >> create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
> >> create mode 100644 arch/arm64/crypto/adler32-sve.S
> >>
> >> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> >> index b8eb045..cfe58b9 100644
> >> --- a/arch/arm64/crypto/Kconfig
> >> +++ b/arch/arm64/crypto/Kconfig
> >> @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
> >> select CRYPTO_LIB_AES
> >> select CRYPTO_SIMD
> >>
> >> +config SVE_ADLER32
> >> + tristate "Accelerate Adler32 using arm64 SVE instructions."
> >> + depends on ARM64_SVE
> >> + select CRYPTO_HASH
> >> +
> >> endif
> >> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> >> index d0901e6..45fe649 100644
> >> --- a/arch/arm64/crypto/Makefile
> >> +++ b/arch/arm64/crypto/Makefile
> >> @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
> >> obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
> >> aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
> >>
> >> +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
> >> +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
> >> +
> >> CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
> >>
> >> $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
> >> diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
> >> new file mode 100644
> >> index 0000000..cb74514
> >> --- /dev/null
> >> +++ b/arch/arm64/crypto/adler32-sve-glue.c
> >> @@ -0,0 +1,93 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Accelerate Adler32 using arm64 SVE instructions.
> >> + * Automatically support all bit width of SVE
> >> + * vector(128~2048).
> >> + *
> >> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> >> + *
> >> + * Author: Li Qiang <[email protected]>
> >> + */
> >> +#include <linux/cpufeature.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/module.h>
> >> +#include <linux/zutil.h>
> >> +
> >> +#include <crypto/internal/hash.h>
> >> +#include <crypto/internal/simd.h>
> >> +
> >> +#include <asm/neon.h>
> >> +#include <asm/simd.h>
> >> +
> >> +/* Scalable vector extension min size 128bit */
> >> +#define SVE_ADLER32_MIN_SIZE 16U
> >> +#define SVE_ADLER32_DIGEST_SIZE 4
> >> +#define SVE_ADLER32_BLOCKSIZE 1
> >> +
> >> +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
> >> +
> >> +static int adler32_sve_init(struct shash_desc *desc)
> >> +{
> >> + u32 *adler32 = shash_desc_ctx(desc);
> >> +
> >> + *adler32 = 1;
> >> + return 0;
> >> +}
> >> +
> >> +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
> >> + unsigned int length)
> >
> > Please indent function parameters
> >
> >> +{
> >> + u32 *adler32 = shash_desc_ctx(desc);
> >> +
> >> + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
> >> + kernel_neon_begin();
> >> + *adler32 = adler32_sve(*adler32, data, length);
> >> + kernel_neon_end();
> >> + } else {
> >> + *adler32 = zlib_adler32(*adler32, data, length);
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
> >> +{
> >> + u32 *adler32 = shash_desc_ctx(desc);
> >> +
> >> + *(u32 *)out = *adler32;
> >
> > Please use put_unaligned here
> >
> >> + return 0;
> >> +}
> >> +
> >> +static struct shash_alg adler32_sve_alg[] = {{
> >> + .digestsize = SVE_ADLER32_DIGEST_SIZE,
> >> + .descsize = SVE_ADLER32_DIGEST_SIZE,
> >> + .init = adler32_sve_init,
> >> + .update = adler32_sve_update,
> >> + .final = adler32_sve_final,
> >> +
> >> + .base.cra_name = "adler32",
> >> + .base.cra_driver_name = "adler32-arm64-sve",
> >> + .base.cra_priority = 200,
> >> + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
> >> + .base.cra_module = THIS_MODULE,
> >
> > Please make sure the indentation is correct here.
> >
> >> +}};
> >> +
> >> +static int __init adler32_sve_mod_init(void)
> >> +{
> >> + if (!cpu_have_named_feature(SVE))
> >> + return 0;
> >> +
> >> + return crypto_register_shash(adler32_sve_alg);
> >> +}
> >> +
> >> +static void __exit adler32_sve_mod_exit(void)
> >> +{
> >> + crypto_unregister_shash(adler32_sve_alg);
> >> +}
> >> +
> >> +module_init(adler32_sve_mod_init);
> >> +module_exit(adler32_sve_mod_exit);
> >> +
> >> +MODULE_AUTHOR("Li Qiang <[email protected]>");
> >> +MODULE_LICENSE("GPL v2");
> >> +MODULE_ALIAS_CRYPTO("adler32");
> >> +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
> >> diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
> >> new file mode 100644
> >> index 0000000..34ee4bb
> >> --- /dev/null
> >> +++ b/arch/arm64/crypto/adler32-sve.S
> >> @@ -0,0 +1,127 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-only */
> >> +/*
> >> + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
> >> + * width of SVE vector(128~2048).
> >> + *
> >> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> >> + *
> >> + * Author: Li Qiang <[email protected]>
> >> + */
> >> +
> >> +#include <linux/linkage.h>
> >> +#include <asm/assembler.h>
> >> +
> >> +.arch armv8-a+sve
> >> +.file "adler32_sve.S"
> >
> > Drop the .file
> >
> > Please indent the rest 1 tab
> >
> >> +.text
> >> +.align 6
> >> +
> >> +//The supported sve vector length range is 128~2048 by this Adler_sequence
> >> +.Adler_sequence:
> >
> > This should be in .rodata. Also, if you use . or L prefixes, use .L
> > because that is what you need to make these local symbols.
> >
> >
> >> + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
> >> + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
> >> + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
> >> + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
> >> + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
> >> + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
> >> + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
> >> + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
> >> + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
> >> + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
> >> + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
> >> + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
> >> + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
> >> + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
> >> + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
> >> + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
> >> +
> >> +SYM_FUNC_START(adler32_sve)
> >> + and w10, w0, #0xffff
> >> + lsr w11, w0, #16
> >> +
> >
> > Please put the instruction mnemonics in different columns using tabs
> >
> >> + // Get the length of the sve vector to x6.
> >> + mov x6, #0
> >> + mov x9, #256
> >> + addvl x6, x6, #1
> >> + adr x12, .Adler_sequence
> >> + ptrue p1.h
> >> +
> >> + // Get the starting position of the required sequence.
> >> + sub x9, x9, x6
> >> + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
> >> + inch x9
> >> + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
> >> + mov x9, #0
> >> + // A little of byte, jumper to normal proc
> >
> > What does this comment mean?
> >
> >> + mov x14, #3
> >> + mul x15, x14, x6
> >> + cmp x2, x15
> >> + b.le Lnormal_proc
> >> +
> >> + ptrue p0.b
> >> +.align 6
> >
> > Ident.
> >
> >> +LBig_loop:
> >
> > Use .L prefix
> >
> >> + // x is SVE vector length (byte).
> >> + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
> >> + // An = An-1 + D1 + D2 + D3 + ... + Dx
> >> +
> >> + .macro ADLER_BLOCK_X
> >
> > Please use lower case for asm macros, to distinguish them from CPP macros
> > Also, indent the name, and move the macro out of the function for legibility.
> >
> >> + ld1b z0.b, p0/z, [x1, x9]
> >> + incb x9
> >> + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
> >> + mov x12, v20.2d[0]
> >> + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
> >> +
> >> + uunpklo z26.h, z0.b
> >> + uunpkhi z27.h, z0.b
> >> + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
> >> + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
> >> +
> >> + uaddv d21, p1, z26.h
> >> + uaddv d22, p1, z27.h
> >> + mov x13, v21.2d[0]
> >> + mov x14, v22.2d[0]
> >> +
> >> + add x11, x13, x11
> >> + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
> >> + add x10, x12, x10 // An += D1 + D2 + ... + Dx
> >> + .endm
> >> + ADLER_BLOCK_X
> >> + ADLER_BLOCK_X
> >> + ADLER_BLOCK_X
> >> + // calc = reg0 % 65521
> >> + .macro mod65521, reg0, reg1, reg2
> >> + mov w\reg1, #0x8071
> >> + mov w\reg2, #0xfff1
> >> + movk w\reg1, #0x8007, lsl #16
> >> + umull x\reg1, w\reg0, w\reg1
> >> + lsr x\reg1, x\reg1, #47
> >> + msub w\reg0, w\reg1, w\reg2, w\reg0
> >> + .endm
> >> +
> >
> > Same as above
> >
> >> + mod65521 10, 14, 12
> >> + mod65521 11, 14, 12
> >> +
> >> + sub x2, x2, x15
> >> + cmp x2, x15
> >> + b.ge LBig_loop
> >> +
> >> +.align 6
> >
> > Indent
> >
> >> +Lnormal_proc:
> >
> > .L
> >
> >
> >> + cmp x2, #0
> >> + b.eq Lret
> >> +
> >> + ldrb w12, [x1, x9]
> >> + add x9, x9, #1
> >> + add x10, x12, x10
> >> + add x11, x10, x11
> >> + sub x2, x2, #1
> >> + b Lnormal_proc
> >> +
> >> +Lret:
> >> + mod65521 10, 14, 12
> >> + mod65521 11, 14, 12
> >> + lsl x11, x11, #16
> >> + orr x0, x10, x11
> >> + ret
> >> +SYM_FUNC_END(adler32_sve)
> >> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
> >> index a64a639..58b8020 100644
> >> --- a/crypto/testmgr.c
> >> +++ b/crypto/testmgr.c
> >> @@ -4174,6 +4174,13 @@ static const struct alg_test_desc alg_test_descs[] = {
> >> .suite = {
> >> .cipher = __VECS(adiantum_xchacha20_aes_tv_template)
> >> },
> >> + }, {
> >> + .alg = "adler32",
> >> + .test = alg_test_hash,
> >> + .fips_allowed = 1,
> >> + .suite = {
> >> + .hash = __VECS(adler32_tv_template)
> >> + }
> >> }, {
> >> .alg = "aegis128",
> >> .test = alg_test_aead,
> >> @@ -5640,7 +5647,6 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
> >> }
> >>
> >> DO_ONCE(testmgr_onetime_init);
> >> -
> >> if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
> >> char nalg[CRYPTO_MAX_ALG_NAME];
> >>
> >> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
> >> index 8c83811..5233960 100644
> >> --- a/crypto/testmgr.h
> >> +++ b/crypto/testmgr.h
> >> @@ -3676,6 +3676,19 @@ static const struct hash_testvec crct10dif_tv_template[] = {
> >> }
> >> };
> >>
> >> +static const struct hash_testvec adler32_tv_template[] = {
> >> + {
> >> + .plaintext = "abcde",
> >> + .psize = 5,
> >> + .digest = "\xf0\x01\xc8\x05",
> >> + },
> >> + {
> >> + .plaintext = "0123456789101112131415",
> >> + .psize = 22,
> >> + .digest = "\x63\x04\xa8\x32",
> >> + },
> >> +};
> >> +
> >> /*
> >> * Streebog test vectors from RFC 6986 and GOST R 34.11-2012
> >> */
> >> --
> >> 2.19.1
> >>
> > .
> >
>
> Thank you for your suggestions, I will modify them one by one in my code.
> :-)
>
> --
> Best regards,
> Li Qiang

2020-11-04 08:15:35

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/4 16:04, Ard Biesheuvel 写道:
> I understand that zlib_deflate uses adler32 internally. But where does
> it invoke the crypto API to use the shash abstraction to perform this
> transformation?

Um... yes, I haven't finished this part yet.

--
Best regards,
Li Qiang

2020-11-04 09:20:48

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Hi Dave,

Thank you very much for your reply and suggestions. :)

?? 2020/11/4 2:00, Dave Martin д??:
> On Tue, Nov 03, 2020 at 03:34:27PM +0100, Ard Biesheuvel wrote:
>> (+ Dave)
>>
>> Hello liqiang,
>>
>> First of all, I don't think it is safe at the moment to use SVE in the
>> kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
>
> I'm not convinced that it's safe right now. SVE in the kernel is
> unsupported, partly due to cost and partly due to the lack of a
> compelling use case.
>
> I think it would be preferable to see this algo accelerated for NEON
> first, since all AArch64 hardware can benefit from that.

Yes i am trying it seriously. :)

>
> It's good to see someone experimenting with SVE though -- so feel
> free to experiment with this in userspace and see what sort of benchmark
> you can achieve. But there's no guarantee this can be merged in the
> kernel any time soon. If there's a big enough advantage over NEON, then
> it becomes more interesting.

Oh yes, I think so too! :)

>
>
>> though, so perhaps Dave can elaborate?
>
> Historically there was another reason: for every nested
> kernel_neon_begin(), we would have potentially needed to save the entire
> SVE state, which is a problem -- we'd likely run out of kernel stack,
> plus it could get costly, especially for latency. Since we refactored
> the kernel-mode NEON support to remove nesting support, this is not so
> much of a problem. kernel_neon_begin() may incur a save of the full SVE
> state anyway, so in some ways it would be a good thing if we could
> actually make use of all those registers.
>
> SVE hardware remains rare, so as a general policy I don't think we
> should accept SVE implementations of any algorithm that does not
> already have a NEON implementation -- unless the contributor can
> explain why nobody with non-SVE hardware is going to care about the
> performance of that algo.
>
>
>>
>> I'll give my feedback on the code itself below, but please don't
>> consider this an ack on the intent of using SVE in the kernel.
>>
>> Could you explain why SVE is so much better than NEON here?
>
>>From my end, I think that SVE probably doesn't offer special advantages
> here beyond less messy code and more number-crunching per loop
> iteration.
>
> That doesn't mean that SVE can't beat NEON, but NEON should still offer
> a big advantage over scalar code.
>
> Since the calculations are quite simple, even the NEON version may
> tend to saturate load/store bandwidth on some hardware, in which case
> the additional performance from SVE may be limited. Experimentation
> would be needed in order to know for sure.

Yes, I found this problem when I was testing the SVE version and NEON
version of other algorithms (higher data throughput). The load/store
unit limits the performance of SVE, resulting in just a slight improvement
in the performance of SVE compared to NEON.

>
>
>
> I've made some random comments on the code below -- but my knowledge of
> the SVE instruction set is a little rusty, and I haven't tried to
> understand precisely what this algo is trying to do!
>
>>
>> On Tue, 3 Nov 2020 at 13:16, l00374334 <[email protected]> wrote:
>>>
>>> From: liqiang <[email protected]>
>>>
>>> In the libz library, the checksum algorithm adler32 usually occupies
>>> a relatively high hot spot, and the SVE instruction set can easily
>>> accelerate it, so that the performance of libz library will be
>>> significantly improved.
>>>
>>> We can divides buf into blocks according to the bit width of SVE,
>>> and then uses vector registers to perform operations in units of blocks
>>> to achieve the purpose of acceleration.
>>>
>>> On machines that support ARM64 sve instructions, this algorithm is
>>> about 3~4 times faster than the algorithm implemented in C language
>>> in libz. The wider the SVE instruction, the better the acceleration effect.
>>>
>>> Measured on a Taishan 1951 machine that supports 256bit width SVE,
>>> below are the results of my measured random data of 1M and 10M:
>>>
>>> [root@xxx adler32]# ./benchmark 1000000
>>> Libz alg: Time used: 608 us, 1644.7 Mb/s.
>>> SVE alg: Time used: 166 us, 6024.1 Mb/s.
>>>
>>> [root@xxx adler32]# ./benchmark 10000000
>>> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
>>> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
>>>
>>> The blocks can be of any size, so the algorithm can automatically adapt
>>> to SVE hardware with different bit widths without modifying the code.
>>>
>>
>> Please drop this indentation from the commit log.
>>
>>>
>>> Signed-off-by: liqiang <[email protected]>
>>> ---
>>> arch/arm64/crypto/Kconfig | 5 ++
>>> arch/arm64/crypto/Makefile | 3 +
>>> arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
>>> arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
>>> crypto/testmgr.c | 8 +-
>>> crypto/testmgr.h | 13 +++
>>
>> Please split into two patches. Also, who is going to use this "adler32" shash?
>>
>>> 6 files changed, 248 insertions(+), 1 deletion(-)
>>> create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
>>> create mode 100644 arch/arm64/crypto/adler32-sve.S
>>>
>>> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
>>> index b8eb045..cfe58b9 100644
>>> --- a/arch/arm64/crypto/Kconfig
>>> +++ b/arch/arm64/crypto/Kconfig
>>> @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
>>> select CRYPTO_LIB_AES
>>> select CRYPTO_SIMD
>>>
>>> +config SVE_ADLER32
>>> + tristate "Accelerate Adler32 using arm64 SVE instructions."
>>> + depends on ARM64_SVE
>>> + select CRYPTO_HASH
>>> +
>>> endif
>>> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
>>> index d0901e6..45fe649 100644
>>> --- a/arch/arm64/crypto/Makefile
>>> +++ b/arch/arm64/crypto/Makefile
>>> @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
>>> obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
>>> aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
>>>
>>> +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
>>> +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
>>> +
>>> CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
>>>
>>> $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
>>> diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
>>> new file mode 100644
>>> index 0000000..cb74514
>>> --- /dev/null
>>> +++ b/arch/arm64/crypto/adler32-sve-glue.c
>>> @@ -0,0 +1,93 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Accelerate Adler32 using arm64 SVE instructions.
>>> + * Automatically support all bit width of SVE
>>> + * vector(128~2048).
>>> + *
>>> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
>>> + *
>>> + * Author: Li Qiang <[email protected]>
>>> + */
>>> +#include <linux/cpufeature.h>
>>> +#include <linux/kernel.h>
>>> +#include <linux/module.h>
>>> +#include <linux/zutil.h>
>>> +
>>> +#include <crypto/internal/hash.h>
>>> +#include <crypto/internal/simd.h>
>>> +
>>> +#include <asm/neon.h>
>>> +#include <asm/simd.h>
>>> +
>>> +/* Scalable vector extension min size 128bit */
>>> +#define SVE_ADLER32_MIN_SIZE 16U
>>> +#define SVE_ADLER32_DIGEST_SIZE 4
>>> +#define SVE_ADLER32_BLOCKSIZE 1
>>> +
>>> +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
>>> +
>>> +static int adler32_sve_init(struct shash_desc *desc)
>>> +{
>>> + u32 *adler32 = shash_desc_ctx(desc);
>>> +
>>> + *adler32 = 1;
>>> + return 0;
>>> +}
>>> +
>>> +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
>>> + unsigned int length)
>>
>> Please indent function parameters
>>
>>> +{
>>> + u32 *adler32 = shash_desc_ctx(desc);
>>> +
>>> + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
>>> + kernel_neon_begin();
>>> + *adler32 = adler32_sve(*adler32, data, length);
>>> + kernel_neon_end();
>>> + } else {
>>> + *adler32 = zlib_adler32(*adler32, data, length);
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
>>> +{
>>> + u32 *adler32 = shash_desc_ctx(desc);
>>> +
>>> + *(u32 *)out = *adler32;
>>
>> Please use put_unaligned here
>>
>>> + return 0;
>>> +}
>>> +
>>> +static struct shash_alg adler32_sve_alg[] = {{
>>> + .digestsize = SVE_ADLER32_DIGEST_SIZE,
>>> + .descsize = SVE_ADLER32_DIGEST_SIZE,
>>> + .init = adler32_sve_init,
>>> + .update = adler32_sve_update,
>>> + .final = adler32_sve_final,
>>> +
>>> + .base.cra_name = "adler32",
>>> + .base.cra_driver_name = "adler32-arm64-sve",
>>> + .base.cra_priority = 200,
>>> + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
>>> + .base.cra_module = THIS_MODULE,
>>
>> Please make sure the indentation is correct here.
>>
>>> +}};
>>> +
>>> +static int __init adler32_sve_mod_init(void)
>>> +{
>>> + if (!cpu_have_named_feature(SVE))
>>> + return 0;
>>> +
>>> + return crypto_register_shash(adler32_sve_alg);
>>> +}
>>> +
>>> +static void __exit adler32_sve_mod_exit(void)
>>> +{
>>> + crypto_unregister_shash(adler32_sve_alg);
>>> +}
>>> +
>>> +module_init(adler32_sve_mod_init);
>>> +module_exit(adler32_sve_mod_exit);
>>> +
>>> +MODULE_AUTHOR("Li Qiang <[email protected]>");
>>> +MODULE_LICENSE("GPL v2");
>>> +MODULE_ALIAS_CRYPTO("adler32");
>>> +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
>>> diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
>>> new file mode 100644
>>> index 0000000..34ee4bb
>>> --- /dev/null
>>> +++ b/arch/arm64/crypto/adler32-sve.S
>>> @@ -0,0 +1,127 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
>>> + * width of SVE vector(128~2048).
>>> + *
>>> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
>>> + *
>>> + * Author: Li Qiang <[email protected]>
>>> + */
>>> +
>>> +#include <linux/linkage.h>
>>> +#include <asm/assembler.h>
>>> +
>>> +.arch armv8-a+sve
>
> Use .arch_extension sve instead.
>
> The compiler frontend already sets .arch to the the appropriate base
> architecture already; that shouldn't be overridden unless there is a
> good reason.
>
>>> +.file "adler32_sve.S"
>>
>> Drop the .file
>>
>> Please indent the rest 1 tab
>>
>>> +.text
>>> +.align 6
>>> +
>>> +//The supported sve vector length range is 128~2048 by this Adler_sequence
>>> +.Adler_sequence:
>>
>> This should be in .rodata. Also, if you use . or L prefixes, use .L
>> because that is what you need to make these local symbols.
>>
>>
>>> + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
>>> + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
>>> + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
>>> + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
>>> + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
>>> + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
>>> + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
>>> + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
>>> + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
>>> + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
>>> + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
>>> + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
>>> + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
>>> + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
>>> + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
>>> + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
>>> +
>>> +SYM_FUNC_START(adler32_sve)
>>> + and w10, w0, #0xffff
>>> + lsr w11, w0, #16
>>> +
>>
>> Please put the instruction mnemonics in different columns using tabs
>>
>>> + // Get the length of the sve vector to x6.
>>> + mov x6, #0
>>> + mov x9, #256
>>> + addvl x6, x6, #1
>>> + adr x12, .Adler_sequence
>>> + ptrue p1.h
>>> +
>>> + // Get the starting position of the required sequence.
>>> + sub x9, x9, x6
>>> + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
>>> + inch x9
>>> + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
>
> This seems cumbersome, and will explode if SVE is ever extended to
> support vectors larger than 256 bytes (though that's probably not very
> likely any time soon).
>
> Can you do something like the following (untested)?
>
> ptrue p0.h
> index z0.h, #0, #1
> mov z1.d, z0.d
> dech z0.h
> dech z1.h, all, mul #2
> negh z0.h, p0/m, z0.h
> negh z1.h, p0/m, z1.h
>
>
>>> + mov x9, #0
>>> + // A little of byte, jumper to normal proc
>>
>> What does this comment mean?
>>
>>> + mov x14, #3
>>> + mul x15, x14, x6
>>> + cmp x2, x15
>>> + b.le Lnormal_proc
>>> +
>>> + ptrue p0.b
>>> +.align 6
>>
>> Ident.
>>
>>> +LBig_loop:
>>
>> Use .L prefix
>>
>>> + // x is SVE vector length (byte).
>>> + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
>>> + // An = An-1 + D1 + D2 + D3 + ... + Dx
>>> +
>>> + .macro ADLER_BLOCK_X
>>
>> Please use lower case for asm macros, to distinguish them from CPP macros
>> Also, indent the name, and move the macro out of the function for legibility.
>>
>>> + ld1b z0.b, p0/z, [x1, x9]
>>> + incb x9
>>> + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
>>> + mov x12, v20.2d[0]
>>> + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
>>> +
>>> + uunpklo z26.h, z0.b
>>> + uunpkhi z27.h, z0.b
>
> Instead of loading and then unpacking elements, it's best to do it in
> one go. If you increment the base address as you go, this becomes
> something like (untested):
>
> ld1b z26.h, p0/z, [x1]
> ld1b z27.h, p0/z, [x1, #1, mul vl]
> inch x1, all, mul #2
>
> (you could keep a pre-incremented version of x1 or x9 in some other
> register to eliminate one of these inch instructions).
>
>>> + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
>>> + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
>>> +
>>> + uaddv d21, p1, z26.h
>>> + uaddv d22, p1, z27.h
>>> + mov x13, v21.2d[0]
>>> + mov x14, v22.2d[0]
>>> +
>>> + add x11, x13, x11
>>> + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
>>> + add x10, x12, x10 // An += D1 + D2 + ... + Dx
>
> If you want best performance, you should do accumulations outside the
> core loop. Vertical reduction instructions such as UADDV need to
> collect the results of multiple element-by-element calculations which
> can otherwise proceed independenly, so doing this too often will heavily
> constrain the ways in which the hardware can schedule the calculations.
>
> Instead, you can accumulate column-wise partial sums with vector MAD
> instructions (maybe with MOXPRFX, since MAD is overwrites a source
> operand).
>
> To avoid frequent overflows, it may make sense to operate on quite wide
> elements within the loop, but you will still need to limit the number of
> loop interations to ensure that an overflow cannot happen.

I will carefully read and test your code suggestions next, thank you.

>
>>> + .endm
>>> + ADLER_BLOCK_X
>>> + ADLER_BLOCK_X
>>> + ADLER_BLOCK_X
>>> + // calc = reg0 % 65521
>>> + .macro mod65521, reg0, reg1, reg2
>>> + mov w\reg1, #0x8071
>>> + mov w\reg2, #0xfff1
>>> + movk w\reg1, #0x8007, lsl #16
>>> + umull x\reg1, w\reg0, w\reg1
>>> + lsr x\reg1, x\reg1, #47
>>> + msub w\reg0, w\reg1, w\reg2, w\reg0
>>> + .endm
>>> +
>>
>> Same as above
>>
>>> + mod65521 10, 14, 12
>>> + mod65521 11, 14, 12
>>> +
>>> + sub x2, x2, x15
>>> + cmp x2, x15
>>> + b.ge LBig_loop
>>> +
>>> +.align 6
>>
>> Indent
>>
>>> +Lnormal_proc:
>>
>> .L
>>
>>
>>> + cmp x2, #0
>>> + b.eq Lret
>>> +
>>> + ldrb w12, [x1, x9]
>>> + add x9, x9, #1
>>> + add x10, x12, x10
>>> + add x11, x10, x11
>>> + sub x2, x2, #1
>
> I haven't tried to understand this algorithm in detail, but there should
> probably be no need for this special case to handle the trailing bytes.
>
> You should search for examples of speculative vectorization using
> WHILELO etc., to get a better feel for how to do this.

Yes, I have considered this problem, but I have not found a good way to achieve it,
because before the end of the loop is reached, the decreasing sequence used for
calculation is determined.

For example, buf is divided into 32-byte blocks. This sequence should be 32,31,...,2,1,
if there are only 10 bytes left at the end of the loop, then this sequence
should be 10,9,8,...,2,1.

If I judge whether the end of the loop has been reached in the body of the loop,
and reset the starting point of the sequence according to the length of the tail,
it does not seem very good.

>
>
> [...]
>
> Cheers
> ---Dave
> .
>

--
Best regards,
Li Qiang

2020-11-04 14:51:19

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Wed, Nov 04, 2020 at 05:19:18PM +0800, Li Qiang wrote:
> Hi Dave,
>
> Thank you very much for your reply and suggestions. :)
>
> 在 2020/11/4 2:00, Dave Martin 写道:
> > On Tue, Nov 03, 2020 at 03:34:27PM +0100, Ard Biesheuvel wrote:
> >> (+ Dave)
> >>
> >> Hello liqiang,
> >>
> >> First of all, I don't think it is safe at the moment to use SVE in the
> >> kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
> >
> > I'm not convinced that it's safe right now. SVE in the kernel is
> > unsupported, partly due to cost and partly due to the lack of a
> > compelling use case.
> >
> > I think it would be preferable to see this algo accelerated for NEON
> > first, since all AArch64 hardware can benefit from that.
>
> Yes i am trying it seriously. :)

OK, that's good.

>
> >
> > It's good to see someone experimenting with SVE though -- so feel
> > free to experiment with this in userspace and see what sort of benchmark
> > you can achieve. But there's no guarantee this can be merged in the
> > kernel any time soon. If there's a big enough advantage over NEON, then
> > it becomes more interesting.
>
> Oh yes, I think so too! :)
>
> >
> >
> >> though, so perhaps Dave can elaborate?
> >
> > Historically there was another reason: for every nested
> > kernel_neon_begin(), we would have potentially needed to save the entire
> > SVE state, which is a problem -- we'd likely run out of kernel stack,
> > plus it could get costly, especially for latency. Since we refactored
> > the kernel-mode NEON support to remove nesting support, this is not so
> > much of a problem. kernel_neon_begin() may incur a save of the full SVE
> > state anyway, so in some ways it would be a good thing if we could
> > actually make use of all those registers.
> >
> > SVE hardware remains rare, so as a general policy I don't think we
> > should accept SVE implementations of any algorithm that does not
> > already have a NEON implementation -- unless the contributor can
> > explain why nobody with non-SVE hardware is going to care about the
> > performance of that algo.
> >
> >
> >>
> >> I'll give my feedback on the code itself below, but please don't
> >> consider this an ack on the intent of using SVE in the kernel.
> >>
> >> Could you explain why SVE is so much better than NEON here?
> >
> >>From my end, I think that SVE probably doesn't offer special advantages
> > here beyond less messy code and more number-crunching per loop
> > iteration.
> >
> > That doesn't mean that SVE can't beat NEON, but NEON should still offer
> > a big advantage over scalar code.
> >
> > Since the calculations are quite simple, even the NEON version may
> > tend to saturate load/store bandwidth on some hardware, in which case
> > the additional performance from SVE may be limited. Experimentation
> > would be needed in order to know for sure.
>
> Yes, I found this problem when I was testing the SVE version and NEON
> version of other algorithms (higher data throughput). The load/store
> unit limits the performance of SVE, resulting in just a slight improvement
> in the performance of SVE compared to NEON.

Right. Which suggests that NEON is the place to start.

> >
> > I've made some random comments on the code below -- but my knowledge of
> > the SVE instruction set is a little rusty, and I haven't tried to
> > understand precisely what this algo is trying to do!
> >
> >>
> >> On Tue, 3 Nov 2020 at 13:16, l00374334 <[email protected]> wrote:
> >>>
> >>> From: liqiang <[email protected]>
> >>>
> >>> In the libz library, the checksum algorithm adler32 usually occupies
> >>> a relatively high hot spot, and the SVE instruction set can easily
> >>> accelerate it, so that the performance of libz library will be
> >>> significantly improved.
> >>>
> >>> We can divides buf into blocks according to the bit width of SVE,
> >>> and then uses vector registers to perform operations in units of blocks
> >>> to achieve the purpose of acceleration.
> >>>
> >>> On machines that support ARM64 sve instructions, this algorithm is
> >>> about 3~4 times faster than the algorithm implemented in C language
> >>> in libz. The wider the SVE instruction, the better the acceleration effect.
> >>>
> >>> Measured on a Taishan 1951 machine that supports 256bit width SVE,
> >>> below are the results of my measured random data of 1M and 10M:
> >>>
> >>> [root@xxx adler32]# ./benchmark 1000000
> >>> Libz alg: Time used: 608 us, 1644.7 Mb/s.
> >>> SVE alg: Time used: 166 us, 6024.1 Mb/s.
> >>>
> >>> [root@xxx adler32]# ./benchmark 10000000
> >>> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> >>> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
> >>>
> >>> The blocks can be of any size, so the algorithm can automatically adapt
> >>> to SVE hardware with different bit widths without modifying the code.
> >>>
> >>
> >> Please drop this indentation from the commit log.
> >>
> >>>
> >>> Signed-off-by: liqiang <[email protected]>
> >>> ---
> >>> arch/arm64/crypto/Kconfig | 5 ++
> >>> arch/arm64/crypto/Makefile | 3 +
> >>> arch/arm64/crypto/adler32-sve-glue.c | 93 ++++++++++++++++++++
> >>> arch/arm64/crypto/adler32-sve.S | 127 +++++++++++++++++++++++++++
> >>> crypto/testmgr.c | 8 +-
> >>> crypto/testmgr.h | 13 +++
> >>
> >> Please split into two patches. Also, who is going to use this "adler32" shash?
> >>
> >>> 6 files changed, 248 insertions(+), 1 deletion(-)
> >>> create mode 100644 arch/arm64/crypto/adler32-sve-glue.c
> >>> create mode 100644 arch/arm64/crypto/adler32-sve.S
> >>>
> >>> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> >>> index b8eb045..cfe58b9 100644
> >>> --- a/arch/arm64/crypto/Kconfig
> >>> +++ b/arch/arm64/crypto/Kconfig
> >>> @@ -126,4 +126,9 @@ config CRYPTO_AES_ARM64_BS
> >>> select CRYPTO_LIB_AES
> >>> select CRYPTO_SIMD
> >>>
> >>> +config SVE_ADLER32
> >>> + tristate "Accelerate Adler32 using arm64 SVE instructions."
> >>> + depends on ARM64_SVE
> >>> + select CRYPTO_HASH
> >>> +
> >>> endif
> >>> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> >>> index d0901e6..45fe649 100644
> >>> --- a/arch/arm64/crypto/Makefile
> >>> +++ b/arch/arm64/crypto/Makefile
> >>> @@ -63,6 +63,9 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
> >>> obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
> >>> aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
> >>>
> >>> +obj-$(CONFIG_SVE_ADLER32) += sve-adler32.o
> >>> +sve-adler32-y := adler32-sve.o adler32-sve-glue.o
> >>> +
> >>> CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
> >>>
> >>> $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
> >>> diff --git a/arch/arm64/crypto/adler32-sve-glue.c b/arch/arm64/crypto/adler32-sve-glue.c
> >>> new file mode 100644
> >>> index 0000000..cb74514
> >>> --- /dev/null
> >>> +++ b/arch/arm64/crypto/adler32-sve-glue.c
> >>> @@ -0,0 +1,93 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * Accelerate Adler32 using arm64 SVE instructions.
> >>> + * Automatically support all bit width of SVE
> >>> + * vector(128~2048).
> >>> + *
> >>> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> >>> + *
> >>> + * Author: Li Qiang <[email protected]>
> >>> + */
> >>> +#include <linux/cpufeature.h>
> >>> +#include <linux/kernel.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/zutil.h>
> >>> +
> >>> +#include <crypto/internal/hash.h>
> >>> +#include <crypto/internal/simd.h>
> >>> +
> >>> +#include <asm/neon.h>
> >>> +#include <asm/simd.h>
> >>> +
> >>> +/* Scalable vector extension min size 128bit */
> >>> +#define SVE_ADLER32_MIN_SIZE 16U
> >>> +#define SVE_ADLER32_DIGEST_SIZE 4
> >>> +#define SVE_ADLER32_BLOCKSIZE 1
> >>> +
> >>> +asmlinkage u32 adler32_sve(u32 adler, const u8 *buf, u32 len);
> >>> +
> >>> +static int adler32_sve_init(struct shash_desc *desc)
> >>> +{
> >>> + u32 *adler32 = shash_desc_ctx(desc);
> >>> +
> >>> + *adler32 = 1;
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int adler32_sve_update(struct shash_desc *desc, const u8 *data,
> >>> + unsigned int length)
> >>
> >> Please indent function parameters
> >>
> >>> +{
> >>> + u32 *adler32 = shash_desc_ctx(desc);
> >>> +
> >>> + if (length >= SVE_ADLER32_MIN_SIZE && crypto_simd_usable()) {
> >>> + kernel_neon_begin();
> >>> + *adler32 = adler32_sve(*adler32, data, length);
> >>> + kernel_neon_end();
> >>> + } else {
> >>> + *adler32 = zlib_adler32(*adler32, data, length);
> >>> + }
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int adler32_sve_final(struct shash_desc *desc, u8 *out)
> >>> +{
> >>> + u32 *adler32 = shash_desc_ctx(desc);
> >>> +
> >>> + *(u32 *)out = *adler32;
> >>
> >> Please use put_unaligned here
> >>
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static struct shash_alg adler32_sve_alg[] = {{
> >>> + .digestsize = SVE_ADLER32_DIGEST_SIZE,
> >>> + .descsize = SVE_ADLER32_DIGEST_SIZE,
> >>> + .init = adler32_sve_init,
> >>> + .update = adler32_sve_update,
> >>> + .final = adler32_sve_final,
> >>> +
> >>> + .base.cra_name = "adler32",
> >>> + .base.cra_driver_name = "adler32-arm64-sve",
> >>> + .base.cra_priority = 200,
> >>> + .base.cra_blocksize = SVE_ADLER32_BLOCKSIZE,
> >>> + .base.cra_module = THIS_MODULE,
> >>
> >> Please make sure the indentation is correct here.
> >>
> >>> +}};
> >>> +
> >>> +static int __init adler32_sve_mod_init(void)
> >>> +{
> >>> + if (!cpu_have_named_feature(SVE))
> >>> + return 0;
> >>> +
> >>> + return crypto_register_shash(adler32_sve_alg);
> >>> +}
> >>> +
> >>> +static void __exit adler32_sve_mod_exit(void)
> >>> +{
> >>> + crypto_unregister_shash(adler32_sve_alg);
> >>> +}
> >>> +
> >>> +module_init(adler32_sve_mod_init);
> >>> +module_exit(adler32_sve_mod_exit);
> >>> +
> >>> +MODULE_AUTHOR("Li Qiang <[email protected]>");
> >>> +MODULE_LICENSE("GPL v2");
> >>> +MODULE_ALIAS_CRYPTO("adler32");
> >>> +MODULE_ALIAS_CRYPTO("adler32-arm64-sve");
> >>> diff --git a/arch/arm64/crypto/adler32-sve.S b/arch/arm64/crypto/adler32-sve.S
> >>> new file mode 100644
> >>> index 0000000..34ee4bb
> >>> --- /dev/null
> >>> +++ b/arch/arm64/crypto/adler32-sve.S
> >>> @@ -0,0 +1,127 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * Accelerate Adler32 using arm64 SVE instructions. Automatically support all bit
> >>> + * width of SVE vector(128~2048).
> >>> + *
> >>> + * Copyright (C) 2020 Huawei Technologies Co., Ltd.
> >>> + *
> >>> + * Author: Li Qiang <[email protected]>
> >>> + */
> >>> +
> >>> +#include <linux/linkage.h>
> >>> +#include <asm/assembler.h>
> >>> +
> >>> +.arch armv8-a+sve
> >
> > Use .arch_extension sve instead.
> >
> > The compiler frontend already sets .arch to the the appropriate base
> > architecture already; that shouldn't be overridden unless there is a
> > good reason.
> >
> >>> +.file "adler32_sve.S"
> >>
> >> Drop the .file
> >>
> >> Please indent the rest 1 tab
> >>
> >>> +.text
> >>> +.align 6
> >>> +
> >>> +//The supported sve vector length range is 128~2048 by this Adler_sequence
> >>> +.Adler_sequence:
> >>
> >> This should be in .rodata. Also, if you use . or L prefixes, use .L
> >> because that is what you need to make these local symbols.
> >>
> >>
> >>> + .short 256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241
> >>> + .short 240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225
> >>> + .short 224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209
> >>> + .short 208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193
> >>> + .short 192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177
> >>> + .short 176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161
> >>> + .short 160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145
> >>> + .short 144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129
> >>> + .short 128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113
> >>> + .short 112,111,110,109,108,107,106,105,104,103,102,101,100, 99, 98, 97
> >>> + .short 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81
> >>> + .short 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65
> >>> + .short 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49
> >>> + .short 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33
> >>> + .short 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17
> >>> + .short 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
> >>> +
> >>> +SYM_FUNC_START(adler32_sve)
> >>> + and w10, w0, #0xffff
> >>> + lsr w11, w0, #16
> >>> +
> >>
> >> Please put the instruction mnemonics in different columns using tabs
> >>
> >>> + // Get the length of the sve vector to x6.
> >>> + mov x6, #0
> >>> + mov x9, #256
> >>> + addvl x6, x6, #1
> >>> + adr x12, .Adler_sequence
> >>> + ptrue p1.h
> >>> +
> >>> + // Get the starting position of the required sequence.
> >>> + sub x9, x9, x6
> >>> + ld1h z24.h, p1/z, [x12, x9, lsl #1] // taps1 to z24.h
> >>> + inch x9
> >>> + ld1h z25.h, p1/z, [x12, x9, lsl #1] // taps2 to z25.h
> >
> > This seems cumbersome, and will explode if SVE is ever extended to
> > support vectors larger than 256 bytes (though that's probably not very
> > likely any time soon).
> >
> > Can you do something like the following (untested)?
> >
> > ptrue p0.h
> > index z0.h, #0, #1
> > mov z1.d, z0.d
> > dech z0.h
> > dech z1.h, all, mul #2
> > negh z0.h, p0/m, z0.h
> > negh z1.h, p0/m, z1.h
> >
> >
> >>> + mov x9, #0
> >>> + // A little of byte, jumper to normal proc
> >>
> >> What does this comment mean?
> >>
> >>> + mov x14, #3
> >>> + mul x15, x14, x6
> >>> + cmp x2, x15
> >>> + b.le Lnormal_proc
> >>> +
> >>> + ptrue p0.b
> >>> +.align 6
> >>
> >> Ident.
> >>
> >>> +LBig_loop:
> >>
> >> Use .L prefix
> >>
> >>> + // x is SVE vector length (byte).
> >>> + // Bn = Bn-1 + An-1 * x + x * D1 + (x-1) * D2 + ... + 1 * Dx
> >>> + // An = An-1 + D1 + D2 + D3 + ... + Dx
> >>> +
> >>> + .macro ADLER_BLOCK_X
> >>
> >> Please use lower case for asm macros, to distinguish them from CPP macros
> >> Also, indent the name, and move the macro out of the function for legibility.
> >>
> >>> + ld1b z0.b, p0/z, [x1, x9]
> >>> + incb x9
> >>> + uaddv d20, p0, z0.b // D1 + D2 + ... + Dx
> >>> + mov x12, v20.2d[0]
> >>> + madd x11, x10, x6, x11 // Bn = An-1 * x + Bn-1
> >>> +
> >>> + uunpklo z26.h, z0.b
> >>> + uunpkhi z27.h, z0.b
> >
> > Instead of loading and then unpacking elements, it's best to do it in
> > one go. If you increment the base address as you go, this becomes
> > something like (untested):
> >
> > ld1b z26.h, p0/z, [x1]
> > ld1b z27.h, p0/z, [x1, #1, mul vl]
> > inch x1, all, mul #2
> >
> > (you could keep a pre-incremented version of x1 or x9 in some other
> > register to eliminate one of these inch instructions).
> >
> >>> + mul z26.h, p1/m, z26.h, z24.h // x * D1 + (x-1) * D2 + ... + (x/2 + 1) * D(x/2)
> >>> + mul z27.h, p1/m, z27.h, z25.h // (x/2) * D(x/2 + 1) + (x/2 - 1) * D(x/2 + 2) + ... + 1 * Dx
> >>> +
> >>> + uaddv d21, p1, z26.h
> >>> + uaddv d22, p1, z27.h
> >>> + mov x13, v21.2d[0]
> >>> + mov x14, v22.2d[0]
> >>> +
> >>> + add x11, x13, x11
> >>> + add x11, x14, x11 // Bn += x * D1 + (x-1) * D2 + ... + 1 * Dx
> >>> + add x10, x12, x10 // An += D1 + D2 + ... + Dx
> >
> > If you want best performance, you should do accumulations outside the
> > core loop. Vertical reduction instructions such as UADDV need to
> > collect the results of multiple element-by-element calculations which
> > can otherwise proceed independenly, so doing this too often will heavily
> > constrain the ways in which the hardware can schedule the calculations.
> >
> > Instead, you can accumulate column-wise partial sums with vector MAD
> > instructions (maybe with MOXPRFX, since MAD is overwrites a source
> > operand).
> >
> > To avoid frequent overflows, it may make sense to operate on quite wide
> > elements within the loop, but you will still need to limit the number of
> > loop interations to ensure that an overflow cannot happen.
>
> I will carefully read and test your code suggestions next, thank you.
>
> >
> >>> + .endm
> >>> + ADLER_BLOCK_X
> >>> + ADLER_BLOCK_X
> >>> + ADLER_BLOCK_X
> >>> + // calc = reg0 % 65521
> >>> + .macro mod65521, reg0, reg1, reg2
> >>> + mov w\reg1, #0x8071
> >>> + mov w\reg2, #0xfff1
> >>> + movk w\reg1, #0x8007, lsl #16
> >>> + umull x\reg1, w\reg0, w\reg1
> >>> + lsr x\reg1, x\reg1, #47
> >>> + msub w\reg0, w\reg1, w\reg2, w\reg0
> >>> + .endm
> >>> +
> >>
> >> Same as above
> >>
> >>> + mod65521 10, 14, 12
> >>> + mod65521 11, 14, 12
> >>> +
> >>> + sub x2, x2, x15
> >>> + cmp x2, x15
> >>> + b.ge LBig_loop
> >>> +
> >>> +.align 6
> >>
> >> Indent
> >>
> >>> +Lnormal_proc:
> >>
> >> .L
> >>
> >>
> >>> + cmp x2, #0
> >>> + b.eq Lret
> >>> +
> >>> + ldrb w12, [x1, x9]
> >>> + add x9, x9, #1
> >>> + add x10, x12, x10
> >>> + add x11, x10, x11
> >>> + sub x2, x2, #1
> >
> > I haven't tried to understand this algorithm in detail, but there should
> > probably be no need for this special case to handle the trailing bytes.
> >
> > You should search for examples of speculative vectorization using
> > WHILELO etc., to get a better feel for how to do this.
>
> Yes, I have considered this problem, but I have not found a good way to achieve it,
> because before the end of the loop is reached, the decreasing sequence used for
> calculation is determined.
>
> For example, buf is divided into 32-byte blocks. This sequence should be 32,31,...,2,1,
> if there are only 10 bytes left at the end of the loop, then this sequence
> should be 10,9,8,...,2,1.
>
> If I judge whether the end of the loop has been reached in the body of the loop,
> and reset the starting point of the sequence according to the length of the tail,
> it does not seem very good.

That would indeed be inefficient, since the adjustment is only needed on
the last iteration.

Can you do instead do the adjustment after the loop ends?

For example, if

y = x[n] * 32 + x[n+1] * 31 + x[n+2] * 30 ...

then

y - (x[n] * 22 + x[n+1] * 22 + x[n+2] * 22 ...)

equals

x[n] + 10 + x[n+1] * 9 + x[n+2] * 8 + ,,,

(This isn't exactly what the algorithm demands, but hopefully you see the
general idea.)

[...]

Cheers
---Dave

2020-11-04 19:03:21

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Tue, Nov 03, 2020 at 08:15:06PM +0800, l00374334 wrote:
> From: liqiang <[email protected]>
>
> In the libz library, the checksum algorithm adler32 usually occupies
> a relatively high hot spot, and the SVE instruction set can easily
> accelerate it, so that the performance of libz library will be
> significantly improved.
>
> We can divides buf into blocks according to the bit width of SVE,
> and then uses vector registers to perform operations in units of blocks
> to achieve the purpose of acceleration.
>
> On machines that support ARM64 sve instructions, this algorithm is
> about 3~4 times faster than the algorithm implemented in C language
> in libz. The wider the SVE instruction, the better the acceleration effect.
>
> Measured on a Taishan 1951 machine that supports 256bit width SVE,
> below are the results of my measured random data of 1M and 10M:
>
> [root@xxx adler32]# ./benchmark 1000000
> Libz alg: Time used: 608 us, 1644.7 Mb/s.
> SVE alg: Time used: 166 us, 6024.1 Mb/s.
>
> [root@xxx adler32]# ./benchmark 10000000
> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
>
> The blocks can be of any size, so the algorithm can automatically adapt
> to SVE hardware with different bit widths without modifying the code.
>
>
> Signed-off-by: liqiang <[email protected]>

Note that this patch does nothing to actually wire up the kernel's copy of libz
(lib/zlib_{deflate,inflate}/) to use this implementation of Adler32. To do so,
libz would either need to be changed to use the shash API, or you'd need to
implement an adler32() function in lib/crypto/ that automatically uses an
accelerated implementation if available, and make libz call it.

Also, in either case a C implementation would be required too. There can't be
just an architecture-specific implementation.

Also as others have pointed out, there's probably not much point in having a SVE
implementation of Adler32 when there isn't even a NEON implementation yet. It's
not too hard to implement Adler32 using NEON, and there are already several
permissively-licensed NEON implementations out there that could be used as a
reference, e.g. my implementation using NEON instrinsics here:
https://github.com/ebiggers/libdeflate/blob/v1.6/lib/arm/adler32_impl.h

- Eric

2020-11-04 19:03:28

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Tue, Nov 03, 2020 at 06:00:32PM +0000, Dave Martin wrote:
> On Tue, Nov 03, 2020 at 03:34:27PM +0100, Ard Biesheuvel wrote:

> > First of all, I don't think it is safe at the moment to use SVE in the
> > kernel, as we don't preserve all state IIRC. My memory is a bit hazy,

> I'm not convinced that it's safe right now. SVE in the kernel is
> unsupported, partly due to cost and partly due to the lack of a
> compelling use case.

I think at a minimum we'd want to handle the vector length explicitly
for kernel mode SVE, vector length independent code will work most of
the time but at the very least it feels like a landmine waiting to cause
trouble. If nothing else there's probably going to be cases where it
makes a difference for performance. Other than that I'm not currently
seeing any issues since we're handling SVE in the same paths we handle
the rest of the FPSIMD stuff.

> I think it would be preferable to see this algo accelerated for NEON
> first, since all AArch64 hardware can benefit from that.

...

> much of a problem. kernel_neon_begin() may incur a save of the full SVE
> state anyway, so in some ways it would be a good thing if we could
> actually make use of all those registers.

> SVE hardware remains rare, so as a general policy I don't think we
> should accept SVE implementations of any algorithm that does not
> already have a NEON implementation -- unless the contributor can
> explain why nobody with non-SVE hardware is going to care about the
> performance of that algo.

I tend to agree here, my concerns are around the cost of maintaining a
SVE implementation relative to the number of users who'd benefit from it
rather than around the basic idea of using SVE at all. If we were
seeing substantial performance benefits over an implementation using
NEON, or had some other strong push to use SVE like a really solid
understanding of why SVE is a good fit for the algorithm but NEON isn't,
then it'd be worth finishing up the infrastructure. The infrastructure
itself doesn't seem fundamentally problematic.

Attachments:

(No filename) (2.05 kB)
signature.asc (499.00 B)
Download all attachments

2020-11-04 19:08:19

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Wed, Nov 04, 2020 at 05:50:33PM +0000, Mark Brown wrote:
> On Tue, Nov 03, 2020 at 06:00:32PM +0000, Dave Martin wrote:
> > On Tue, Nov 03, 2020 at 03:34:27PM +0100, Ard Biesheuvel wrote:
>
> > > First of all, I don't think it is safe at the moment to use SVE in the
> > > kernel, as we don't preserve all state IIRC. My memory is a bit hazy,
>
> > I'm not convinced that it's safe right now. SVE in the kernel is
> > unsupported, partly due to cost and partly due to the lack of a
> > compelling use case.
>
> I think at a minimum we'd want to handle the vector length explicitly
> for kernel mode SVE, vector length independent code will work most of
> the time but at the very least it feels like a landmine waiting to cause
> trouble. If nothing else there's probably going to be cases where it
> makes a difference for performance. Other than that I'm not currently
> seeing any issues since we're handling SVE in the same paths we handle
> the rest of the FPSIMD stuff.

Having a random vector length could be good for testing ;)

I was tempted to add that as a deliberate feature, but that sort of
nothing doesn't really belong in the kernel...

Anyway:

The main reasons for constraining the vector length are a) to hide
mismatches between CPUs in heterogeneous systems, b) to ensure that
validated software doesn't run with a vector length it wasn't validated
for, and c) testing.

For kernel code, it's reasonable to say that all code should be vector-
length agnostic unless there's a really good reason not to be. So we
may not care too much about (b).

In that case, just setting ZCR_EL1.LEN to max in kernel_sve_begin() (or
whatever) probably makes sense.

For (c), it might be useful to have a command-line parameter or debugfs
widget to constrain the vector length for kernel code; perhaps globally
or perhaps per driver or algo.

Otherwise, I agree that using SVE in the kernel _should_ probably work
safely, using the same basic mechanism as kernel_mode_neon(). Also,
it shouldn't have higher overheads than kernel-mode-NEON now.

>
> > I think it would be preferable to see this algo accelerated for NEON
> > first, since all AArch64 hardware can benefit from that.
>
> ...
>
> > much of a problem. kernel_neon_begin() may incur a save of the full SVE
> > state anyway, so in some ways it would be a good thing if we could
> > actually make use of all those registers.
>
> > SVE hardware remains rare, so as a general policy I don't think we
> > should accept SVE implementations of any algorithm that does not
> > already have a NEON implementation -- unless the contributor can
> > explain why nobody with non-SVE hardware is going to care about the
> > performance of that algo.
>
> I tend to agree here, my concerns are around the cost of maintaining a
> SVE implementation relative to the number of users who'd benefit from it
> rather than around the basic idea of using SVE at all. If we were
> seeing substantial performance benefits over an implementation using
> NEON, or had some other strong push to use SVE like a really solid
> understanding of why SVE is a good fit for the algorithm but NEON isn't,
> then it'd be worth finishing up the infrastructure. The infrastructure
> itself doesn't seem fundamentally problematic.

Agreed

Nonetheless, working up a candidate algorithm to help us see whether
there is a good use case seems like a worthwhile project, so I don't
want to discourage that too much.

Cheers
---Dave

2020-11-04 19:19:21

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Wed, Nov 04, 2020 at 06:13:06PM +0000, Dave Martin wrote:
> On Wed, Nov 04, 2020 at 05:50:33PM +0000, Mark Brown wrote:

> > I think at a minimum we'd want to handle the vector length explicitly
> > for kernel mode SVE, vector length independent code will work most of
> > the time but at the very least it feels like a landmine waiting to cause
> > trouble. If nothing else there's probably going to be cases where it
> > makes a difference for performance. Other than that I'm not currently

...

> The main reasons for constraining the vector length are a) to hide
> mismatches between CPUs in heterogeneous systems, b) to ensure that
> validated software doesn't run with a vector length it wasn't validated
> for, and c) testing.

> For kernel code, it's reasonable to say that all code should be vector-
> length agnostic unless there's a really good reason not to be. So we
> may not care too much about (b).

> In that case, just setting ZCR_EL1.LEN to max in kernel_sve_begin() (or
> whatever) probably makes sense.

I agree, that's most likely a good default.

> For (c), it might be useful to have a command-line parameter or debugfs
> widget to constrain the vector length for kernel code; perhaps globally
> or perhaps per driver or algo.

I think a global control would be good for testing, it seems simpler and
easier all round. The per thing tuning seems more useful for cases
where we run into something like a performance reason to use a limited
set of vector lengths but I think we should only add that when we have
at least one user for it, some examples of actual restrictions we want
would probably be helpful for designing the interface.

> Nonetheless, working up a candidate algorithm to help us see whether
> there is a good use case seems like a worthwhile project, so I don't
> want to discourage that too much.

Definitely worth exploring.

Attachments:

(No filename) (1.87 kB)
signature.asc (499.00 B)
Download all attachments

2020-11-05 04:52:39

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/4 22:49, Dave Martin 写道:
> On Wed, Nov 04, 2020 at 05:19:18PM +0800, Li Qiang wrote:

...

>>>
>>> I haven't tried to understand this algorithm in detail, but there should
>>> probably be no need for this special case to handle the trailing bytes.
>>>
>>> You should search for examples of speculative vectorization using
>>> WHILELO etc., to get a better feel for how to do this.
>>
>> Yes, I have considered this problem, but I have not found a good way to achieve it,
>> because before the end of the loop is reached, the decreasing sequence used for
>> calculation is determined.
>>
>> For example, buf is divided into 32-byte blocks. This sequence should be 32,31,...,2,1,
>> if there are only 10 bytes left at the end of the loop, then this sequence
>> should be 10,9,8,...,2,1.
>>
>> If I judge whether the end of the loop has been reached in the body of the loop,
>> and reset the starting point of the sequence according to the length of the tail,
>> it does not seem very good.
>
> That would indeed be inefficient, since the adjustment is only needed on
> the last iteration.
>
> Can you do instead do the adjustment after the loop ends?
>
> For example, if
>
> y = x[n] * 32 + x[n+1] * 31 + x[n+2] * 30 ...
>
> then
>
> y - (x[n] * 22 + x[n+1] * 22 + x[n+2] * 22 ...)
>
> equals
>
> x[n] + 10 + x[n+1] * 9 + x[n+2] * 8 + ,,,
>
> (This isn't exactly what the algorithm demands, but hopefully you see the
> general idea.)
>
> [...]
>
> Cheers
> ---Dave
> .
>

This idea seems feasible, so that the judgment can be made only once after the
end of the loop, and the extra part is subtracted, and there is no need to enter
another loop to process the trailing bytes.

I will try this solution later. Thank you! :)

--
Best regards,
Li Qiang

2020-11-05 04:56:47

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Hi Eric,

?? 2020/11/5 1:57, Eric Biggers д??:
> On Tue, Nov 03, 2020 at 08:15:06PM +0800, l00374334 wrote:
>> From: liqiang <[email protected]>
>>
>> In the libz library, the checksum algorithm adler32 usually occupies
>> a relatively high hot spot, and the SVE instruction set can easily
>> accelerate it, so that the performance of libz library will be
>> significantly improved.
>>
>> We can divides buf into blocks according to the bit width of SVE,
>> and then uses vector registers to perform operations in units of blocks
>> to achieve the purpose of acceleration.
>>
>> On machines that support ARM64 sve instructions, this algorithm is
>> about 3~4 times faster than the algorithm implemented in C language
>> in libz. The wider the SVE instruction, the better the acceleration effect.
>>
>> Measured on a Taishan 1951 machine that supports 256bit width SVE,
>> below are the results of my measured random data of 1M and 10M:
>>
>> [root@xxx adler32]# ./benchmark 1000000
>> Libz alg: Time used: 608 us, 1644.7 Mb/s.
>> SVE alg: Time used: 166 us, 6024.1 Mb/s.
>>
>> [root@xxx adler32]# ./benchmark 10000000
>> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
>> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
>>
>> The blocks can be of any size, so the algorithm can automatically adapt
>> to SVE hardware with different bit widths without modifying the code.
>>
>>
>> Signed-off-by: liqiang <[email protected]>
>
> Note that this patch does nothing to actually wire up the kernel's copy of libz
> (lib/zlib_{deflate,inflate}/) to use this implementation of Adler32. To do so,
> libz would either need to be changed to use the shash API, or you'd need to
> implement an adler32() function in lib/crypto/ that automatically uses an
> accelerated implementation if available, and make libz call it.
>
> Also, in either case a C implementation would be required too. There can't be
> just an architecture-specific implementation.

Okay, thank you for the problems and suggestions you gave. I will continue to
improve my code.

>
> Also as others have pointed out, there's probably not much point in having a SVE
> implementation of Adler32 when there isn't even a NEON implementation yet. It's
> not too hard to implement Adler32 using NEON, and there are already several
> permissively-licensed NEON implementations out there that could be used as a
> reference, e.g. my implementation using NEON instrinsics here:
> https://github.com/ebiggers/libdeflate/blob/v1.6/lib/arm/adler32_impl.h
>
> - Eric
> .
>

I am very happy to get this NEON implementation code. :)

--
Best regards,
Li Qiang

2020-11-05 07:54:53

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Thu, 5 Nov 2020 at 03:50, Li Qiang <[email protected]> wrote:
>
> Hi Eric,
>
> 在 2020/11/5 1:57, Eric Biggers 写道:
> > On Tue, Nov 03, 2020 at 08:15:06PM +0800, l00374334 wrote:
> >> From: liqiang <[email protected]>
> >>
> >> In the libz library, the checksum algorithm adler32 usually occupies
> >> a relatively high hot spot, and the SVE instruction set can easily
> >> accelerate it, so that the performance of libz library will be
> >> significantly improved.
> >>
> >> We can divides buf into blocks according to the bit width of SVE,
> >> and then uses vector registers to perform operations in units of blocks
> >> to achieve the purpose of acceleration.
> >>
> >> On machines that support ARM64 sve instructions, this algorithm is
> >> about 3~4 times faster than the algorithm implemented in C language
> >> in libz. The wider the SVE instruction, the better the acceleration effect.
> >>
> >> Measured on a Taishan 1951 machine that supports 256bit width SVE,
> >> below are the results of my measured random data of 1M and 10M:
> >>
> >> [root@xxx adler32]# ./benchmark 1000000
> >> Libz alg: Time used: 608 us, 1644.7 Mb/s.
> >> SVE alg: Time used: 166 us, 6024.1 Mb/s.
> >>
> >> [root@xxx adler32]# ./benchmark 10000000
> >> Libz alg: Time used: 6484 us, 1542.3 Mb/s.
> >> SVE alg: Time used: 2034 us, 4916.4 Mb/s.
> >>
> >> The blocks can be of any size, so the algorithm can automatically adapt
> >> to SVE hardware with different bit widths without modifying the code.
> >>
> >>
> >> Signed-off-by: liqiang <[email protected]>
> >
> > Note that this patch does nothing to actually wire up the kernel's copy of libz
> > (lib/zlib_{deflate,inflate}/) to use this implementation of Adler32. To do so,
> > libz would either need to be changed to use the shash API, or you'd need to
> > implement an adler32() function in lib/crypto/ that automatically uses an
> > accelerated implementation if available, and make libz call it.
> >
> > Also, in either case a C implementation would be required too. There can't be
> > just an architecture-specific implementation.
>
> Okay, thank you for the problems and suggestions you gave. I will continue to
> improve my code.
>
> >
> > Also as others have pointed out, there's probably not much point in having a SVE
> > implementation of Adler32 when there isn't even a NEON implementation yet. It's
> > not too hard to implement Adler32 using NEON, and there are already several
> > permissively-licensed NEON implementations out there that could be used as a
> > reference, e.g. my implementation using NEON instrinsics here:
> > https://github.com/ebiggers/libdeflate/blob/v1.6/lib/arm/adler32_impl.h
> >
> > - Eric
> > .
> >
>
> I am very happy to get this NEON implementation code. :)
>

Note that NEON intrinsics can be compiled for 32-bit ARM as well (with
a bit of care - please refer to lib/raid6/recov_neon_inner.c for an
example of how to deal with intrinsics that are only available on
arm64) and are less error prone, so intrinsics should be preferred if
feasible.

However, you have still not explained how optimizing Adler32 makes a
difference for a real-world use case. Where is libdeflate used on a
hot path?

2020-11-05 09:06:53

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/5 15:51, Ard Biesheuvel 写道:
> Note that NEON intrinsics can be compiled for 32-bit ARM as well (with
> a bit of care - please refer to lib/raid6/recov_neon_inner.c for an
> example of how to deal with intrinsics that are only available on
> arm64) and are less error prone, so intrinsics should be preferred if
> feasible.
>
> However, you have still not explained how optimizing Adler32 makes a
> difference for a real-world use case. Where is libdeflate used on a
> hot path?
> .

Sorry :(, I have not specifically searched for the use of this algorithm
in the kernel.

When I used perf to test the performance of the libz library before,
I saw that the adler32 algorithm occupies a lot of hot spots.I just
saw this algorithm used in the kernel code, so I think optimizing this
algorithm may have some positive optimization effects on the kernel.:)

--
Best regards,
Li Qiang

2020-11-05 16:54:12

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Tue, Nov 03, 2020 at 08:15:05PM +0800, l00374334 wrote:
> From: liqiang <[email protected]>
>
> Dear all,
>
> Thank you for taking the precious time to read this email!
>
> Let me introduce the implementation ideas of my code here.
>
> In the process of using the compression library libz, I found that the adler32
> checksum always occupies a higher hot spot, so I paid attention to this algorithm.
> After getting in touch with the SVE instruction set of armv8, I realized that
> SVE can effectively accelerate adler32, so I made some attempts and got correct
> and better performance results. I very much hope that this modification can be
> applied to the kernel.
>
> Below is my analysis process:
>
> Adler32 algorithm
> =================
>
> Reference: https://en.wikipedia.org/wiki/Adler-32
>
> Assume that the buf of the Adler32 checksum to be calculated is D and the length is n:
>
> A = 1 + D1 + D2 + ... + Dn (mod 65521)
>
> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
>
> Adler-32(D) = B × 65536 + A
>
> In C, an inefficient but straightforward implementation is:
>
> const uint32_t MOD_ADLER = 65521;
>
> uint32_t adler32(unsigned char *data, size_t len)
> {
> uint32_t a = 1, b = 0;
> size_t index;
>
> // Process each byte of the data in order
> for (index = 0; index < len; ++index)
> {
> a = (a + data[index]) % MOD_ADLER;
> b = (b + a) % MOD_ADLER;
> }
>
> return (b << 16) | a;
> }
>
> SVE vector method
> =================
>
> Step 1. Determine the block size:
> Use addvl instruction to get SVE bit width.
> Assuming the SVE bit width is x here.
>
> Step 2. Start to calculate the first block:
> The calculation formula is:
> A1 = 1 + D1 + D2 + ... + Dx (mod 65521)
> B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521)
>
> Step 3. Calculate the follow block:
> The calculation formula of A2 is very simple, just add up:
> A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521)
>
> The calculation formula of B2 is more complicated, because
> the result is related to the length of buf. When calculating
> the B1 block, it is actually assumed that the length is the
> block length x. Now when calculating B2, the length is expanded
> to 2x, so B2 becomes:
> B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x
> = x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x]
> ^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++
> Through the above polynomial transformation:
> Symbol "^" represents the <x * A1>;
> Symbol "~" represents the <B1>;
> Symbol "+" represents the next block.
>
> So we can get the method of calculating the next block from
> the previous block(Assume that the first byte number of the
> new block starts from 1):
> An+1 = An + D1 + D2 + ... + Dx (mod 65521)
> Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521)

Putting aside people's concerns for the moment, I think this may be
formulated in a slightly more convenient way:

If
X0, X1, ... are the data bytes
An is 1 + Sum [i=0 .. n-1] Xi
Bn is n + Sum [i=0 .. n-1] (n-i)Xi
= Sum [i=1 .. n] Ai

(i.e., An, Bn are the accumulations for the first n bytes here, not the
first n blocks)

then

A[n+v] - An = Sum[i=n .. n+v-1] Xi

B[n+v] - Bn = v + (Sum [i=0 .. n+v-1] (n+v-i) Xi)
- Sum [i=0 .. n-1] (n-i)Xi

= v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
+ (Sum [i=0 .. n-1] (n+v-i) Xi)
- Sum [i=0 .. n-1] (n-i)Xi

= v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
+ Sum [i=0 .. n-1] ((n+v-i) - (n-i)) Xi

= v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
+ vSum [i=0 .. n-1] Xi

= v + v(An - 1) + Sum [i=n .. n+v-1] (n+v-i) Xi

= vAn + Sum [i=n .. n+v-1] (n+v-i) Xi

= vAn + vSum [i=n .. n+v-1] Xi
+ Sum [i=n .. n+v-1] (n-i) Xi

= vAn + vSum [i=n .. n+v-1] Xi
+ Sum [i=n .. n+v-1] (n-i) Xi

= vA[n+v] + Sum [i=n .. n+v-1] (n-i) Xi

Let j = i - n; then:

B[n+v] - Bn = vA[n+v] - Sum [j=0 .. v-1] j X[j+n]

Which gives us a multiplier j that increases with the X[] index.

I think this gives a core loop along the following lines. I don't know
whether this is correct, or whether it works -- but feel free to take
ideas from it if it helps.

Accumulators are 32 bits. This provides for a fair number of iterations
without overflow, but large input data will still require splitting into
chunks, with modulo reduction steps in between. There are rather a lot
of serial dependencies in the core loop, but since the operations
involved are relatively cheap, this is probably not a significant issue
in practice: the load-to-use dependency is probably the bigger concern.
Pipelined loop unrolling could address these if necessary.

The horizontal reductions (UADDV) still probably don't need to be done
until after the last chunk.

Beware: I wasn't careful with the initial values for Bn / An, so some
additional adjustments might be needed...

--8<--

ptrue pP.s
ptrue pA.s

mov zA.s, #0 // accumulator for An
mov zB.s, #0 // accumulator for Bn
index zJ.s, #0, #1 // zJ.s = [0, 1, .. V-1]

mov zV.s, #0
incw zV.s // zV.s = [V, V, .. V]

// where V is number of elements per block
// = the number of 32-bit elements that fit in a Z-register

add xLimit, xX, xLen
b start

loop:
ld1b zX.s, pP/z, [xX]
incw xX

add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s

msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s

movprfx zB, zA
mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
start:
whilelo pP.s, xX, xLimit
b.first loop

// Collect the partial sums together:

uaddv d0, pA, z0.s
uaddv d1, pA, z1.s

// Finally, add 1 to d0, and xLen to d1, and do modulo reduction.

-->8--

[...]

Cheers
---Dave

2020-11-05 17:57:29

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Wed, Nov 04, 2020 at 06:49:05PM +0000, Mark Brown wrote:
> On Wed, Nov 04, 2020 at 06:13:06PM +0000, Dave Martin wrote:
> > On Wed, Nov 04, 2020 at 05:50:33PM +0000, Mark Brown wrote:
>
> > > I think at a minimum we'd want to handle the vector length explicitly
> > > for kernel mode SVE, vector length independent code will work most of
> > > the time but at the very least it feels like a landmine waiting to cause
> > > trouble. If nothing else there's probably going to be cases where it
> > > makes a difference for performance. Other than that I'm not currently
>
> ...
>
> > The main reasons for constraining the vector length are a) to hide
> > mismatches between CPUs in heterogeneous systems, b) to ensure that
> > validated software doesn't run with a vector length it wasn't validated
> > for, and c) testing.
>
> > For kernel code, it's reasonable to say that all code should be vector-
> > length agnostic unless there's a really good reason not to be. So we
> > may not care too much about (b).
>
> > In that case, just setting ZCR_EL1.LEN to max in kernel_sve_begin() (or
> > whatever) probably makes sense.
>
> I agree, that's most likely a good default.
>
> > For (c), it might be useful to have a command-line parameter or debugfs
> > widget to constrain the vector length for kernel code; perhaps globally
> > or perhaps per driver or algo.
>
> I think a global control would be good for testing, it seems simpler and
> easier all round. The per thing tuning seems more useful for cases
> where we run into something like a performance reason to use a limited
> set of vector lengths but I think we should only add that when we have
> at least one user for it, some examples of actual restrictions we want
> would probably be helpful for designing the interface.

Ack; note that an algo that wants to use a particular vector length can
do so by means of the special predicate patterns VLnnn, POW2, MUL3 etc.
So setting an explicit limit in ZCR_EL1.LEN should hopefully be an
uncommon requirement.

>
> > Nonetheless, working up a candidate algorithm to help us see whether
> > there is a good use case seems like a worthwhile project, so I don't
> > want to discourage that too much.
>
> Definitely worth exploring.

Cheers
---Dave

2020-11-05 18:23:28

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Thu, Nov 05, 2020 at 05:05:53PM +0800, Li Qiang wrote:
>
>
> 在 2020/11/5 15:51, Ard Biesheuvel 写道:
> > Note that NEON intrinsics can be compiled for 32-bit ARM as well (with
> > a bit of care - please refer to lib/raid6/recov_neon_inner.c for an
> > example of how to deal with intrinsics that are only available on
> > arm64) and are less error prone, so intrinsics should be preferred if
> > feasible.
> >
> > However, you have still not explained how optimizing Adler32 makes a
> > difference for a real-world use case. Where is libdeflate used on a
> > hot path?
> > .
>
> Sorry :(, I have not specifically searched for the use of this algorithm
> in the kernel.
>
> When I used perf to test the performance of the libz library before,
> I saw that the adler32 algorithm occupies a lot of hot spots.I just
> saw this algorithm used in the kernel code, so I think optimizing this
> algorithm may have some positive optimization effects on the kernel.:)

Adler32 performance is important for zlib compression/decompression, which has a
few use cases in the kernel such as btrfs compression. However, these days
those few kernel use cases are mostly switching to newer algorithms like lz4 and
zstd. Also as I mentioned, your patch doesn't actually wire up your code to be
used by the kernel's implementation of zlib compression/decompression.

I think you'd be much better off contributing to a userspace project, where
DEFLATE/zlib/gzip support still has a long tail of use cases. The official zlib
isn't really being maintained and isn't accepting architecture-specific
optimizations, but there are some performance-oriented forks of zlib (e.g.
https://chromium.googlesource.com/chromium/src/third_party/zlib/ and
https://github.com/zlib-ng/zlib-ng), as well as other projects like libdeflate
(https://github.com/ebiggers/libdeflate). Generally I'm happy to accept
architecture-specific optimizations in libdeflate, but they need to be testable.

- Eric

2020-11-09 03:44:59

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Hi Dave,

I carefully read the ideas you provided and the sample code you gave me.:)

在 2020/11/6 0:53, Dave Martin 写道:
> On Tue, Nov 03, 2020 at 08:15:05PM +0800, l00374334 wrote:
>> From: liqiang <[email protected]>
>>
>> Dear all,
>>
>> Thank you for taking the precious time to read this email!
>>
>> Let me introduce the implementation ideas of my code here.
>>
>> In the process of using the compression library libz, I found that the adler32
>> checksum always occupies a higher hot spot, so I paid attention to this algorithm.
>> After getting in touch with the SVE instruction set of armv8, I realized that
>> SVE can effectively accelerate adler32, so I made some attempts and got correct
>> and better performance results. I very much hope that this modification can be
>> applied to the kernel.
>>
>> Below is my analysis process:
>>
>> Adler32 algorithm
>> =================
>>
>> Reference: https://en.wikipedia.org/wiki/Adler-32
>>
>> Assume that the buf of the Adler32 checksum to be calculated is D and the length is n:
>>
>> A = 1 + D1 + D2 + ... + Dn (mod 65521)
>>
>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
>>
>> Adler-32(D) = B × 65536 + A
>>
>> In C, an inefficient but straightforward implementation is:
>>
>> const uint32_t MOD_ADLER = 65521;
>>
>> uint32_t adler32(unsigned char *data, size_t len)
>> {
>> uint32_t a = 1, b = 0;
>> size_t index;
>>
>> // Process each byte of the data in order
>> for (index = 0; index < len; ++index)
>> {
>> a = (a + data[index]) % MOD_ADLER;
>> b = (b + a) % MOD_ADLER;
>> }
>>
>> return (b << 16) | a;
>> }
>>
>> SVE vector method
>> =================
>>
>> Step 1. Determine the block size:
>> Use addvl instruction to get SVE bit width.
>> Assuming the SVE bit width is x here.
>>
>> Step 2. Start to calculate the first block:
>> The calculation formula is:
>> A1 = 1 + D1 + D2 + ... + Dx (mod 65521)
>> B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521)
>>
>> Step 3. Calculate the follow block:
>> The calculation formula of A2 is very simple, just add up:
>> A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521)
>>
>> The calculation formula of B2 is more complicated, because
>> the result is related to the length of buf. When calculating
>> the B1 block, it is actually assumed that the length is the
>> block length x. Now when calculating B2, the length is expanded
>> to 2x, so B2 becomes:
>> B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x
>> = x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x]
>> ^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++
>> Through the above polynomial transformation:
>> Symbol "^" represents the <x * A1>;
>> Symbol "~" represents the <B1>;
>> Symbol "+" represents the next block.
>>
>> So we can get the method of calculating the next block from
>> the previous block(Assume that the first byte number of the
>> new block starts from 1):
>> An+1 = An + D1 + D2 + ... + Dx (mod 65521)
>> Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521)
>
> Putting aside people's concerns for the moment, I think this may be
> formulated in a slightly more convenient way:
>
> If
> X0, X1, ... are the data bytes
> An is 1 + Sum [i=0 .. n-1] Xi
> Bn is n + Sum [i=0 .. n-1] (n-i)Xi
> = Sum [i=1 .. n] Ai

Yes, this can be calculated from the original expression.

>
> (i.e., An, Bn are the accumulations for the first n bytes here, not the
> first n blocks)
>
> then
>
> A[n+v] - An = Sum[i=n .. n+v-1] Xi
>
> B[n+v] - Bn = v + (Sum [i=0 .. n+v-1] (n+v-i) Xi)
> - Sum [i=0 .. n-1] (n-i)Xi
>
> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> + (Sum [i=0 .. n-1] (n+v-i) Xi)
> - Sum [i=0 .. n-1] (n-i)Xi
>
> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> + Sum [i=0 .. n-1] ((n+v-i) - (n-i)) Xi
>
> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> + vSum [i=0 .. n-1] Xi
>
> = v + v(An - 1) + Sum [i=n .. n+v-1] (n+v-i) Xi
>
> = vAn + Sum [i=n .. n+v-1] (n+v-i) Xi
>
> = vAn + vSum [i=n .. n+v-1] Xi
> + Sum [i=n .. n+v-1] (n-i) Xi
>
> = vAn + vSum [i=n .. n+v-1] Xi
> + Sum [i=n .. n+v-1] (n-i) Xi
>
> = vA[n+v] + Sum [i=n .. n+v-1] (n-i) Xi
>
> Let j = i - n; then:
>
> B[n+v] - Bn = vA[n+v] - Sum [j=0 .. v-1] j X[j+n]
>
> Which gives us a multiplier j that increases with the X[] index.

My original approach is to multiply the byte sequence with the **decreasing**
sequence. I think the increasing sequence here is more friendly to the trailing
bytes of the loop.:-)

>
>
> I think this gives a core loop along the following lines. I don't know
> whether this is correct, or whether it works -- but feel free to take
> ideas from it if it helps.
>
> Accumulators are 32 bits. This provides for a fair number of iterations
> without overflow, but large input data will still require splitting into
> chunks, with modulo reduction steps in between. There are rather a lot
> of serial dependencies in the core loop, but since the operations
> involved are relatively cheap, this is probably not a significant issue
> in practice: the load-to-use dependency is probably the bigger concern.
> Pipelined loop unrolling could address these if necessary.
>
> The horizontal reductions (UADDV) still probably don't need to be done
> until after the last chunk.
>
>
> Beware: I wasn't careful with the initial values for Bn / An, so some
> additional adjustments might be needed...
>
> --8<--
>
> ptrue pP.s
> ptrue pA.s
>
> mov zA.s, #0 // accumulator for An
> mov zB.s, #0 // accumulator for Bn
> index zJ.s, #0, #1 // zJ.s = [0, 1, .. V-1]
>
> mov zV.s, #0
> incw zV.s // zV.s = [V, V, .. V]

When I actually tested this code, I found that the byte length of the
test must be equal to the vector length divided by 4 (that is, an integer
multiple of the number of words in the SVE) and the result is correct.

And I think one of the reasons is that the value stored in zV.s needs to be
changed in the last cycle. It should be changed to the actual number of
bytes remaining in the last cycle. :)

>
> // where V is number of elements per block
> // = the number of 32-bit elements that fit in a Z-register
>
> add xLimit, xX, xLen
> b start
>
> loop:
> ld1b zX.s, pP/z, [xX]
> incw xX

In order to verify my conjecture, I added a bit of code to here, which is to
subtract the corresponding value from zV in the last loop. But it is correct
only when the number of bytes is less than one cycle. Test cases that exceed
one complete cycle will also fail.

So I guess before calculating the last cycle, zA should be summed first,
because before the end of the cycle, zA and zB are scattered in the elements
of the vector, if the last cycle calculates zB, only part of zA is summed
( Because pP/m does not count inactive elements), it should be incomplete.

This is just my guess and has not yet been verified.:)

>
> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
>
> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
>
> movprfx zB, zA
> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
> start:
> whilelo pP.s, xX, xLimit
> b.first loop
>
> // Collect the partial sums together:
>
> uaddv d0, pA, z0.s
> uaddv d1, pA, z1.s
>
> // Finally, add 1 to d0, and xLen to d1, and do modulo reduction.
>
> -->8--
>
> [...]
>
> Cheers
> ---Dave
> .
>

The code you sent me provides a correct way to deal with trailing bytes,
I need to spend some more time to debug the problem encountered above.
Thank you!

Cheers.^_^

--
Best regards,
Li Qiang

2020-11-09 06:30:37

[permalink] [raw]

Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/6 2:21, Eric Biggers 写道:
> On Thu, Nov 05, 2020 at 05:05:53PM +0800, Li Qiang wrote:
>>
>>
>> 在 2020/11/5 15:51, Ard Biesheuvel 写道:
>>> Note that NEON intrinsics can be compiled for 32-bit ARM as well (with
>>> a bit of care - please refer to lib/raid6/recov_neon_inner.c for an
>>> example of how to deal with intrinsics that are only available on
>>> arm64) and are less error prone, so intrinsics should be preferred if
>>> feasible.
>>>
>>> However, you have still not explained how optimizing Adler32 makes a
>>> difference for a real-world use case. Where is libdeflate used on a
>>> hot path?
>>> .
>>
>> Sorry :(, I have not specifically searched for the use of this algorithm
>> in the kernel.
>>
>> When I used perf to test the performance of the libz library before,
>> I saw that the adler32 algorithm occupies a lot of hot spots.I just
>> saw this algorithm used in the kernel code, so I think optimizing this
>> algorithm may have some positive optimization effects on the kernel.:)
>
> Adler32 performance is important for zlib compression/decompression, which has a
> few use cases in the kernel such as btrfs compression. However, these days
> those few kernel use cases are mostly switching to newer algorithms like lz4 and
> zstd. Also as I mentioned, your patch doesn't actually wire up your code to be
> used by the kernel's implementation of zlib compression/decompression.
>
> I think you'd be much better off contributing to a userspace project, where
> DEFLATE/zlib/gzip support still has a long tail of use cases. The official zlib
> isn't really being maintained and isn't accepting architecture-specific
> optimizations, but there are some performance-oriented forks of zlib (e.g.
> https://chromium.googlesource.com/chromium/src/third_party/zlib/ and
> https://github.com/zlib-ng/zlib-ng), as well as other projects like libdeflate
> (https://github.com/ebiggers/libdeflate). Generally I'm happy to accept
> architecture-specific optimizations in libdeflate, but they need to be testable.
>
> - Eric
> .
>

Thank you for your answers and suggestions. I have not seen these repositories
before. Regarding the SVE implementation of adler32, I will focus on the
repositories you mentioned later.:)

--
Best regards,
Li Qiang

2020-11-10 10:47:14

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Mon, Nov 09, 2020 at 11:43:35AM +0800, Li Qiang wrote:
> Hi Dave,
>
> I carefully read the ideas you provided and the sample code you gave me.:)
>
> 在 2020/11/6 0:53, Dave Martin 写道:
> > On Tue, Nov 03, 2020 at 08:15:05PM +0800, l00374334 wrote:
> >> From: liqiang <[email protected]>
> >>
> >> Dear all,
> >>
> >> Thank you for taking the precious time to read this email!
> >>
> >> Let me introduce the implementation ideas of my code here.
> >>
> >> In the process of using the compression library libz, I found that the adler32
> >> checksum always occupies a higher hot spot, so I paid attention to this algorithm.
> >> After getting in touch with the SVE instruction set of armv8, I realized that
> >> SVE can effectively accelerate adler32, so I made some attempts and got correct
> >> and better performance results. I very much hope that this modification can be
> >> applied to the kernel.
> >>
> >> Below is my analysis process:
> >>
> >> Adler32 algorithm
> >> =================
> >>
> >> Reference: https://en.wikipedia.org/wiki/Adler-32
> >>
> >> Assume that the buf of the Adler32 checksum to be calculated is D and the length is n:
> >>
> >> A = 1 + D1 + D2 + ... + Dn (mod 65521)
> >>
> >> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
> >> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
> >>
> >> Adler-32(D) = B × 65536 + A
> >>
> >> In C, an inefficient but straightforward implementation is:
> >>
> >> const uint32_t MOD_ADLER = 65521;
> >>
> >> uint32_t adler32(unsigned char *data, size_t len)
> >> {
> >> uint32_t a = 1, b = 0;
> >> size_t index;
> >>
> >> // Process each byte of the data in order
> >> for (index = 0; index < len; ++index)
> >> {
> >> a = (a + data[index]) % MOD_ADLER;
> >> b = (b + a) % MOD_ADLER;
> >> }
> >>
> >> return (b << 16) | a;
> >> }
> >>
> >> SVE vector method
> >> =================
> >>
> >> Step 1. Determine the block size:
> >> Use addvl instruction to get SVE bit width.
> >> Assuming the SVE bit width is x here.
> >>
> >> Step 2. Start to calculate the first block:
> >> The calculation formula is:
> >> A1 = 1 + D1 + D2 + ... + Dx (mod 65521)
> >> B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521)
> >>
> >> Step 3. Calculate the follow block:
> >> The calculation formula of A2 is very simple, just add up:
> >> A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521)
> >>
> >> The calculation formula of B2 is more complicated, because
> >> the result is related to the length of buf. When calculating
> >> the B1 block, it is actually assumed that the length is the
> >> block length x. Now when calculating B2, the length is expanded
> >> to 2x, so B2 becomes:
> >> B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x
> >> = x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x]
> >> ^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++
> >> Through the above polynomial transformation:
> >> Symbol "^" represents the <x * A1>;
> >> Symbol "~" represents the <B1>;
> >> Symbol "+" represents the next block.
> >>
> >> So we can get the method of calculating the next block from
> >> the previous block(Assume that the first byte number of the
> >> new block starts from 1):
> >> An+1 = An + D1 + D2 + ... + Dx (mod 65521)
> >> Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521)
> >
> > Putting aside people's concerns for the moment, I think this may be
> > formulated in a slightly more convenient way:
> >
> > If
> > X0, X1, ... are the data bytes
> > An is 1 + Sum [i=0 .. n-1] Xi
> > Bn is n + Sum [i=0 .. n-1] (n-i)Xi
> > = Sum [i=1 .. n] Ai
>
> Yes, this can be calculated from the original expression.
>
> >
> > (i.e., An, Bn are the accumulations for the first n bytes here, not the
> > first n blocks)
> >
> > then
> >
> > A[n+v] - An = Sum[i=n .. n+v-1] Xi
> >
> > B[n+v] - Bn = v + (Sum [i=0 .. n+v-1] (n+v-i) Xi)
> > - Sum [i=0 .. n-1] (n-i)Xi
> >
> > = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> > + (Sum [i=0 .. n-1] (n+v-i) Xi)
> > - Sum [i=0 .. n-1] (n-i)Xi
> >
> > = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> > + Sum [i=0 .. n-1] ((n+v-i) - (n-i)) Xi
> >
> > = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> > + vSum [i=0 .. n-1] Xi
> >
> > = v + v(An - 1) + Sum [i=n .. n+v-1] (n+v-i) Xi
> >
> > = vAn + Sum [i=n .. n+v-1] (n+v-i) Xi
> >
> > = vAn + vSum [i=n .. n+v-1] Xi
> > + Sum [i=n .. n+v-1] (n-i) Xi
> >
> > = vAn + vSum [i=n .. n+v-1] Xi
> > + Sum [i=n .. n+v-1] (n-i) Xi
> >
> > = vA[n+v] + Sum [i=n .. n+v-1] (n-i) Xi
> >
> > Let j = i - n; then:
> >
> > B[n+v] - Bn = vA[n+v] - Sum [j=0 .. v-1] j X[j+n]
> >
> > Which gives us a multiplier j that increases with the X[] index.
>
> My original approach is to multiply the byte sequence with the **decreasing**
> sequence. I think the increasing sequence here is more friendly to the trailing
> bytes of the loop.:-)
>
> >
> >
> > I think this gives a core loop along the following lines. I don't know
> > whether this is correct, or whether it works -- but feel free to take
> > ideas from it if it helps.
> >
> > Accumulators are 32 bits. This provides for a fair number of iterations
> > without overflow, but large input data will still require splitting into
> > chunks, with modulo reduction steps in between. There are rather a lot
> > of serial dependencies in the core loop, but since the operations
> > involved are relatively cheap, this is probably not a significant issue
> > in practice: the load-to-use dependency is probably the bigger concern.
> > Pipelined loop unrolling could address these if necessary.
> >
> > The horizontal reductions (UADDV) still probably don't need to be done
> > until after the last chunk.
> >
> >
> > Beware: I wasn't careful with the initial values for Bn / An, so some
> > additional adjustments might be needed...
> >
> > --8<--
> >
> > ptrue pP.s
> > ptrue pA.s
> >
> > mov zA.s, #0 // accumulator for An
> > mov zB.s, #0 // accumulator for Bn
> > index zJ.s, #0, #1 // zJ.s = [0, 1, .. V-1]
> >
> > mov zV.s, #0
> > incw zV.s // zV.s = [V, V, .. V]
>
> When I actually tested this code, I found that the byte length of the
> test must be equal to the vector length divided by 4 (that is, an integer
> multiple of the number of words in the SVE) and the result is correct.
>
> And I think one of the reasons is that the value stored in zV.s needs to be
> changed in the last cycle. It should be changed to the actual number of
> bytes remaining in the last cycle. :)

Ah yes, you're quite right about this. In my derivation above, v will
need to be smaller than the vector length when processing the final,
partial block (if any).

>
> >
> > // where V is number of elements per block
> > // = the number of 32-bit elements that fit in a Z-register
> >
> > add xLimit, xX, xLen
> > b start
> >
> > loop:
> > ld1b zX.s, pP/z, [xX]
> > incw xX
>
> In order to verify my conjecture, I added a bit of code to here, which is to
> subtract the corresponding value from zV in the last loop. But it is correct
> only when the number of bytes is less than one cycle. Test cases that exceed
> one complete cycle will also fail.
>
> So I guess before calculating the last cycle, zA should be summed first,
> because before the end of the cycle, zA and zB are scattered in the elements
> of the vector, if the last cycle calculates zB, only part of zA is summed
> ( Because pP/m does not count inactive elements), it should be incomplete.
>
> This is just my guess and has not yet been verified.:)

I think zA shouldn't be wrong, since that only accumulates active
elements anyway. I think it's the bogus zV multiplier involved in the
update to zB that is the problem.

> >
> > add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
> >
> > msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> >
> > movprfx zB, zA
> > mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
> > start:
> > whilelo pP.s, xX, xLimit
> > b.first loop

There are a few options, I guess.

One way is to use CNTP inside the loop to get the number of active
elements, instead of just setting zV in advance. This approach may add
a slight overhead, but it is worth experimenting with it. If the
overhead is neglibible, then this approach has the example of being
simple to understand.

Alternatively, you could do an adjustment after the loop ends to
subtract the appropriate amounts from zB. Unfortunately, pP has already
been overwritten by the time the loop ends. If could be backed up into
another P-register before overwriting it: this should be pretty low-
overhead.

A final option would be to change the b.first to a b.last, so that the
loop ends after the final full block. Then copy-paste the loop body to
execute once more, but using CNTP to get the element count to multiply
by, as described above. This makes the code a bit larger, but it
probably the best-performance option. You may be able to rotate the
loop so that it breaks out after updating zA (which IIUC doesn't need to
be done in a special way for the partial block). This would mean you
only have to specialise the zB update code.

> >
> > // Collect the partial sums together:
> >
> > uaddv d0, pA, z0.s
> > uaddv d1, pA, z1.s
> >
> > // Finally, add 1 to d0, and xLen to d1, and do modulo reduction.
> >
> > -->8--
> >
> > [...]
> >
> > Cheers
> > ---Dave
> > .
> >
>
> The code you sent me provides a correct way to deal with trailing bytes,
> I need to spend some more time to debug the problem encountered above.
> Thank you!
>
> Cheers.^_^

I was cheekily leaving the testing and debugging for you to do :)

Part of my reason for interest in this is that if we enable SVE in the
kernel, it would be good to have some clean examples for people to
follow -- even if not this algo. SVE is a bit different to use
compared with fixed-length vector ISAs, so worked examples are always
useful.

Cheers
---Dave

2020-11-10 13:21:36

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/10 18:46, Dave Martin 写道:
> On Mon, Nov 09, 2020 at 11:43:35AM +0800, Li Qiang wrote:
>> Hi Dave,
>>
>> I carefully read the ideas you provided and the sample code you gave me.:)
>>
>> 在 2020/11/6 0:53, Dave Martin 写道:
>>> On Tue, Nov 03, 2020 at 08:15:05PM +0800, l00374334 wrote:
>>>> From: liqiang <[email protected]>
>>>>
>>>> Dear all,
>>>>
>>>> Thank you for taking the precious time to read this email!
>>>>
>>>> Let me introduce the implementation ideas of my code here.
>>>>
>>>> In the process of using the compression library libz, I found that the adler32
>>>> checksum always occupies a higher hot spot, so I paid attention to this algorithm.
>>>> After getting in touch with the SVE instruction set of armv8, I realized that
>>>> SVE can effectively accelerate adler32, so I made some attempts and got correct
>>>> and better performance results. I very much hope that this modification can be
>>>> applied to the kernel.
>>>>
>>>> Below is my analysis process:
>>>>
>>>> Adler32 algorithm
>>>> =================
>>>>
>>>> Reference: https://en.wikipedia.org/wiki/Adler-32
>>>>
>>>> Assume that the buf of the Adler32 checksum to be calculated is D and the length is n:
>>>>
>>>> A = 1 + D1 + D2 + ... + Dn (mod 65521)
>>>>
>>>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
>>>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
>>>>
>>>> Adler-32(D) = B × 65536 + A
>>>>
>>>> In C, an inefficient but straightforward implementation is:
>>>>
>>>> const uint32_t MOD_ADLER = 65521;
>>>>
>>>> uint32_t adler32(unsigned char *data, size_t len)
>>>> {
>>>> uint32_t a = 1, b = 0;
>>>> size_t index;
>>>>
>>>> // Process each byte of the data in order
>>>> for (index = 0; index < len; ++index)
>>>> {
>>>> a = (a + data[index]) % MOD_ADLER;
>>>> b = (b + a) % MOD_ADLER;
>>>> }
>>>>
>>>> return (b << 16) | a;
>>>> }
>>>>
>>>> SVE vector method
>>>> =================
>>>>
>>>> Step 1. Determine the block size:
>>>> Use addvl instruction to get SVE bit width.
>>>> Assuming the SVE bit width is x here.
>>>>
>>>> Step 2. Start to calculate the first block:
>>>> The calculation formula is:
>>>> A1 = 1 + D1 + D2 + ... + Dx (mod 65521)
>>>> B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521)
>>>>
>>>> Step 3. Calculate the follow block:
>>>> The calculation formula of A2 is very simple, just add up:
>>>> A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521)
>>>>
>>>> The calculation formula of B2 is more complicated, because
>>>> the result is related to the length of buf. When calculating
>>>> the B1 block, it is actually assumed that the length is the
>>>> block length x. Now when calculating B2, the length is expanded
>>>> to 2x, so B2 becomes:
>>>> B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x
>>>> = x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x]
>>>> ^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++
>>>> Through the above polynomial transformation:
>>>> Symbol "^" represents the <x * A1>;
>>>> Symbol "~" represents the <B1>;
>>>> Symbol "+" represents the next block.
>>>>
>>>> So we can get the method of calculating the next block from
>>>> the previous block(Assume that the first byte number of the
>>>> new block starts from 1):
>>>> An+1 = An + D1 + D2 + ... + Dx (mod 65521)
>>>> Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521)
>>>
>>> Putting aside people's concerns for the moment, I think this may be
>>> formulated in a slightly more convenient way:
>>>
>>> If
>>> X0, X1, ... are the data bytes
>>> An is 1 + Sum [i=0 .. n-1] Xi
>>> Bn is n + Sum [i=0 .. n-1] (n-i)Xi
>>> = Sum [i=1 .. n] Ai
>>
>> Yes, this can be calculated from the original expression.
>>
>>>
>>> (i.e., An, Bn are the accumulations for the first n bytes here, not the
>>> first n blocks)
>>>
>>> then
>>>
>>> A[n+v] - An = Sum[i=n .. n+v-1] Xi
>>>
>>> B[n+v] - Bn = v + (Sum [i=0 .. n+v-1] (n+v-i) Xi)
>>> - Sum [i=0 .. n-1] (n-i)Xi
>>>
>>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
>>> + (Sum [i=0 .. n-1] (n+v-i) Xi)
>>> - Sum [i=0 .. n-1] (n-i)Xi
>>>
>>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
>>> + Sum [i=0 .. n-1] ((n+v-i) - (n-i)) Xi
>>>
>>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
>>> + vSum [i=0 .. n-1] Xi
>>>
>>> = v + v(An - 1) + Sum [i=n .. n+v-1] (n+v-i) Xi
>>>
>>> = vAn + Sum [i=n .. n+v-1] (n+v-i) Xi
>>>
>>> = vAn + vSum [i=n .. n+v-1] Xi
>>> + Sum [i=n .. n+v-1] (n-i) Xi
>>>
>>> = vAn + vSum [i=n .. n+v-1] Xi
>>> + Sum [i=n .. n+v-1] (n-i) Xi
>>>
>>> = vA[n+v] + Sum [i=n .. n+v-1] (n-i) Xi
>>>
>>> Let j = i - n; then:
>>>
>>> B[n+v] - Bn = vA[n+v] - Sum [j=0 .. v-1] j X[j+n]
>>>
>>> Which gives us a multiplier j that increases with the X[] index.
>>
>> My original approach is to multiply the byte sequence with the **decreasing**
>> sequence. I think the increasing sequence here is more friendly to the trailing
>> bytes of the loop.:-)
>>
>>>
>>>
>>> I think this gives a core loop along the following lines. I don't know
>>> whether this is correct, or whether it works -- but feel free to take
>>> ideas from it if it helps.
>>>
>>> Accumulators are 32 bits. This provides for a fair number of iterations
>>> without overflow, but large input data will still require splitting into
>>> chunks, with modulo reduction steps in between. There are rather a lot
>>> of serial dependencies in the core loop, but since the operations
>>> involved are relatively cheap, this is probably not a significant issue
>>> in practice: the load-to-use dependency is probably the bigger concern.
>>> Pipelined loop unrolling could address these if necessary.
>>>
>>> The horizontal reductions (UADDV) still probably don't need to be done
>>> until after the last chunk.
>>>
>>>
>>> Beware: I wasn't careful with the initial values for Bn / An, so some
>>> additional adjustments might be needed...
>>>
>>> --8<--
>>>
>>> ptrue pP.s
>>> ptrue pA.s
>>>
>>> mov zA.s, #0 // accumulator for An
>>> mov zB.s, #0 // accumulator for Bn
>>> index zJ.s, #0, #1 // zJ.s = [0, 1, .. V-1]
>>>
>>> mov zV.s, #0
>>> incw zV.s // zV.s = [V, V, .. V]
>>
>> When I actually tested this code, I found that the byte length of the
>> test must be equal to the vector length divided by 4 (that is, an integer
>> multiple of the number of words in the SVE) and the result is correct.
>>
>> And I think one of the reasons is that the value stored in zV.s needs to be
>> changed in the last cycle. It should be changed to the actual number of
>> bytes remaining in the last cycle. :)
>
> Ah yes, you're quite right about this. In my derivation above, v will
> need to be smaller than the vector length when processing the final,
> partial block (if any).
>
>>
>>>
>>> // where V is number of elements per block
>>> // = the number of 32-bit elements that fit in a Z-register
>>>
>>> add xLimit, xX, xLen
>>> b start
>>>
>>> loop:
>>> ld1b zX.s, pP/z, [xX]
>>> incw xX
>>
>> In order to verify my conjecture, I added a bit of code to here, which is to
>> subtract the corresponding value from zV in the last loop. But it is correct
>> only when the number of bytes is less than one cycle. Test cases that exceed
>> one complete cycle will also fail.
>>
>> So I guess before calculating the last cycle, zA should be summed first,
>> because before the end of the cycle, zA and zB are scattered in the elements
>> of the vector, if the last cycle calculates zB, only part of zA is summed
>> ( Because pP/m does not count inactive elements), it should be incomplete.
>>
>> This is just my guess and has not yet been verified.:)
>
> I think zA shouldn't be wrong, since that only accumulates active
> elements anyway. I think it's the bogus zV multiplier involved in the
> update to zB that is the problem.
>
>>>
>>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
>>>
>>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
>>>
>>> movprfx zB, zA
>>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V

I found the bug I encountered earlier, that is, the calculation of zB here
needs to use pA with all elements activated. The reason is the same as my
previous guess, because all elements of zA should be involved when calculating zB.
Because the original calculation formula is like this.

For example:
In the last loop:
left byte is: 3 | 4 | \0 |
zA.s is: 100 | 200 | 100 | 200 (sum = 600)
pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels)

At this time, if the calculation of zB only takes the first 2 active elements, the data
is incomplete, because according to the description of the original algorithm, zB is always
based on the sum of all the accumulated bytes.

Here we can simply change the prediction register used in the two sentences related to
zB to the one that is all true (it is pA in our code), like this:
msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s

movprfx zB, zA
mad zB.s, pA/m, zV.s, zX.s // zB.s := zX.s + zA.s * V

>>> start:
>>> whilelo pP.s, xX, xLimit
>>> b.first loop
>
> There are a few options, I guess.
>
> One way is to use CNTP inside the loop to get the number of active
> elements, instead of just setting zV in advance. This approach may add
> a slight overhead, but it is worth experimenting with it. If the
> overhead is neglibible, then this approach has the example of being
> simple to understand.
>
> Alternatively, you could do an adjustment after the loop ends to
> subtract the appropriate amounts from zB. Unfortunately, pP has already
> been overwritten by the time the loop ends. If could be backed up into
> another P-register before overwriting it: this should be pretty low-
> overhead.
>
> A final option would be to change the b.first to a b.last, so that the
> loop ends after the final full block. Then copy-paste the loop body to
> execute once more, but using CNTP to get the element count to multiply
> by, as described above. This makes the code a bit larger, but it
> probably the best-performance option. You may be able to rotate the
> loop so that it breaks out after updating zA (which IIUC doesn't need to
> be done in a special way for the partial block). This would mean you
> only have to specialise the zB update code.

Regarding the last loop, I tried to wrap another layer before and after the
core loop, and I have verified its correctness.
My code is at the end of this email.

>
>>>
>>> // Collect the partial sums together:
>>>
>>> uaddv d0, pA, z0.s
>>> uaddv d1, pA, z1.s
>>>
>>> // Finally, add 1 to d0, and xLen to d1, and do modulo reduction.
>>>
>>> -->8--
>>>
>>> [...]
>>>
>>> Cheers
>>> ---Dave
>>> .
>>>
>>
>> The code you sent me provides a correct way to deal with trailing bytes,
>> I need to spend some more time to debug the problem encountered above.
>> Thank you!
>>
>> Cheers.^_^
>
> I was cheekily leaving the testing and debugging for you to do :)
>
> Part of my reason for interest in this is that if we enable SVE in the
> kernel, it would be good to have some clean examples for people to
> follow -- even if not this algo. SVE is a bit different to use
> compared with fixed-length vector ISAs, so worked examples are always
> useful.
Yes it is necessary.

>
> Cheers
> ---Dave
> .
>

This is the main part of my newly modified code:
--8<--
ptrue p0.s
ptrue p1.s

mov zA.s, #0
mov zB.s, #0
index zJ.s, #0, #1
mov zV.s, #0
incw zV.s

add xLimit, x1, x2
b start //Enter the core loop first

trailloop: // Last cycle entrance
cntp x6, p1, p0.s // Get active element count of last cycle
cpy zV.s, p1/m, w6 // Set zV to the actual value.

loop: // Core loop entrance
ld1b zX.s, p0/z, [x1]
incw x1

add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0
msb zX.s, p1/m, zJ.s, zB.s // Change p register here
movprfx zB, zA
mad zB.s, p1/m, zV.s, zX.s // Change p register here
start:
whilelo p0.s, x1, xLimit
b.last loop // The condition for the core loop to continue is that b.last is true
b.first trailloop // If b.last is false and b.first is true, it means the last cycle

uaddv d0, p1, zA.s
uaddv d1, p1, zB.s

mov x12, v0.2d[0]
mov x13, v1.2d[0]
add x10, x10, x12
add x11, x11, x13
add x11, x11, x2

mod65521 10, 14, 12
mod65521 11, 14, 12
lsl x11, x11, #16
orr x0, x10, x11
ret
-->8--

After this modification, The test results are correct when the data length is less than about 8 Kbyte,
part A will still be correct after 8K, and an overflow error will occur in part B. This is because A
only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the
accumulative formula of B is:
B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
= n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)

If we take the average value of Dx to 128 and n to 8192:
B = (1 + 2 + ... + 8129) * 128 + 8192
= 4,295,499,776 (32bit overflow)

So I think the 32-bit accumulator is still not enough for part B here. :)

--
Best regards,
Li Qiang

2020-11-10 16:08:12

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Tue, Nov 10, 2020 at 09:20:46PM +0800, Li Qiang wrote:
>
>
> 在 2020/11/10 18:46, Dave Martin 写道:
> > On Mon, Nov 09, 2020 at 11:43:35AM +0800, Li Qiang wrote:
> >> Hi Dave,
> >>
> >> I carefully read the ideas you provided and the sample code you gave me.:)
> >>
> >> 在 2020/11/6 0:53, Dave Martin 写道:
> >>> On Tue, Nov 03, 2020 at 08:15:05PM +0800, l00374334 wrote:
> >>>> From: liqiang <[email protected]>
> >>>>
> >>>> Dear all,
> >>>>
> >>>> Thank you for taking the precious time to read this email!
> >>>>
> >>>> Let me introduce the implementation ideas of my code here.
> >>>>
> >>>> In the process of using the compression library libz, I found that the adler32
> >>>> checksum always occupies a higher hot spot, so I paid attention to this algorithm.
> >>>> After getting in touch with the SVE instruction set of armv8, I realized that
> >>>> SVE can effectively accelerate adler32, so I made some attempts and got correct
> >>>> and better performance results. I very much hope that this modification can be
> >>>> applied to the kernel.
> >>>>
> >>>> Below is my analysis process:
> >>>>
> >>>> Adler32 algorithm
> >>>> =================
> >>>>
> >>>> Reference: https://en.wikipedia.org/wiki/Adler-32
> >>>>
> >>>> Assume that the buf of the Adler32 checksum to be calculated is D and the length is n:
> >>>>
> >>>> A = 1 + D1 + D2 + ... + Dn (mod 65521)
> >>>>
> >>>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
> >>>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
> >>>>
> >>>> Adler-32(D) = B × 65536 + A
> >>>>
> >>>> In C, an inefficient but straightforward implementation is:
> >>>>
> >>>> const uint32_t MOD_ADLER = 65521;
> >>>>
> >>>> uint32_t adler32(unsigned char *data, size_t len)
> >>>> {
> >>>> uint32_t a = 1, b = 0;
> >>>> size_t index;
> >>>>
> >>>> // Process each byte of the data in order
> >>>> for (index = 0; index < len; ++index)
> >>>> {
> >>>> a = (a + data[index]) % MOD_ADLER;
> >>>> b = (b + a) % MOD_ADLER;
> >>>> }
> >>>>
> >>>> return (b << 16) | a;
> >>>> }
> >>>>
> >>>> SVE vector method
> >>>> =================
> >>>>
> >>>> Step 1. Determine the block size:
> >>>> Use addvl instruction to get SVE bit width.
> >>>> Assuming the SVE bit width is x here.
> >>>>
> >>>> Step 2. Start to calculate the first block:
> >>>> The calculation formula is:
> >>>> A1 = 1 + D1 + D2 + ... + Dx (mod 65521)
> >>>> B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521)
> >>>>
> >>>> Step 3. Calculate the follow block:
> >>>> The calculation formula of A2 is very simple, just add up:
> >>>> A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521)
> >>>>
> >>>> The calculation formula of B2 is more complicated, because
> >>>> the result is related to the length of buf. When calculating
> >>>> the B1 block, it is actually assumed that the length is the
> >>>> block length x. Now when calculating B2, the length is expanded
> >>>> to 2x, so B2 becomes:
> >>>> B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x
> >>>> = x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x]
> >>>> ^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++
> >>>> Through the above polynomial transformation:
> >>>> Symbol "^" represents the <x * A1>;
> >>>> Symbol "~" represents the <B1>;
> >>>> Symbol "+" represents the next block.
> >>>>
> >>>> So we can get the method of calculating the next block from
> >>>> the previous block(Assume that the first byte number of the
> >>>> new block starts from 1):
> >>>> An+1 = An + D1 + D2 + ... + Dx (mod 65521)
> >>>> Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521)
> >>>
> >>> Putting aside people's concerns for the moment, I think this may be
> >>> formulated in a slightly more convenient way:
> >>>
> >>> If
> >>> X0, X1, ... are the data bytes
> >>> An is 1 + Sum [i=0 .. n-1] Xi
> >>> Bn is n + Sum [i=0 .. n-1] (n-i)Xi
> >>> = Sum [i=1 .. n] Ai
> >>
> >> Yes, this can be calculated from the original expression.
> >>
> >>>
> >>> (i.e., An, Bn are the accumulations for the first n bytes here, not the
> >>> first n blocks)
> >>>
> >>> then
> >>>
> >>> A[n+v] - An = Sum[i=n .. n+v-1] Xi
> >>>
> >>> B[n+v] - Bn = v + (Sum [i=0 .. n+v-1] (n+v-i) Xi)
> >>> - Sum [i=0 .. n-1] (n-i)Xi
> >>>
> >>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> >>> + (Sum [i=0 .. n-1] (n+v-i) Xi)
> >>> - Sum [i=0 .. n-1] (n-i)Xi
> >>>
> >>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> >>> + Sum [i=0 .. n-1] ((n+v-i) - (n-i)) Xi
> >>>
> >>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi)
> >>> + vSum [i=0 .. n-1] Xi
> >>>
> >>> = v + v(An - 1) + Sum [i=n .. n+v-1] (n+v-i) Xi
> >>>
> >>> = vAn + Sum [i=n .. n+v-1] (n+v-i) Xi
> >>>
> >>> = vAn + vSum [i=n .. n+v-1] Xi
> >>> + Sum [i=n .. n+v-1] (n-i) Xi
> >>>
> >>> = vAn + vSum [i=n .. n+v-1] Xi
> >>> + Sum [i=n .. n+v-1] (n-i) Xi
> >>>
> >>> = vA[n+v] + Sum [i=n .. n+v-1] (n-i) Xi
> >>>
> >>> Let j = i - n; then:
> >>>
> >>> B[n+v] - Bn = vA[n+v] - Sum [j=0 .. v-1] j X[j+n]
> >>>
> >>> Which gives us a multiplier j that increases with the X[] index.
> >>
> >> My original approach is to multiply the byte sequence with the **decreasing**
> >> sequence. I think the increasing sequence here is more friendly to the trailing
> >> bytes of the loop.:-)
> >>
> >>>
> >>>
> >>> I think this gives a core loop along the following lines. I don't know
> >>> whether this is correct, or whether it works -- but feel free to take
> >>> ideas from it if it helps.
> >>>
> >>> Accumulators are 32 bits. This provides for a fair number of iterations
> >>> without overflow, but large input data will still require splitting into
> >>> chunks, with modulo reduction steps in between. There are rather a lot
> >>> of serial dependencies in the core loop, but since the operations
> >>> involved are relatively cheap, this is probably not a significant issue
> >>> in practice: the load-to-use dependency is probably the bigger concern.
> >>> Pipelined loop unrolling could address these if necessary.
> >>>
> >>> The horizontal reductions (UADDV) still probably don't need to be done
> >>> until after the last chunk.
> >>>
> >>>
> >>> Beware: I wasn't careful with the initial values for Bn / An, so some
> >>> additional adjustments might be needed...
> >>>
> >>> --8<--
> >>>
> >>> ptrue pP.s
> >>> ptrue pA.s
> >>>
> >>> mov zA.s, #0 // accumulator for An
> >>> mov zB.s, #0 // accumulator for Bn
> >>> index zJ.s, #0, #1 // zJ.s = [0, 1, .. V-1]
> >>>
> >>> mov zV.s, #0
> >>> incw zV.s // zV.s = [V, V, .. V]
> >>
> >> When I actually tested this code, I found that the byte length of the
> >> test must be equal to the vector length divided by 4 (that is, an integer
> >> multiple of the number of words in the SVE) and the result is correct.
> >>
> >> And I think one of the reasons is that the value stored in zV.s needs to be
> >> changed in the last cycle. It should be changed to the actual number of
> >> bytes remaining in the last cycle. :)
> >
> > Ah yes, you're quite right about this. In my derivation above, v will
> > need to be smaller than the vector length when processing the final,
> > partial block (if any).
> >
> >>
> >>>
> >>> // where V is number of elements per block
> >>> // = the number of 32-bit elements that fit in a Z-register
> >>>
> >>> add xLimit, xX, xLen
> >>> b start
> >>>
> >>> loop:
> >>> ld1b zX.s, pP/z, [xX]
> >>> incw xX
> >>
> >> In order to verify my conjecture, I added a bit of code to here, which is to
> >> subtract the corresponding value from zV in the last loop. But it is correct
> >> only when the number of bytes is less than one cycle. Test cases that exceed
> >> one complete cycle will also fail.
> >>
> >> So I guess before calculating the last cycle, zA should be summed first,
> >> because before the end of the cycle, zA and zB are scattered in the elements
> >> of the vector, if the last cycle calculates zB, only part of zA is summed
> >> ( Because pP/m does not count inactive elements), it should be incomplete.
> >>
> >> This is just my guess and has not yet been verified.:)
> >
> > I think zA shouldn't be wrong, since that only accumulates active
> > elements anyway. I think it's the bogus zV multiplier involved in the
> > update to zB that is the problem.
> >
> >>>
> >>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
> >>>
> >>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> >>>
> >>> movprfx zB, zA
> >>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
>
> I found the bug I encountered earlier, that is, the calculation of zB here
> needs to use pA with all elements activated. The reason is the same as my
> previous guess, because all elements of zA should be involved when calculating zB.
> Because the original calculation formula is like this.
>
> For example:
> In the last loop:
> left byte is: 3 | 4 | \0 |
> zA.s is: 100 | 200 | 100 | 200 (sum = 600)
> pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels)
>
> At this time, if the calculation of zB only takes the first 2 active elements, the data
> is incomplete, because according to the description of the original algorithm, zB is always
> based on the sum of all the accumulated bytes.

Yes, you're quite right here: zX is partial: only the elements pP are
valid; but all elements of zA and zB are always valid. I was focusing
too much on the handling of the partial input block.

>
> Here we can simply change the prediction register used in the two sentences related to
> zB to the one that is all true (it is pA in our code), like this:
> msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s

Are you sure about this? In a final partial block, the trailing
elements of zX.s beyond pP will be leftover junk from the last
iteration.

This might require a bit more thought in order to get the final block
handling correct.

> movprfx zB, zA
> mad zB.s, pA/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
>
> >>> start:
> >>> whilelo pP.s, xX, xLimit
> >>> b.first loop
> >
> > There are a few options, I guess.
> >
> > One way is to use CNTP inside the loop to get the number of active
> > elements, instead of just setting zV in advance. This approach may add
> > a slight overhead, but it is worth experimenting with it. If the
> > overhead is neglibible, then this approach has the example of being
> > simple to understand.
> >
> > Alternatively, you could do an adjustment after the loop ends to
> > subtract the appropriate amounts from zB. Unfortunately, pP has already
> > been overwritten by the time the loop ends. If could be backed up into
> > another P-register before overwriting it: this should be pretty low-
> > overhead.
> >
> > A final option would be to change the b.first to a b.last, so that the
> > loop ends after the final full block. Then copy-paste the loop body to
> > execute once more, but using CNTP to get the element count to multiply
> > by, as described above. This makes the code a bit larger, but it
> > probably the best-performance option. You may be able to rotate the
> > loop so that it breaks out after updating zA (which IIUC doesn't need to
> > be done in a special way for the partial block). This would mean you
> > only have to specialise the zB update code.
>
> Regarding the last loop, I tried to wrap another layer before and after the
> core loop, and I have verified its correctness.
> My code is at the end of this email.
>
> >
> >>>
> >>> // Collect the partial sums together:
> >>>
> >>> uaddv d0, pA, z0.s
> >>> uaddv d1, pA, z1.s
> >>>
> >>> // Finally, add 1 to d0, and xLen to d1, and do modulo reduction.
> >>>
> >>> -->8--
> >>>
> >>> [...]
> >>>
> >>> Cheers
> >>> ---Dave
> >>> .
> >>>
> >>
> >> The code you sent me provides a correct way to deal with trailing bytes,
> >> I need to spend some more time to debug the problem encountered above.
> >> Thank you!
> >>
> >> Cheers.^_^
> >
> > I was cheekily leaving the testing and debugging for you to do :)
> >
> > Part of my reason for interest in this is that if we enable SVE in the
> > kernel, it would be good to have some clean examples for people to
> > follow -- even if not this algo. SVE is a bit different to use
> > compared with fixed-length vector ISAs, so worked examples are always
> > useful.
> Yes it is necessary.
>
> >
> > Cheers
> > ---Dave
> > .
> >
>
> This is the main part of my newly modified code:
> --8<--
> ptrue p0.s
> ptrue p1.s
>
> mov zA.s, #0
> mov zB.s, #0
> index zJ.s, #0, #1
> mov zV.s, #0
> incw zV.s
>
> add xLimit, x1, x2
> b start //Enter the core loop first
>
> trailloop: // Last cycle entrance
> cntp x6, p1, p0.s // Get active element count of last cycle
> cpy zV.s, p1/m, w6 // Set zV to the actual value.

Note that you can also write "mov" here, but I'm not sure which alias is
preferred.

> loop: // Core loop entrance
> ld1b zX.s, p0/z, [x1]
> incw x1
>
> add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0
> msb zX.s, p1/m, zJ.s, zB.s // Change p register here
> movprfx zB, zA
> mad zB.s, p1/m, zV.s, zX.s // Change p register here

As discussed above, are you sure this is correct now?

> start:
> whilelo p0.s, x1, xLimit
> b.last loop // The condition for the core loop to continue is that b.last is true
> b.first trailloop // If b.last is false and b.first is true, it means the last cycle
>
> uaddv d0, p1, zA.s
> uaddv d1, p1, zB.s
>
> mov x12, v0.2d[0]
> mov x13, v1.2d[0]

The "2" here seems not to be required by the syntax, although it's
harmless.

> add x10, x10, x12
> add x11, x11, x13
> add x11, x11, x2

If x10 and x11 are used as accmulators by the caller, I guess this works.

>
> mod65521 10, 14, 12
> mod65521 11, 14, 12
> lsl x11, x11, #16
> orr x0, x10, x11
> ret
> -->8--
>
> After this modification, The test results are correct when the data length is less than about 8 Kbyte,
> part A will still be correct after 8K, and an overflow error will occur in part B. This is because A
> only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the
> accumulative formula of B is:
> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
>
> If we take the average value of Dx to 128 and n to 8192:
> B = (1 + 2 + ... + 8129) * 128 + 8192
> = 4,295,499,776 (32bit overflow)
>
> So I think the 32-bit accumulator is still not enough for part B here. :)
>
> --
> Best regards,
> Li Qiang

That makes sense. I hadn't tried to calculate the actual bound.

It may be worth trying this with 64-bit accumulators. This will
probably slow things down, but it depends on the relative computation /
memory throughput exhibited by the hardware.

I think the code can't be bulletproof without breaking the input into
chunks anyway, though.

Cheers
---Dave

2020-11-12 07:21:35

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/11 0:07, Dave Martin 写道:
>>>>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
>>>>>
>>>>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
>>>>>
>>>>> movprfx zB, zA
>>>>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
>> I found the bug I encountered earlier, that is, the calculation of zB here
>> needs to use pA with all elements activated. The reason is the same as my
>> previous guess, because all elements of zA should be involved when calculating zB.
>> Because the original calculation formula is like this.
>>
>> For example:
>> In the last loop:
>> left byte is: 3 | 4 | \0 |
>> zA.s is: 100 | 200 | 100 | 200 (sum = 600)
>> pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels)
>>
>> At this time, if the calculation of zB only takes the first 2 active elements, the data
>> is incomplete, because according to the description of the original algorithm, zB is always
>> based on the sum of all the accumulated bytes.
> Yes, you're quite right here: zX is partial: only the elements pP are
> valid; but all elements of zA and zB are always valid. I was focusing
> too much on the handling of the partial input block.
>
>> Here we can simply change the prediction register used in the two sentences related to
>> zB to the one that is all true (it is pA in our code), like this:
>> msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> Are you sure about this? In a final partial block, the trailing
> elements of zX.s beyond pP will be leftover junk from the last
> iteration.

Yes, I have verified this code and it is correct. The reason is that if pP is used here,
the inactive elements of zB will be ignored in zX, which will cause data loss.(I think it is
because the zB data is covered by the multiplication and addition results of zX, zA, and zV
using movprfx and mad. Have I got that right?) :)

On the other hand zX uses the prediction register pP/z when loading data, the value of the
inactive element is 0, the inactive element in zX will not affect the final result, the inactive
element in zB will be directly assigned to the inactive element of zX element.

Then in the next instruction, it will be added to zB along with zX.

>
> This might require a bit more thought in order to get the final block
> handling correct.
>
>> trailloop: // Last cycle entrance
>> cntp x6, p1, p0.s // Get active element count of last cycle
>> cpy zV.s, p1/m, w6 // Set zV to the actual value.
> Note that you can also write "mov" here, but I'm not sure which alias is
> preferred>
>> loop: // Core loop entrance
>> ld1b zX.s, p0/z, [x1]
>> incw x1
>>
>> add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0
>> msb zX.s, p1/m, zJ.s, zB.s // Change p register here
>> movprfx zB, zA
>> mad zB.s, p1/m, zV.s, zX.s // Change p register here
> As discussed above, are you sure this is correct now?
>
>> start:
>> whilelo p0.s, x1, xLimit
>> b.last loop // The condition for the core loop to continue is that b.last is true
>> b.first trailloop // If b.last is false and b.first is true, it means the last cycle
>>
>> uaddv d0, p1, zA.s
>> uaddv d1, p1, zB.s
>>
>> mov x12, v0.2d[0]
>> mov x13, v1.2d[0]
> The "2" here seems not to be required by the syntax, although it's
> harmless.

Yes I deleted it.

>
>> add x10, x10, x12
>> add x11, x11, x13
>> add x11, x11, x2
> If x10 and x11 are used as accmulators by the caller, I guess this works.

X10 and X11 are part A and part B of the initial value of adler32 passed in by the caller.

>
>> mod65521 10, 14, 12
>> mod65521 11, 14, 12
>> lsl x11, x11, #16
>> orr x0, x10, x11
>> ret
>> -->8--
>>
>> After this modification, The test results are correct when the data length is less than about 8 Kbyte,
>> part A will still be correct after 8K, and an overflow error will occur in part B. This is because A
>> only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the
>> accumulative formula of B is:
>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
>>
>> If we take the average value of Dx to 128 and n to 8192:
>> B = (1 + 2 + ... + 8129) * 128 + 8192
>> = 4,295,499,776 (32bit overflow)
>>
>> So I think the 32-bit accumulator is still not enough for part B here. :)
>>
>> --
>> Best regards,
>> Li Qiang
> That makes sense. I hadn't tried to calculate the actual bound.
>
> It may be worth trying this with 64-bit accumulators. This will
> probably slow things down, but it depends on the relative computation /
> memory throughput exhibited by the hardware.

If a 64-bit wide vector register is used, for most scenes where the amount of data is not particularly large,
is it wasted more vector resources?

Maybe we can also try to use 16-bit wide vector registers to load data and calculations,
and accumulate them into the scalar register xn before overflow, just like my original patch,
but I can try to use ascending order to change the processing of the last loop Be more elegant.

>
> I think the code can't be bulletproof without breaking the input into
> chunks anyway, though.

I don't quite understand what it means here. Does it mean that the input bytes are read into the vector in
blocks for calculation (this is how it is done now) or the intermediate results are stored in different elements
of the vector in blocks during the calculation process? :-)

>
> Cheers
> ---Dave
> .
--
Best regards,
Li Qiang

2020-11-12 11:18:32

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Thu, Nov 12, 2020 at 03:20:53PM +0800, Li Qiang wrote:
>
>
> 在 2020/11/11 0:07, Dave Martin 写道:
> >>>>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
> >>>>>
> >>>>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> >>>>>
> >>>>> movprfx zB, zA
> >>>>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
> >> I found the bug I encountered earlier, that is, the calculation of zB here
> >> needs to use pA with all elements activated. The reason is the same as my
> >> previous guess, because all elements of zA should be involved when calculating zB.
> >> Because the original calculation formula is like this.
> >>
> >> For example:
> >> In the last loop:
> >> left byte is: 3 | 4 | \0 |
> >> zA.s is: 100 | 200 | 100 | 200 (sum = 600)
> >> pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels)
> >>
> >> At this time, if the calculation of zB only takes the first 2 active elements, the data
> >> is incomplete, because according to the description of the original algorithm, zB is always
> >> based on the sum of all the accumulated bytes.
> > Yes, you're quite right here: zX is partial: only the elements pP are
> > valid; but all elements of zA and zB are always valid. I was focusing
> > too much on the handling of the partial input block.
> >
> >> Here we can simply change the prediction register used in the two sentences related to
> >> zB to the one that is all true (it is pA in our code), like this:
> >> msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> > Are you sure about this? In a final partial block, the trailing
> > elements of zX.s beyond pP will be leftover junk from the last
> > iteration.
>
> Yes, I have verified this code and it is correct. The reason is that if pP is used here,
> the inactive elements of zB will be ignored in zX, which will cause data loss.(I think it is

Yes, you're quite right. I was forgetting about the /z (zeroing)
semantics for the ld1b. This means that we get away with not
inactivating those elements in the msb instruction, since zeros
multiplied by the elements of zJ remain zero.

> because the zB data is covered by the multiplication and addition results of zX, zA, and zV
> using movprfx and mad. Have I got that right?) :)

Yes, I think so.

> On the other hand zX uses the prediction register pP/z when loading data, the value of the
> inactive element is 0, the inactive element in zX will not affect the final result, the inactive
> element in zB will be directly assigned to the inactive element of zX element.
>
> Then in the next instruction, it will be added to zB along with zX.

Yes.

This might be part of the reason why the architects decided that SVE
loads zero the inactive elements instead of leaving them unchanged.

>
> >
> > This might require a bit more thought in order to get the final block
> > handling correct.
> >
> >> trailloop: // Last cycle entrance
> >> cntp x6, p1, p0.s // Get active element count of last cycle
> >> cpy zV.s, p1/m, w6 // Set zV to the actual value.
> > Note that you can also write "mov" here, but I'm not sure which alias is
> > preferred>
> >> loop: // Core loop entrance
> >> ld1b zX.s, p0/z, [x1]
> >> incw x1
> >>
> >> add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0
> >> msb zX.s, p1/m, zJ.s, zB.s // Change p register here
> >> movprfx zB, zA
> >> mad zB.s, p1/m, zV.s, zX.s // Change p register here
> > As discussed above, are you sure this is correct now?

I think we've agreed that it is correct.

Thinking about the code flow, I think the initialisation of zV is
slightly wrong: for very short data, the first block may be shorter than
VL.

I think this is easily solved by getting rid of the constant
initialisation for zV and removing the branch that skips the CNTP code
when entering the loop for the first time. That way, zV gets
initialised to the correct thing on entering the loop, irrespective of
whether we have a whole vector on the first iteration.

Note, to squeeze maximum performance out of this, you still probably
want to unroll the loop a few times so that you can schedule useful work
in between each load and the computations that depend on it.

> >> start:
> >> whilelo p0.s, x1, xLimit
> >> b.last loop // The condition for the core loop to continue is that b.last is true
> >> b.first trailloop // If b.last is false and b.first is true, it means the last cycle
> >>
> >> uaddv d0, p1, zA.s
> >> uaddv d1, p1, zB.s
> >>
> >> mov x12, v0.2d[0]
> >> mov x13, v1.2d[0]
> > The "2" here seems not to be required by the syntax, although it's
> > harmless.
>
> Yes I deleted it.
>
> >
> >> add x10, x10, x12
> >> add x11, x11, x13
> >> add x11, x11, x2
> > If x10 and x11 are used as accmulators by the caller, I guess this works.
>
> X10 and X11 are part A and part B of the initial value of adler32 passed in by the caller.

Right, that makes sense.

>
> >
> >> mod65521 10, 14, 12
> >> mod65521 11, 14, 12

Note, can you replace these with udiv?

While using mul might be slightly cheaper to achieve this, it makes the
code more complex and will have a negligible impact on the overall
cost...

So, does something like this work:

mov x3, #65521
udiv x4, x10, x3
udiv x5, x11, x3
msub x10, x4, x3, x10
msub x11, x5, x3, x11

> >> lsl x11, x11, #16
> >> orr x0, x10, x11
> >> ret
> >> -->8--
> >>
> >> After this modification, The test results are correct when the data length is less than about 8 Kbyte,
> >> part A will still be correct after 8K, and an overflow error will occur in part B. This is because A
> >> only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the
> >> accumulative formula of B is:
> >> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
> >> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
> >>
> >> If we take the average value of Dx to 128 and n to 8192:
> >> B = (1 + 2 + ... + 8129) * 128 + 8192
> >> = 4,295,499,776 (32bit overflow)
> >>
> >> So I think the 32-bit accumulator is still not enough for part B here. :)
> >>
> >> --
> >> Best regards,
> >> Li Qiang
> > That makes sense. I hadn't tried to calculate the actual bound.
> >
> > It may be worth trying this with 64-bit accumulators. This will
> > probably slow things down, but it depends on the relative computation /
> > memory throughput exhibited by the hardware.
>
> If a 64-bit wide vector register is used, for most scenes where the amount of data is not particularly large,
> is it wasted more vector resources?

Yes :)

Depending on the algorithm, this might be a better tradeoff if it meant
that the data could be processed in larger chunks. I suspect that the
tradeoff is unfavourable in this particluar case though -- but I haven't
tried it.

> Maybe we can also try to use 16-bit wide vector registers to load data and calculations,
> and accumulate them into the scalar register xn before overflow, just like my original patch,
> but I can try to use ascending order to change the processing of the last loop Be more elegant.

Perhaps. But I assumed that 16-bit elements would overflow much too
fast to be practical. Just multiplying zX by zJ once can produce
element values up to 0xfe01 if VL is 256 bytes.

Did you have some idea for how to make this work?

It may be possible to do 16-bit multiplies with 32-bit accumulation.
SVE2 has some NEON-style mixed-width multiply-accumulate instructions
that can achieve this sort of thing directly, but in SVE(1) I think
you would have to break the multiply-accumulates up. Say:

mul zX.h, pP/m, zX.h, zJ.h
sub zX.s, pP/m, zB.s, zX.s

whilelo pP.s should generate a predicate with odd .h elements inactive,
and ld1b zX.s, pP/z, ... will make sure those elements are all zeroed in
xX.h.

I'm not sure this would be faster than a single 32-bit msb, since
multiply-accumulates are likely to be heavily optimised in the hardware,
but you could try it.

> >
> > I think the code can't be bulletproof without breaking the input into
> > chunks anyway, though.
>
> I don't quite understand what it means here. Does it mean that the input bytes are read into the vector in
> blocks for calculation (this is how it is done now) or the intermediate results are stored in different elements
> of the vector in blocks during the calculation process? :-)

I mean, you limit the number of iterations of the core loop so that
overflow doesn't happen. You're already doing this; I just wanted to
make the point that 64-bit accumulators probably don't solve this
problem, even though it will take a very large number of iterations to
cause an overflow.

Unless Adler32 specifies the maximum length of the input data not to
exceed some value, it could go on forever -- so overflow becomes
inevitable.

Cheers
---Dave

2020-11-14 07:36:02

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/12 19:17, Dave Martin 写道:
> On Thu, Nov 12, 2020 at 03:20:53PM +0800, Li Qiang wrote:
>>
>>
>> 在 2020/11/11 0:07, Dave Martin 写道:
>>>>>>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
>>>>>>>
>>>>>>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
>>>>>>>
>>>>>>> movprfx zB, zA
>>>>>>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
>>>> I found the bug I encountered earlier, that is, the calculation of zB here
>>>> needs to use pA with all elements activated. The reason is the same as my
>>>> previous guess, because all elements of zA should be involved when calculating zB.
>>>> Because the original calculation formula is like this.
>>>>
>>>> For example:
>>>> In the last loop:
>>>> left byte is: 3 | 4 | \0 |
>>>> zA.s is: 100 | 200 | 100 | 200 (sum = 600)
>>>> pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels)
>>>>
>>>> At this time, if the calculation of zB only takes the first 2 active elements, the data
>>>> is incomplete, because according to the description of the original algorithm, zB is always
>>>> based on the sum of all the accumulated bytes.
>>> Yes, you're quite right here: zX is partial: only the elements pP are
>>> valid; but all elements of zA and zB are always valid. I was focusing
>>> too much on the handling of the partial input block.
>>>
>>>> Here we can simply change the prediction register used in the two sentences related to
>>>> zB to the one that is all true (it is pA in our code), like this:
>>>> msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
>>> Are you sure about this? In a final partial block, the trailing
>>> elements of zX.s beyond pP will be leftover junk from the last
>>> iteration.
>>
>> Yes, I have verified this code and it is correct. The reason is that if pP is used here,
>> the inactive elements of zB will be ignored in zX, which will cause data loss.(I think it is
>
> Yes, you're quite right. I was forgetting about the /z (zeroing)
> semantics for the ld1b. This means that we get away with not
> inactivating those elements in the msb instruction, since zeros
> multiplied by the elements of zJ remain zero.
>
>> because the zB data is covered by the multiplication and addition results of zX, zA, and zV
>> using movprfx and mad. Have I got that right?) :)
>
> Yes, I think so.
>
>> On the other hand zX uses the prediction register pP/z when loading data, the value of the
>> inactive element is 0, the inactive element in zX will not affect the final result, the inactive
>> element in zB will be directly assigned to the inactive element of zX element.
>>
>> Then in the next instruction, it will be added to zB along with zX.
>
> Yes.
>
> This might be part of the reason why the architects decided that SVE
> loads zero the inactive elements instead of leaving them unchanged.

I think so, it is a useful feature in this scenario.

>
>>
>>>
>>> This might require a bit more thought in order to get the final block
>>> handling correct.
>>>
>>>> trailloop: // Last cycle entrance
>>>> cntp x6, p1, p0.s // Get active element count of last cycle
>>>> cpy zV.s, p1/m, w6 // Set zV to the actual value.
>>> Note that you can also write "mov" here, but I'm not sure which alias is
>>> preferred>
>>>> loop: // Core loop entrance
>>>> ld1b zX.s, p0/z, [x1]
>>>> incw x1
>>>>
>>>> add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0
>>>> msb zX.s, p1/m, zJ.s, zB.s // Change p register here
>>>> movprfx zB, zA
>>>> mad zB.s, p1/m, zV.s, zX.s // Change p register here
>>> As discussed above, are you sure this is correct now?
>
> I think we've agreed that it is correct.
>
> Thinking about the code flow, I think the initialisation of zV is
> slightly wrong: for very short data, the first block may be shorter than
> VL.
>
> I think this is easily solved by getting rid of the constant
> initialisation for zV and removing the branch that skips the CNTP code
> when entering the loop for the first time. That way, zV gets
> initialised to the correct thing on entering the loop, irrespective of
> whether we have a whole vector on the first iteration.

The problem you are worried about is valuable. This problem must be considered when
the loop is unrolled. However, the code in my email a few days ago did not have this
problem, and I have also tested use cases with a length less than VL. :)

The reason is that when the length of the test case is less than VL, 'b.last' is
false and 'b.first' is true, and then it will still jump directly to trailloop to update zV.

b start
trailloop:
cntp x6, p1, p0.s
mov zV.s, p1/m, w6
loop:
...
start:
whilelo p0.s, x1, xLimit
b.last loop
b.first trailloop

>
> Note, to squeeze maximum performance out of this, you still probably
> want to unroll the loop a few times so that you can schedule useful work
> in between each load and the computations that depend on it.

Yes, I tried to expand the loop on the basis of the previous code, but the effect was
not very satisfactory(the performance is only about 2 times that of the C version),
so I reconsidered the way of implementation, based on the formula you derived earlier.

>
>>>> start:
>>>> whilelo p0.s, x1, xLimit
>>>> b.last loop // The condition for the core loop to continue is that b.last is true
>>>> b.first trailloop // If b.last is false and b.first is true, it means the last cycle
>>>>
>>>> uaddv d0, p1, zA.s
>>>> uaddv d1, p1, zB.s
>>>>
>>>> mov x12, v0.2d[0]
>>>> mov x13, v1.2d[0]
>>> The "2" here seems not to be required by the syntax, although it's
>>> harmless.
>>
>> Yes I deleted it.
>>
>>>
>>>> add x10, x10, x12
>>>> add x11, x11, x13
>>>> add x11, x11, x2
>>> If x10 and x11 are used as accmulators by the caller, I guess this works.
>>
>> X10 and X11 are part A and part B of the initial value of adler32 passed in by the caller.
>
> Right, that makes sense.
>
>>
>>>
>>>> mod65521 10, 14, 12
>>>> mod65521 11, 14, 12
>
> Note, can you replace these with udiv?
>
> While using mul might be slightly cheaper to achieve this, it makes the
> code more complex and will have a negligible impact on the overall
> cost...
>
> So, does something like this work:
>
> mov x3, #65521
> udiv x4, x10, x3
> udiv x5, x11, x3
> msub x10, x4, x3, x10
> msub x11, x5, x3, x11

Yes, the reason for doing this here is that I initially performed modulo division
in the loop body, so that we can never overflow the data, so I considered optimizing
the division to multiplication.If we only do a modular division at the end, we don't
need to implement the code like this.

>
>>>> lsl x11, x11, #16
>>>> orr x0, x10, x11
>>>> ret
>>>> -->8--
>>>>
>>>> After this modification, The test results are correct when the data length is less than about 8 Kbyte,
>>>> part A will still be correct after 8K, and an overflow error will occur in part B. This is because A
>>>> only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the
>>>> accumulative formula of B is:
>>>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
>>>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
>>>>
>>>> If we take the average value of Dx to 128 and n to 8192:
>>>> B = (1 + 2 + ... + 8129) * 128 + 8192
>>>> = 4,295,499,776 (32bit overflow)
>>>>
>>>> So I think the 32-bit accumulator is still not enough for part B here. :)
>>>>
>>>> --
>>>> Best regards,
>>>> Li Qiang
>>> That makes sense. I hadn't tried to calculate the actual bound.
>>>
>>> It may be worth trying this with 64-bit accumulators. This will
>>> probably slow things down, but it depends on the relative computation /
>>> memory throughput exhibited by the hardware.
>>
>> If a 64-bit wide vector register is used, for most scenes where the amount of data is not particularly large,
>> is it wasted more vector resources?
>
> Yes :)
>
> Depending on the algorithm, this might be a better tradeoff if it meant
> that the data could be processed in larger chunks. I suspect that the
> tradeoff is unfavourable in this particluar case though -- but I haven't
> tried it.
>
>> Maybe we can also try to use 16-bit wide vector registers to load data and calculations,
>> and accumulate them into the scalar register xn before overflow, just like my original patch,
>> but I can try to use ascending order to change the processing of the last loop Be more elegant.
>
> Perhaps. But I assumed that 16-bit elements would overflow much too
> fast to be practical. Just multiplying zX by zJ once can produce
> element values up to 0xfe01 if VL is 256 bytes.
>
> Did you have some idea for how to make this work?
>
> It may be possible to do 16-bit multiplies with 32-bit accumulation.
> SVE2 has some NEON-style mixed-width multiply-accumulate instructions
> that can achieve this sort of thing directly, but in SVE(1) I think
> you would have to break the multiply-accumulates up. Say:
>
> mul zX.h, pP/m, zX.h, zJ.h
> sub zX.s, pP/m, zB.s, zX.s
>
> whilelo pP.s should generate a predicate with odd .h elements inactive,
> and ld1b zX.s, pP/z, ... will make sure those elements are all zeroed in
> xX.h.
>
> I'm not sure this would be faster than a single 32-bit msb, since
> multiply-accumulates are likely to be heavily optimised in the hardware,
> but you could try it.

I adopted a part of the patch I originally submitted, combined with the formula you gave
to calculate B[n+v] using an increasing sequence, and then modify the code again. The code is at the end.

>
>>>
>>> I think the code can't be bulletproof without breaking the input into
>>> chunks anyway, though.
>>
>> I don't quite understand what it means here. Does it mean that the input bytes are read into the vector in
>> blocks for calculation (this is how it is done now) or the intermediate results are stored in different elements
>> of the vector in blocks during the calculation process? :-)
>
> I mean, you limit the number of iterations of the core loop so that
> overflow doesn't happen. You're already doing this; I just wanted to
> make the point that 64-bit accumulators probably don't solve this
> problem, even though it will take a very large number of iterations to
> cause an overflow.
>
> Unless Adler32 specifies the maximum length of the input data not to
> exceed some value, it could go on forever -- so overflow becomes
> inevitable.

Yes, I understand that if we do not perform the modulo division in time, data overflow will happen sooner or later.
So my initial patch added modulo division to the loop body and optimized it.:)

As you said, if we don’t want to perform modular division in the loop body, then we need to block the input and
perform the modular division in time before the data will never overflow (Need to consider the worst case,
that is, all data is 0xff).

--8<--
...
adler_A .req x10
adler_B .req x11

.macro adler32_core
ld1b zX.h, p0/z, [x1] // load bytes
inch x1

uaddv d0, p0, zX.h
mul zX.h, p0/m, zX.h, zJ.h // Sum [j=0 .. v-1] j*X[j+n]
mov x9, v0.d[0]
uaddv d1, p0, zX.h
add adler_A, adler_A, x9 // A[n+v] = An + Sum [j=0 ... v-1] X[j]
mov x9, v1.d[0]
madd adler_B, x7, adler_A, adler_B // Bn + v*A[n+v]
sub adler_B, adler_B, x9 // B[n+v] = Bn + v*A[n+v] - Sum [j=0 .. v-1] j*X[j+n]
.endm

.macro adler32_loop
whilelo p0.h, x1, xLimit
cntp x7, p0, p0.h // x7 is used to store 'v'
adler32_core
.endm

ENTRY(XXX)
...
add xLimit, x1, x2

loop:
adler32_loop
adler32_loop
adler32_loop
adler32_loop
cmp x7, #0
b.ne loop

mov x3, #65521
udiv x4, x10, x3
udiv x5, x11, x3
msub x10, x4, x3, x10
msub x11, x5, x3, x11
...
-->8--

I tested 500Mbyte random data, the result is completely correct, longer data volume did not test,
I think we can also wrap a layer of data block control code, that is, after the data volume
exceeds a certain threshold, the control code The input data is segmented by threshold, and
the initial adler32 of the subsequent segment is the result of the previous segment calculation.

Like:
adler32 = 1;
if (len <= MAX_BLOCK_LEN)
adler32 = adler32_sve(adler32, buf, len);
else {
int i = 0;
while (i < len) {
int block_len = (len - i > MAX_BLOCK_LEN) ? MAX_BLOCK_LEN : (len - i);
adler32 = adler32_sve(adler32, &buf[i], block_len);
i += block_len;
}
}
return adler32;

In this way, we don't have to worry about when the data overflows and when to modulo in the core algorithm code.

--
Best regards,
Li Qiang

2020-11-16 15:59:41

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

On Sat, Nov 14, 2020 at 03:31:56PM +0800, Li Qiang wrote:
>
>
> 在 2020/11/12 19:17, Dave Martin 写道:
> > On Thu, Nov 12, 2020 at 03:20:53PM +0800, Li Qiang wrote:
> >>
> >>
> >> 在 2020/11/11 0:07, Dave Martin 写道:
> >>>>>>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s
> >>>>>>>
> >>>>>>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> >>>>>>>
> >>>>>>> movprfx zB, zA
> >>>>>>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V
> >>>> I found the bug I encountered earlier, that is, the calculation of zB here
> >>>> needs to use pA with all elements activated. The reason is the same as my
> >>>> previous guess, because all elements of zA should be involved when calculating zB.
> >>>> Because the original calculation formula is like this.
> >>>>
> >>>> For example:
> >>>> In the last loop:
> >>>> left byte is: 3 | 4 | \0 |
> >>>> zA.s is: 100 | 200 | 100 | 200 (sum = 600)
> >>>> pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels)
> >>>>
> >>>> At this time, if the calculation of zB only takes the first 2 active elements, the data
> >>>> is incomplete, because according to the description of the original algorithm, zB is always
> >>>> based on the sum of all the accumulated bytes.
> >>> Yes, you're quite right here: zX is partial: only the elements pP are
> >>> valid; but all elements of zA and zB are always valid. I was focusing
> >>> too much on the handling of the partial input block.
> >>>
> >>>> Here we can simply change the prediction register used in the two sentences related to
> >>>> zB to the one that is all true (it is pA in our code), like this:
> >>>> msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s
> >>> Are you sure about this? In a final partial block, the trailing
> >>> elements of zX.s beyond pP will be leftover junk from the last
> >>> iteration.
> >>
> >> Yes, I have verified this code and it is correct. The reason is that if pP is used here,
> >> the inactive elements of zB will be ignored in zX, which will cause data loss.(I think it is
> >
> > Yes, you're quite right. I was forgetting about the /z (zeroing)
> > semantics for the ld1b. This means that we get away with not
> > inactivating those elements in the msb instruction, since zeros
> > multiplied by the elements of zJ remain zero.
> >
> >> because the zB data is covered by the multiplication and addition results of zX, zA, and zV
> >> using movprfx and mad. Have I got that right?) :)
> >
> > Yes, I think so.
> >
> >> On the other hand zX uses the prediction register pP/z when loading data, the value of the
> >> inactive element is 0, the inactive element in zX will not affect the final result, the inactive
> >> element in zB will be directly assigned to the inactive element of zX element.
> >>
> >> Then in the next instruction, it will be added to zB along with zX.
> >
> > Yes.
> >
> > This might be part of the reason why the architects decided that SVE
> > loads zero the inactive elements instead of leaving them unchanged.
>
> I think so, it is a useful feature in this scenario.
>
> >
> >>
> >>>
> >>> This might require a bit more thought in order to get the final block
> >>> handling correct.
> >>>
> >>>> trailloop: // Last cycle entrance
> >>>> cntp x6, p1, p0.s // Get active element count of last cycle
> >>>> cpy zV.s, p1/m, w6 // Set zV to the actual value.
> >>> Note that you can also write "mov" here, but I'm not sure which alias is
> >>> preferred>
> >>>> loop: // Core loop entrance
> >>>> ld1b zX.s, p0/z, [x1]
> >>>> incw x1
> >>>>
> >>>> add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0
> >>>> msb zX.s, p1/m, zJ.s, zB.s // Change p register here
> >>>> movprfx zB, zA
> >>>> mad zB.s, p1/m, zV.s, zX.s // Change p register here
> >>> As discussed above, are you sure this is correct now?
> >
> > I think we've agreed that it is correct.
> >
> > Thinking about the code flow, I think the initialisation of zV is
> > slightly wrong: for very short data, the first block may be shorter than
> > VL.
> >
> > I think this is easily solved by getting rid of the constant
> > initialisation for zV and removing the branch that skips the CNTP code
> > when entering the loop for the first time. That way, zV gets
> > initialised to the correct thing on entering the loop, irrespective of
> > whether we have a whole vector on the first iteration.
>
> The problem you are worried about is valuable. This problem must be considered when
> the loop is unrolled. However, the code in my email a few days ago did not have this
> problem, and I have also tested use cases with a length less than VL. :)
>
> The reason is that when the length of the test case is less than VL, 'b.last' is
> false and 'b.first' is true, and then it will still jump directly to trailloop to update zV.
>
> b start
> trailloop:
> cntp x6, p1, p0.s
> mov zV.s, p1/m, w6
> loop:
> ...
> start:
> whilelo p0.s, x1, xLimit
> b.last loop
> b.first trailloop

Yes, that makes sense. I rotated the loop so that it didn't need a
separate entry point, but the logic is similar otherwise.

> >
> > Note, to squeeze maximum performance out of this, you still probably
> > want to unroll the loop a few times so that you can schedule useful work
> > in between each load and the computations that depend on it.
>
> Yes, I tried to expand the loop on the basis of the previous code, but the effect was
> not very satisfactory(the performance is only about 2 times that of the C version),
> so I reconsidered the way of implementation, based on the formula you derived earlier.
>
> >
> >>>> start:
> >>>> whilelo p0.s, x1, xLimit
> >>>> b.last loop // The condition for the core loop to continue is that b.last is true
> >>>> b.first trailloop // If b.last is false and b.first is true, it means the last cycle
> >>>>
> >>>> uaddv d0, p1, zA.s
> >>>> uaddv d1, p1, zB.s
> >>>>
> >>>> mov x12, v0.2d[0]
> >>>> mov x13, v1.2d[0]
> >>> The "2" here seems not to be required by the syntax, although it's
> >>> harmless.
> >>
> >> Yes I deleted it.
> >>
> >>>
> >>>> add x10, x10, x12
> >>>> add x11, x11, x13
> >>>> add x11, x11, x2
> >>> If x10 and x11 are used as accmulators by the caller, I guess this works.
> >>
> >> X10 and X11 are part A and part B of the initial value of adler32 passed in by the caller.
> >
> > Right, that makes sense.
> >
> >>
> >>>
> >>>> mod65521 10, 14, 12
> >>>> mod65521 11, 14, 12
> >
> > Note, can you replace these with udiv?
> >
> > While using mul might be slightly cheaper to achieve this, it makes the
> > code more complex and will have a negligible impact on the overall
> > cost...
> >
> > So, does something like this work:
> >
> > mov x3, #65521
> > udiv x4, x10, x3
> > udiv x5, x11, x3
> > msub x10, x4, x3, x10
> > msub x11, x5, x3, x11
>
> Yes, the reason for doing this here is that I initially performed modulo division
> in the loop body, so that we can never overflow the data, so I considered optimizing
> the division to multiplication.If we only do a modular division at the end, we don't
> need to implement the code like this.
>
> >
> >>>> lsl x11, x11, #16
> >>>> orr x0, x10, x11
> >>>> ret
> >>>> -->8--
> >>>>
> >>>> After this modification, The test results are correct when the data length is less than about 8 Kbyte,
> >>>> part A will still be correct after 8K, and an overflow error will occur in part B. This is because A
> >>>> only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the
> >>>> accumulative formula of B is:
> >>>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
> >>>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)
> >>>>
> >>>> If we take the average value of Dx to 128 and n to 8192:
> >>>> B = (1 + 2 + ... + 8129) * 128 + 8192
> >>>> = 4,295,499,776 (32bit overflow)
> >>>>
> >>>> So I think the 32-bit accumulator is still not enough for part B here. :)
> >>>>
> >>>> --
> >>>> Best regards,
> >>>> Li Qiang
> >>> That makes sense. I hadn't tried to calculate the actual bound.
> >>>
> >>> It may be worth trying this with 64-bit accumulators. This will
> >>> probably slow things down, but it depends on the relative computation /
> >>> memory throughput exhibited by the hardware.
> >>
> >> If a 64-bit wide vector register is used, for most scenes where the amount of data is not particularly large,
> >> is it wasted more vector resources?
> >
> > Yes :)
> >
> > Depending on the algorithm, this might be a better tradeoff if it meant
> > that the data could be processed in larger chunks. I suspect that the
> > tradeoff is unfavourable in this particluar case though -- but I haven't
> > tried it.
> >
> >> Maybe we can also try to use 16-bit wide vector registers to load data and calculations,
> >> and accumulate them into the scalar register xn before overflow, just like my original patch,
> >> but I can try to use ascending order to change the processing of the last loop Be more elegant.
> >
> > Perhaps. But I assumed that 16-bit elements would overflow much too
> > fast to be practical. Just multiplying zX by zJ once can produce
> > element values up to 0xfe01 if VL is 256 bytes.
> >
> > Did you have some idea for how to make this work?
> >
> > It may be possible to do 16-bit multiplies with 32-bit accumulation.
> > SVE2 has some NEON-style mixed-width multiply-accumulate instructions
> > that can achieve this sort of thing directly, but in SVE(1) I think
> > you would have to break the multiply-accumulates up. Say:
> >
> > mul zX.h, pP/m, zX.h, zJ.h
> > sub zX.s, pP/m, zB.s, zX.s
> >
> > whilelo pP.s should generate a predicate with odd .h elements inactive,
> > and ld1b zX.s, pP/z, ... will make sure those elements are all zeroed in
> > xX.h.
> >
> > I'm not sure this would be faster than a single 32-bit msb, since
> > multiply-accumulates are likely to be heavily optimised in the hardware,
> > but you could try it.
>
> I adopted a part of the patch I originally submitted, combined with the formula you gave
> to calculate B[n+v] using an increasing sequence, and then modify the code again. The code is at the end.
>
> >
> >>>
> >>> I think the code can't be bulletproof without breaking the input into
> >>> chunks anyway, though.
> >>
> >> I don't quite understand what it means here. Does it mean that the input bytes are read into the vector in
> >> blocks for calculation (this is how it is done now) or the intermediate results are stored in different elements
> >> of the vector in blocks during the calculation process? :-)
> >
> > I mean, you limit the number of iterations of the core loop so that
> > overflow doesn't happen. You're already doing this; I just wanted to
> > make the point that 64-bit accumulators probably don't solve this
> > problem, even though it will take a very large number of iterations to
> > cause an overflow.
> >
> > Unless Adler32 specifies the maximum length of the input data not to
> > exceed some value, it could go on forever -- so overflow becomes
> > inevitable.
>
> Yes, I understand that if we do not perform the modulo division in time, data overflow will happen sooner or later.
> So my initial patch added modulo division to the loop body and optimized it.:)
>
> As you said, if we don’t want to perform modular division in the loop body, then we need to block the input and
> perform the modular division in time before the data will never overflow (Need to consider the worst case,
> that is, all data is 0xff).
>
> --8<--
> ...
> adler_A .req x10
> adler_B .req x11
>
> .macro adler32_core
> ld1b zX.h, p0/z, [x1] // load bytes
> inch x1
>
> uaddv d0, p0, zX.h
> mul zX.h, p0/m, zX.h, zJ.h // Sum [j=0 .. v-1] j*X[j+n]
> mov x9, v0.d[0]
> uaddv d1, p0, zX.h
> add adler_A, adler_A, x9 // A[n+v] = An + Sum [j=0 ... v-1] X[j]
> mov x9, v1.d[0]
> madd adler_B, x7, adler_A, adler_B // Bn + v*A[n+v]
> sub adler_B, adler_B, x9 // B[n+v] = Bn + v*A[n+v] - Sum [j=0 .. v-1] j*X[j+n]
> .endm

If this has best performance, I find that quite surprising. Those uaddv
instructions will stop the vector lanes flowing independently inside the
loop, so if an individual element load is slow arriving then everything
will have to wait.

A decent hardware prefetcher may tend to hide that issue for sequential
memory access, though: i.e., if the hardware does a decent job of
fetching data before the actual loads are issued, the data may appear to
arrive with minimal delay.

The effect might be a lot worse for algorithms that have less
predictable memory access patterns.

Possibly you do win some additional performance due to processing twice
as many elements at once, here.

>
> .macro adler32_loop
> whilelo p0.h, x1, xLimit
> cntp x7, p0, p0.h // x7 is used to store 'v'
> adler32_core
> .endm
>
> ENTRY(XXX)
> ...
> add xLimit, x1, x2
>
> loop:
> adler32_loop
> adler32_loop
> adler32_loop
> adler32_loop
> cmp x7, #0
> b.ne loop
>
> mov x3, #65521
> udiv x4, x10, x3
> udiv x5, x11, x3
> msub x10, x4, x3, x10
> msub x11, x5, x3, x11
> ...
> -->8--
>
>
> I tested 500Mbyte random data, the result is completely correct, longer data volume did not test,
> I think we can also wrap a layer of data block control code, that is, after the data volume
> exceeds a certain threshold, the control code The input data is segmented by threshold, and
> the initial adler32 of the subsequent segment is the result of the previous segment calculation.
>
> Like:
> adler32 = 1;
> if (len <= MAX_BLOCK_LEN)
> adler32 = adler32_sve(adler32, buf, len);
> else {
> int i = 0;
> while (i < len) {
> int block_len = (len - i > MAX_BLOCK_LEN) ? MAX_BLOCK_LEN : (len - i);
> adler32 = adler32_sve(adler32, &buf[i], block_len);
> i += block_len;
> }
> }
> return adler32;
>
> In this way, we don't have to worry about when the data overflows and when to modulo in the core algorithm code.

Yes, something like that ought to work.

Cheers
---Dave

2020-11-17 12:47:17

[permalink] [raw]

Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

在 2020/11/16 23:56, Dave Martin 写道:
>> --8<--
>> ...
>> adler_A .req x10
>> adler_B .req x11
>>
>> .macro adler32_core
>> ld1b zX.h, p0/z, [x1] // load bytes
>> inch x1
>>
>> uaddv d0, p0, zX.h
>> mul zX.h, p0/m, zX.h, zJ.h // Sum [j=0 .. v-1] j*X[j+n]
>> mov x9, v0.d[0]
>> uaddv d1, p0, zX.h
>> add adler_A, adler_A, x9 // A[n+v] = An + Sum [j=0 ... v-1] X[j]
>> mov x9, v1.d[0]
>> madd adler_B, x7, adler_A, adler_B // Bn + v*A[n+v]
>> sub adler_B, adler_B, x9 // B[n+v] = Bn + v*A[n+v] - Sum [j=0 .. v-1] j*X[j+n]
>> .endm
> If this has best performance, I find that quite surprising. Those uaddv
> instructions will stop the vector lanes flowing independently inside the
> loop, so if an individual element load is slow arriuaddving then everything
> will have to wait.

I don't know much about this problem, do you mean that the uaddv instruction used in
the loop has a great impact on performance?

>
> A decent hardware prefetcher may tend to hide that issue for sequential
> memory access, though: i.e., if the hardware does a decent job of
> fetching data before the actual loads are issued, the data may appear to
> arrive with minimal delay.
>
> The effect might be a lot worse for algorithms that have less
> predictable memory access patterns.
>
> Possibly you do win some additional performance due to processing twice
> as many elements at once, here.

I think so. Compared to loading bytes into zX.h, if you load them directly into zX.b,
and then use uunpklo/uunpkhi for register expansion, the performance will be more better(20% faster).
This may be the reason.

--
Best regards,
Li Qiang