LinuxLists.cc - [PATCH v2 00/11] crypto: arm32-optimized BLAKE2b and BLAKE2s

2020-12-17 22:29:08

Subject: [PATCH v2 00/11] crypto: arm32-optimized BLAKE2b and BLAKE2s

This patchset adds 32-bit ARM assembly language implementations of
BLAKE2b and BLAKE2s.

The BLAKE2b implementation is NEON-accelerated, while the BLAKE2s
implementation uses scalar instructions since NEON doesn't work very
well for it. The BLAKE2b implementation is faster and is expected to be
useful as a replacement for SHA-1 in dm-verity, while the BLAKE2s
implementation would be useful for WireGuard which uses BLAKE2s.

Both implementations are provided via the shash API, while BLAKE2s is
also provided via the library API.

While adding these, I also reworked the generic implementations of
BLAKE2b and BLAKE2s to provide helper functions that make implementing
other "shash" providers for these algorithms much easier.

See the individual commits for full details, including benchmarks.

This patchset was tested on a Raspberry Pi 2 (which uses a Cortex-A7
processor) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, plus other tests.

This patchset applies to mainline commit 0c6c887835b5.

Changed since v1:
- Added BLAKE2s implementation.
- Adjusted the BLAKE2b helper functions to be consistent with what I
decided to do for BLAKE2s.
- Fixed build error in blake2b-neon-core.S in some configurations.

Eric Biggers (11):
crypto: blake2b - rename constants for consistency with blake2s
crypto: blake2b - define shash_alg structs using macros
crypto: blake2b - export helpers for optimized implementations
crypto: blake2b - update file comment
crypto: arm/blake2b - add NEON-accelerated BLAKE2b
crypto: blake2s - define shash_alg structs using macros
crypto: x86/blake2s - define shash_alg structs using macros
crypto: blake2s - remove unneeded includes
crypto: blake2s - share the "shash" API boilerplate code
crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM

arch/arm/crypto/Kconfig | 20 ++
arch/arm/crypto/Makefile | 4 +
arch/arm/crypto/blake2b-neon-core.S | 345 ++++++++++++++++++++++++++++
arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
arch/arm/crypto/blake2s-core.S | 272 ++++++++++++++++++++++
arch/arm/crypto/blake2s-glue.c | 78 +++++++
arch/x86/crypto/blake2s-glue.c | 150 +++---------
crypto/Kconfig | 5 +
crypto/Makefile | 1 +
crypto/blake2b_generic.c | 200 +++++++---------
crypto/blake2s_generic.c | 161 +++----------
crypto/blake2s_helpers.c | 87 +++++++
drivers/net/Kconfig | 1 +
include/crypto/blake2b.h | 27 +++
include/crypto/internal/blake2b.h | 33 +++
include/crypto/internal/blake2s.h | 17 ++
16 files changed, 1139 insertions(+), 367 deletions(-)
create mode 100644 arch/arm/crypto/blake2b-neon-core.S
create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
create mode 100644 arch/arm/crypto/blake2s-core.S
create mode 100644 arch/arm/crypto/blake2s-glue.c
create mode 100644 crypto/blake2s_helpers.c
create mode 100644 include/crypto/blake2b.h
create mode 100644 include/crypto/internal/blake2b.h

base-commit: 0c6c887835b59c10602add88057c9c06f265effe
--
2.29.2

2020-12-17 22:29:17

by Eric Biggers

[permalink] [raw]

Subject: [PATCH v2 10/11] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s

From: Eric Biggers <[email protected]>

Add an ARM scalar optimized implementation of BLAKE2s.

NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too
small for NEON to help. Each NEON instruction would depend on the
previous one, resulting in poor performance.

With scalar instructions, on the other hand, we can take advantage of
ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an
implementation get runs much faster than the C implementation.

Performance results on Cortex-A7 in cycles per byte using the shash API:

4096-byte messages:
blake2s-256-arm: 18.8
blake2s-256-generic: 26.1

500-byte messages:
blake2s-256-arm: 20.7
blake2s-256-generic: 28.3

100-byte messages:
blake2s-256-arm: 32.3
blake2s-256-generic: 42.4

32-byte messages:
blake2s-256-arm: 55.8
blake2s-256-generic: 72.1

Except on very short messages, this is still slower than the NEON
implementation of BLAKE2b which I've written; that is 14.0, 16.7, 26.2,
and 80.6 cpb on 4096, 500, 100, and 32-byte messages, respectively.
However, optimized BLAKE2s is useful for cases where BLAKE2s is used
instead of BLAKE2b, such as WireGuard.

This new implementation is added in the form of a new module
blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it
provides blake2s_compress_arch() for use by the library API as well as
optionally register the algorithms with the shash API.

Signed-off-by: Eric Biggers <[email protected]>
---
arch/arm/crypto/Kconfig | 10 ++
arch/arm/crypto/Makefile | 2 +
arch/arm/crypto/blake2s-core.S | 272 +++++++++++++++++++++++++++++++++
arch/arm/crypto/blake2s-glue.c | 78 ++++++++++
4 files changed, 362 insertions(+)
create mode 100644 arch/arm/crypto/blake2s-core.S
create mode 100644 arch/arm/crypto/blake2s-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index f6a14c186b4ec..e87e3bc9b6d5e 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -72,6 +72,16 @@ config CRYPTO_BLAKE2B_NEON
Crypto Extensions, typically this BLAKE2b implementation is
much faster than SHA-2 and slightly faster than SHA-1.

+config CRYPTO_BLAKE2S_ARM
+ tristate "BLAKE2s digest algorithm (ARM)"
+ select CRYPTO_ARCH_HAVE_LIB_BLAKE2S
+ select CRYPTO_BLAKE2S_HELPERS if CRYPTO_HASH
+ help
+ BLAKE2s digest algorithm optimized with ARM scalar instructions. This
+ is faster than the generic implementations of BLAKE2s and BLAKE2b, but
+ slower than the NEON implementation of BLAKE2b. (There is no NEON
+ implementation of BLAKE2s, since NEON doesn't really help with it.)
+
config CRYPTO_AES_ARM
tristate "Scalar AES cipher for ARM"
select CRYPTO_ALGAPI
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index ab835ceeb4f2e..fcd80f660b762 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
+obj-$(CONFIG_CRYPTO_BLAKE2S_ARM) += blake2s-arm.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
@@ -31,6 +32,7 @@ sha256-arm-y := sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
sha512-arm-y := sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
blake2b-neon-y := blake2b-neon-core.o blake2b-neon-glue.o
+blake2s-arm-y := blake2s-core.o blake2s-glue.o
sha1-arm-ce-y := sha1-ce-core.o sha1-ce-glue.o
sha2-arm-ce-y := sha2-ce-core.o sha2-ce-glue.o
aes-arm-ce-y := aes-ce-core.o aes-ce-glue.o
diff --git a/arch/arm/crypto/blake2s-core.S b/arch/arm/crypto/blake2s-core.S
new file mode 100644
index 0000000000000..88aa1875911d8
--- /dev/null
+++ b/arch/arm/crypto/blake2s-core.S
@@ -0,0 +1,272 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * BLAKE2s digest algorithm, ARM scalar implementation
+ *
+ * Copyright 2020 Google LLC
+ *
+ * Author: Eric Biggers <[email protected]>
+ */
+
+#include <linux/linkage.h>
+
+ // Registers used to hold message words temporarily. There aren't
+ // enough ARM registers to hold the whole message block, so we have to
+ // load the words on-demand.
+ M_0 .req r12
+ M_1 .req r14
+
+// The BLAKE2s initialization vector
+.Lblake2s_IV:
+ .word 0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A
+ .word 0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+
+.macro __ldrd a, b, src, offset
+#if __LINUX_ARM_ARCH__ >= 6
+ ldrd \a, \b, [\src, #\offset]
+#else
+ ldr \a, [\src, #\offset]
+ ldr \b, [\src, #\offset + 4]
+#endif
+.endm
+
+.macro __strd a, b, dst, offset
+#if __LINUX_ARM_ARCH__ >= 6
+ strd \a, \b, [\dst, #\offset]
+#else
+ str \a, [\dst, #\offset]
+ str \b, [\dst, #\offset + 4]
+#endif
+.endm
+
+.macro _blake2s_quarterround a0, b0, c0, d0, a1, b1, c1, d1, s0, s1, s2, s3
+
+ ldr M_0, [sp, #32 + 4 * \s0]
+ ldr M_1, [sp, #32 + 4 * \s2]
+
+ // a += b + m[blake2s_sigma[r][2 * i + 0]];
+ add \a0, \a0, \b0, ror #brot
+ add \a1, \a1, \b1, ror #brot
+ add \a0, \a0, M_0
+ add \a1, \a1, M_1
+
+ // d = ror32(d ^ a, 16);
+ eor \d0, \a0, \d0, ror #drot
+ eor \d1, \a1, \d1, ror #drot
+
+ // c += d;
+ add \c0, \c0, \d0, ror #16
+ add \c1, \c1, \d1, ror #16
+
+ // b = ror32(b ^ c, 12);
+ eor \b0, \c0, \b0, ror #brot
+ eor \b1, \c1, \b1, ror #brot
+
+ ldr M_0, [sp, #32 + 4 * \s1]
+ ldr M_1, [sp, #32 + 4 * \s3]
+
+ // a += b + m[blake2s_sigma[r][2 * i + 1]];
+ add \a0, \a0, \b0, ror #12
+ add \a1, \a1, \b1, ror #12
+ add \a0, \a0, M_0
+ add \a1, \a1, M_1
+
+ // d = ror32(d ^ a, 8);
+ eor \d0, \a0, \d0, ror#16
+ eor \d1, \a1, \d1, ror#16
+
+ // c += d;
+ add \c0, \c0, \d0, ror#8
+ add \c1, \c1, \d1, ror#8
+
+ // b = ror32(b ^ c, 7);
+ eor \b0, \c0, \b0, ror#12
+ eor \b1, \c1, \b1, ror#12
+.endm
+
+// Execute one round of BLAKE2s by updating the state matrix v[0..15]. v[0..9]
+// are in r0..r9. The stack pointer points to 8 bytes of scratch space for
+// spilling v[8..9], then to v[9..15], then to the message block. r10-r12 and
+// r14 are free to use. The macro arguments s0-s15 give the order in which the
+// message words are used in this round.
+//
+// All rotates are performed using the implicit rotate operand accepted by the
+// 'add' and 'eor' instructions. This is faster than using explicit rotate
+// instructions. To make this work, we allow the values in the second and last
+// rows of the BLAKE2s state matrix (rows 'b' and 'd') to temporarily have the
+// wrong rotation amount. The rotation amount is then fixed up just in time
+// when the values are used. 'brot' is the number of bits the values in row 'b'
+// need to be rotated right to arrive at the correct values, and 'drot'
+// similarly for row 'd'. (brot, drot) start out as (0, 0) but we make it such
+// that they end up as (7, 8) after every round.
+.macro _blake2s_round s0, s1, s2, s3, s4, s5, s6, s7, \
+ s8, s9, s10, s11, s12, s13, s14, s15
+
+ // Mix first two columns:
+ // (v[0], v[4], v[8], v[12]) and (v[1], v[5], v[9], v[13]).
+ __ldrd r10, r11, sp, 16 // load v[12] and v[13]
+ _blake2s_quarterround r0, r4, r8, r10, r1, r5, r9, r11, \
+ \s0, \s1, \s2, \s3
+ __strd r8, r9, sp, 0
+ __strd r10, r11, sp, 16
+
+ // Mix second two columns:
+ // (v[2], v[6], v[10], v[14]) and (v[3], v[7], v[11], v[15]).
+ __ldrd r8, r9, sp, 8 // load v[10] and v[11]
+ __ldrd r10, r11, sp, 24 // load v[14] and v[15]
+ _blake2s_quarterround r2, r6, r8, r10, r3, r7, r9, r11, \
+ \s4, \s5, \s6, \s7
+ str r10, [sp, #24] // store v[14]
+ // v[10], v[11], and v[15] are used below, so no need to store them yet.
+
+ .set brot, 7
+ .set drot, 8
+
+ // Mix first two diagonals:
+ // (v[0], v[5], v[10], v[15]) and (v[1], v[6], v[11], v[12]).
+ ldr r10, [sp, #16] // load v[12]
+ _blake2s_quarterround r0, r5, r8, r11, r1, r6, r9, r10, \
+ \s8, \s9, \s10, \s11
+ __strd r8, r9, sp, 8
+ str r11, [sp, #28]
+ str r10, [sp, #16]
+
+ // Mix second two diagonals:
+ // (v[2], v[7], v[8], v[13]) and (v[3], v[4], v[9], v[14]).
+ __ldrd r8, r9, sp, 0 // load v[8] and v[9]
+ __ldrd r10, r11, sp, 20 // load v[13] and v[14]
+ _blake2s_quarterround r2, r7, r8, r10, r3, r4, r9, r11, \
+ \s12, \s13, \s14, \s15
+ __strd r10, r11, sp, 20
+.endm
+
+//
+// void blake2s_compress_arch(struct blake2s_state *state,
+// const u8 *block, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2s_state are used:
+// u32 h[8]; (inout)
+// u32 t[2]; (in)
+// u32 f[2]; (in)
+//
+ .align 5
+ENTRY(blake2s_compress_arch)
+ push {r0-r2,r4-r11,lr} // keep this an even number
+
+.Lnext_block:
+ // r0 is 'state'
+ // r1 is 'block'
+ // r3 is counter increment amount
+
+ // Load and increment the counter t[0..1].
+ __ldrd r10, r11, r0, 32
+ adds r10, r10, r3
+ adc r11, r11, #0
+ __strd r10, r11, r0, 32
+
+ // _blake2s_round is very short on registers, so copy the message block
+ // to the stack to save a register during the rounds. This also has the
+ // advantage that misalignment only needs to be dealt with in one place.
+ sub sp, sp, #64
+ mov r12, sp
+ tst r1, #3
+ bne .Lcopy_block_misaligned
+ ldmia r1!, {r2-r9}
+ stmia r12!, {r2-r9}
+ ldmia r1!, {r2-r9}
+ stmia r12, {r2-r9}
+.Lcopy_block_done:
+ str r1, [sp, #68] // Update message pointer
+
+ // Calculate v[8..15]. Push v[9..15] onto the stack, and leave space
+ // for spilling v[8..9]. Leave v[8..9] in r8-r9.
+ mov r14, r0 // r14 = state
+ adr r12, .Lblake2s_IV
+ ldmia r12!, {r8-r9} // load IV[0..1]
+ __ldrd r0, r1, r14, 40 // load f[0..1]
+ ldm r12, {r2-r7} // load IV[3..7]
+ eor r4, r4, r10 // v[12] = IV[4] ^ t[0]
+ eor r5, r5, r11 // v[13] = IV[5] ^ t[1]
+ eor r6, r6, r0 // v[14] = IV[6] ^ f[0]
+ eor r7, r7, r1 // v[15] = IV[7] ^ f[1]
+ push {r2-r7}
+ sub sp, sp, #8
+
+ // Load h[0..7] == v[0..7].
+ ldm r14, {r0-r7}
+
+ // Execute the rounds. Each round is provided the order in which it
+ // needs to use the message words.
+ .set brot, 0
+ .set drot, 0
+ _blake2s_round 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+ _blake2s_round 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
+ _blake2s_round 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
+ _blake2s_round 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
+ _blake2s_round 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
+ _blake2s_round 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
+ _blake2s_round 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
+ _blake2s_round 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
+ _blake2s_round 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
+ _blake2s_round 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
+
+ // Fold the final state matrix into the hash chaining value:
+ //
+ // for (i = 0; i < 8; i++)
+ // h[i] ^= v[i] ^ v[i + 8];
+ //
+ ldr r14, [sp, #96] // r14 = &h[0]
+ add sp, sp, #8 // v[8..9] are already loaded.
+ pop {r10-r11} // load v[10..11]
+ eor r0, r0, r8
+ eor r1, r1, r9
+ eor r2, r2, r10
+ eor r3, r3, r11
+ ldm r14, {r8-r11} // load h[0..3]
+ eor r0, r0, r8
+ eor r1, r1, r9
+ eor r2, r2, r10
+ eor r3, r3, r11
+ stmia r14!, {r0-r3} // store new h[0..3]
+ ldm r14, {r0-r3} // load old h[4..7]
+ pop {r8-r11} // load v[12..15]
+ eor r0, r0, r4, ror #brot
+ eor r1, r1, r5, ror #brot
+ eor r2, r2, r6, ror #brot
+ eor r3, r3, r7, ror #brot
+ eor r0, r0, r8, ror #drot
+ eor r1, r1, r9, ror #drot
+ eor r2, r2, r10, ror #drot
+ eor r3, r3, r11, ror #drot
+ add sp, sp, #64 // skip copy of message block
+ stm r14, {r0-r3} // store new h[4..7]
+
+ ldm sp, {r0, r1, r2} // load (state, block, nblocks)
+ mov r3, #64 // set counter increment amount
+ subs r2, r2, #1 // nblocks--
+ str r2, [sp, #8]
+ bne .Lnext_block // nblocks != 0?
+
+ pop {r0-r2,r4-r11,pc}
+
+ // The message block isn't 4-byte aligned, so it can't be loaded using
+ // ldmia. Copy it to the stack using an alternative method.
+.Lcopy_block_misaligned:
+ mov r2, #64
+1:
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ ldr r3, [r1], #4
+#else
+ ldrb r3, [r1, #0]
+ ldrb r4, [r1, #1]
+ ldrb r5, [r1, #2]
+ ldrb r6, [r1, #3]
+ add r1, r1, #4
+ orr r3, r3, r4, lsl #8
+ orr r3, r3, r5, lsl #16
+ orr r3, r3, r6, lsl #24
+#endif
+ subs r2, r2, #4
+ str r3, [r12], #4
+ bne 1b
+ b .Lcopy_block_done
+ENDPROC(blake2s_compress_arch)
diff --git a/arch/arm/crypto/blake2s-glue.c b/arch/arm/crypto/blake2s-glue.c
new file mode 100644
index 0000000000000..f2cc1e5fc9ec1
--- /dev/null
+++ b/arch/arm/crypto/blake2s-glue.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BLAKE2s digest algorithm, ARM scalar implementation
+ *
+ * Copyright 2020 Google LLC
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <crypto/internal/hash.h>
+
+#include <linux/module.h>
+
+/* defined in blake2s-core.S */
+EXPORT_SYMBOL(blake2s_compress_arch);
+
+static int crypto_blake2s_update_arm(struct shash_desc *desc,
+ const u8 *in, unsigned int inlen)
+{
+ return crypto_blake2s_update(desc, in, inlen, blake2s_compress_arch);
+}
+
+static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out)
+{
+ return crypto_blake2s_final(desc, out, blake2s_compress_arch);
+}
+
+#define BLAKE2S_ALG(name, driver_name, digest_size) \
+ { \
+ .base.cra_name = name, \
+ .base.cra_driver_name = driver_name, \
+ .base.cra_priority = 200, \
+ .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY, \
+ .base.cra_blocksize = BLAKE2S_BLOCK_SIZE, \
+ .base.cra_ctxsize = sizeof(struct blake2s_tfm_ctx), \
+ .base.cra_module = THIS_MODULE, \
+ .digestsize = digest_size, \
+ .setkey = crypto_blake2s_setkey, \
+ .init = crypto_blake2s_init, \
+ .update = crypto_blake2s_update_arm, \
+ .final = crypto_blake2s_final_arm, \
+ .descsize = sizeof(struct blake2s_state), \
+ }
+
+static struct shash_alg blake2s_arm_algs[] = {
+ BLAKE2S_ALG("blake2s-128", "blake2s-128-arm", BLAKE2S_128_HASH_SIZE),
+ BLAKE2S_ALG("blake2s-160", "blake2s-160-arm", BLAKE2S_160_HASH_SIZE),
+ BLAKE2S_ALG("blake2s-224", "blake2s-224-arm", BLAKE2S_224_HASH_SIZE),
+ BLAKE2S_ALG("blake2s-256", "blake2s-256-arm", BLAKE2S_256_HASH_SIZE),
+};
+
+static int __init blake2s_arm_mod_init(void)
+{
+ return IS_REACHABLE(CONFIG_CRYPTO_HASH) ?
+ crypto_register_shashes(blake2s_arm_algs,
+ ARRAY_SIZE(blake2s_arm_algs)) : 0;
+}
+
+static void __exit blake2s_arm_mod_exit(void)
+{
+ if (IS_REACHABLE(CONFIG_CRYPTO_HASH))
+ crypto_unregister_shashes(blake2s_arm_algs,
+ ARRAY_SIZE(blake2s_arm_algs));
+}
+
+module_init(blake2s_arm_mod_init);
+module_exit(blake2s_arm_mod_exit);
+
+MODULE_DESCRIPTION("BLAKE2s digest algorithm, ARM scalar implementation");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <[email protected]>");
+MODULE_ALIAS_CRYPTO("blake2s-128");
+MODULE_ALIAS_CRYPTO("blake2s-128-arm");
+MODULE_ALIAS_CRYPTO("blake2s-160");
+MODULE_ALIAS_CRYPTO("blake2s-160-arm");
+MODULE_ALIAS_CRYPTO("blake2s-224");
+MODULE_ALIAS_CRYPTO("blake2s-224-arm");
+MODULE_ALIAS_CRYPTO("blake2s-256");
+MODULE_ALIAS_CRYPTO("blake2s-256-arm");
--
2.29.2

2020-12-17 22:29:36

by Eric Biggers

[permalink] [raw]

Subject: [PATCH v2 06/11] crypto: blake2s - define shash_alg structs using macros

From: Eric Biggers <[email protected]>

The shash_alg structs for the four variants of BLAKE2s are identical
except for the algorithm name, driver name, and digest size. So, avoid
code duplication by using a macro to define these structs.

Signed-off-by: Eric Biggers <[email protected]>
---
crypto/blake2s_generic.c | 88 ++++++++++++----------------------------
1 file changed, 27 insertions(+), 61 deletions(-)

diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index 005783ff45ad0..e3aa6e7ff3d83 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -83,67 +83,33 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
return 0;
}

-static struct shash_alg blake2s_algs[] = {{
- .base.cra_name = "blake2s-128",
- .base.cra_driver_name = "blake2s-128-generic",
- .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY,
- .base.cra_ctxsize = sizeof(struct blake2s_tfm_ctx),
- .base.cra_priority = 200,
- .base.cra_blocksize = BLAKE2S_BLOCK_SIZE,
- .base.cra_module = THIS_MODULE,
-
- .digestsize = BLAKE2S_128_HASH_SIZE,
- .setkey = crypto_blake2s_setkey,
- .init = crypto_blake2s_init,
- .update = crypto_blake2s_update,
- .final = crypto_blake2s_final,
- .descsize = sizeof(struct blake2s_state),
-}, {
- .base.cra_name = "blake2s-160",
- .base.cra_driver_name = "blake2s-160-generic",
- .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY,
- .base.cra_ctxsize = sizeof(struct blake2s_tfm_ctx),
- .base.cra_priority = 200,
- .base.cra_blocksize = BLAKE2S_BLOCK_SIZE,
- .base.cra_module = THIS_MODULE,
-
- .digestsize = BLAKE2S_160_HASH_SIZE,
- .setkey = crypto_blake2s_setkey,
- .init = crypto_blake2s_init,
- .update = crypto_blake2s_update,
- .final = crypto_blake2s_final,
- .descsize = sizeof(struct blake2s_state),
-}, {
- .base.cra_name = "blake2s-224",
- .base.cra_driver_name = "blake2s-224-generic",
- .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY,
- .base.cra_ctxsize = sizeof(struct blake2s_tfm_ctx),
- .base.cra_priority = 200,
- .base.cra_blocksize = BLAKE2S_BLOCK_SIZE,
- .base.cra_module = THIS_MODULE,
-
- .digestsize = BLAKE2S_224_HASH_SIZE,
- .setkey = crypto_blake2s_setkey,
- .init = crypto_blake2s_init,
- .update = crypto_blake2s_update,
- .final = crypto_blake2s_final,
- .descsize = sizeof(struct blake2s_state),
-}, {
- .base.cra_name = "blake2s-256",
- .base.cra_driver_name = "blake2s-256-generic",
- .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY,
- .base.cra_ctxsize = sizeof(struct blake2s_tfm_ctx),
- .base.cra_priority = 200,
- .base.cra_blocksize = BLAKE2S_BLOCK_SIZE,
- .base.cra_module = THIS_MODULE,
-
- .digestsize = BLAKE2S_256_HASH_SIZE,
- .setkey = crypto_blake2s_setkey,
- .init = crypto_blake2s_init,
- .update = crypto_blake2s_update,
- .final = crypto_blake2s_final,
- .descsize = sizeof(struct blake2s_state),
-}};
+#define BLAKE2S_ALG(name, driver_name, digest_size) \
+ { \
+ .base.cra_name = name, \
+ .base.cra_driver_name = driver_name, \
+ .base.cra_priority = 100, \
+ .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY, \
+ .base.cra_blocksize = BLAKE2S_BLOCK_SIZE, \
+ .base.cra_ctxsize = sizeof(struct blake2s_tfm_ctx), \
+ .base.cra_module = THIS_MODULE, \
+ .digestsize = digest_size, \
+ .setkey = crypto_blake2s_setkey, \
+ .init = crypto_blake2s_init, \
+ .update = crypto_blake2s_update, \
+ .final = crypto_blake2s_final, \
+ .descsize = sizeof(struct blake2s_state), \
+ }
+
+static struct shash_alg blake2s_algs[] = {
+ BLAKE2S_ALG("blake2s-128", "blake2s-128-generic",
+ BLAKE2S_128_HASH_SIZE),
+ BLAKE2S_ALG("blake2s-160", "blake2s-160-generic",
+ BLAKE2S_160_HASH_SIZE),
+ BLAKE2S_ALG("blake2s-224", "blake2s-224-generic",
+ BLAKE2S_224_HASH_SIZE),
+ BLAKE2S_ALG("blake2s-256", "blake2s-256-generic",
+ BLAKE2S_256_HASH_SIZE),
+};

static int __init blake2s_mod_init(void)
{
--
2.29.2

2020-12-17 22:29:39

by Eric Biggers

[permalink] [raw]

Subject: [PATCH v2 05/11] crypto: arm/blake2b - add NEON-accelerated BLAKE2b

From: Eric Biggers <[email protected]>

Add a NEON-accelerated implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1. It is also almost three times
as fast as the generic implementation of BLAKE2b:

Algorithm Cycles per byte (on 4096-byte messages)
=================== =======================================
blake2b-256-neon 14.0
sha1-neon 16.3
sha1-asm 20.8
blake2s-256-generic 26.1
sha256-neon 28.9
sha256-asm 32.0
blake2b-256-generic 39.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S. At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP). It does
only one block at a time, so it performs well on short messages too.

NEON-accelerated BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s. This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it. The best I've been able to do with BLAKE2s
on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

For now this BLAKE2b implementation is only exposed via the shash API,
since the library API doesn't support BLAKE2b yet. However, I've tried
to keep things somewhat consistent with BLAKE2s, e.g. by defining
blake2b_compress_arch() which is analogous to blake2s_compress_arch()
and could be exported for use by the library API later if needed.

Signed-off-by: Eric Biggers <[email protected]>
---
arch/arm/crypto/Kconfig | 10 +
arch/arm/crypto/Makefile | 2 +
arch/arm/crypto/blake2b-neon-core.S | 345 ++++++++++++++++++++++++++++
arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
4 files changed, 462 insertions(+)
create mode 100644 arch/arm/crypto/blake2b-neon-core.S
create mode 100644 arch/arm/crypto/blake2b-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index c9bf2df85cb90..f6a14c186b4ec 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -62,6 +62,16 @@ config CRYPTO_SHA512_ARM
SHA-512 secure hash standard (DFIPS 180-2) implemented
using optimized ARM assembler and NEON, when available.

+config CRYPTO_BLAKE2B_NEON
+ tristate "BLAKE2b digest algorithm (ARM NEON)"
+ depends on KERNEL_MODE_NEON
+ select CRYPTO_BLAKE2B
+ help
+ BLAKE2b digest algorithm optimized with ARM NEON instructions.
+ On ARM processors that have NEON support but not the ARMv8
+ Crypto Extensions, typically this BLAKE2b implementation is
+ much faster than SHA-2 and slightly faster than SHA-1.
+
config CRYPTO_AES_ARM
tristate "Scalar AES cipher for ARM"
select CRYPTO_ALGAPI
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index b745c17d356fe..ab835ceeb4f2e 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
+obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
@@ -29,6 +30,7 @@ sha256-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha256_neon_glue.o
sha256-arm-y := sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
sha512-arm-y := sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
+blake2b-neon-y := blake2b-neon-core.o blake2b-neon-glue.o
sha1-arm-ce-y := sha1-ce-core.o sha1-ce-glue.o
sha2-arm-ce-y := sha2-ce-core.o sha2-ce-glue.o
aes-arm-ce-y := aes-ce-core.o aes-ce-glue.o
diff --git a/arch/arm/crypto/blake2b-neon-core.S b/arch/arm/crypto/blake2b-neon-core.S
new file mode 100644
index 0000000000000..f246884c29cb4
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-core.S
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ *
+ * Author: Eric Biggers <[email protected]>
+ */
+
+#include <linux/linkage.h>
+
+ .text
+ .fpu neon
+
+ // The arguments to blake2b_compress_neon()
+ STATE .req r0
+ BLOCK .req r1
+ NBLOCKS .req r2
+ INC .req r3
+
+ // Pointers to the rotation tables
+ ROR24_TABLE .req r4
+ ROR16_TABLE .req r5
+
+ // The original stack pointer
+ ORIG_SP .req r6
+
+ // NEON registers which contain the message words of the current block.
+ // M_0-M_3 are occasionally used for other purposes too.
+ M_0 .req d16
+ M_1 .req d17
+ M_2 .req d18
+ M_3 .req d19
+ M_4 .req d20
+ M_5 .req d21
+ M_6 .req d22
+ M_7 .req d23
+ M_8 .req d24
+ M_9 .req d25
+ M_10 .req d26
+ M_11 .req d27
+ M_12 .req d28
+ M_13 .req d29
+ M_14 .req d30
+ M_15 .req d31
+
+ .align 4
+ // Tables for computing ror64(x, 24) and ror64(x, 16) using the vtbl.8
+ // instruction. This is the most efficient way to implement these
+ // rotation amounts with NEON. (On Cortex-A53 it's the same speed as
+ // vshr.u64 + vsli.u64, while on Cortex-A7 it's faster.)
+.Lror24_table:
+ .byte 3, 4, 5, 6, 7, 0, 1, 2
+.Lror16_table:
+ .byte 2, 3, 4, 5, 6, 7, 0, 1
+ // The BLAKE2b initialization vector
+.Lblake2b_IV:
+ .quad 0x6a09e667f3bcc908, 0xbb67ae8584caa73b
+ .quad 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
+ .quad 0x510e527fade682d1, 0x9b05688c2b3e6c1f
+ .quad 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
+
+// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
+// NEON registers q0-q7. The message block is in q8..q15. The stack pointer
+// points to a 32-byte aligned buffer containing a copy of q8 and q9, so that
+// they can be reloaded if q8 and q9 are used as temporary registers. The macro
+// arguments s0-s15 give the order in which the message words are used in this
+// round. 'final' is 1 if this is the final round.
+.macro _blake2b_round s0, s1, s2, s3, s4, s5, s6, s7, \
+ s8, s9, s10, s11, s12, s13, s14, s15, final=0
+
+ // Mix the columns:
+ // (v[0], v[4], v[8], v[12]), (v[1], v[5], v[9], v[13]),
+ // (v[2], v[6], v[10], v[14]), and (v[3], v[7], v[11], v[15]).
+
+ // a += b + m[blake2b_sigma[r][2*i + 0]];
+ vadd.u64 q0, q0, q2
+ vadd.u64 q1, q1, q3
+ vadd.u64 d0, d0, M_\s0
+ vadd.u64 d1, d1, M_\s2
+ vadd.u64 d2, d2, M_\s4
+ vadd.u64 d3, d3, M_\s6
+
+ // d = ror64(d ^ a, 32);
+ veor q6, q6, q0
+ veor q7, q7, q1
+ vrev64.32 q6, q6
+ vrev64.32 q7, q7
+
+ // c += d;
+ vadd.u64 q4, q4, q6
+ vadd.u64 q5, q5, q7
+
+ // b = ror64(b ^ c, 24);
+ vld1.8 {M_0}, [ROR24_TABLE, :64]
+ veor q2, q2, q4
+ veor q3, q3, q5
+ vtbl.8 d4, {d4}, M_0
+ vtbl.8 d5, {d5}, M_0
+ vtbl.8 d6, {d6}, M_0
+ vtbl.8 d7, {d7}, M_0
+
+ // a += b + m[blake2b_sigma[r][2*i + 1]];
+ //
+ // M_0 got clobbered above, so we have to reload it if any of the four
+ // message words this step needs happens to be M_0. Otherwise we don't
+ // need to reload it here, as it will just get clobbered again below.
+.if \s1 == 0 || \s3 == 0 || \s5 == 0 || \s7 == 0
+ vld1.8 {M_0}, [sp, :64]
+.endif
+ vadd.u64 q0, q0, q2
+ vadd.u64 q1, q1, q3
+ vadd.u64 d0, d0, M_\s1
+ vadd.u64 d1, d1, M_\s3
+ vadd.u64 d2, d2, M_\s5
+ vadd.u64 d3, d3, M_\s7
+
+ // d = ror64(d ^ a, 16);
+ vld1.8 {M_0}, [ROR16_TABLE, :64]
+ veor q6, q6, q0
+ veor q7, q7, q1
+ vtbl.8 d12, {d12}, M_0
+ vtbl.8 d13, {d13}, M_0
+ vtbl.8 d14, {d14}, M_0
+ vtbl.8 d15, {d15}, M_0
+
+ // c += d;
+ vadd.u64 q4, q4, q6
+ vadd.u64 q5, q5, q7
+
+ // b = ror64(b ^ c, 63);
+ //
+ // This rotation amount isn't a multiple of 8, so it has to be
+ // implemented using a pair of shifts, which requires temporary
+ // registers. Use q8-q9 (M_0-M_3) for this, and reload them afterwards.
+ veor q8, q2, q4
+ veor q9, q3, q5
+ vshr.u64 q2, q8, #63
+ vshr.u64 q3, q9, #63
+ vsli.u64 q2, q8, #1
+ vsli.u64 q3, q9, #1
+ vld1.8 {q8-q9}, [sp, :256]
+
+ // Mix the diagonals:
+ // (v[0], v[5], v[10], v[15]), (v[1], v[6], v[11], v[12]),
+ // (v[2], v[7], v[8], v[13]), and (v[3], v[4], v[9], v[14]).
+ //
+ // There are two possible ways to do this: use 'vext' instructions to
+ // shift the rows of the matrix so that the diagonals become columns,
+ // and undo it afterwards; or just use 64-bit operations on 'd'
+ // registers instead of 128-bit operations on 'q' registers. We use the
+ // latter approach, as it performs much better on Cortex-A7.
+
+ // a += b + m[blake2b_sigma[r][2*i + 0]];
+ vadd.u64 d0, d0, d5
+ vadd.u64 d1, d1, d6
+ vadd.u64 d2, d2, d7
+ vadd.u64 d3, d3, d4
+ vadd.u64 d0, d0, M_\s8
+ vadd.u64 d1, d1, M_\s10
+ vadd.u64 d2, d2, M_\s12
+ vadd.u64 d3, d3, M_\s14
+
+ // d = ror64(d ^ a, 32);
+ veor d15, d15, d0
+ veor d12, d12, d1
+ veor d13, d13, d2
+ veor d14, d14, d3
+ vrev64.32 d15, d15
+ vrev64.32 d12, d12
+ vrev64.32 d13, d13
+ vrev64.32 d14, d14
+
+ // c += d;
+ vadd.u64 d10, d10, d15
+ vadd.u64 d11, d11, d12
+ vadd.u64 d8, d8, d13
+ vadd.u64 d9, d9, d14
+
+ // b = ror64(b ^ c, 24);
+ vld1.8 {M_0}, [ROR24_TABLE, :64]
+ veor d5, d5, d10
+ veor d6, d6, d11
+ veor d7, d7, d8
+ veor d4, d4, d9
+ vtbl.8 d5, {d5}, M_0
+ vtbl.8 d6, {d6}, M_0
+ vtbl.8 d7, {d7}, M_0
+ vtbl.8 d4, {d4}, M_0
+
+ // a += b + m[blake2b_sigma[r][2*i + 1]];
+.if \s9 == 0 || \s11 == 0 || \s13 == 0 || \s15 == 0
+ vld1.8 {M_0}, [sp, :64]
+.endif
+ vadd.u64 d0, d0, d5
+ vadd.u64 d1, d1, d6
+ vadd.u64 d2, d2, d7
+ vadd.u64 d3, d3, d4
+ vadd.u64 d0, d0, M_\s9
+ vadd.u64 d1, d1, M_\s11
+ vadd.u64 d2, d2, M_\s13
+ vadd.u64 d3, d3, M_\s15
+
+ // d = ror64(d ^ a, 16);
+ vld1.8 {M_0}, [ROR16_TABLE, :64]
+ veor d15, d15, d0
+ veor d12, d12, d1
+ veor d13, d13, d2
+ veor d14, d14, d3
+ vtbl.8 d12, {d12}, M_0
+ vtbl.8 d13, {d13}, M_0
+ vtbl.8 d14, {d14}, M_0
+ vtbl.8 d15, {d15}, M_0
+
+ // c += d;
+ vadd.u64 d10, d10, d15
+ vadd.u64 d11, d11, d12
+ vadd.u64 d8, d8, d13
+ vadd.u64 d9, d9, d14
+
+ // b = ror64(b ^ c, 63);
+ veor d16, d4, d9
+ veor d17, d5, d10
+ veor d18, d6, d11
+ veor d19, d7, d8
+ vshr.u64 q2, q8, #63
+ vshr.u64 q3, q9, #63
+ vsli.u64 q2, q8, #1
+ vsli.u64 q3, q9, #1
+ // Reloading q8-q9 can be skipped on the final round.
+.if ! \final
+ vld1.8 {q8-q9}, [sp, :256]
+.endif
+.endm
+
+//
+// void blake2b_compress_neon(struct blake2b_state *state,
+// const u8 *block, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2b_state are used:
+// u64 h[8]; (inout)
+// u64 t[2]; (in)
+// u64 f[2]; (in)
+//
+ .align 5
+ENTRY(blake2b_compress_neon)
+ push {r4-r10}
+
+ // Allocate a 32-byte stack buffer that is 32-byte aligned.
+ mov ORIG_SP, sp
+ sub ip, sp, #32
+ bic ip, ip, #31
+ mov sp, ip
+
+ adr ROR24_TABLE, .Lror24_table
+ adr ROR16_TABLE, .Lror16_table
+
+ mov ip, STATE
+ vld1.64 {q0-q1}, [ip]! // Load h[0..3]
+ vld1.64 {q2-q3}, [ip]! // Load h[4..7]
+.Lnext_block:
+ adr r10, .Lblake2b_IV
+ vld1.64 {q14-q15}, [ip] // Load t[0..1] and f[0..1]
+ vld1.64 {q4-q5}, [r10]! // Load IV[0..3]
+ vmov r7, r8, d28 // Copy t[0] to (r7, r8)
+ vld1.64 {q6-q7}, [r10] // Load IV[4..7]
+ adds r7, r7, INC // Increment counter
+ bcs .Lslow_inc_ctr
+ vmov.i32 d28[0], r7
+ vst1.64 {d28}, [ip] // Update t[0]
+.Linc_ctr_done:
+
+ // Load the next message block and finish initializing the state matrix
+ // 'v'. Fortunately, there are exactly enough NEON registers to fit the
+ // entire state matrix in q0-q7 and the entire message block in q8-15.
+ //
+ // However, _blake2b_round also needs some extra registers for rotates,
+ // so we have to spill some registers. It's better to spill the message
+ // registers than the state registers, as the message doesn't change.
+ // Therefore we store a copy of the first 32 bytes of the message block
+ // (q8-q9) in an aligned buffer on the stack so that they can be
+ // reloaded when needed. (We could just reload directly from the
+ // message buffer, but it's faster to use aligned loads.)
+ vld1.8 {q8-q9}, [BLOCK]!
+ veor q6, q6, q14 // v[12..13] = IV[4..5] ^ t[0..1]
+ vld1.8 {q10-q11}, [BLOCK]!
+ veor q7, q7, q15 // v[14..15] = IV[6..7] ^ f[0..1]
+ vld1.8 {q12-q13}, [BLOCK]!
+ vst1.8 {q8-q9}, [sp, :256]
+ mov ip, STATE
+ vld1.8 {q14-q15}, [BLOCK]!
+
+ // Execute the rounds. Each round is provided the order in which it
+ // needs to use the message words.
+ _blake2b_round 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+ _blake2b_round 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
+ _blake2b_round 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
+ _blake2b_round 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
+ _blake2b_round 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
+ _blake2b_round 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
+ _blake2b_round 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
+ _blake2b_round 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
+ _blake2b_round 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
+ _blake2b_round 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
+ _blake2b_round 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+ _blake2b_round 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 \
+ final=1
+
+ // Fold the final state matrix into the hash chaining value:
+ //
+ // for (i = 0; i < 8; i++)
+ // h[i] ^= v[i] ^ v[i + 8];
+ //
+ vld1.64 {q8-q9}, [ip]! // Load old h[0..3]
+ veor q0, q0, q4 // v[0..1] ^= v[8..9]
+ veor q1, q1, q5 // v[2..3] ^= v[10..11]
+ vld1.64 {q10-q11}, [ip] // Load old h[4..7]
+ veor q2, q2, q6 // v[4..5] ^= v[12..13]
+ veor q3, q3, q7 // v[6..7] ^= v[14..15]
+ veor q0, q0, q8 // v[0..1] ^= h[0..1]
+ veor q1, q1, q9 // v[2..3] ^= h[2..3]
+ mov ip, STATE
+ subs NBLOCKS, NBLOCKS, #1 // nblocks--
+ vst1.64 {q0-q1}, [ip]! // Store new h[0..3]
+ veor q2, q2, q10 // v[4..5] ^= h[4..5]
+ veor q3, q3, q11 // v[6..7] ^= h[6..7]
+ vst1.64 {q2-q3}, [ip]! // Store new h[4..7]
+ bne .Lnext_block // nblocks != 0?
+
+ mov sp, ORIG_SP
+ pop {r4-r10}
+ mov pc, lr
+
+.Lslow_inc_ctr:
+ // Handle the case where the counter overflowed its low 32 bits, by
+ // carrying the overflow bit into the full 128-bit counter.
+ vmov r9, r10, d29
+ adcs r8, r8, #0
+ adcs r9, r9, #0
+ adc r10, r10, #0
+ vmov d28, r7, r8
+ vmov d29, r9, r10
+ vst1.64 {q14}, [ip] // Update t[0] and t[1]
+ b .Linc_ctr_done
+ENDPROC(blake2b_compress_neon)
diff --git a/arch/arm/crypto/blake2b-neon-glue.c b/arch/arm/crypto/blake2b-neon-glue.c
new file mode 100644
index 0000000000000..34d73200e7fa6
--- /dev/null
+++ b/arch/arm/crypto/blake2b-neon-glue.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BLAKE2b digest algorithm, NEON accelerated
+ *
+ * Copyright 2020 Google LLC
+ */
+
+#include <crypto/internal/blake2b.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+
+#include <linux/module.h>
+#include <linux/sizes.h>
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+
+asmlinkage void blake2b_compress_neon(struct blake2b_state *state,
+ const u8 *block, size_t nblocks, u32 inc);
+
+static void blake2b_compress_arch(struct blake2b_state *state,
+ const u8 *block, size_t nblocks, u32 inc)
+{
+ if (!crypto_simd_usable()) {
+ blake2b_compress_generic(state, block, nblocks, inc);
+ return;
+ }
+
+ do {
+ const size_t blocks = min_t(size_t, nblocks,
+ SZ_4K / BLAKE2B_BLOCK_SIZE);
+
+ kernel_neon_begin();
+ blake2b_compress_neon(state, block, blocks, inc);
+ kernel_neon_end();
+
+ nblocks -= blocks;
+ block += blocks * BLAKE2B_BLOCK_SIZE;
+ } while (nblocks);
+}
+
+static int crypto_blake2b_update_neon(struct shash_desc *desc,
+ const u8 *in, unsigned int inlen)
+{
+ return crypto_blake2b_update(desc, in, inlen, blake2b_compress_arch);
+}
+
+static int crypto_blake2b_final_neon(struct shash_desc *desc, u8 *out)
+{
+ return crypto_blake2b_final(desc, out, blake2b_compress_arch);
+}
+
+#define BLAKE2B_ALG(name, driver_name, digest_size) \
+ { \
+ .base.cra_name = name, \
+ .base.cra_driver_name = driver_name, \
+ .base.cra_priority = 200, \
+ .base.cra_flags = CRYPTO_ALG_OPTIONAL_KEY, \
+ .base.cra_blocksize = BLAKE2B_BLOCK_SIZE, \
+ .base.cra_ctxsize = sizeof(struct blake2b_tfm_ctx), \
+ .base.cra_module = THIS_MODULE, \
+ .digestsize = digest_size, \
+ .setkey = crypto_blake2b_setkey, \
+ .init = crypto_blake2b_init, \
+ .update = crypto_blake2b_update_neon, \
+ .final = crypto_blake2b_final_neon, \
+ .descsize = sizeof(struct blake2b_state), \
+ }
+
+static struct shash_alg blake2b_neon_algs[] = {
+ BLAKE2B_ALG("blake2b-160", "blake2b-160-neon", BLAKE2B_160_HASH_SIZE),
+ BLAKE2B_ALG("blake2b-256", "blake2b-256-neon", BLAKE2B_256_HASH_SIZE),
+ BLAKE2B_ALG("blake2b-384", "blake2b-384-neon", BLAKE2B_384_HASH_SIZE),
+ BLAKE2B_ALG("blake2b-512", "blake2b-512-neon", BLAKE2B_512_HASH_SIZE),
+};
+
+static int __init blake2b_neon_mod_init(void)
+{
+ if (!(elf_hwcap & HWCAP_NEON))
+ return -ENODEV;
+
+ return crypto_register_shashes(blake2b_neon_algs,
+ ARRAY_SIZE(blake2b_neon_algs));
+}
+
+static void __exit blake2b_neon_mod_exit(void)
+{
+ return crypto_unregister_shashes(blake2b_neon_algs,
+ ARRAY_SIZE(blake2b_neon_algs));
+}
+
+module_init(blake2b_neon_mod_init);
+module_exit(blake2b_neon_mod_exit);
+
+MODULE_DESCRIPTION("BLAKE2b digest algorithm, NEON accelerated");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Biggers <[email protected]>");
+MODULE_ALIAS_CRYPTO("blake2b-160");
+MODULE_ALIAS_CRYPTO("blake2b-160-neon");
+MODULE_ALIAS_CRYPTO("blake2b-256");
+MODULE_ALIAS_CRYPTO("blake2b-256-neon");
+MODULE_ALIAS_CRYPTO("blake2b-384");
+MODULE_ALIAS_CRYPTO("blake2b-384-neon");
+MODULE_ALIAS_CRYPTO("blake2b-512");
+MODULE_ALIAS_CRYPTO("blake2b-512-neon");
--
2.29.2

2020-12-17 22:29:55

by Eric Biggers

[permalink] [raw]

Subject: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

From: Eric Biggers <[email protected]>

Move the boilerplate code for setkey(), init(), update(), and final()
from blake2s_generic.ko into a new module blake2s_helpers.ko, and export
it so that it can be used by other shash implementations of BLAKE2s.

setkey() and init() are exported as-is, while update() and final() have
a blake2s_compress_t function pointer argument added. This allows the
implementation of the compression function to be overridden, which is
the only part that optimized implementations really care about.

The helper functions are defined in a separate module blake2s_helpers.ko
(rather than just than in blake2s_generic.ko) because we can't simply
select CRYPTO_BLAKE2B from CRYPTO_BLAKE2S_X86. Doing this selection
unconditionally would make the library API select the shash API, while
doing it conditionally on CRYPTO_HASH would create a recursive kconfig
dependency on CRYPTO_HASH. As a bonus, using a separate module also
allows the generic implementation to be omitted when unneeded.

These helper functions very closely match the ones I defined for
BLAKE2b, except the BLAKE2b ones didn't go in a separate module yet
because BLAKE2b isn't exposed through the library API yet.

Finally, use these new helper functions in the x86 implementation of
BLAKE2s. (This part should be a separate patch, but unfortunately the
x86 implementation used the exact same function names like
"crypto_blake2s_update()", so it had to be updated at the same time.)

Signed-off-by: Eric Biggers <[email protected]>
---
arch/x86/crypto/blake2s-glue.c | 74 +++-----------------------
crypto/Kconfig | 5 ++
crypto/Makefile | 1 +
crypto/blake2s_generic.c | 79 ++++------------------------
crypto/blake2s_helpers.c | 87 +++++++++++++++++++++++++++++++
include/crypto/internal/blake2s.h | 17 ++++++
6 files changed, 126 insertions(+), 137 deletions(-)
create mode 100644 crypto/blake2s_helpers.c

diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
index 4dcb2ee89efc9..1306b3272c77f 100644
--- a/arch/x86/crypto/blake2s-glue.c
+++ b/arch/x86/crypto/blake2s-glue.c
@@ -58,75 +58,15 @@ void blake2s_compress_arch(struct blake2s_state *state,
}
EXPORT_SYMBOL(blake2s_compress_arch);

-static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
- unsigned int keylen)
+static int crypto_blake2s_update_x86(struct shash_desc *desc, const u8 *in,
+ unsigned int inlen)
{
- struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
-
- if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
- return -EINVAL;
-
- memcpy(tctx->key, key, keylen);
- tctx->keylen = keylen;
-
- return 0;
-}
-
-static int crypto_blake2s_init(struct shash_desc *desc)
-{
- struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
- struct blake2s_state *state = shash_desc_ctx(desc);
- const int outlen = crypto_shash_digestsize(desc->tfm);
-
- if (tctx->keylen)
- blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
- else
- blake2s_init(state, outlen);
-
- return 0;
-}
-
-static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
- unsigned int inlen)
-{
- struct blake2s_state *state = shash_desc_ctx(desc);
- const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
-
- if (unlikely(!inlen))
- return 0;
- if (inlen > fill) {
- memcpy(state->buf + state->buflen, in, fill);
- blake2s_compress_arch(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
- state->buflen = 0;
- in += fill;
- inlen -= fill;
- }
- if (inlen > BLAKE2S_BLOCK_SIZE) {
- const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
- /* Hash one less (full) block than strictly possible */
- blake2s_compress_arch(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
- in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
- inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
- }
- memcpy(state->buf + state->buflen, in, inlen);
- state->buflen += inlen;
-
- return 0;
+ return crypto_blake2s_update(desc, in, inlen, blake2s_compress_arch);
}

-static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
+static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out)
{
- struct blake2s_state *state = shash_desc_ctx(desc);
-
- blake2s_set_lastblock(state);
- memset(state->buf + state->buflen, 0,
- BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
- blake2s_compress_arch(state, state->buf, 1, state->buflen);
- cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
- memcpy(out, state->h, state->outlen);
- memzero_explicit(state, sizeof(*state));
-
- return 0;
+ return crypto_blake2s_final(desc, out, blake2s_compress_arch);
}

#define BLAKE2S_ALG(name, driver_name, digest_size) \
@@ -141,8 +81,8 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
.digestsize = digest_size, \
.setkey = crypto_blake2s_setkey, \
.init = crypto_blake2s_init, \
- .update = crypto_blake2s_update, \
- .final = crypto_blake2s_final, \
+ .update = crypto_blake2s_update_x86, \
+ .final = crypto_blake2s_final_x86, \
.descsize = sizeof(struct blake2s_state), \
}

diff --git a/crypto/Kconfig b/crypto/Kconfig
index a367fcfeb5d45..e3a51154ac0cf 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -681,6 +681,7 @@ config CRYPTO_BLAKE2B
config CRYPTO_BLAKE2S
tristate "BLAKE2s digest algorithm"
select CRYPTO_LIB_BLAKE2S_GENERIC
+ select CRYPTO_BLAKE2S_HELPERS
select CRYPTO_HASH
help
Implementation of cryptographic hash function BLAKE2s
@@ -696,11 +697,15 @@ config CRYPTO_BLAKE2S

See https://blake2.net for further information.

+config CRYPTO_BLAKE2S_HELPERS
+ tristate
+
config CRYPTO_BLAKE2S_X86
tristate "BLAKE2s digest algorithm (x86 accelerated version)"
depends on X86 && 64BIT
select CRYPTO_LIB_BLAKE2S_GENERIC
select CRYPTO_ARCH_HAVE_LIB_BLAKE2S
+ select CRYPTO_BLAKE2S_HELPERS if CRYPTO_HASH

config CRYPTO_CRCT10DIF
tristate "CRCT10DIF algorithm"
diff --git a/crypto/Makefile b/crypto/Makefile
index b279483fba50b..75f99ab82bfc2 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -82,6 +82,7 @@ CFLAGS_wp512.o := $(call cc-option,-fno-schedule-insns) # https://gcc.gnu.org/b
obj-$(CONFIG_CRYPTO_TGR192) += tgr192.o
obj-$(CONFIG_CRYPTO_BLAKE2B) += blake2b_generic.o
obj-$(CONFIG_CRYPTO_BLAKE2S) += blake2s_generic.o
+obj-$(CONFIG_CRYPTO_BLAKE2S_HELPERS) += blake2s_helpers.o
obj-$(CONFIG_CRYPTO_GF128MUL) += gf128mul.o
obj-$(CONFIG_CRYPTO_ECB) += ecb.o
obj-$(CONFIG_CRYPTO_CBC) += cbc.o
diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index b89536c3671cf..ea2f1c6eb3231 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -1,84 +1,23 @@
// SPDX-License-Identifier: GPL-2.0 OR MIT
/*
+ * shash interface to the generic implementation of BLAKE2s
+ *
* Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
*/

#include <crypto/internal/blake2s.h>
#include <crypto/internal/hash.h>
-
-#include <linux/types.h>
-#include <linux/kernel.h>
#include <linux/module.h>

-static int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
- unsigned int keylen)
-{
- struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
-
- if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
- return -EINVAL;
-
- memcpy(tctx->key, key, keylen);
- tctx->keylen = keylen;
-
- return 0;
-}
-
-static int crypto_blake2s_init(struct shash_desc *desc)
-{
- struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
- struct blake2s_state *state = shash_desc_ctx(desc);
- const int outlen = crypto_shash_digestsize(desc->tfm);
-
- if (tctx->keylen)
- blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
- else
- blake2s_init(state, outlen);
-
- return 0;
-}
-
-static int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
- unsigned int inlen)
+static int crypto_blake2s_update_generic(struct shash_desc *desc,
+ const u8 *in, unsigned int inlen)
{
- struct blake2s_state *state = shash_desc_ctx(desc);
- const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
-
- if (unlikely(!inlen))
- return 0;
- if (inlen > fill) {
- memcpy(state->buf + state->buflen, in, fill);
- blake2s_compress_generic(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
- state->buflen = 0;
- in += fill;
- inlen -= fill;
- }
- if (inlen > BLAKE2S_BLOCK_SIZE) {
- const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
- /* Hash one less (full) block than strictly possible */
- blake2s_compress_generic(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
- in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
- inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
- }
- memcpy(state->buf + state->buflen, in, inlen);
- state->buflen += inlen;
-
- return 0;
+ return crypto_blake2s_update(desc, in, inlen, blake2s_compress_generic);
}

-static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
+static int crypto_blake2s_final_generic(struct shash_desc *desc, u8 *out)
{
- struct blake2s_state *state = shash_desc_ctx(desc);
-
- blake2s_set_lastblock(state);
- memset(state->buf + state->buflen, 0,
- BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
- blake2s_compress_generic(state, state->buf, 1, state->buflen);
- cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
- memcpy(out, state->h, state->outlen);
- memzero_explicit(state, sizeof(*state));
-
- return 0;
+ return crypto_blake2s_final(desc, out, blake2s_compress_generic);
}

#define BLAKE2S_ALG(name, driver_name, digest_size) \
@@ -93,8 +32,8 @@ static int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
.digestsize = digest_size, \
.setkey = crypto_blake2s_setkey, \
.init = crypto_blake2s_init, \
- .update = crypto_blake2s_update, \
- .final = crypto_blake2s_final, \
+ .update = crypto_blake2s_update_generic, \
+ .final = crypto_blake2s_final_generic, \
.descsize = sizeof(struct blake2s_state), \
}

diff --git a/crypto/blake2s_helpers.c b/crypto/blake2s_helpers.c
new file mode 100644
index 0000000000000..0c3b9fcbd022c
--- /dev/null
+++ b/crypto/blake2s_helpers.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Helper functions for shash implementations of BLAKE2s. The caller must
+ * provide a pointer to their blake2s_compress_t implementation.
+ *
+ * Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
+ */
+
+#include <crypto/internal/blake2s.h>
+#include <crypto/internal/hash.h>
+#include <linux/export.h>
+
+int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int keylen)
+{
+ struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(tfm);
+
+ if (keylen == 0 || keylen > BLAKE2S_KEY_SIZE)
+ return -EINVAL;
+
+ memcpy(tctx->key, key, keylen);
+ tctx->keylen = keylen;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_blake2s_setkey);
+
+int crypto_blake2s_init(struct shash_desc *desc)
+{
+ struct blake2s_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+ struct blake2s_state *state = shash_desc_ctx(desc);
+ const int outlen = crypto_shash_digestsize(desc->tfm);
+
+ if (tctx->keylen)
+ blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
+ else
+ blake2s_init(state, outlen);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_blake2s_init);
+
+int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
+ unsigned int inlen, blake2s_compress_t compress)
+{
+ struct blake2s_state *state = shash_desc_ctx(desc);
+ const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;
+
+ if (unlikely(!inlen))
+ return 0;
+ if (inlen > fill) {
+ memcpy(state->buf + state->buflen, in, fill);
+ (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+ state->buflen = 0;
+ in += fill;
+ inlen -= fill;
+ }
+ if (inlen > BLAKE2S_BLOCK_SIZE) {
+ const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
+ /* Hash one less (full) block than strictly possible */
+ (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+ in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+ inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
+ }
+ memcpy(state->buf + state->buflen, in, inlen);
+ state->buflen += inlen;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_blake2s_update);
+
+int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
+ blake2s_compress_t compress)
+{
+ struct blake2s_state *state = shash_desc_ctx(desc);
+
+ blake2s_set_lastblock(state);
+ memset(state->buf + state->buflen, 0,
+ BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
+ (*compress)(state, state->buf, 1, state->buflen);
+ cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
+ memcpy(out, state->h, state->outlen);
+ memzero_explicit(state, sizeof(*state));
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_blake2s_final);
diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
index 6e376ae6b6b58..8c79af7662aa2 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -23,4 +23,21 @@ static inline void blake2s_set_lastblock(struct blake2s_state *state)
state->f[0] = -1;
}

+typedef void (*blake2s_compress_t)(struct blake2s_state *state,
+ const u8 *block, size_t nblocks, u32 inc);
+
+struct crypto_shash;
+struct shash_desc;
+
+int crypto_blake2s_setkey(struct crypto_shash *tfm, const u8 *key,
+ unsigned int keylen);
+
+int crypto_blake2s_init(struct shash_desc *desc);
+
+int crypto_blake2s_update(struct shash_desc *desc, const u8 *in,
+ unsigned int inlen, blake2s_compress_t compress);
+
+int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
+ blake2s_compress_t compress);
+
#endif /* BLAKE2S_INTERNAL_H */
--
2.29.2

2020-12-18 16:16:25

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

On Thu, Dec 17, 2020 at 11:25 PM Eric Biggers <[email protected]> wrote:
>
> From: Eric Biggers <[email protected]>
>
> Move the boilerplate code for setkey(), init(), update(), and final()
> from blake2s_generic.ko into a new module blake2s_helpers.ko, and export
> it so that it can be used by other shash implementations of BLAKE2s.
>
> setkey() and init() are exported as-is, while update() and final() have
> a blake2s_compress_t function pointer argument added. This allows the
> implementation of the compression function to be overridden, which is
> the only part that optimized implementations really care about.
>
> The helper functions are defined in a separate module blake2s_helpers.ko
> (rather than just than in blake2s_generic.ko) because we can't simply
> select CRYPTO_BLAKE2B from CRYPTO_BLAKE2S_X86. Doing this selection
> unconditionally would make the library API select the shash API, while
> doing it conditionally on CRYPTO_HASH would create a recursive kconfig
> dependency on CRYPTO_HASH. As a bonus, using a separate module also
> allows the generic implementation to be omitted when unneeded.
>
> These helper functions very closely match the ones I defined for
> BLAKE2b, except the BLAKE2b ones didn't go in a separate module yet
> because BLAKE2b isn't exposed through the library API yet.
>
> Finally, use these new helper functions in the x86 implementation of
> BLAKE2s. (This part should be a separate patch, but unfortunately the
> x86 implementation used the exact same function names like
> "crypto_blake2s_update()", so it had to be updated at the same time.)

There's a similar situation happening with chacha20poly1305 and with
curve25519. Each of these uses a mildly different approach to how we
split things up between library and crypto api code. The _helpers.ko
is another one. There any opportunity here to take a more
unified/consistant approach? Or is shash slightly different than the
others and benefits from a third way?

Jason

2020-12-18 16:35:46

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [PATCH v2 00/11] crypto: arm32-optimized BLAKE2b and BLAKE2s

On Thu, 17 Dec 2020 at 23:24, Eric Biggers <[email protected]> wrote:
>
> This patchset adds 32-bit ARM assembly language implementations of
> BLAKE2b and BLAKE2s.
>
> The BLAKE2b implementation is NEON-accelerated, while the BLAKE2s
> implementation uses scalar instructions since NEON doesn't work very
> well for it. The BLAKE2b implementation is faster and is expected to be
> useful as a replacement for SHA-1 in dm-verity, while the BLAKE2s
> implementation would be useful for WireGuard which uses BLAKE2s.
>
> Both implementations are provided via the shash API, while BLAKE2s is
> also provided via the library API.
>
> While adding these, I also reworked the generic implementations of
> BLAKE2b and BLAKE2s to provide helper functions that make implementing
> other "shash" providers for these algorithms much easier.
>
> See the individual commits for full details, including benchmarks.
>
> This patchset was tested on a Raspberry Pi 2 (which uses a Cortex-A7
> processor) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, plus other tests.
>
> This patchset applies to mainline commit 0c6c887835b5.
>
> Changed since v1:
> - Added BLAKE2s implementation.
> - Adjusted the BLAKE2b helper functions to be consistent with what I
> decided to do for BLAKE2s.
> - Fixed build error in blake2b-neon-core.S in some configurations.
>
> Eric Biggers (11):
> crypto: blake2b - rename constants for consistency with blake2s
> crypto: blake2b - define shash_alg structs using macros
> crypto: blake2b - export helpers for optimized implementations
> crypto: blake2b - update file comment
> crypto: arm/blake2b - add NEON-accelerated BLAKE2b
> crypto: blake2s - define shash_alg structs using macros
> crypto: x86/blake2s - define shash_alg structs using macros
> crypto: blake2s - remove unneeded includes
> crypto: blake2s - share the "shash" API boilerplate code
> crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
> wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
>

Very nice!

For the series,

Acked-by: Ard Biesheuvel <[email protected]>

> arch/arm/crypto/Kconfig | 20 ++
> arch/arm/crypto/Makefile | 4 +
> arch/arm/crypto/blake2b-neon-core.S | 345 ++++++++++++++++++++++++++++
> arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
> arch/arm/crypto/blake2s-core.S | 272 ++++++++++++++++++++++
> arch/arm/crypto/blake2s-glue.c | 78 +++++++
> arch/x86/crypto/blake2s-glue.c | 150 +++---------
> crypto/Kconfig | 5 +
> crypto/Makefile | 1 +
> crypto/blake2b_generic.c | 200 +++++++---------
> crypto/blake2s_generic.c | 161 +++----------
> crypto/blake2s_helpers.c | 87 +++++++
> drivers/net/Kconfig | 1 +
> include/crypto/blake2b.h | 27 +++
> include/crypto/internal/blake2b.h | 33 +++
> include/crypto/internal/blake2s.h | 17 ++
> 16 files changed, 1139 insertions(+), 367 deletions(-)
> create mode 100644 arch/arm/crypto/blake2b-neon-core.S
> create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
> create mode 100644 arch/arm/crypto/blake2s-core.S
> create mode 100644 arch/arm/crypto/blake2s-glue.c
> create mode 100644 crypto/blake2s_helpers.c
> create mode 100644 include/crypto/blake2b.h
> create mode 100644 include/crypto/internal/blake2b.h
>
>
> base-commit: 0c6c887835b59c10602add88057c9c06f265effe
> --
> 2.29.2
>

2020-12-18 20:53:15

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

On Fri, Dec 18, 2020 at 05:14:58PM +0100, Jason A. Donenfeld wrote:
> On Thu, Dec 17, 2020 at 11:25 PM Eric Biggers <[email protected]> wrote:
> >
> > From: Eric Biggers <[email protected]>
> >
> > Move the boilerplate code for setkey(), init(), update(), and final()
> > from blake2s_generic.ko into a new module blake2s_helpers.ko, and export
> > it so that it can be used by other shash implementations of BLAKE2s.
> >
> > setkey() and init() are exported as-is, while update() and final() have
> > a blake2s_compress_t function pointer argument added. This allows the
> > implementation of the compression function to be overridden, which is
> > the only part that optimized implementations really care about.
> >
> > The helper functions are defined in a separate module blake2s_helpers.ko
> > (rather than just than in blake2s_generic.ko) because we can't simply
> > select CRYPTO_BLAKE2B from CRYPTO_BLAKE2S_X86. Doing this selection
> > unconditionally would make the library API select the shash API, while
> > doing it conditionally on CRYPTO_HASH would create a recursive kconfig
> > dependency on CRYPTO_HASH. As a bonus, using a separate module also
> > allows the generic implementation to be omitted when unneeded.
> >
> > These helper functions very closely match the ones I defined for
> > BLAKE2b, except the BLAKE2b ones didn't go in a separate module yet
> > because BLAKE2b isn't exposed through the library API yet.
> >
> > Finally, use these new helper functions in the x86 implementation of
> > BLAKE2s. (This part should be a separate patch, but unfortunately the
> > x86 implementation used the exact same function names like
> > "crypto_blake2s_update()", so it had to be updated at the same time.)
>
> There's a similar situation happening with chacha20poly1305 and with
> curve25519. Each of these uses a mildly different approach to how we
> split things up between library and crypto api code. The _helpers.ko
> is another one. There any opportunity here to take a more
> unified/consistant approach? Or is shash slightly different than the
> others and benefits from a third way?

What I'm doing isn't that new; it's basically the same thing that's already done
for sha1, sha256, sha512, sm3, and nhpoly1305. The difference is just where the
shash helper functions are defined. For sha1, sha256, sha512, and sm3 it's in
headers (include/crypto/{sha1,sha256,sha512,sm3}_base.h). For nhpoly1305 it's
in the same module as the generic implementation (crypto/nhpoly1305.c).

As I mentioned in the commit message, putting the shash helpers in the same
module as the generic implementation doesn't work for blake2s due to the need
for the architecture-specific blake2s modules to support the library API (and
possibly *only* the library API) too. So that mainly leaves the options of
adding include/crypto/blake2s_base.h, or adding blake2s_helpers.ko. But these
approaches aren't fundamentally different -- it's just a matter of whether the
code is short enough for it to make sense to always inline it or not.

It's hard to compare to chacha20 and curve25519 (or chacha20poly1305), as those
are different types of algorithms. poly1305 is the most comparable, but for
poly1305 the library API works differently in that there are
poly1305_{init,update,final}_arch(), not just blake2s_compress_arch(). This
seems to be due to some of the characteristics that optimized poly1305
implementations usually have -- they need to compute key powers differently, and
they may have an optimized implementation of emitting the final hash value, not
just processing blocks.

So that is going to result in a different implementation, as the
architecture-specific poly1305 modules have to implement the full
init/update/final sequence for the library API, while for blake2s they only need
that full sequence for the shash API. Hence the usefulness of adding helper
functions for blake2s shash implementations but not poly1305. I don't think
this difference is necessarily bad; it just means that optimized blake2s
implementations don't need to do as much themselves, in comparison to poly1305.

What we *could* do to reuse even more code than what I've proposed is to make
the functions in lib/crypto/blake2s.c usable as the helper functions for shash
implementations, like the following. Then blake2s_helpers.ko wouldn't really be
necessary. (A few bits would still be needed, but they could be inline
functions in a header.) I thought this might be controversial though, as it
would add some indirection into the library implementation:

diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 41025a30c524c..5188e1ca43c60 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -20,6 +20,16 @@
bool blake2s_selftest(void);

void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
+{
+ if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+ __blake2s_update(state, in, inlen, blake2s_compress_arch);
+ else
+ __blake2s_update(state, in, inlen, blake2s_compress_generic);
+}
+EXPORT_SYMBOL(blake2s_update);
+
+void __blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen,
+ blake2s_compress_t compress)
{
const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;

@@ -27,12 +37,7 @@ void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
return;
if (inlen > fill) {
memcpy(state->buf + state->buflen, in, fill);
- if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
- blake2s_compress_arch(state, state->buf, 1,
- BLAKE2S_BLOCK_SIZE);
- else
- blake2s_compress_generic(state, state->buf, 1,
- BLAKE2S_BLOCK_SIZE);
+ (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
state->buflen = 0;
in += fill;
inlen -= fill;
@@ -40,35 +45,37 @@ void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
if (inlen > BLAKE2S_BLOCK_SIZE) {
const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
/* Hash one less (full) block than strictly possible */
- if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
- blake2s_compress_arch(state, in, nblocks - 1,
- BLAKE2S_BLOCK_SIZE);
- else
- blake2s_compress_generic(state, in, nblocks - 1,
- BLAKE2S_BLOCK_SIZE);
+ (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
}
memcpy(state->buf + state->buflen, in, inlen);
state->buflen += inlen;
}
-EXPORT_SYMBOL(blake2s_update);
+EXPORT_SYMBOL(__blake2s_update);

void blake2s_final(struct blake2s_state *state, u8 *out)
+{
+ if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+ __blake2s_final(state, out, blake2s_compress_arch);
+ else
+ __blake2s_final(state, out, blake2s_compress_generic);
+}
+EXPORT_SYMBOL(blake2s_final);
+
+void __blake2s_final(struct blake2s_state *state, u8 *out,
+ blake2s_compress_t compress)
{
WARN_ON(IS_ENABLED(DEBUG) && !out);
blake2s_set_lastblock(state);
memset(state->buf + state->buflen, 0,
BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
- if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
- blake2s_compress_arch(state, state->buf, 1, state->buflen);
- else
- blake2s_compress_generic(state, state->buf, 1, state->buflen);
+ (*compress)(state, state->buf, 1, state->buflen);
cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
memcpy(out, state->h, state->outlen);
memzero_explicit(state, sizeof(*state));
}
-EXPORT_SYMBOL(blake2s_final);
+EXPORT_SYMBOL(__blake2s_final);

void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
const size_t keylen)

2020-12-19 00:03:11

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

Hey Eric,

The solution you've proposed at the end of your email is actually kind
of similar to what we do with curve25519. Check out
include/crypto/curve25519.h. The critical difference between that and
the blake proposal is that it's in the header for curve25519, so the
indirection disappears.

Could we do that with headers for blake?

Jason

2020-12-22 08:57:29

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

On Sat, Dec 19, 2020 at 01:01:53AM +0100, Jason A. Donenfeld wrote:
> Hey Eric,
>
> The solution you've proposed at the end of your email is actually kind
> of similar to what we do with curve25519. Check out
> include/crypto/curve25519.h. The critical difference between that and
> the blake proposal is that it's in the header for curve25519, so the
> indirection disappears.
>
> Could we do that with headers for blake?
>

That doesn't look too similar, since most of include/crypto/curve25519.h is just
for the library API. curve25519_generate_secret() is shared, but it's only a
few lines of code and there's no function pointer argument.

Either way, it would be possible to add __blake2s_update() and __blake2s_final()
(taking a blake2s_compress_t argument) to include/crypto/internal/blake2s.h, and
make these used by (and inlined into) both the library and shash functions.

Note, that's mostly separate from the question of whether blake2s_helpers.ko
should exist, since that depends on whether we want the functions in it to get
inlined into every shash implementation or not. I don't really have a strong
preference. They did seem long enough to make them out-of-line; however,
indirect calls are bad too. If we go with inlining, then the shash helper
functions (crypto_blake2s_{setkey,init,update,final}()) would just be inline
functions in include/crypto/internal/blake2s.h too, similar to sha256_base.h,
and they would get compiled into both blake2s_generic.ko and blake2s-${arch}.ko.

- Eric