2022-01-25 08:22:10

by Nathan Huckleberry

[permalink] [raw]
Subject: [RFC PATCH 0/7] crypto: HCTR2 support

HCTR2 is a length-preserving encryption mode that is efficient on
processors with instructions to accelerate AES and carryless
multiplication, e.g. x86 processors with AES-NI and CLMUL, and ARM
processors with the ARMv8 Crypto Extensions.

HCTR2 is specified in https://ia.cr/2021/1441 “Length-preserving
encryption with HCTR2” which shows that if AES is secure and HCTR2 is
instantiated with AES, then HCTR2 is secure. Reference code and test
vectors are at https://github.com/google/hctr2.

As a length-preserving encryption mode, HCTR2 is suitable for applications
such as storage encryption where ciphertext expansion is not possible, and
thus authenticated encryption cannot be used. Currently, such
applications usually use XTS, or in some cases Adiantum. XTS has the
disadvantage that it is a narrow-block mode: a bitflip will only change 16
bytes in the resulting ciphertext or plaintext. This reveals more
information to an attacker than necessary.

HCTR2 is a wide-block mode, so it provides a stronger security property: a
bitflip will change the entire message. HCTR2 is somewhat similar to
Adiantum, which is also a wide-block mode. However, HCTR2 is designed to
take advantage of existing crypto instructions, while Adiantum targets
devices without such hardware support. Adiantum is also designed with
longer messages in mind, while HCTR2 is designed to be efficient even on
short messages.

The first intended use of this mode in the kernel is for the encryption of
filenames, where for efficiency reasons encryption must be fully
deterministic (only one ciphertext for each plaintext) and the existing
CBC solution leaks more information than necessary for filenames with
common prefixes.

HCTR2 uses two passes of an ε-almost-∆-universal hash function called
POLYVAL and one pass of a block cipher mode called XCTR. POLYVAL is a
polynomial hash designed for efficiency on modern processors and was
originally specified for use in AES-GCM-SIV (RFC 8452). XCTR mode is a
variant of CTR mode that is more efficient on little-endian machines.

This patchset adds HCTR2 to Linux's crypto API, including generic
implementations of XCTR and POLYVAL, hardware accelerated implementations
of XCTR and POLYVAL for both x86-64 and ARM64, and a templated
implementation of HCTR2.

Nathan Huckleberry (7):
crypto: xctr - Add XCTR support
crypto: polyval - Add POLYVAL support
crypto: hctr2 - Add HCTR2 support
crypto: x86/aesni-xctr: Add accelerated implementation of XCTR
crypto: arm64/aes-xctr: Add accelerated implementation of XCTR
crypto: x86/polyval: Add PCLMULQDQ accelerated implementation of
POLYVAL
crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL

arch/arm64/crypto/Kconfig | 10 +-
arch/arm64/crypto/Makefile | 3 +
arch/arm64/crypto/aes-glue.c | 70 +-
arch/arm64/crypto/aes-modes.S | 128 ++
arch/arm64/crypto/polyval-ce-core.S | 317 ++++
arch/arm64/crypto/polyval-ce-glue.c | 164 ++
arch/x86/crypto/Makefile | 5 +-
arch/x86/crypto/aes_xctrby8_avx-x86_64.S | 529 ++++++
arch/x86/crypto/aesni-intel_asm.S | 70 +
arch/x86/crypto/aesni-intel_glue.c | 88 +
arch/x86/crypto/polyval-clmulni-intel_asm.S | 319 ++++
arch/x86/crypto/polyval-clmulni-intel_glue.c | 165 ++
crypto/Kconfig | 37 +
crypto/Makefile | 3 +
crypto/hctr2.c | 475 +++++
crypto/polyval-generic.c | 183 ++
crypto/tcrypt.c | 10 +
crypto/testmgr.c | 18 +
crypto/testmgr.h | 1617 ++++++++++++++++++
crypto/xctr.c | 202 +++
include/crypto/polyval.h | 22 +
include/crypto/xctr.h | 19 +
22 files changed, 4449 insertions(+), 5 deletions(-)
create mode 100644 arch/arm64/crypto/polyval-ce-core.S
create mode 100644 arch/arm64/crypto/polyval-ce-glue.c
create mode 100644 arch/x86/crypto/aes_xctrby8_avx-x86_64.S
create mode 100644 arch/x86/crypto/polyval-clmulni-intel_asm.S
create mode 100644 arch/x86/crypto/polyval-clmulni-intel_glue.c
create mode 100644 crypto/hctr2.c
create mode 100644 crypto/polyval-generic.c
create mode 100644 crypto/xctr.c
create mode 100644 include/crypto/polyval.h
create mode 100644 include/crypto/xctr.h

--
2.35.0.rc0.227.g00780c9af4-goog


2022-01-25 08:22:14

by Nathan Huckleberry

[permalink] [raw]
Subject: [RFC PATCH 6/7] crypto: x86/polyval: Add PCLMULQDQ accelerated implementation of POLYVAL

Add hardware accelerated version of POLYVAL for x86-64 CPUs with
PCLMULQDQ support.

This implementation is accelerated using PCLMULQDQ instructions to
perform the finite field computations. For added efficiency, 8 blocks
of the plaintext are processed simultaneously by precomputing the first
8 powers of the key.

Schoolbook multiplication is used instead of Karatsuba multiplication
because it was found to be slightly faster on x86-64 machines.
Montgomery reduction must be used instead of Barrett reduction due to
the difference in modulus between POLYVAL's field and other finite
fields.

More information on POLYVAL can be found in the HCTR2 paper:
Length-preserving encryption with HCTR2:
https://eprint.iacr.org/2021/1441.pdf

Signed-off-by: Nathan Huckleberry <[email protected]>
---
arch/x86/crypto/Makefile | 3 +
arch/x86/crypto/polyval-clmulni-intel_asm.S | 319 +++++++++++++++++++
arch/x86/crypto/polyval-clmulni-intel_glue.c | 165 ++++++++++
crypto/Kconfig | 9 +
4 files changed, 496 insertions(+)
create mode 100644 arch/x86/crypto/polyval-clmulni-intel_asm.S
create mode 100644 arch/x86/crypto/polyval-clmulni-intel_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index ed187fcd0b01..0214c5f22606 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -69,6 +69,9 @@ libblake2s-x86_64-y := blake2s-core.o blake2s-glue.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o

+obj-$(CONFIG_CRYPTO_POLYVAL_CLMUL_NI_INTEL) += polyval-clmulni-intel.o
+polyval-clmulni-intel-y := polyval-clmulni-intel_asm.o polyval-clmulni-intel_glue.o
+
obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
crc32c-intel-y := crc32c-intel_glue.o
crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
diff --git a/arch/x86/crypto/polyval-clmulni-intel_asm.S b/arch/x86/crypto/polyval-clmulni-intel_asm.S
new file mode 100644
index 000000000000..4339b58e610d
--- /dev/null
+++ b/arch/x86/crypto/polyval-clmulni-intel_asm.S
@@ -0,0 +1,319 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2021 Google LLC
+ *
+ * Use of this source code is governed by an MIT-style
+ * license that can be found in the LICENSE file or at
+ * https://opensource.org/licenses/MIT.
+ */
+/*
+ * This is an efficient implementation of POLYVAL using intel PCLMULQDQ-NI
+ * instructions. It works on 8 blocks at a time, computing the 256 degree
+ * polynomial p(x) = h^8m_0 + ... + h^1m_7. It then computes the modular
+ * reduction of p(x) and XORs p(x) with the current digest.
+ */
+
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#define NUM_PRECOMPUTE_POWERS 8
+
+.align 16
+
+#define GSTAR %xmm7
+#define PL %xmm8
+#define PH %xmm9
+#define T %xmm10
+#define Z %xmm11
+#define C %xmm12
+#define D %xmm13
+#define EF %xmm14
+#define SUM %xmm15
+
+#define BLOCKS_LEFT %rdx
+#define OP1 %rdi
+#define OP2 %r10
+#define IDX %r11
+#define TMP %rax
+
+Lgstar:
+ .quad 0xc200000000000000, 0xc200000000000000
+
+.text
+
+/*
+ * Accepts operand lists of length b in rdi and rsi. Computes the product of
+ * each rdi,rsi pair then XORs the products into A, B, C, D.
+ *
+ * If first == 1 then XOR the value of SUM into the first block processed.
+ * This avoids an extra multication of SUM and h^N.
+ *
+ * XORs product into C, D, EF
+ * Preserves SUM
+ * All other xmm registers clobbered
+ */
+.macro schoolbook1 b
+ .set by, \b
+ .set i, 0
+ .rept (by)
+ schoolbook1_iteration i 0
+ .set i, (i +1)
+ .endr
+.endm
+
+.macro schoolbook1_iteration i first
+ .set first, \first
+ .set i, \i
+ movups (16*i)(OP1), %xmm0
+ .if(i == 0 && first == 1)
+ pxor SUM, %xmm0
+ .endif
+ vpclmulqdq $0x01, (16*i)(OP2), %xmm0, %xmm1
+ vpxor %xmm1, EF, EF
+ vpclmulqdq $0x00, (16*i)(OP2), %xmm0, %xmm2
+ vpxor %xmm2, C, C
+ vpclmulqdq $0x11, (16*i)(OP2), %xmm0, %xmm3
+ vpxor %xmm3, D, D
+ vpclmulqdq $0x10, (16*i)(OP2), %xmm0, %xmm4
+ vpxor %xmm4, EF, EF
+.endm
+
+/*
+ * Computes first schoolbook step of values loaded into xmm0 and xmm1. Used to
+ * multiply intermediate register values rather than memory stored values.
+ *
+ * XORs product into C, D, EF
+ * Preserves SUM
+ * All other xmm registers clobbered
+ */
+.macro schoolbook1_noload
+ vpclmulqdq $0x01, %xmm0, %xmm1, %xmm2
+ vpxor %xmm2, EF, EF
+ vpclmulqdq $0x00, %xmm0, %xmm1, %xmm3
+ vpxor %xmm3, C, C
+ vpclmulqdq $0x11, %xmm0, %xmm1, %xmm4
+ vpxor %xmm4, D, D
+ vpclmulqdq $0x10, %xmm0, %xmm1, %xmm5
+ vpxor %xmm5, EF, EF
+.endm
+
+/*
+ * Computes the 256-bit polynomial represented by C, D, EF. Stores
+ * the result in PL, PH.
+ *
+ * All other xmm registers are preserved.
+ */
+.macro schoolbook2
+ vpslldq $8, EF, PL
+ vpsrldq $8, EF, PH
+ pxor C, PL
+ pxor D, PH
+.endm
+
+/*
+ * Computes the 128-bit reduction of PL, PH. Stores the result in PH.
+ *
+ * PL, PH, Z, T.
+ * All other xmm registers are preserved.
+ */
+.macro montgomery_reduction
+ movdqa PL, T
+ pclmulqdq $0x00, GSTAR, T # T = [X0 * g*(x)]
+ pshufd $0b01001110, T, Z # Z = [T0 : T1]
+ pxor Z, PL # PL = [X1 ^ T0 : X0 ^ T1]
+ pxor PL, PH # PH = [X1 ^ T0 ^ X3 : X0 ^ T1 ^ X2]
+ pclmulqdq $0x11, GSTAR, PL # PL = [X1 ^ T0 * g*(x)]
+ pxor PL, PH
+.endm
+
+/*
+ * Compute schoolbook multiplication for 8 blocks
+ * (M_0h + REDUCE(PL, PH))h^8 + ... + M_{7}h^1 (no constant term)
+ *
+ * Sets PL, PH
+ * Clobbers C, D, E
+ *
+ * If reduce is set, computes the montgomery reduction of the
+ * previous full_stride call.
+ */
+.macro full_stride reduce
+ .set reduce, \reduce
+ mov %rsi, OP2
+ pxor C, C
+ pxor D, D
+ pxor EF, EF
+
+ schoolbook1_iteration 7 0
+ .if(reduce)
+ movdqa PL, T
+ .endif
+
+ schoolbook1_iteration 6 0
+ .if(reduce)
+ pclmulqdq $0x00, GSTAR, T # T = [X0 * g*(x)]
+ .endif
+
+ schoolbook1_iteration 5 0
+ .if(reduce)
+ pshufd $0b01001110, T, Z # Z = [T0 : T1]
+ .endif
+
+ schoolbook1_iteration 4 0
+ .if(reduce)
+ pxor Z, PL # PL = [X1 ^ T0 : X0 ^ T1]
+ .endif
+
+ schoolbook1_iteration 3 0
+ .if(reduce)
+ pxor PL, PH # PH = [X1 ^ T0 ^ X3 : X0 ^ T1 ^ X2]
+ .endif
+
+ schoolbook1_iteration 2 0
+ .if(reduce)
+ pclmulqdq $0x11, GSTAR, PL # PL = [X1 ^ T0 * g*(x)]
+ .endif
+
+ schoolbook1_iteration 1 0
+ .if(reduce)
+ pxor PL, PH
+ movdqa PH, SUM
+ .endif
+
+ schoolbook1_iteration 0 1
+
+ addq $(8*16), OP1
+ addq $(8*16), OP2
+ schoolbook2
+.endm
+
+/*
+ * Compute poly on window size of %rdx blocks
+ * 0 < %rdx < NUM_PRECOMPUTE_POWERS
+ */
+.macro partial_stride
+ pxor C, C
+ pxor D, D
+ pxor EF, EF
+ mov BLOCKS_LEFT, TMP
+ shlq $4, TMP
+ mov %rsi, OP2
+ addq $(16*NUM_PRECOMPUTE_POWERS), OP2
+ subq TMP, OP2
+ # Multiply sum by h^N
+ movups (OP2), %xmm0
+ movdqa SUM, %xmm1
+ schoolbook1_noload
+ schoolbook2
+ montgomery_reduction
+ movdqa PH, SUM
+ pxor C, C
+ pxor D, D
+ pxor EF, EF
+ xor IDX, IDX
+.LloopPartial:
+ cmpq BLOCKS_LEFT, IDX # IDX < rdx
+ jae .LloopExitPartial
+
+ movq BLOCKS_LEFT, TMP
+ subq IDX, TMP # TMP = rdx - IDX
+
+ cmp $4, TMP # TMP < 4 ?
+ jl .Llt4Partial
+ schoolbook1 4
+ addq $4, IDX
+ addq $(4*16), OP1
+ addq $(4*16), OP2
+ jmp .LoutPartial
+.Llt4Partial:
+ cmp $3, TMP # TMP < 3 ?
+ jl .Llt3Partial
+ schoolbook1 3
+ addq $3, IDX
+ addq $(3*16), OP1
+ addq $(3*16), OP2
+ jmp .LoutPartial
+.Llt3Partial:
+ cmp $2, TMP # TMP < 2 ?
+ jl .Llt2Partial
+ schoolbook1 2
+ addq $2, IDX
+ addq $(2*16), OP1
+ addq $(2*16), OP2
+ jmp .LoutPartial
+.Llt2Partial:
+ schoolbook1 1 # TMP < 1 ?
+ addq $1, IDX
+ addq $(1*16), OP1
+ addq $(1*16), OP2
+.LoutPartial:
+ jmp .LloopPartial
+.LloopExitPartial:
+ schoolbook2
+ montgomery_reduction
+ pxor PH, SUM
+.endm
+
+/*
+ * Perform montgomery multiplication in GF(2^128) and store result in op1.
+ *
+ * Computes op1*op2*x^{-128} mod x^128 + x^127 + x^126 + x^121 + 1
+ * If op1, op2 are in montgomery form, this computes the montgomery
+ * form of op1*op2.
+ *
+ * void clmul_polyval_mul(ble128 *op1, const ble128 *op2);
+ */
+SYM_FUNC_START(clmul_polyval_mul)
+ FRAME_BEGIN
+ vmovdqa Lgstar(%rip), GSTAR
+ pxor C, C
+ pxor D, D
+ pxor EF, EF
+ mov %rsi, OP2
+ schoolbook1 1
+ schoolbook2
+ montgomery_reduction
+ movups PH, (%rdi)
+ FRAME_END
+ ret
+SYM_FUNC_END(clmul_polyval_mul)
+
+/*
+ * Perform polynomial evaluation as specified by POLYVAL. Multiplies the value
+ * stored at accumulator by h^k and XORs the evaluated polynomial into it.
+ *
+ * Computes h^k*accumulator + h^kM_0 + ... + h^1M_{k-1} (No constant term)
+ *
+ * rdi (OP1) - pointer to message blocks
+ * rsi - pointer to precomputed key struct
+ * rdx - number of blocks to hash
+ * rcx - location to XOR with evaluated polynomial
+ *
+ * void clmul_polyval_update(const u8 *in, const struct polyhash_key* keys,
+ * size_t nblocks, ble128* accumulator);
+ */
+SYM_FUNC_START(clmul_polyval_update)
+ FRAME_BEGIN
+ vmovdqa Lgstar(%rip), GSTAR
+ movups (%rcx), SUM
+ cmpq $NUM_PRECOMPUTE_POWERS, BLOCKS_LEFT
+ jb .LstrideLoopExit
+ full_stride 0
+ subq $NUM_PRECOMPUTE_POWERS, BLOCKS_LEFT
+.LstrideLoop:
+ cmpq $NUM_PRECOMPUTE_POWERS, BLOCKS_LEFT
+ jb .LstrideLoopExitReduce
+ full_stride 1
+ subq $NUM_PRECOMPUTE_POWERS, BLOCKS_LEFT
+ jmp .LstrideLoop
+.LstrideLoopExitReduce:
+ montgomery_reduction
+ movdqa PH, SUM
+.LstrideLoopExit:
+ test BLOCKS_LEFT, BLOCKS_LEFT
+ je .LskipPartial
+ partial_stride
+.LskipPartial:
+ movups SUM, (%rcx)
+ FRAME_END
+ ret
+SYM_FUNC_END(clmul_polyval_update)
diff --git a/arch/x86/crypto/polyval-clmulni-intel_glue.c b/arch/x86/crypto/polyval-clmulni-intel_glue.c
new file mode 100644
index 000000000000..64a432b67b49
--- /dev/null
+++ b/arch/x86/crypto/polyval-clmulni-intel_glue.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Accelerated POLYVAL implementation with Intel PCLMULQDQ-NI
+ * instructions. This file contains glue code.
+ *
+ * Copyright (c) 2007 Nokia Siemens Networks - Mikko Herranen <[email protected]>
+ * Copyright (c) 2009 Intel Corp.
+ * Author: Huang Ying <[email protected]>
+ * Copyright 2021 Google LLC
+ */
+/*
+ * Glue code based on ghash-clmulni-intel_glue.c.
+ *
+ * This implementation of POLYVAL uses montgomery multiplication
+ * accelerated by PCLMULQDQ-NI to implement the finite field
+ * operations.
+ *
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/gf128mul.h>
+#include <crypto/internal/hash.h>
+#include <linux/crypto.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/simd.h>
+
+#define POLYVAL_BLOCK_SIZE 16
+#define POLYVAL_DIGEST_SIZE 16
+#define NUM_PRECOMPUTE_POWERS 8
+
+struct polyval_ctx {
+ be128 key_powers[NUM_PRECOMPUTE_POWERS];
+};
+
+struct polyval_desc_ctx {
+ u8 buffer[POLYVAL_BLOCK_SIZE];
+ u32 bytes;
+};
+
+asmlinkage void clmul_polyval_update(const u8 *in, const be128 *keys, size_t
+ nblocks, be128 *accumulator);
+asmlinkage void clmul_polyval_mul(be128 *op1, const be128 *op2);
+
+static int polyval_init(struct shash_desc *desc)
+{
+ struct polyval_desc_ctx *dctx = shash_desc_ctx(desc);
+
+ memset(dctx, 0, sizeof(*dctx));
+
+ return 0;
+}
+
+static int polyval_setkey(struct crypto_shash *tfm,
+ const u8 *key, unsigned int keylen)
+{
+ struct polyval_ctx *ctx = crypto_shash_ctx(tfm);
+ int i;
+
+ if (keylen != POLYVAL_BLOCK_SIZE)
+ return -EINVAL;
+
+ memcpy(&ctx->key_powers[NUM_PRECOMPUTE_POWERS-1], key, sizeof(be128));
+
+ for (i = NUM_PRECOMPUTE_POWERS-2; i >= 0; i--) {
+ memcpy(&ctx->key_powers[i], key, sizeof(be128));
+ clmul_polyval_mul(&ctx->key_powers[i], &ctx->key_powers[i+1]);
+ }
+
+ return 0;
+}
+
+static int polyval_update(struct shash_desc *desc,
+ const u8 *src, unsigned int srclen)
+{
+ struct polyval_desc_ctx *dctx = shash_desc_ctx(desc);
+ struct polyval_ctx *ctx = crypto_shash_ctx(desc->tfm);
+ u8 *dst = dctx->buffer;
+ u8 *pos;
+ unsigned int nblocks;
+ int n;
+
+ kernel_fpu_begin();
+ if (dctx->bytes) {
+ n = min(srclen, dctx->bytes);
+ pos = dst + POLYVAL_BLOCK_SIZE - dctx->bytes;
+
+ dctx->bytes -= n;
+ srclen -= n;
+
+ while (n--)
+ *pos++ ^= *src++;
+
+ if (!dctx->bytes)
+ clmul_polyval_mul((be128 *)dst, &ctx->key_powers[NUM_PRECOMPUTE_POWERS-1]);
+ }
+
+ nblocks = srclen/POLYVAL_BLOCK_SIZE;
+ clmul_polyval_update(src, ctx->key_powers, nblocks, (be128 *)dst);
+ srclen -= nblocks*POLYVAL_BLOCK_SIZE;
+ kernel_fpu_end();
+
+ if (srclen) {
+ dctx->bytes = POLYVAL_BLOCK_SIZE - srclen;
+ src += nblocks*POLYVAL_BLOCK_SIZE;
+ pos = dst;
+ while (srclen--)
+ *pos++ ^= *src++;
+ }
+
+ return 0;
+}
+
+static int polyval_final(struct shash_desc *desc, u8 *dst)
+{
+ struct polyval_desc_ctx *dctx = shash_desc_ctx(desc);
+ struct polyval_ctx *ctx = crypto_shash_ctx(desc->tfm);
+ u8 *buf = dctx->buffer;
+
+ if (dctx->bytes) {
+ kernel_fpu_begin();
+ clmul_polyval_mul((be128 *)dst, &ctx->key_powers[NUM_PRECOMPUTE_POWERS-1]);
+ kernel_fpu_end();
+ }
+
+ dctx->bytes = 0;
+ memcpy(dst, buf, POLYVAL_BLOCK_SIZE);
+
+ return 0;
+}
+
+static struct shash_alg polyval_alg = {
+ .digestsize = POLYVAL_DIGEST_SIZE,
+ .init = polyval_init,
+ .update = polyval_update,
+ .final = polyval_final,
+ .setkey = polyval_setkey,
+ .descsize = sizeof(struct polyval_desc_ctx),
+ .base = {
+ .cra_name = "polyval",
+ .cra_driver_name = "polyval-pclmulqdqni",
+ .cra_priority = 200,
+ .cra_blocksize = POLYVAL_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct polyval_ctx),
+ .cra_module = THIS_MODULE,
+ },
+};
+
+static int __init polyval_mod_init(void)
+{
+ return crypto_register_shash(&polyval_alg);
+}
+
+static void __exit polyval_mod_exit(void)
+{
+ crypto_unregister_shash(&polyval_alg);
+}
+
+subsys_initcall(polyval_mod_init);
+module_exit(polyval_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("POLYVAL hash function accelerated by PCLMULQDQ-NI");
+MODULE_ALIAS_CRYPTO("polyval");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 3cdb6c351062..ecff82b77b42 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -779,6 +779,15 @@ config CRYPTO_POLYVAL
POLYVAL is the hash function used in HCTR2. It is not a general-purpose
cryptographic hash function.

+config CRYPTO_POLYVAL_CLMUL_NI_INTEL
+ tristate "POLYVAL hash function (CLMUL-NI accelerated)"
+ depends on X86 && 64BIT
+ select CRYPTO_POLYVAL
+ help
+ This is the x86_64 CLMUL-NI accelerated implementation of POLYVAL. It is
+ used to efficiently implement HCTR2 on x86-64 processors that support
+ carry-less multiplication instructions.
+
config CRYPTO_POLY1305
tristate "Poly1305 authenticator algorithm"
select CRYPTO_HASH
--
2.35.0.rc0.227.g00780c9af4-goog

2022-02-02 23:49:31

by Eric Biggers

[permalink] [raw]
Subject: Re: [RFC PATCH 6/7] crypto: x86/polyval: Add PCLMULQDQ accelerated implementation of POLYVAL

On Mon, Jan 24, 2022 at 07:44:21PM -0600, Nathan Huckleberry wrote:
> Add hardware accelerated version of POLYVAL for x86-64 CPUs with
> PCLMULQDQ support.
>
> This implementation is accelerated using PCLMULQDQ instructions to
> perform the finite field computations. For added efficiency, 8 blocks
> of the plaintext are processed simultaneously by precomputing the first
> 8 powers of the key.
>
> Schoolbook multiplication is used instead of Karatsuba multiplication
> because it was found to be slightly faster on x86-64 machines.
> Montgomery reduction must be used instead of Barrett reduction due to
> the difference in modulus between POLYVAL's field and other finite
> fields.
>
> More information on POLYVAL can be found in the HCTR2 paper:
> Length-preserving encryption with HCTR2:
> https://eprint.iacr.org/2021/1441.pdf
>
> Signed-off-by: Nathan Huckleberry <[email protected]>
> ---
> arch/x86/crypto/Makefile | 3 +
> arch/x86/crypto/polyval-clmulni-intel_asm.S | 319 +++++++++++++++++++

This file is causing a build-time warning:

arch/x86/crypto/polyval-clmulni-intel_asm.o: warning: objtool: .text+0x0: unreachable instruction

- Eric

2022-02-04 00:00:55

by Eric Biggers

[permalink] [raw]
Subject: Re: [RFC PATCH 6/7] crypto: x86/polyval: Add PCLMULQDQ accelerated implementation of POLYVAL

[Note that many of these comments will apply to the arm64 version too.]

On Mon, Jan 24, 2022 at 07:44:21PM -0600, Nathan Huckleberry wrote:
> Add hardware accelerated version of POLYVAL for x86-64 CPUs with
> PCLMULQDQ support.
>
> This implementation is accelerated using PCLMULQDQ instructions to
> perform the finite field computations. For added efficiency, 8 blocks
> of the plaintext are processed simultaneously by precomputing the first

plaintext => message

> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> index ed187fcd0b01..0214c5f22606 100644
> --- a/arch/x86/crypto/Makefile
> +++ b/arch/x86/crypto/Makefile
> @@ -69,6 +69,9 @@ libblake2s-x86_64-y := blake2s-core.o blake2s-glue.o
> obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
> ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
>
> +obj-$(CONFIG_CRYPTO_POLYVAL_CLMUL_NI_INTEL) += polyval-clmulni-intel.o
> +polyval-clmulni-intel-y := polyval-clmulni-intel_asm.o polyval-clmulni-intel_glue.o
> +

IMO this should be named just polyval-clmulni. Including "intel" is a bit
gratuituous, given that AMD supports this too, and this is in the x86 directory.
I guess that some of the authors of some of the existing files wanted to include
their company name. Doesn't actually matter, though; it's up to you.

> diff --git a/arch/x86/crypto/polyval-clmulni-intel_asm.S b/arch/x86/crypto/polyval-clmulni-intel_asm.S
> new file mode 100644
> index 000000000000..4339b58e610d
> --- /dev/null
> +++ b/arch/x86/crypto/polyval-clmulni-intel_asm.S
> @@ -0,0 +1,319 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2021 Google LLC
> + *
> + * Use of this source code is governed by an MIT-style
> + * license that can be found in the LICENSE file or at
> + * https://opensource.org/licenses/MIT.
> + */
> +/*
> + * This is an efficient implementation of POLYVAL using intel PCLMULQDQ-NI
> + * instructions. It works on 8 blocks at a time, computing the 256 degree
> + * polynomial p(x) = h^8m_0 + ... + h^1m_7. It then computes the modular
> + * reduction of p(x) and XORs p(x) with the current digest.
> + */

What does "256 degree polynomial" mean here?

> +/*
> + * Accepts operand lists of length b in rdi and rsi.

In general the first sentence of a comment describing a function or macro should
be a summary of what it does, not some particular detail.

> + * Computes the product of
> + * each rdi,rsi pair then XORs the products into A, B, C, D.

Where are A, B, and rsi used?

> + * If first == 1 then XOR the value of SUM into the first block processed.
> + * This avoids an extra multication of SUM and h^N.

first == 1 on the *last* call per 8 blocks. Perhaps it needs a better name?

> + * All other xmm registers clobbered

This doesn't appear to be true; the code relies on GSTAR not being clobbered.

> +.macro schoolbook1_iteration i first
> + .set first, \first
> + .set i, \i
> + movups (16*i)(OP1), %xmm0
> + .if(i == 0 && first == 1)
> + pxor SUM, %xmm0
> + .endif

I don't think the ".set" statements are necessary here. You can just use \i and
\first directly.

> +/*
> + * Computes first schoolbook step of values loaded into xmm0 and xmm1. Used to
> + * multiply intermediate register values rather than memory stored values.
> + *
> + * XORs product into C, D, EF
> + * Preserves SUM
> + * All other xmm registers clobbered
> + */
> +.macro schoolbook1_noload
> + vpclmulqdq $0x01, %xmm0, %xmm1, %xmm2
> + vpxor %xmm2, EF, EF
> + vpclmulqdq $0x00, %xmm0, %xmm1, %xmm3
> + vpxor %xmm3, C, C
> + vpclmulqdq $0x11, %xmm0, %xmm1, %xmm4
> + vpxor %xmm4, D, D
> + vpclmulqdq $0x10, %xmm0, %xmm1, %xmm5
> + vpxor %xmm5, EF, EF
> +.endm

So C holds the low part of the product, EF the middle part, and D the high part.
How about giving these better names, like LO, MID, and HI?

> +/*
> + * Computes the 128-bit reduction of PL, PH. Stores the result in PH.
> + *
> + * PL, PH, Z, T.
> + * All other xmm registers are preserved.
> + */
> +.macro montgomery_reduction
> + movdqa PL, T
> + pclmulqdq $0x00, GSTAR, T # T = [X0 * g*(x)]
> + pshufd $0b01001110, T, Z # Z = [T0 : T1]
> + pxor Z, PL # PL = [X1 ^ T0 : X0 ^ T1]
> + pxor PL, PH # PH = [X1 ^ T0 ^ X3 : X0 ^ T1 ^ X2]
> + pclmulqdq $0x11, GSTAR, PL # PL = [X1 ^ T0 * g*(x)]
> + pxor PL, PH
> +.endm

This really needs a comment that describes at a high level what is going on --
adding multiples of the reduction polynomial to cancel out the low-order parts.
And also how Montgomery multiplication works in this context. The one-line
comments don't help much, especially since "X" is never defined.

Also, it seems like you've implemented an optimization that avoids a second
pshufd instruction, over the simpler approach of folding 64 bits up twice in the
same way. Can you add a comment that explains this?

Also what do the names T and Z mean? If they're just temporary values, TMP1 and
TMP2 might be better names.

> +/*
> + * Compute schoolbook multiplication for 8 blocks
> + * (M_0h + REDUCE(PL, PH))h^8 + ... + M_{7}h^1 (no constant term)

Shouldn't M_0h be just M_0?

Also, isn't the REDUCE part conditional on \reduce?

> +/*
> + * Perform polynomial evaluation as specified by POLYVAL. Multiplies the value
> + * stored at accumulator by h^k and XORs the evaluated polynomial into it.

What is 'k'?

> + *
> + * Computes h^k*accumulator + h^kM_0 + ... + h^1M_{k-1} (No constant term)
> + *
> + * rdi (OP1) - pointer to message blocks
> + * rsi - pointer to precomputed key struct
> + * rdx - number of blocks to hash
> + * rcx - location to XOR with evaluated polynomial
> + *
> + * void clmul_polyval_update(const u8 *in, const struct polyhash_key* keys,
> + * size_t nblocks, ble128* accumulator);
> + */

struct polyhash_key isn't defined anywhere.

> diff --git a/arch/x86/crypto/polyval-clmulni-intel_glue.c b/arch/x86/crypto/polyval-clmulni-intel_glue.c
> new file mode 100644
> index 000000000000..64a432b67b49
> --- /dev/null
> +++ b/arch/x86/crypto/polyval-clmulni-intel_glue.c
> @@ -0,0 +1,165 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Accelerated POLYVAL implementation with Intel PCLMULQDQ-NI
> + * instructions. This file contains glue code.
> + *
> + * Copyright (c) 2007 Nokia Siemens Networks - Mikko Herranen <[email protected]>
> + * Copyright (c) 2009 Intel Corp.
> + * Author: Huang Ying <[email protected]>
> + * Copyright 2021 Google LLC
> + */
> +/*
> + * Glue code based on ghash-clmulni-intel_glue.c.
> + *
> + * This implementation of POLYVAL uses montgomery multiplication
> + * accelerated by PCLMULQDQ-NI to implement the finite field
> + * operations.
> + *
> + */
> +
> +#include <crypto/algapi.h>
> +#include <crypto/gf128mul.h>
> +#include <crypto/internal/hash.h>
> +#include <linux/crypto.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <asm/simd.h>
> +
> +#define POLYVAL_BLOCK_SIZE 16
> +#define POLYVAL_DIGEST_SIZE 16

How about including <crypto/polyval.h> (added by an earlier patch) to get these
definitions?

> +#define NUM_PRECOMPUTE_POWERS 8
> +
> +struct polyval_ctx {
> + be128 key_powers[NUM_PRECOMPUTE_POWERS];
> +};

There should be a comment that says what order the key_powers are in.

Also why is the type be128? These aren't big endian.

> +static int polyval_setkey(struct crypto_shash *tfm,
> + const u8 *key, unsigned int keylen)
> +{
> + struct polyval_ctx *ctx = crypto_shash_ctx(tfm);
> + int i;
> +
> + if (keylen != POLYVAL_BLOCK_SIZE)
> + return -EINVAL;

This could use a:

BUILD_BUG_ON(POLYVAL_BLOCK_SIZE != sizeof(be128));

> +
> + memcpy(&ctx->key_powers[NUM_PRECOMPUTE_POWERS-1], key, sizeof(be128));
> +
> + for (i = NUM_PRECOMPUTE_POWERS-2; i >= 0; i--) {
> + memcpy(&ctx->key_powers[i], key, sizeof(be128));
> + clmul_polyval_mul(&ctx->key_powers[i], &ctx->key_powers[i+1]);
> + }

It appears this is using the SIMD registers without first executing
kernel_fpu_begin(), which isn't valid.

> +static int polyval_update(struct shash_desc *desc,
> + const u8 *src, unsigned int srclen)
> +{
> + struct polyval_desc_ctx *dctx = shash_desc_ctx(desc);
> + struct polyval_ctx *ctx = crypto_shash_ctx(desc->tfm);
> + u8 *dst = dctx->buffer;
> + u8 *pos;
> + unsigned int nblocks;
> + int n;
> +
> + kernel_fpu_begin();
> + if (dctx->bytes) {
> + n = min(srclen, dctx->bytes);
> + pos = dst + POLYVAL_BLOCK_SIZE - dctx->bytes;
> +
> + dctx->bytes -= n;
> + srclen -= n;
> +
> + while (n--)
> + *pos++ ^= *src++;
> +
> + if (!dctx->bytes)
> + clmul_polyval_mul((be128 *)dst, &ctx->key_powers[NUM_PRECOMPUTE_POWERS-1]);

Casting u8 to be128 violates alignment rules. Given that clmul_polyval_mul()
uses the unaligned load/store instructions on this argument, its type should be
a byte pointer.

> +static int polyval_final(struct shash_desc *desc, u8 *dst)
> +{
> + struct polyval_desc_ctx *dctx = shash_desc_ctx(desc);
> + struct polyval_ctx *ctx = crypto_shash_ctx(desc->tfm);
> + u8 *buf = dctx->buffer;
> +
> + if (dctx->bytes) {
> + kernel_fpu_begin();
> + clmul_polyval_mul((be128 *)dst, &ctx->key_powers[NUM_PRECOMPUTE_POWERS-1]);
> + kernel_fpu_end();
> + }

The above call to clmul_polyval_mul() is incorrect as it is reading from *dst
before writing to it. Presumably non-block-multiple messages aren't being
tested? I don't think that such messages make sense, so how about returning an
error in that case instead?

> +static struct shash_alg polyval_alg = {
> + .digestsize = POLYVAL_DIGEST_SIZE,
> + .init = polyval_init,
> + .update = polyval_update,
> + .final = polyval_final,
> + .setkey = polyval_setkey,
> + .descsize = sizeof(struct polyval_desc_ctx),
> + .base = {
> + .cra_name = "polyval",
> + .cra_driver_name = "polyval-pclmulqdqni",

How about "polyval-clmulni", like "ghash-clmulni"? pclmulqdqni is a mouthful.

> + .cra_priority = 200,
> + .cra_blocksize = POLYVAL_BLOCK_SIZE,
> + .cra_ctxsize = sizeof(struct polyval_ctx),
> + .cra_module = THIS_MODULE,
> + },
> +};
> +
> +static int __init polyval_mod_init(void)
> +{
> + return crypto_register_shash(&polyval_alg);
> +}
> +
> +static void __exit polyval_mod_exit(void)
> +{
> + crypto_unregister_shash(&polyval_alg);
> +}

Hmm, so this isn't being wrapped with an ahash like the ghash implementation is.
Unfortunately, I don't think that's allowed, since you are assuming that the
code is always called in a context where SIMD instructions are usable. I don't
think that's the case on x86; the other x86 crypto code goes to some length to
avoid this.

Unless anyone else has any better idea, I think you'll have to make the shash an
internal algorithm, and wrap it with an ahash algorithm, like "ghash-clmulni"
does.

Ideally you'd refactor the ahash helper code from ghash-clmulni into
crypto/simd.c, as otherwise you'll need to copy+paste essentially.

> +
> +subsys_initcall(polyval_mod_init);
> +module_exit(polyval_mod_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("POLYVAL hash function accelerated by PCLMULQDQ-NI");
> +MODULE_ALIAS_CRYPTO("polyval");

A MODULE_ALIAS_CRYPTO for the cra_driver_name should be added too.

- Eric