LinuxLists.cc - [PATCH 0/8] crypto: Clean up arm64 AES-CCM code

2024-01-11 12:33:31

Subject: [PATCH 0/8] crypto: Clean up arm64 AES-CCM code

From: Ard Biesheuvel <[email protected]>

The AES-CCM driver was written 10+ years ago, based on the very first
kernel mode NEON API for arm64, which eagerly preserved/restored the
NEON registers on each call to kernel_neon_begin() resp.
kernel_neon_end().

For this reason, the asm helpers were constructed in a way that used
only 6 NEON registers, as the kernel mode NEON API at the time
implemented an optimization where kernel_neon_begin() took an int
denoting the number of NEON registers to preserve/restore. Given that no
actual hardware existed at the time (except perhaps for APM Xgene1 which
did not implement the crypto instructions), all of this was based on
premature assumptions.

These days, the NEON API is a bit more sophisticated, and does not
bother to preserve/restore anything unless it is needed (e.g., when
context switching or returning to user space). It also no longer
disables preemption. Finally, we've developed some code patterns in the
mean time to deal with tail blocks more cleanly and efficiently.

So let's bring the CCM driver up to date with all of this.

Ard Biesheuvel (8):
crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop"
crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk
crypto: arm64/aes-ccm - Pass short inputs via stack buffer
crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON
permute
crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input
crypto: arm64/aes-ccm - Cache round keys and unroll AES loops
crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines
crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper

arch/arm64/crypto/Kconfig | 1 +
arch/arm64/crypto/aes-ce-ccm-core.S | 270 +++++++-------------
arch/arm64/crypto/aes-ce-ccm-glue.c | 154 +++++++----
arch/arm64/crypto/aes-glue.c | 1 +
4 files changed, 199 insertions(+), 227 deletions(-)

--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:33

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 1/8] crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop"

From: Ard Biesheuvel <[email protected]>

This reverts commit 57ead1bf1c54, which updated the CCM code to only
rely on walk.nbytes to check for failures returned from the skcipher
walk API, mostly for the common good rather than to fix a particular
problem in the code.

This change introduces a problem of its own: the skcipher walk is
started with the 'atomic' argument set to false, which means that the
skcipher walk API is permitted to sleep. Subsequently, it invokes
skcipher_walk_done() with preemption disabled on the final iteration of
the loop. This appears to work by accident, but it is arguably a bad
example, and providing a better example was the point of the original
patch.

Given that future changes to the CCM code will rely on the original
behavior of entering the loop even for zero sized inputs, let's just
revert this change entirely, and proceed from there.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 57 +++++++++++---------
1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index 25cd3808ecbe..c4f14415f5f0 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -161,39 +161,43 @@ static int ccm_encrypt(struct aead_request *req)
memcpy(buf, req->iv, AES_BLOCK_SIZE);

err = skcipher_walk_aead_encrypt(&walk, req, false);
+ if (unlikely(err))
+ return err;

kernel_neon_begin();

if (req->assoclen)
ccm_calculate_auth_mac(req, mac);

- while (walk.nbytes) {
+ do {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
- bool final = walk.nbytes == walk.total;

- if (final)
+ if (walk.nbytes == walk.total)
tail = 0;

ce_aes_ccm_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);

- if (!final)
- kernel_neon_end();
- err = skcipher_walk_done(&walk, tail);
- if (!final)
- kernel_neon_begin();
- }
+ if (walk.nbytes == walk.total)
+ ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));

- ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+ kernel_neon_end();

- kernel_neon_end();
+ if (walk.nbytes) {
+ err = skcipher_walk_done(&walk, tail);
+ if (unlikely(err))
+ return err;
+ if (unlikely(walk.nbytes))
+ kernel_neon_begin();
+ }
+ } while (walk.nbytes);

/* copy authtag to end of dst */
scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
crypto_aead_authsize(aead), 1);

- return err;
+ return 0;
}

static int ccm_decrypt(struct aead_request *req)
@@ -215,36 +219,37 @@ static int ccm_decrypt(struct aead_request *req)
memcpy(buf, req->iv, AES_BLOCK_SIZE);

err = skcipher_walk_aead_decrypt(&walk, req, false);
+ if (unlikely(err))
+ return err;

kernel_neon_begin();

if (req->assoclen)
ccm_calculate_auth_mac(req, mac);

- while (walk.nbytes) {
+ do {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
- bool final = walk.nbytes == walk.total;

- if (final)
+ if (walk.nbytes == walk.total)
tail = 0;

ce_aes_ccm_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);

- if (!final)
- kernel_neon_end();
- err = skcipher_walk_done(&walk, tail);
- if (!final)
- kernel_neon_begin();
- }
+ if (walk.nbytes == walk.total)
+ ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));

- ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+ kernel_neon_end();

- kernel_neon_end();
-
- if (unlikely(err))
- return err;
+ if (walk.nbytes) {
+ err = skcipher_walk_done(&walk, tail);
+ if (unlikely(err))
+ return err;
+ if (unlikely(walk.nbytes))
+ kernel_neon_begin();
+ }
+ } while (walk.nbytes);

/* compare calculated auth tag with the stored one */
scatterwalk_map_and_copy(buf, req->src,
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:34

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 2/8] crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk

From: Ard Biesheuvel <[email protected]>

Now that kernel mode NEON no longer disables preemption, we no longer
have to take care to disable and re-enable use of the NEON when calling
into the skcipher walk API. So just keep it enabled until done.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 22 +++++++++-----------
1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index c4f14415f5f0..b177ebea7d09 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -182,17 +182,16 @@ static int ccm_encrypt(struct aead_request *req)
if (walk.nbytes == walk.total)
ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));

- kernel_neon_end();
-
if (walk.nbytes) {
err = skcipher_walk_done(&walk, tail);
- if (unlikely(err))
- return err;
- if (unlikely(walk.nbytes))
- kernel_neon_begin();
}
} while (walk.nbytes);

+ kernel_neon_end();
+
+ if (unlikely(err))
+ return err;
+
/* copy authtag to end of dst */
scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
crypto_aead_authsize(aead), 1);
@@ -240,17 +239,16 @@ static int ccm_decrypt(struct aead_request *req)
if (walk.nbytes == walk.total)
ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));

- kernel_neon_end();
-
if (walk.nbytes) {
err = skcipher_walk_done(&walk, tail);
- if (unlikely(err))
- return err;
- if (unlikely(walk.nbytes))
- kernel_neon_begin();
}
} while (walk.nbytes);

+ kernel_neon_end();
+
+ if (unlikely(err))
+ return err;
+
/* compare calculated auth tag with the stored one */
scatterwalk_map_and_copy(buf, req->src,
req->assoclen + req->cryptlen - authsize,
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:37

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 3/8] crypto: arm64/aes-ccm - Pass short inputs via stack buffer

From: Ard Biesheuvel <[email protected]>

In preparation for optimizing the CCM core asm code using permutation
vectors and overlapping loads and stores, ensure that inputs shorter
than the size of a AES block are passed via a buffer on the stack, in a
way that positions the data at the end of a 16 byte buffer. This removes
the need for the asm code to reason about a rare corner case where the
tail of the data cannot be read/written using a single NEON load/store
instruction.

While at it, tweak the copyright header and authorship to bring it up to
date.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 57 ++++++++++++++------
1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index b177ebea7d09..2f4e6a318fcd 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -1,8 +1,11 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
- * aes-ccm-glue.c - AES-CCM transform for ARMv8 with Crypto Extensions
+ * aes-ce-ccm-glue.c - AES-CCM transform for ARMv8 with Crypto Extensions
*
- * Copyright (C) 2013 - 2017 Linaro Ltd <[email protected]>
+ * Copyright (C) 2013 - 2017 Linaro Ltd.
+ * Copyright (C) 2024 Google LLC
+ *
+ * Author: Ard Biesheuvel <[email protected]>
*/

#include <asm/neon.h>
@@ -149,7 +152,7 @@ static int ccm_encrypt(struct aead_request *req)
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
struct skcipher_walk walk;
u8 __aligned(8) mac[AES_BLOCK_SIZE];
- u8 buf[AES_BLOCK_SIZE];
+ u8 orig_iv[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
int err;

@@ -158,7 +161,7 @@ static int ccm_encrypt(struct aead_request *req)
return err;

/* preserve the original iv for the final round */
- memcpy(buf, req->iv, AES_BLOCK_SIZE);
+ memcpy(orig_iv, req->iv, AES_BLOCK_SIZE);

err = skcipher_walk_aead_encrypt(&walk, req, false);
if (unlikely(err))
@@ -171,16 +174,26 @@ static int ccm_encrypt(struct aead_request *req)

do {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ u8 buf[AES_BLOCK_SIZE];

if (walk.nbytes == walk.total)
tail = 0;

- ce_aes_ccm_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- walk.nbytes - tail, ctx->key_enc,
- num_rounds(ctx), mac, walk.iv);
+ if (unlikely(walk.total < AES_BLOCK_SIZE))
+ src = dst = memcpy(buf + sizeof(buf) - walk.total,
+ src, walk.total);
+
+ ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
+ ctx->key_enc, num_rounds(ctx),
+ mac, walk.iv);
+
+ if (unlikely(walk.total < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr, dst, walk.total);

if (walk.nbytes == walk.total)
- ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+ ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));

if (walk.nbytes) {
err = skcipher_walk_done(&walk, tail);
@@ -206,7 +219,7 @@ static int ccm_decrypt(struct aead_request *req)
unsigned int authsize = crypto_aead_authsize(aead);
struct skcipher_walk walk;
u8 __aligned(8) mac[AES_BLOCK_SIZE];
- u8 buf[AES_BLOCK_SIZE];
+ u8 orig_iv[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
int err;

@@ -215,7 +228,7 @@ static int ccm_decrypt(struct aead_request *req)
return err;

/* preserve the original iv for the final round */
- memcpy(buf, req->iv, AES_BLOCK_SIZE);
+ memcpy(orig_iv, req->iv, AES_BLOCK_SIZE);

err = skcipher_walk_aead_decrypt(&walk, req, false);
if (unlikely(err))
@@ -228,16 +241,26 @@ static int ccm_decrypt(struct aead_request *req)

do {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ u8 buf[AES_BLOCK_SIZE];

if (walk.nbytes == walk.total)
tail = 0;

- ce_aes_ccm_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- walk.nbytes - tail, ctx->key_enc,
- num_rounds(ctx), mac, walk.iv);
+ if (unlikely(walk.total < AES_BLOCK_SIZE))
+ src = dst = memcpy(buf + sizeof(buf) - walk.total,
+ src, walk.total);
+
+ ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
+ ctx->key_enc, num_rounds(ctx),
+ mac, walk.iv);
+
+ if (unlikely(walk.total < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr, dst, walk.total);

if (walk.nbytes == walk.total)
- ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+ ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));

if (walk.nbytes) {
err = skcipher_walk_done(&walk, tail);
@@ -250,11 +273,11 @@ static int ccm_decrypt(struct aead_request *req)
return err;

/* compare calculated auth tag with the stored one */
- scatterwalk_map_and_copy(buf, req->src,
+ scatterwalk_map_and_copy(orig_iv, req->src,
req->assoclen + req->cryptlen - authsize,
authsize, 0);

- if (crypto_memneq(mac, buf, authsize))
+ if (crypto_memneq(mac, orig_iv, authsize))
return -EBADMSG;
return 0;
}
@@ -293,6 +316,6 @@ module_init(aes_mod_init);
module_exit(aes_mod_exit);

MODULE_DESCRIPTION("Synchronous AES in CCM mode using ARMv8 Crypto Extensions");
-MODULE_AUTHOR("Ard Biesheuvel <[email protected]>");
+MODULE_AUTHOR("Ard Biesheuvel <[email protected]>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("ccm(aes)");
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:37

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute

From: Ard Biesheuvel <[email protected]>

Implement the CCM tail handling using a single sequence that uses
permute vectors and overlapping loads and stores, rather than going over
the tail byte by byte in a loop, and using scalar operations. This is
more efficient, even though the measured speedup is only around 1-2% on
the CPUs I have tried.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 59 +++++++++++++-------
arch/arm64/crypto/aes-ce-ccm-glue.c | 20 +++----
2 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index b03f7f71f893..b21a9b759ab2 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -1,8 +1,11 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
- * aesce-ccm-core.S - AES-CCM transform for ARMv8 with Crypto Extensions
+ * aes-ce-ccm-core.S - AES-CCM transform for ARMv8 with Crypto Extensions
*
- * Copyright (C) 2013 - 2017 Linaro Ltd <[email protected]>
+ * Copyright (C) 2013 - 2017 Linaro Ltd.
+ * Copyright (C) 2024 Google LLC
+ *
+ * Author: Ard Biesheuvel <[email protected]>
*/

#include <linux/linkage.h>
@@ -168,13 +171,13 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
ld1 {v2.16b}, [x1], #16 /* load next input block */
.if \enc == 1
eor v2.16b, v2.16b, v5.16b /* final round enc+mac */
- eor v1.16b, v1.16b, v2.16b /* xor with crypted ctr */
+ eor v6.16b, v1.16b, v2.16b /* xor with crypted ctr */
.else
eor v2.16b, v2.16b, v1.16b /* xor with crypted ctr */
- eor v1.16b, v2.16b, v5.16b /* final round enc */
+ eor v6.16b, v2.16b, v5.16b /* final round enc */
.endif
eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
- st1 {v1.16b}, [x0], #16 /* write output block */
+ st1 {v6.16b}, [x0], #16 /* write output block */
bne 0b
CPU_LE( rev x8, x8 )
st1 {v0.16b}, [x5] /* store mac */
@@ -183,25 +186,31 @@ CPU_LE( rev x8, x8 )

6: eor v0.16b, v0.16b, v5.16b /* final round mac */
eor v1.16b, v1.16b, v5.16b /* final round enc */
- st1 {v0.16b}, [x5] /* store mac */
- add w2, w2, #16 /* process partial tail block */
-7: ldrb w9, [x1], #1 /* get 1 byte of input */
- umov w6, v1.b[0] /* get top crypted ctr byte */
- umov w7, v0.b[0] /* get top mac byte */
+
+ add x1, x1, w2, sxtw /* rewind the input pointer (w2 < 0) */
+ add x0, x0, w2, sxtw /* rewind the output pointer */
+
+ adr_l x8, .Lpermute /* load permute vectors */
+ add x9, x8, w2, sxtw
+ sub x8, x8, w2, sxtw
+ ld1 {v7.16b-v8.16b}, [x9]
+ ld1 {v9.16b}, [x8]
+
+ ld1 {v2.16b}, [x1] /* load a full block of input */
+ tbl v1.16b, {v1.16b}, v7.16b /* move keystream to end of register */
.if \enc == 1
- eor w7, w7, w9
- eor w9, w9, w6
+ tbl v7.16b, {v2.16b}, v9.16b /* copy plaintext to start of v7 */
+ eor v2.16b, v2.16b, v1.16b /* encrypt partial input block */
.else
- eor w9, w9, w6
- eor w7, w7, w9
+ eor v2.16b, v2.16b, v1.16b /* decrypt partial input block */
+ tbl v7.16b, {v2.16b}, v9.16b /* copy plaintext to start of v7 */
.endif
- strb w9, [x0], #1 /* store out byte */
- strb w7, [x5], #1 /* store mac byte */
- subs w2, w2, #1
- beq 5b
- ext v0.16b, v0.16b, v0.16b, #1 /* shift out mac byte */
- ext v1.16b, v1.16b, v1.16b, #1 /* shift out ctr byte */
- b 7b
+ eor v0.16b, v0.16b, v7.16b /* fold plaintext into mac */
+ tbx v2.16b, {v6.16b}, v8.16b /* insert output from previous iteration */
+
+ st1 {v0.16b}, [x5] /* store mac */
+ st1 {v2.16b}, [x0] /* store output block */
+ ret
.endm

/*
@@ -219,3 +228,11 @@ SYM_FUNC_END(ce_aes_ccm_encrypt)
SYM_FUNC_START(ce_aes_ccm_decrypt)
aes_ccm_do_crypt 0
SYM_FUNC_END(ce_aes_ccm_decrypt)
+
+ .section ".rodata", "a"
+ .align 6
+ .fill 15, 1, 0xff
+.Lpermute:
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .fill 15, 1, 0xff
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index 2f4e6a318fcd..4710e59075f5 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -181,16 +181,16 @@ static int ccm_encrypt(struct aead_request *req)
if (walk.nbytes == walk.total)
tail = 0;

- if (unlikely(walk.total < AES_BLOCK_SIZE))
- src = dst = memcpy(buf + sizeof(buf) - walk.total,
- src, walk.total);
+ if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+ src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
+ src, walk.nbytes);

ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
ctx->key_enc, num_rounds(ctx),
mac, walk.iv);

- if (unlikely(walk.total < AES_BLOCK_SIZE))
- memcpy(walk.dst.virt.addr, dst, walk.total);
+ if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr, dst, walk.nbytes);

if (walk.nbytes == walk.total)
ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
@@ -248,16 +248,16 @@ static int ccm_decrypt(struct aead_request *req)
if (walk.nbytes == walk.total)
tail = 0;

- if (unlikely(walk.total < AES_BLOCK_SIZE))
- src = dst = memcpy(buf + sizeof(buf) - walk.total,
- src, walk.total);
+ if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+ src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
+ src, walk.nbytes);

ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
ctx->key_enc, num_rounds(ctx),
mac, walk.iv);

- if (unlikely(walk.total < AES_BLOCK_SIZE))
- memcpy(walk.dst.virt.addr, dst, walk.total);
+ if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr, dst, walk.nbytes);

if (walk.nbytes == walk.total)
ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:39

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 5/8] crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input

From: Ard Biesheuvel <[email protected]>

CCM combines the counter (CTR) encryption mode with a MAC based on the
same block cipher. This MAC construction is a bit clunky: it invokes the
block cipher in a way that cannot be parallelized, resulting in poor CPU
pipeline efficiency.

The arm64 CCM code mitigates this by interleaving the encryption and MAC
at the AES round level, resulting in a substantial speedup. But this
approach does not apply to the additional authenticated data (AAD) which
is not encrypted.

This means the special asm routine dealing with the AAD is not any
better than the MAC update routine used by the arm64 AES block
encryption driver, so let's reuse that, and drop the special AES-CCM
version.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/Kconfig | 1 +
arch/arm64/crypto/aes-ce-ccm-core.S | 71 --------------------
arch/arm64/crypto/aes-ce-ccm-glue.c | 49 +++++++++++---
arch/arm64/crypto/aes-glue.c | 1 +
4 files changed, 43 insertions(+), 79 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index eb7b423ba463..e7d9bd8e4709 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -268,6 +268,7 @@ config CRYPTO_AES_ARM64_CE_CCM
depends on ARM64 && KERNEL_MODE_NEON
select CRYPTO_ALGAPI
select CRYPTO_AES_ARM64_CE
+ select CRYPTO_AES_ARM64_CE_BLK
select CRYPTO_AEAD
select CRYPTO_LIB_AES
help
diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index b21a9b759ab2..0132872bd780 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -14,77 +14,6 @@
.text
.arch armv8-a+crypto

- /*
- * u32 ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
- * u32 macp, u8 const rk[], u32 rounds);
- */
-SYM_FUNC_START(ce_aes_ccm_auth_data)
- ld1 {v0.16b}, [x0] /* load mac */
- cbz w3, 1f
- sub w3, w3, #16
- eor v1.16b, v1.16b, v1.16b
-0: ldrb w7, [x1], #1 /* get 1 byte of input */
- subs w2, w2, #1
- add w3, w3, #1
- ins v1.b[0], w7
- ext v1.16b, v1.16b, v1.16b, #1 /* rotate in the input bytes */
- beq 8f /* out of input? */
- cbnz w3, 0b
- eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4] /* load first round key */
- prfm pldl1strm, [x1]
- cmp w5, #12 /* which key size? */
- add x6, x4, #16
- sub w7, w5, #2 /* modified # of rounds */
- bmi 2f
- bne 5f
- mov v5.16b, v3.16b
- b 4f
-2: mov v4.16b, v3.16b
- ld1 {v5.4s}, [x6], #16 /* load 2nd round key */
-3: aese v0.16b, v4.16b
- aesmc v0.16b, v0.16b
-4: ld1 {v3.4s}, [x6], #16 /* load next round key */
- aese v0.16b, v5.16b
- aesmc v0.16b, v0.16b
-5: ld1 {v4.4s}, [x6], #16 /* load next round key */
- subs w7, w7, #3
- aese v0.16b, v3.16b
- aesmc v0.16b, v0.16b
- ld1 {v5.4s}, [x6], #16 /* load next round key */
- bpl 3b
- aese v0.16b, v4.16b
- subs w2, w2, #16 /* last data? */
- eor v0.16b, v0.16b, v5.16b /* final round */
- bmi 6f
- ld1 {v1.16b}, [x1], #16 /* load next input block */
- eor v0.16b, v0.16b, v1.16b /* xor with mac */
- bne 1b
-6: st1 {v0.16b}, [x0] /* store mac */
- beq 10f
- adds w2, w2, #16
- beq 10f
- mov w3, w2
-7: ldrb w7, [x1], #1
- umov w6, v0.b[0]
- eor w6, w6, w7
- strb w6, [x0], #1
- subs w2, w2, #1
- beq 10f
- ext v0.16b, v0.16b, v0.16b, #1 /* rotate out the mac bytes */
- b 7b
-8: cbz w3, 91f
- mov w7, w3
- add w3, w3, #16
-9: ext v1.16b, v1.16b, v1.16b, #1
- adds w7, w7, #1
- bne 9b
-91: eor v0.16b, v0.16b, v1.16b
- st1 {v0.16b}, [x0]
-10: mov w0, w3
- ret
-SYM_FUNC_END(ce_aes_ccm_auth_data)
-
/*
* void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
* u32 rounds);
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index 4710e59075f5..ed3d79e05112 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -18,6 +18,8 @@

#include "aes-ce-setkey.h"

+MODULE_IMPORT_NS(CRYPTO_INTERNAL);
+
static int num_rounds(struct crypto_aes_ctx *ctx)
{
/*
@@ -30,8 +32,9 @@ static int num_rounds(struct crypto_aes_ctx *ctx)
return 6 + ctx->key_length / 4;
}

-asmlinkage u32 ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
- u32 macp, u32 const rk[], u32 rounds);
+asmlinkage u32 ce_aes_mac_update(u8 const in[], u32 const rk[], int rounds,
+ int blocks, u8 dg[], int enc_before,
+ int enc_after);

asmlinkage void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
u32 const rk[], u32 rounds, u8 mac[],
@@ -97,6 +100,41 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
return 0;
}

+static u32 ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
+ u32 macp, u32 const rk[], u32 rounds)
+{
+ int enc_after = (macp + abytes) % AES_BLOCK_SIZE;
+
+ do {
+ u32 blocks = abytes / AES_BLOCK_SIZE;
+
+ if (macp == AES_BLOCK_SIZE || (!macp && blocks > 0)) {
+ u32 rem = ce_aes_mac_update(in, rk, rounds, blocks, mac,
+ macp, enc_after);
+ u32 adv = (blocks - rem) * AES_BLOCK_SIZE;
+
+ macp = enc_after ? 0 : AES_BLOCK_SIZE;
+ in += adv;
+ abytes -= adv;
+
+ if (unlikely(rem)) {
+ kernel_neon_end();
+ kernel_neon_begin();
+ macp = 0;
+ }
+ } else {
+ u32 l = min(AES_BLOCK_SIZE - macp, abytes);
+
+ crypto_xor(&mac[macp], in, l);
+ in += l;
+ macp += l;
+ abytes -= l;
+ }
+ } while (abytes > 0);
+
+ return macp;
+}
+
static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
@@ -104,7 +142,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
struct __packed { __be16 l; __be32 h; u16 len; } ltag;
struct scatter_walk walk;
u32 len = req->assoclen;
- u32 macp = 0;
+ u32 macp = AES_BLOCK_SIZE;

/* prepend the AAD with a length tag */
if (len < 0xff00) {
@@ -128,16 +166,11 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
scatterwalk_start(&walk, sg_next(walk.sg));
n = scatterwalk_clamp(&walk, len);
}
- n = min_t(u32, n, SZ_4K); /* yield NEON at least every 4k */
p = scatterwalk_map(&walk);

macp = ce_aes_ccm_auth_data(mac, p, n, macp, ctx->key_enc,
num_rounds(ctx));

- if (len / SZ_4K > (len - n) / SZ_4K) {
- kernel_neon_end();
- kernel_neon_begin();
- }
len -= n;

scatterwalk_unmap(p);
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 162787c7aa86..a147e847a5a1 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -1048,6 +1048,7 @@ static int __init aes_init(void)

#ifdef USE_V8_CRYPTO_EXTENSIONS
module_cpu_feature_match(AES, aes_init);
+EXPORT_SYMBOL_NS(ce_aes_mac_update, CRYPTO_INTERNAL);
#else
module_init(aes_init);
EXPORT_SYMBOL(neon_aes_ecb_encrypt);
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:41

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 6/8] crypto: arm64/aes-ccm - Cache round keys and unroll AES loops

From: Ard Biesheuvel <[email protected]>

The CCM code as originally written attempted to use as few NEON
registers as possible, to avoid having to eagerly preserve/restore the
entire NEON register file at every call to kernel_neon_begin/end. At
that time, this API took a number of NEON registers as a parameter, and
only preserved that many registers.

Today, the NEON register file is restored lazily, and the old API is
long gone. This means we can use as many NEON registers as we can make
meaningful use of, which means in the AES case that we can keep all
round keys in registers rather than reloading each of them for each AES
block processed.

On Cortex-A53, this results in a speedup of more than 50%. (From 4
cycles per byte to 2.6 cycles per byte)

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 95 ++++++++------------
1 file changed, 38 insertions(+), 57 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index 0132872bd780..0ec59fc4ef3e 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -14,40 +14,46 @@
.text
.arch armv8-a+crypto

+ .macro load_round_keys, rk, nr, tmp
+ sub w\tmp, \nr, #10
+ add \tmp, \rk, w\tmp, sxtw #4
+ ld1 {v10.4s-v13.4s}, [\rk]
+ ld1 {v14.4s-v17.4s}, [\tmp], #64
+ ld1 {v18.4s-v21.4s}, [\tmp], #64
+ ld1 {v3.4s-v5.4s}, [\tmp]
+ .endm
+
+ .macro dround, va, vb, vk
+ aese \va\().16b, \vk\().16b
+ aesmc \va\().16b, \va\().16b
+ aese \vb\().16b, \vk\().16b
+ aesmc \vb\().16b, \vb\().16b
+ .endm
+
+ .macro aes_encrypt, va, vb, nr
+ tbz \nr, #2, .L\@
+ dround \va, \vb, v10
+ dround \va, \vb, v11
+ tbz \nr, #1, .L\@
+ dround \va, \vb, v12
+ dround \va, \vb, v13
+.L\@: .irp v, v14, v15, v16, v17, v18, v19, v20, v21, v3
+ dround \va, \vb, \v
+ .endr
+ aese \va\().16b, v4.16b
+ aese \vb\().16b, v4.16b
+ .endm
+
/*
* void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
* u32 rounds);
*/
SYM_FUNC_START(ce_aes_ccm_final)
- ld1 {v3.4s}, [x2], #16 /* load first round key */
ld1 {v0.16b}, [x0] /* load mac */
- cmp w3, #12 /* which key size? */
- sub w3, w3, #2 /* modified # of rounds */
ld1 {v1.16b}, [x1] /* load 1st ctriv */
- bmi 0f
- bne 3f
- mov v5.16b, v3.16b
- b 2f
-0: mov v4.16b, v3.16b
-1: ld1 {v5.4s}, [x2], #16 /* load next round key */
- aese v0.16b, v4.16b
- aesmc v0.16b, v0.16b
- aese v1.16b, v4.16b
- aesmc v1.16b, v1.16b
-2: ld1 {v3.4s}, [x2], #16 /* load next round key */
- aese v0.16b, v5.16b
- aesmc v0.16b, v0.16b
- aese v1.16b, v5.16b
- aesmc v1.16b, v1.16b
-3: ld1 {v4.4s}, [x2], #16 /* load next round key */
- subs w3, w3, #3
- aese v0.16b, v3.16b
- aesmc v0.16b, v0.16b
- aese v1.16b, v3.16b
- aesmc v1.16b, v1.16b
- bpl 1b
- aese v0.16b, v4.16b
- aese v1.16b, v4.16b
+
+ aes_encrypt v0, v1, w3
+
/* final round key cancels out */
eor v0.16b, v0.16b, v1.16b /* en-/decrypt the mac */
st1 {v0.16b}, [x0] /* store result */
@@ -55,6 +61,8 @@ SYM_FUNC_START(ce_aes_ccm_final)
SYM_FUNC_END(ce_aes_ccm_final)

.macro aes_ccm_do_crypt,enc
+ load_round_keys x3, w4, x10
+
cbz x2, 5f
ldr x8, [x6, #8] /* load lower ctr */
ld1 {v0.16b}, [x5] /* load mac */
@@ -64,37 +72,10 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
prfm pldl1strm, [x1]
add x8, x8, #1
rev x9, x8
- cmp w4, #12 /* which key size? */
- sub w7, w4, #2 /* get modified # of rounds */
ins v1.d[1], x9 /* no carry in lower ctr */
- ld1 {v3.4s}, [x3] /* load first round key */
- add x10, x3, #16
- bmi 1f
- bne 4f
- mov v5.16b, v3.16b
- b 3f
-1: mov v4.16b, v3.16b
- ld1 {v5.4s}, [x10], #16 /* load 2nd round key */
-2: /* inner loop: 3 rounds, 2x interleaved */
- aese v0.16b, v4.16b
- aesmc v0.16b, v0.16b
- aese v1.16b, v4.16b
- aesmc v1.16b, v1.16b
-3: ld1 {v3.4s}, [x10], #16 /* load next round key */
- aese v0.16b, v5.16b
- aesmc v0.16b, v0.16b
- aese v1.16b, v5.16b
- aesmc v1.16b, v1.16b
-4: ld1 {v4.4s}, [x10], #16 /* load next round key */
- subs w7, w7, #3
- aese v0.16b, v3.16b
- aesmc v0.16b, v0.16b
- aese v1.16b, v3.16b
- aesmc v1.16b, v1.16b
- ld1 {v5.4s}, [x10], #16 /* load next round key */
- bpl 2b
- aese v0.16b, v4.16b
- aese v1.16b, v4.16b
+
+ aes_encrypt v0, v1, w4
+
subs w2, w2, #16
bmi 6f /* partial block? */
ld1 {v2.16b}, [x1], #16 /* load next input block */
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:45

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 7/8] crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines

From: Ard Biesheuvel <[email protected]>

The encryption and decryption code paths are mostly identical, except
for a small difference where the plaintext input into the MAC is taken
from either the input or the output block.

We can factor this in quite easily using a vector bit select, and a few
additional XORs, without the need for branches. This way, we can use the
same asm helper on the encrypt and decrypt code paths.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 41 +++++++++-----------
1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index 0ec59fc4ef3e..75be3157bae1 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -60,7 +60,7 @@ SYM_FUNC_START(ce_aes_ccm_final)
ret
SYM_FUNC_END(ce_aes_ccm_final)

- .macro aes_ccm_do_crypt,enc
+SYM_FUNC_START_LOCAL(aes_ccm_do_crypt)
load_round_keys x3, w4, x10

cbz x2, 5f
@@ -76,28 +76,24 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */

aes_encrypt v0, v1, w4

+ eor v0.16b, v0.16b, v5.16b /* final round mac */
+ eor v1.16b, v1.16b, v5.16b /* final round enc */
subs w2, w2, #16
bmi 6f /* partial block? */
ld1 {v2.16b}, [x1], #16 /* load next input block */
- .if \enc == 1
- eor v2.16b, v2.16b, v5.16b /* final round enc+mac */
- eor v6.16b, v1.16b, v2.16b /* xor with crypted ctr */
- .else
- eor v2.16b, v2.16b, v1.16b /* xor with crypted ctr */
- eor v6.16b, v2.16b, v5.16b /* final round enc */
- .endif
- eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
+ eor v6.16b, v2.16b, v1.16b /* en/decrypt input block */
+ mov v23.16b, v22.16b
+ bsl v23.16b, v2.16b, v6.16b /* select plaintext */
st1 {v6.16b}, [x0], #16 /* write output block */
+ eor v0.16b, v0.16b, v23.16b /* fold plaintext into mac */
+
bne 0b
CPU_LE( rev x8, x8 )
st1 {v0.16b}, [x5] /* store mac */
str x8, [x6, #8] /* store lsb end of ctr (BE) */
5: ret

-6: eor v0.16b, v0.16b, v5.16b /* final round mac */
- eor v1.16b, v1.16b, v5.16b /* final round enc */
-
- add x1, x1, w2, sxtw /* rewind the input pointer (w2 < 0) */
+6: add x1, x1, w2, sxtw /* rewind the input pointer (w2 < 0) */
add x0, x0, w2, sxtw /* rewind the output pointer */

adr_l x8, .Lpermute /* load permute vectors */
@@ -108,20 +104,17 @@ CPU_LE( rev x8, x8 )

ld1 {v2.16b}, [x1] /* load a full block of input */
tbl v1.16b, {v1.16b}, v7.16b /* move keystream to end of register */
- .if \enc == 1
- tbl v7.16b, {v2.16b}, v9.16b /* copy plaintext to start of v7 */
+ tbl v7.16b, {v2.16b}, v9.16b /* copy input block to start of v7 */
eor v2.16b, v2.16b, v1.16b /* encrypt partial input block */
- .else
- eor v2.16b, v2.16b, v1.16b /* decrypt partial input block */
- tbl v7.16b, {v2.16b}, v9.16b /* copy plaintext to start of v7 */
- .endif
- eor v0.16b, v0.16b, v7.16b /* fold plaintext into mac */
+ tbl v9.16b, {v2.16b}, v9.16b /* copy output block to start of v9 */
+ bsl v22.16b, v7.16b, v9.16b /* select plaintext */
+ eor v0.16b, v0.16b, v22.16b /* fold plaintext into mac */
tbx v2.16b, {v6.16b}, v8.16b /* insert output from previous iteration */

st1 {v0.16b}, [x5] /* store mac */
st1 {v2.16b}, [x0] /* store output block */
ret
- .endm
+SYM_FUNC_END(aes_ccm_do_crypt)

/*
* void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
@@ -132,11 +125,13 @@ CPU_LE( rev x8, x8 )
* u8 ctr[]);
*/
SYM_FUNC_START(ce_aes_ccm_encrypt)
- aes_ccm_do_crypt 1
+ movi v22.16b, #255
+ b aes_ccm_do_crypt
SYM_FUNC_END(ce_aes_ccm_encrypt)

SYM_FUNC_START(ce_aes_ccm_decrypt)
- aes_ccm_do_crypt 0
+ movi v22.16b, #0
+ b aes_ccm_do_crypt
SYM_FUNC_END(ce_aes_ccm_decrypt)

.section ".rodata", "a"
--
2.43.0.275.g3460e3d667-goog

2024-01-11 12:33:45

by Ard Biesheuvel

[permalink] [raw]

Subject: [PATCH 8/8] crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper

From: Ard Biesheuvel <[email protected]>

The C glue code already infers whether or not the current iteration is
the final one, by comparing walk.nbytes with walk.total. This means we
can easily inform the asm helper of this as well, by conditionally
passing a pointer to the original IV, which is used in the finalization
of the MAC. This removes the need for a separate call into the asm code
to perform the finalization.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 32 ++++++++------------
arch/arm64/crypto/aes-ce-ccm-glue.c | 27 ++++++++---------
2 files changed, 24 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index 75be3157bae1..c0d89f8ae4c4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -44,28 +44,12 @@
aese \vb\().16b, v4.16b
.endm

- /*
- * void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
- * u32 rounds);
- */
-SYM_FUNC_START(ce_aes_ccm_final)
- ld1 {v0.16b}, [x0] /* load mac */
- ld1 {v1.16b}, [x1] /* load 1st ctriv */
-
- aes_encrypt v0, v1, w3
-
- /* final round key cancels out */
- eor v0.16b, v0.16b, v1.16b /* en-/decrypt the mac */
- st1 {v0.16b}, [x0] /* store result */
- ret
-SYM_FUNC_END(ce_aes_ccm_final)
-
SYM_FUNC_START_LOCAL(aes_ccm_do_crypt)
load_round_keys x3, w4, x10

+ ld1 {v0.16b}, [x5] /* load mac */
cbz x2, 5f
ldr x8, [x6, #8] /* load lower ctr */
- ld1 {v0.16b}, [x5] /* load mac */
CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
0: /* outer loop */
ld1 {v1.8b}, [x6] /* load upper ctr */
@@ -89,9 +73,9 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */

bne 0b
CPU_LE( rev x8, x8 )
- st1 {v0.16b}, [x5] /* store mac */
str x8, [x6, #8] /* store lsb end of ctr (BE) */
-5: ret
+5: cbz x7, 8f
+ b 7f

6: add x1, x1, w2, sxtw /* rewind the input pointer (w2 < 0) */
add x0, x0, w2, sxtw /* rewind the output pointer */
@@ -111,8 +95,16 @@ CPU_LE( rev x8, x8 )
eor v0.16b, v0.16b, v22.16b /* fold plaintext into mac */
tbx v2.16b, {v6.16b}, v8.16b /* insert output from previous iteration */

- st1 {v0.16b}, [x5] /* store mac */
st1 {v2.16b}, [x0] /* store output block */
+ cbz x7, 8f /* time to finalize MAC? */
+7: ld1 {v1.16b}, [x7] /* load 1st ctriv */
+
+ aes_encrypt v0, v1, w4
+
+ /* final round key cancels out */
+ eor v0.16b, v0.16b, v1.16b /* en-/decrypt the mac */
+
+8: st1 {v0.16b}, [x5] /* store mac */
ret
SYM_FUNC_END(aes_ccm_do_crypt)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index ed3d79e05112..ce9b28e3c7d6 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -38,14 +38,11 @@ asmlinkage u32 ce_aes_mac_update(u8 const in[], u32 const rk[], int rounds,

asmlinkage void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
u32 const rk[], u32 rounds, u8 mac[],
- u8 ctr[]);
+ u8 ctr[], u8 const final_iv[]);

asmlinkage void ce_aes_ccm_decrypt(u8 out[], u8 const in[], u32 cbytes,
u32 const rk[], u32 rounds, u8 mac[],
- u8 ctr[]);
-
-asmlinkage void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u32 const rk[],
- u32 rounds);
+ u8 ctr[], u8 const final_iv[]);

static int ccm_setkey(struct crypto_aead *tfm, const u8 *in_key,
unsigned int key_len)
@@ -210,9 +207,12 @@ static int ccm_encrypt(struct aead_request *req)
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
u8 buf[AES_BLOCK_SIZE];
+ u8 *final_iv = NULL;

- if (walk.nbytes == walk.total)
+ if (walk.nbytes == walk.total) {
tail = 0;
+ final_iv = orig_iv;
+ }

if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
@@ -220,14 +220,11 @@ static int ccm_encrypt(struct aead_request *req)

ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
ctx->key_enc, num_rounds(ctx),
- mac, walk.iv);
+ mac, walk.iv, final_iv);

if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
memcpy(walk.dst.virt.addr, dst, walk.nbytes);

- if (walk.nbytes == walk.total)
- ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
-
if (walk.nbytes) {
err = skcipher_walk_done(&walk, tail);
}
@@ -277,9 +274,12 @@ static int ccm_decrypt(struct aead_request *req)
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;
u8 buf[AES_BLOCK_SIZE];
+ u8 *final_iv = NULL;

- if (walk.nbytes == walk.total)
+ if (walk.nbytes == walk.total) {
tail = 0;
+ final_iv = orig_iv;
+ }

if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
@@ -287,14 +287,11 @@ static int ccm_decrypt(struct aead_request *req)

ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
ctx->key_enc, num_rounds(ctx),
- mac, walk.iv);
+ mac, walk.iv, final_iv);

if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
memcpy(walk.dst.virt.addr, dst, walk.nbytes);

- if (walk.nbytes == walk.total)
- ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
-
if (walk.nbytes) {
err = skcipher_walk_done(&walk, tail);
}
--
2.43.0.275.g3460e3d667-goog

2024-01-11 16:46:11

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute

On Thu, 11 Jan 2024 at 13:33, Ard Biesheuvel <[email protected]> wrote:
>
> From: Ard Biesheuvel <[email protected]>
>
> Implement the CCM tail handling using a single sequence that uses
> permute vectors and overlapping loads and stores, rather than going over
> the tail byte by byte in a loop, and using scalar operations. This is
> more efficient, even though the measured speedup is only around 1-2% on
> the CPUs I have tried.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/crypto/aes-ce-ccm-core.S | 59 +++++++++++++-------
> arch/arm64/crypto/aes-ce-ccm-glue.c | 20 +++----
> 2 files changed, 48 insertions(+), 31 deletions(-)
>
...

The hunks below don't belong here: they were supposed to be squashed
into the previous patch.

I will fix that up for the next revision.

> diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
> index 2f4e6a318fcd..4710e59075f5 100644
> --- a/arch/arm64/crypto/aes-ce-ccm-glue.c
> +++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
> @@ -181,16 +181,16 @@ static int ccm_encrypt(struct aead_request *req)
> if (walk.nbytes == walk.total)
> tail = 0;
>
> - if (unlikely(walk.total < AES_BLOCK_SIZE))
> - src = dst = memcpy(buf + sizeof(buf) - walk.total,
> - src, walk.total);
> + if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> + src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
> + src, walk.nbytes);
>
> ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
> ctx->key_enc, num_rounds(ctx),
> mac, walk.iv);
>
> - if (unlikely(walk.total < AES_BLOCK_SIZE))
> - memcpy(walk.dst.virt.addr, dst, walk.total);
> + if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> + memcpy(walk.dst.virt.addr, dst, walk.nbytes);
>
> if (walk.nbytes == walk.total)
> ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
> @@ -248,16 +248,16 @@ static int ccm_decrypt(struct aead_request *req)
> if (walk.nbytes == walk.total)
> tail = 0;
>
> - if (unlikely(walk.total < AES_BLOCK_SIZE))
> - src = dst = memcpy(buf + sizeof(buf) - walk.total,
> - src, walk.total);
> + if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> + src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
> + src, walk.nbytes);
>
> ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
> ctx->key_enc, num_rounds(ctx),
> mac, walk.iv);
>
> - if (unlikely(walk.total < AES_BLOCK_SIZE))
> - memcpy(walk.dst.virt.addr, dst, walk.total);
> + if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> + memcpy(walk.dst.virt.addr, dst, walk.nbytes);
>
> if (walk.nbytes == walk.total)
> ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
> --
> 2.43.0.275.g3460e3d667-goog
>