LinuxLists.cc - [PATCH 0/3] crypto: aria: add ARIA AES-NI/AVX/x86

2022-08-24 15:59:49

Subject: [PATCH 0/3] crypto: aria: add ARIA AES-NI/AVX/x86_64 implementation

The purpose of this patchset is to support the implementation of ARIA-AVX.
Many of the ideas in this implementation are from Camellia-avx,
especially byte slicing.
Like Camellia, ARIA also uses a 16way strategy.

ARIA cipher algorithm is similar to AES.
There are four s-boxes in the ARIA spec and the first and second s-boxes
are the same as AES's s-boxes.
Almost functions are based on aria-generic code except for s-box related
function.
The aria-avx doesn't implement the key expanding function.
it only support encrypt() and decrypt().

Encryption and Decryption logic is actually the same but it should use
separated keys(encryption key and decryption key).
En/Decryption steps are like below:
1. Add-Round-Key
2. S-box.
3. Diffusion Layer.

There is no special thing in the Add-Round-Key step.

There are some notable things in s-box step.
Like Camellia, it doesn't use a lookup table, instead, it uses aes-ni.

To calculate the first s-box, it just uses the aesenclast and then
inverts shift_row. No more process is needed for this job because the
first s-box is the same as the AES encryption s-box.

To calculate a second s-box(invert of s1), it just uses the aesdeclast
and then inverts shift_row. No more process is needed for this job
because the second s-box is the same as the AES decryption s-box.

To calculate a third and fourth s-boxes, it uses the aesenclast,
then inverts shift_row, and affine transformation.

The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation.
The aria-avx Diffusion Layer implementation is based on aria-generic
implementation because 8-bit implementation is not fit for parallel
implementation but 32-bit is fit for this.

The first patch in this series is to export functions for aria-avx.
The aria-avx uses existing functions in the aria-generic code.
The second patch is to implement aria-avx.
The last patch is to add async test for aria.

Benchmarks:
The tcrypt is used.
cpu: i3-12100

How to test:
modprobe aria-generic
tcrypt mode=610 num_mb=8192

Result:
testing speed of multibuffer ecb(aria) (ecb(aria-generic)) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 534 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 2006 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 3674 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 52374 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 608 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2586 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 4707 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 69794 cycles

testing speed of multibuffer ecb(aria) (ecb(aria-generic)) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 545 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 1995 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 3673 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 52359 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 615 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2588 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 4712 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 69916 cycles

How to test:
modprobe aria
tcrypt mode=610 num_mb=8192

Result:
testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 727 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 2040 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 1399 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 14758 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 702 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2615 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 1677 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 19454 cycles
testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 638 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 2090 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 1394 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 14824 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 719 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2633 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 1684 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 19457 cycles

Taehee Yoo (3):
crypto: aria: prepare generic module for optimized implementations
crypto: aria-avx: add AES-NI/AVX/x86_64 assembler implementation of
aria cipher
crypto: tcrypt: add async speed test for aria cipher

arch/x86/crypto/Makefile | 3 +
arch/x86/crypto/aria-aesni-avx-asm_64.S | 648 ++++++++++++++++++++++++
arch/x86/crypto/aria_aesni_avx_glue.c | 165 ++++++
crypto/Kconfig | 21 +
crypto/Makefile | 2 +-
crypto/{aria.c => aria_generic.c} | 39 +-
crypto/tcrypt.c | 13 +
include/crypto/aria.h | 14 +-
8 files changed, 889 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/crypto/aria-aesni-avx-asm_64.S
create mode 100644 arch/x86/crypto/aria_aesni_avx_glue.c
rename crypto/{aria.c => aria_generic.c} (86%)

--
2.17.1

2022-08-24 16:00:11

by Taehee Yoo

[permalink] [raw]

Subject: [PATCH 3/3] crypto: tcrypt: add async speed test for aria cipher

In order to test for the performance of aria-avx implementation, it needs
an async speed test.
So, it adds async speed tests to the tcrypt.

Signed-off-by: Taehee Yoo <[email protected]>
---
crypto/tcrypt.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 59eb8ec36664..f36d3bbf88f5 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -2648,6 +2648,13 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
speed_template_16);
break;

+ case 519:
+ test_acipher_speed("ecb(aria)", ENCRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("ecb(aria)", DECRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ break;
+
case 600:
test_mb_skcipher_speed("ecb(aes)", ENCRYPT, sec, NULL, 0,
speed_template_16_24_32, num_mb);
@@ -2859,6 +2866,12 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
test_mb_skcipher_speed("ctr(blowfish)", DECRYPT, sec, NULL, 0,
speed_template_8_32, num_mb);
break;
+ case 610:
+ test_mb_skcipher_speed("ecb(aria)", ENCRYPT, sec, NULL, 0,
+ speed_template_16_32, num_mb);
+ test_mb_skcipher_speed("ecb(aria)", DECRYPT, sec, NULL, 0,
+ speed_template_16_32, num_mb);
+ break;

case 1000:
test_available();
--
2.17.1

2022-08-24 16:00:11

by Taehee Yoo

[permalink] [raw]

Subject: [PATCH 1/3] crypto: aria: prepare generic module for optimized implementations

It renames aria to aria_generic and exports some functions such as
aria_set_key(), aria_encrypt(), and aria_decrypt() to be able to be
used by aria-avx implementation.

Signed-off-by: Taehee Yoo <[email protected]>
---
crypto/Makefile | 2 +-
crypto/{aria.c => aria_generic.c} | 39 +++++++++++++++++++++++++------
include/crypto/aria.h | 14 +++++------
3 files changed, 39 insertions(+), 16 deletions(-)
rename crypto/{aria.c => aria_generic.c} (86%)

diff --git a/crypto/Makefile b/crypto/Makefile
index a6f94e04e1da..303b21c43df0 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -149,7 +149,7 @@ obj-$(CONFIG_CRYPTO_TEA) += tea.o
obj-$(CONFIG_CRYPTO_KHAZAD) += khazad.o
obj-$(CONFIG_CRYPTO_ANUBIS) += anubis.o
obj-$(CONFIG_CRYPTO_SEED) += seed.o
-obj-$(CONFIG_CRYPTO_ARIA) += aria.o
+obj-$(CONFIG_CRYPTO_ARIA) += aria_generic.o
obj-$(CONFIG_CRYPTO_CHACHA20) += chacha_generic.o
obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_generic.o
obj-$(CONFIG_CRYPTO_DEFLATE) += deflate.o
diff --git a/crypto/aria.c b/crypto/aria_generic.c
similarity index 86%
rename from crypto/aria.c
rename to crypto/aria_generic.c
index ac3dffac34bb..4cc29b82b99d 100644
--- a/crypto/aria.c
+++ b/crypto/aria_generic.c
@@ -16,6 +16,14 @@

#include <crypto/aria.h>

+static const u32 key_rc[20] = {
+ 0x517cc1b7, 0x27220a94, 0xfe13abe8, 0xfa9a6ee0,
+ 0x6db14acc, 0x9e21c820, 0xff28b1d5, 0xef5de2b0,
+ 0xdb92371d, 0x2126e970, 0x03249775, 0x04e8c90e,
+ 0x517cc1b7, 0x27220a94, 0xfe13abe8, 0xfa9a6ee0,
+ 0x6db14acc, 0x9e21c820, 0xff28b1d5, 0xef5de2b0
+};
+
static void aria_set_encrypt_key(struct aria_ctx *ctx, const u8 *in_key,
unsigned int key_len)
{
@@ -25,7 +33,7 @@ static void aria_set_encrypt_key(struct aria_ctx *ctx, const u8 *in_key,
const u32 *ck;
int rkidx = 0;

- ck = &key_rc[(key_len - 16) / 8][0];
+ ck = &key_rc[(key_len - 16) / 2];

w0[0] = be32_to_cpu(key[0]);
w0[1] = be32_to_cpu(key[1]);
@@ -163,8 +171,7 @@ static void aria_set_decrypt_key(struct aria_ctx *ctx)
}
}

-static int aria_set_key(struct crypto_tfm *tfm, const u8 *in_key,
- unsigned int key_len)
+int aria_set_key(struct crypto_tfm *tfm, const u8 *in_key, unsigned int key_len)
{
struct aria_ctx *ctx = crypto_tfm_ctx(tfm);

@@ -179,6 +186,7 @@ static int aria_set_key(struct crypto_tfm *tfm, const u8 *in_key,

return 0;
}
+EXPORT_SYMBOL_GPL(aria_set_key);

static void __aria_crypt(struct aria_ctx *ctx, u8 *out, const u8 *in,
u32 key[][ARIA_RD_KEY_WORDS])
@@ -235,14 +243,30 @@ static void __aria_crypt(struct aria_ctx *ctx, u8 *out, const u8 *in,
dst[3] = cpu_to_be32(reg3);
}

-static void aria_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void aria_encrypt(void *_ctx, u8 *out, const u8 *in)
+{
+ struct aria_ctx *ctx = (struct aria_ctx *)_ctx;
+
+ __aria_crypt(ctx, out, in, ctx->enc_key);
+}
+EXPORT_SYMBOL_GPL(aria_encrypt);
+
+void aria_decrypt(void *_ctx, u8 *out, const u8 *in)
+{
+ struct aria_ctx *ctx = (struct aria_ctx *)_ctx;
+
+ __aria_crypt(ctx, out, in, ctx->dec_key);
+}
+EXPORT_SYMBOL_GPL(aria_decrypt);
+
+static void __aria_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
{
struct aria_ctx *ctx = crypto_tfm_ctx(tfm);

__aria_crypt(ctx, out, in, ctx->enc_key);
}

-static void aria_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+static void __aria_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
{
struct aria_ctx *ctx = crypto_tfm_ctx(tfm);

@@ -263,8 +287,8 @@ static struct crypto_alg aria_alg = {
.cia_min_keysize = ARIA_MIN_KEY_SIZE,
.cia_max_keysize = ARIA_MAX_KEY_SIZE,
.cia_setkey = aria_set_key,
- .cia_encrypt = aria_encrypt,
- .cia_decrypt = aria_decrypt
+ .cia_encrypt = __aria_encrypt,
+ .cia_decrypt = __aria_decrypt
}
}
};
@@ -286,3 +310,4 @@ MODULE_DESCRIPTION("ARIA Cipher Algorithm");
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Taehee Yoo <[email protected]>");
MODULE_ALIAS_CRYPTO("aria");
+MODULE_ALIAS_CRYPTO("aria-generic");
diff --git a/include/crypto/aria.h b/include/crypto/aria.h
index 4a86661788e8..5b9fe2a224df 100644
--- a/include/crypto/aria.h
+++ b/include/crypto/aria.h
@@ -28,6 +28,7 @@
#define ARIA_MIN_KEY_SIZE 16
#define ARIA_MAX_KEY_SIZE 32
#define ARIA_BLOCK_SIZE 16
+#define ARIA_AVX_BLOCK_SIZE (ARIA_BLOCK_SIZE * 16)
#define ARIA_MAX_RD_KEYS 17
#define ARIA_RD_KEY_WORDS (ARIA_BLOCK_SIZE / sizeof(u32))

@@ -38,14 +39,6 @@ struct aria_ctx {
u32 dec_key[ARIA_MAX_RD_KEYS][ARIA_RD_KEY_WORDS];
};

-static const u32 key_rc[5][4] = {
- { 0x517cc1b7, 0x27220a94, 0xfe13abe8, 0xfa9a6ee0 },
- { 0x6db14acc, 0x9e21c820, 0xff28b1d5, 0xef5de2b0 },
- { 0xdb92371d, 0x2126e970, 0x03249775, 0x04e8c90e },
- { 0x517cc1b7, 0x27220a94, 0xfe13abe8, 0xfa9a6ee0 },
- { 0x6db14acc, 0x9e21c820, 0xff28b1d5, 0xef5de2b0 }
-};
-
static const u32 s1[256] = {
0x00636363, 0x007c7c7c, 0x00777777, 0x007b7b7b,
0x00f2f2f2, 0x006b6b6b, 0x006f6f6f, 0x00c5c5c5,
@@ -458,4 +451,9 @@ static inline void aria_gsrk(u32 *rk, u32 *x, u32 *y, u32 n)
((y[(q + 2) % 4]) << (32 - r));
}

+void aria_encrypt(void *ctx, u8 *out, const u8 *in);
+void aria_decrypt(void *ctx, u8 *out, const u8 *in);
+int aria_set_key(struct crypto_tfm *tfm, const u8 *in_key,
+ unsigned int key_len);
+
#endif
--
2.17.1

2022-08-24 16:00:12

by Taehee Yoo

[permalink] [raw]

Subject: [PATCH 2/3] crypto: aria-avx: add AES-NI/AVX/x86_64 assembler implementation of aria cipher

The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.

Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.

There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes. The basic strategy is to use the aes-ni.

To calculate a first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.

To calculate a second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.

To calculate a third and fourth s-boxes, it uses the aesenclast,
then inverts shift_row, and affine transformation.

The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.

Signed-off-by: Taehee Yoo <[email protected]>
---
arch/x86/crypto/Makefile | 3 +
arch/x86/crypto/aria-aesni-avx-asm_64.S | 648 ++++++++++++++++++++++++
arch/x86/crypto/aria_aesni_avx_glue.c | 165 ++++++
crypto/Kconfig | 21 +
4 files changed, 837 insertions(+)
create mode 100644 arch/x86/crypto/aria-aesni-avx-asm_64.S
create mode 100644 arch/x86/crypto/aria_aesni_avx_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 04d07ab744b2..3b1d701a4f6c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -100,6 +100,9 @@ sm4-aesni-avx-x86_64-y := sm4-aesni-avx-asm_64.o sm4_aesni_avx_glue.o
obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64) += sm4-aesni-avx2-x86_64.o
sm4-aesni-avx2-x86_64-y := sm4-aesni-avx2-asm_64.o sm4_aesni_avx2_glue.o

+obj-$(CONFIG_CRYPTO_ARIA_AESNI_AVX_X86_64) += aria-aesni-avx-x86_64.o
+aria-aesni-avx-x86_64-y := aria-aesni-avx-asm_64.o aria_aesni_avx_glue.o
+
quiet_cmd_perlasm = PERLASM $@
cmd_perlasm = $(PERL) $< > $@
$(obj)/%.S: $(src)/%.pl FORCE
diff --git a/arch/x86/crypto/aria-aesni-avx-asm_64.S b/arch/x86/crypto/aria-aesni-avx-asm_64.S
new file mode 100644
index 000000000000..3d01f5229f72
--- /dev/null
+++ b/arch/x86/crypto/aria-aesni-avx-asm_64.S
@@ -0,0 +1,648 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * ARIA Cipher 16-way parallel algorithm (AVX)
+ *
+ * Copyright (c) 2022 Taehee Yoo <[email protected]>
+ *
+ */
+
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#define filter_8bit(x, lo_t, hi_t, mask4bit, tmp0) \
+ vpand x, mask4bit, tmp0; \
+ vpandn x, mask4bit, x; \
+ vpsrld $4, x, x; \
+ \
+ vpshufb tmp0, lo_t, tmp0; \
+ vpshufb x, hi_t, x; \
+ vpxor tmp0, x, x;
+
+#define transpose_4x4(x0, x1, x2, x3, t1, t2) \
+ vpunpckhdq x1, x0, t2; \
+ vpunpckldq x1, x0, x0; \
+ \
+ vpunpckldq x3, x2, t1; \
+ vpunpckhdq x3, x2, x2; \
+ \
+ vpunpckhqdq t1, x0, x1; \
+ vpunpcklqdq t1, x0, x0; \
+ \
+ vpunpckhqdq x2, t2, x3; \
+ vpunpcklqdq x2, t2, x2;
+
+#define byteslice_16x16b(a0, b0, c0, d0, \
+ a1, b1, c1, d1, \
+ a2, b2, c2, d2, \
+ a3, b3, c3, d3, \
+ st0, st1) \
+ vmovdqu d2, st0; \
+ vmovdqu d3, st1; \
+ transpose_4x4(a0, a1, a2, a3, d2, d3); \
+ transpose_4x4(b0, b1, b2, b3, d2, d3); \
+ vmovdqu st0, d2; \
+ vmovdqu st1, d3; \
+ \
+ vmovdqu a0, st0; \
+ vmovdqu a1, st1; \
+ transpose_4x4(c0, c1, c2, c3, a0, a1); \
+ transpose_4x4(d0, d1, d2, d3, a0, a1); \
+ \
+ vmovdqu .Lshufb_16x16b, a0; \
+ vmovdqu st1, a1; \
+ vpshufb a0, a2, a2; \
+ vpshufb a0, a3, a3; \
+ vpshufb a0, b0, b0; \
+ vpshufb a0, b1, b1; \
+ vpshufb a0, b2, b2; \
+ vpshufb a0, b3, b3; \
+ vpshufb a0, a1, a1; \
+ vpshufb a0, c0, c0; \
+ vpshufb a0, c1, c1; \
+ vpshufb a0, c2, c2; \
+ vpshufb a0, c3, c3; \
+ vpshufb a0, d0, d0; \
+ vpshufb a0, d1, d1; \
+ vpshufb a0, d2, d2; \
+ vpshufb a0, d3, d3; \
+ vmovdqu d3, st1; \
+ vmovdqu st0, d3; \
+ vpshufb a0, d3, a0; \
+ vmovdqu d2, st0; \
+ \
+ transpose_4x4(a0, b0, c0, d0, d2, d3); \
+ transpose_4x4(a1, b1, c1, d1, d2, d3); \
+ vmovdqu st0, d2; \
+ vmovdqu st1, d3; \
+ \
+ vmovdqu b0, st0; \
+ vmovdqu b1, st1; \
+ transpose_4x4(a2, b2, c2, d2, b0, b1); \
+ transpose_4x4(a3, b3, c3, d3, b0, b1); \
+ vmovdqu st0, b0; \
+ vmovdqu st1, b1; \
+ /* does not adjust output bytes inside vectors */
+
+#define debyteslice_16x16b(a0, b0, c0, d0, \
+ a1, b1, c1, d1, \
+ a2, b2, c2, d2, \
+ a3, b3, c3, d3, \
+ st0, st1) \
+ vmovdqu d2, st0; \
+ vmovdqu d3, st1; \
+ transpose_4x4(a0, a1, a2, a3, d2, d3); \
+ transpose_4x4(b0, b1, b2, b3, d2, d3); \
+ vmovdqu st0, d2; \
+ vmovdqu st1, d3; \
+ \
+ vmovdqu a0, st0; \
+ vmovdqu a1, st1; \
+ transpose_4x4(c0, c1, c2, c3, a0, a1); \
+ transpose_4x4(d0, d1, d2, d3, a0, a1); \
+ \
+ vmovdqu .Lshufb_16x16b, a0; \
+ vmovdqu st1, a1; \
+ vpshufb a0, a2, a2; \
+ vpshufb a0, a3, a3; \
+ vpshufb a0, b0, b0; \
+ vpshufb a0, b1, b1; \
+ vpshufb a0, b2, b2; \
+ vpshufb a0, b3, b3; \
+ vpshufb a0, a1, a1; \
+ vpshufb a0, c0, c0; \
+ vpshufb a0, c1, c1; \
+ vpshufb a0, c2, c2; \
+ vpshufb a0, c3, c3; \
+ vpshufb a0, d0, d0; \
+ vpshufb a0, d1, d1; \
+ vpshufb a0, d2, d2; \
+ vpshufb a0, d3, d3; \
+ vmovdqu d3, st1; \
+ vmovdqu st0, d3; \
+ vpshufb a0, d3, a0; \
+ vmovdqu d2, st0; \
+ \
+ transpose_4x4(c0, d0, a0, b0, d2, d3); \
+ transpose_4x4(c1, d1, a1, b1, d2, d3); \
+ vmovdqu st0, d2; \
+ vmovdqu st1, d3; \
+ \
+ vmovdqu b0, st0; \
+ vmovdqu b1, st1; \
+ transpose_4x4(c2, d2, a2, b2, b0, b1); \
+ transpose_4x4(c3, d3, a3, b3, b0, b1); \
+ vmovdqu st0, b0; \
+ vmovdqu st1, b1; \
+ /* does not adjust output bytes inside vectors */
+
+/* load blocks to registers and apply pre-whitening */
+#define inpack16_pre(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ rio) \
+ vmovdqu (0 * 16)(rio), x0; \
+ vmovdqu (1 * 16)(rio), x1; \
+ vmovdqu (2 * 16)(rio), x2; \
+ vmovdqu (3 * 16)(rio), x3; \
+ vmovdqu (4 * 16)(rio), x4; \
+ vmovdqu (5 * 16)(rio), x5; \
+ vmovdqu (6 * 16)(rio), x6; \
+ vmovdqu (7 * 16)(rio), x7; \
+ vmovdqu (8 * 16)(rio), y0; \
+ vmovdqu (9 * 16)(rio), y1; \
+ vmovdqu (10 * 16)(rio), y2; \
+ vmovdqu (11 * 16)(rio), y3; \
+ vmovdqu (12 * 16)(rio), y4; \
+ vmovdqu (13 * 16)(rio), y5; \
+ vmovdqu (14 * 16)(rio), y6; \
+ vmovdqu (15 * 16)(rio), y7;
+
+/* byteslice pre-whitened blocks and store to temporary memory */
+#define inpack16_post(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_ab, mem_cd) \
+ byteslice_16x16b(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ (mem_ab), (mem_cd)); \
+ \
+ vmovdqu x0, 0 * 16(mem_ab); \
+ vmovdqu x1, 1 * 16(mem_ab); \
+ vmovdqu x2, 2 * 16(mem_ab); \
+ vmovdqu x3, 3 * 16(mem_ab); \
+ vmovdqu x4, 4 * 16(mem_ab); \
+ vmovdqu x5, 5 * 16(mem_ab); \
+ vmovdqu x6, 6 * 16(mem_ab); \
+ vmovdqu x7, 7 * 16(mem_ab); \
+ vmovdqu y0, 0 * 16(mem_cd); \
+ vmovdqu y1, 1 * 16(mem_cd); \
+ vmovdqu y2, 2 * 16(mem_cd); \
+ vmovdqu y3, 3 * 16(mem_cd); \
+ vmovdqu y4, 4 * 16(mem_cd); \
+ vmovdqu y5, 5 * 16(mem_cd); \
+ vmovdqu y6, 6 * 16(mem_cd); \
+ vmovdqu y7, 7 * 16(mem_cd);
+
+/* de-byteslice, apply post-whitening and store blocks */
+#define outunpack16(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_ab, mem_cd) \
+ debyteslice_16x16b(y0, y4, x0, x4, \
+ y1, y5, x1, x5, \
+ y2, y6, x2, x6, \
+ y3, y7, x3, x7, \
+ (mem_ab), (mem_cd)); \
+ vmovdqu x0, 0 * 16(mem_ab); \
+ vmovdqu x1, 1 * 16(mem_ab); \
+ vmovdqu x2, 2 * 16(mem_ab); \
+ vmovdqu x3, 3 * 16(mem_ab); \
+ vmovdqu x4, 4 * 16(mem_ab); \
+ vmovdqu x5, 5 * 16(mem_ab); \
+ vmovdqu x6, 6 * 16(mem_ab); \
+ vmovdqu x7, 7 * 16(mem_ab); \
+ vmovdqu y0, 8 * 16(mem_ab); \
+ vmovdqu y1, 9 * 16(mem_ab); \
+ vmovdqu y2, 10 * 16(mem_ab); \
+ vmovdqu y3, 11 * 16(mem_ab); \
+ vmovdqu y4, 12 * 16(mem_ab); \
+ vmovdqu y5, 13 * 16(mem_ab); \
+ vmovdqu y6, 14 * 16(mem_ab); \
+ vmovdqu y7, 15 * 16(mem_ab); \
+
+#define aria_store_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, idx) \
+ vmovdqu x0, ((idx + 0) * 16)(mem_tmp); \
+ vmovdqu x1, ((idx + 1) * 16)(mem_tmp); \
+ vmovdqu x2, ((idx + 2) * 16)(mem_tmp); \
+ vmovdqu x3, ((idx + 3) * 16)(mem_tmp); \
+ vmovdqu x4, ((idx + 4) * 16)(mem_tmp); \
+ vmovdqu x5, ((idx + 5) * 16)(mem_tmp); \
+ vmovdqu x6, ((idx + 6) * 16)(mem_tmp); \
+ vmovdqu x7, ((idx + 7) * 16)(mem_tmp);
+
+#define aria_load_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, idx) \
+ vmovdqu ((idx + 0) * 16)(mem_tmp), x0; \
+ vmovdqu ((idx + 1) * 16)(mem_tmp), x1; \
+ vmovdqu ((idx + 2) * 16)(mem_tmp), x2; \
+ vmovdqu ((idx + 3) * 16)(mem_tmp), x3; \
+ vmovdqu ((idx + 4) * 16)(mem_tmp), x4; \
+ vmovdqu ((idx + 5) * 16)(mem_tmp), x5; \
+ vmovdqu ((idx + 6) * 16)(mem_tmp), x6; \
+ vmovdqu ((idx + 7) * 16)(mem_tmp), x7;
+
+#define aria_ark_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ t0, rk, idx, round) \
+ /* AddRoundKey */ \
+ vpbroadcastb ((round * 16) + idx + 3)(rk), t0; \
+ vpxor t0, x0, x0; \
+ vpbroadcastb ((round * 16) + idx + 2)(rk), t0; \
+ vpxor t0, x1, x1; \
+ vpbroadcastb ((round * 16) + idx + 1)(rk), t0; \
+ vpxor t0, x2, x2; \
+ vpbroadcastb ((round * 16) + idx + 0)(rk), t0; \
+ vpxor t0, x3, x3; \
+ vpbroadcastb ((round * 16) + idx + 7)(rk), t0; \
+ vpxor t0, x4, x4; \
+ vpbroadcastb ((round * 16) + idx + 6)(rk), t0; \
+ vpxor t0, x5, x5; \
+ vpbroadcastb ((round * 16) + idx + 5)(rk), t0; \
+ vpxor t0, x6, x6; \
+ vpbroadcastb ((round * 16) + idx + 4)(rk), t0; \
+ vpxor t0, x7, x7;
+
+#define aria_sbox_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ t0, t1, t2, t3, \
+ t4, t5, t6, t7) \
+ vpxor t0, t0, t0; \
+ vaesenclast t0, x0, x0; \
+ vaesenclast t0, x4, x4; \
+ vaesenclast t0, x1, x1; \
+ vaesenclast t0, x5, x5; \
+ vaesdeclast t0, x2, x2; \
+ vaesdeclast t0, x6, x6; \
+ \
+ /* AES inverse shift rows */ \
+ vmovdqa .Linv_shift_row, t0; \
+ vmovdqa .Lshift_row, t1; \
+ vpshufb t0, x0, x0; \
+ vpshufb t0, x4, x4; \
+ vpshufb t0, x1, x1; \
+ vpshufb t0, x5, x5; \
+ vpshufb t0, x3, x3; \
+ vpshufb t0, x7, x7; \
+ vpshufb t1, x2, x2; \
+ vpshufb t1, x6, x6; \
+ \
+ vmovdqa .Linv_lo, t0; \
+ vmovdqa .Linv_hi, t1; \
+ vmovdqa .Ltf_lo_s2, t2; \
+ vmovdqa .Ltf_hi_s2, t3; \
+ vmovdqa .Ltf_lo_x2, t4; \
+ vmovdqa .Ltf_hi_x2, t5; \
+ vbroadcastss .L0f0f0f0f, t6; \
+ \
+ /* extract multiplicative inverse */ \
+ filter_8bit(x1, t0, t1, t6, t7); \
+ /* affine transformation for S2 */ \
+ filter_8bit(x1, t2, t3, t6, t7); \
+ /* extract multiplicative inverse */ \
+ filter_8bit(x5, t0, t1, t6, t7); \
+ /* affine transformation for S2 */ \
+ filter_8bit(x5, t2, t3, t6, t7); \
+ \
+ /* affine transformation for X2 */ \
+ filter_8bit(x3, t4, t5, t6, t7); \
+ vpxor t7, t7, t7; \
+ vaesenclast t7, x3, x3; \
+ /* extract multiplicative inverse */ \
+ filter_8bit(x3, t0, t1, t6, t7); \
+ /* affine transformation for X2 */ \
+ filter_8bit(x7, t4, t5, t6, t7); \
+ vpxor t7, t7, t7; \
+ vaesenclast t7, x7, x7; \
+ /* extract multiplicative inverse */ \
+ filter_8bit(x7, t0, t1, t6, t7);
+
+#define aria_diff_m(x0, x1, x2, x3, \
+ t0, t1, t2, t3) \
+ /* T = rotr32(X, 8); */ \
+ /* X ^= T */ \
+ vpxor x0, x3, t0; \
+ vpxor x1, x0, t1; \
+ vpxor x2, x1, t2; \
+ vpxor x3, x2, t3; \
+ /* X = T ^ rotr(X, 16); */ \
+ vpxor t2, x0, x0; \
+ vpxor x1, t3, t3; \
+ vpxor t0, x2, x2; \
+ vpxor t1, x3, x1; \
+ vmovdqu t3, x3;
+
+#define aria_diff_word(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7) \
+ /* t1 ^= t2; */ \
+ vpxor y0, x4, x4; \
+ vpxor y1, x5, x5; \
+ vpxor y2, x6, x6; \
+ vpxor y3, x7, x7; \
+ \
+ /* t2 ^= t3; */ \
+ vpxor y4, y0, y0; \
+ vpxor y5, y1, y1; \
+ vpxor y6, y2, y2; \
+ vpxor y7, y3, y3; \
+ \
+ /* t0 ^= t1; */ \
+ vpxor x4, x0, x0; \
+ vpxor x5, x1, x1; \
+ vpxor x6, x2, x2; \
+ vpxor x7, x3, x3; \
+ \
+ /* t3 ^= t1; */ \
+ vpxor x4, y4, y4; \
+ vpxor x5, y5, y5; \
+ vpxor x6, y6, y6; \
+ vpxor x7, y7, y7; \
+ \
+ /* t2 ^= t0; */ \
+ vpxor x0, y0, y0; \
+ vpxor x1, y1, y1; \
+ vpxor x2, y2, y2; \
+ vpxor x3, y3, y3; \
+ \
+ /* t1 ^= t2; */ \
+ vpxor y0, x4, x4; \
+ vpxor y1, x5, x5; \
+ vpxor y2, x6, x6; \
+ vpxor y3, x7, x7;
+
+#define aria_fe(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_tmp, rk, round) \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 8, round); \
+ \
+ aria_sbox_8way(x2, x3, x0, x1, x6, x7, x4, x5, \
+ y0, y1, y2, y3, y4, y5, y6, y7); \
+ \
+ aria_diff_m(x0, x1, x2, x3, y0, y1, y2, y3); \
+ aria_diff_m(x4, x5, x6, x7, y0, y1, y2, y3); \
+ aria_store_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 8); \
+ \
+ aria_load_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 0); \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 0, round); \
+ \
+ aria_sbox_8way(x2, x3, x0, x1, x6, x7, x4, x5, \
+ y0, y1, y2, y3, y4, y5, y6, y7); \
+ \
+ aria_diff_m(x0, x1, x2, x3, y0, y1, y2, y3); \
+ aria_diff_m(x4, x5, x6, x7, y0, y1, y2, y3); \
+ aria_store_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 0); \
+ aria_load_state_8way(y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_tmp, 8); \
+ aria_diff_word(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7); \
+ /* aria_diff_byte() \
+ * T3 = ABCD -> BADC \
+ * T3 = y4, y5, y6, y7 -> y5, y4, y7, y6 \
+ * T0 = ABCD -> CDAB \
+ * T0 = x0, x1, x2, x3 -> x2, x3, x0, x1 \
+ * T1 = ABCD -> DCBA \
+ * T1 = x4, x5, x6, x7 -> x7, x6, x5, x4 \
+ */ \
+ aria_diff_word(x2, x3, x0, x1, \
+ x7, x6, x5, x4, \
+ y0, y1, y2, y3, \
+ y5, y4, y7, y6); \
+ aria_store_state_8way(x3, x2, x1, x0, \
+ x6, x7, x4, x5, \
+ mem_tmp, 0);
+
+#define aria_fo(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_tmp, rk, round) \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 8, round); \
+ \
+ aria_sbox_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, y1, y2, y3, y4, y5, y6, y7); \
+ \
+ aria_diff_m(x0, x1, x2, x3, y0, y1, y2, y3); \
+ aria_diff_m(x4, x5, x6, x7, y0, y1, y2, y3); \
+ aria_store_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 8); \
+ \
+ aria_load_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 0); \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 0, round); \
+ \
+ aria_sbox_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, y1, y2, y3, y4, y5, y6, y7); \
+ \
+ aria_diff_m(x0, x1, x2, x3, y0, y1, y2, y3); \
+ aria_diff_m(x4, x5, x6, x7, y0, y1, y2, y3); \
+ aria_store_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 0); \
+ aria_load_state_8way(y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_tmp, 8); \
+ aria_diff_word(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7); \
+ /* aria_diff_byte() \
+ * T1 = ABCD -> BADC \
+ * T1 = x4, x5, x6, x7 -> x5, x4, x7, x6 \
+ * T2 = ABCD -> CDAB \
+ * T2 = y0, y1, y2, y3, -> y2, y3, y0, y1 \
+ * T3 = ABCD -> DCBA \
+ * T3 = y4, y5, y6, y7 -> y7, y6, y5, y4 \
+ */ \
+ aria_diff_word(x0, x1, x2, x3, \
+ x5, x4, x7, x6, \
+ y2, y3, y0, y1, \
+ y7, y6, y5, y4); \
+ aria_store_state_8way(x3, x2, x1, x0, \
+ x6, x7, x4, x5, \
+ mem_tmp, 0);
+
+#define aria_ff(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_tmp, rk, round, last_round) \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 8, round); \
+ \
+ aria_sbox_8way(x2, x3, x0, x1, x6, x7, x4, x5, \
+ y0, y1, y2, y3, y4, y5, y6, y7); \
+ \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 8, last_round); \
+ \
+ aria_store_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 8); \
+ \
+ aria_load_state_8way(x0, x1, x2, x3, \
+ x4, x5, x6, x7, \
+ mem_tmp, 0); \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 0, round); \
+ \
+ aria_sbox_8way(x2, x3, x0, x1, x6, x7, x4, x5, \
+ y0, y1, y2, y3, y4, y5, y6, y7); \
+ \
+ aria_ark_8way(x0, x1, x2, x3, x4, x5, x6, x7, \
+ y0, rk, 0, last_round); \
+ \
+ aria_load_state_8way(y0, y1, y2, y3, \
+ y4, y5, y6, y7, \
+ mem_tmp, 8);
+
+/* NB: section is mergeable, all elements must be aligned 16-byte blocks */
+.section .rodata.cst16, "aM", @progbits, 16
+.align 16
+
+#define SHUFB_BYTES(idx) \
+ 0 + (idx), 4 + (idx), 8 + (idx), 12 + (idx)
+
+.Lshufb_16x16b:
+ .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3);
+/* For isolating SubBytes from AESENCLAST, inverse shift row */
+.Linv_shift_row:
+ .byte 0x00, 0x0d, 0x0a, 0x07, 0x04, 0x01, 0x0e, 0x0b
+ .byte 0x08, 0x05, 0x02, 0x0f, 0x0c, 0x09, 0x06, 0x03
+.Lshift_row:
+ .byte 0x00, 0x05, 0x0a, 0x0f, 0x04, 0x09, 0x0e, 0x03
+ .byte 0x08, 0x0d, 0x02, 0x07, 0x0c, 0x01, 0x06, 0x0b
+/* extract multiplicative inverse from subByte(x) */
+.Linv_lo:
+ .byte 0x05, 0x4f, 0x91, 0xdb, 0x2c, 0x66, 0xb8, 0xf2
+ .byte 0x57, 0x1d, 0xc3, 0x89, 0x7e, 0x34, 0xea, 0xa0
+.Linv_hi:
+ .byte 0x00, 0xa4, 0x49, 0xed, 0x92, 0x36, 0xdb, 0x7f
+ .byte 0x25, 0x81, 0x6c, 0xc8, 0xb7, 0x13, 0xfe, 0x5a
+.Ltf_lo_s2:
+ .byte 0xe2, 0x4e, 0x1f, 0xb3, 0x24, 0x88, 0xd9, 0x75
+ .byte 0x61, 0xcd, 0x9c, 0x30, 0xa7, 0x0b, 0x5a, 0xf6
+.Ltf_hi_s2:
+ .byte 0x00, 0x26, 0xa7, 0x81, 0xfb, 0xdd, 0x5c, 0x7a
+ .byte 0x5f, 0x79, 0xf8, 0xde, 0xa4, 0x82, 0x03, 0x25
+.Ltf_lo_x2:
+ .byte 0x2c, 0xf4, 0x14, 0xcc, 0x56, 0x8e, 0x6e, 0xb6
+ .byte 0xed, 0x35, 0xd5, 0x0d, 0x97, 0x4f, 0xaf, 0x77
+.Ltf_hi_x2:
+ .byte 0x00, 0x75, 0x52, 0x27, 0xae, 0xdb, 0xfc, 0x89
+ .byte 0xe8, 0x9d, 0xba, 0xcf, 0x46, 0x33, 0x14, 0x61
+
+/* 4-bit mask */
+.section .rodata.cst4.L0f0f0f0f, "aM", @progbits, 4
+.align 4
+.L0f0f0f0f:
+ .long 0x0f0f0f0f
+
+.text
+
+.align 8
+SYM_FUNC_START(aria_aesni_avx_crypt_16way)
+ /* input:
+ * %rdi: rk
+ * %rsi: dst
+ * %rdx: src
+ * %rcx: rounds
+ */
+
+ FRAME_BEGIN
+
+ inpack16_pre(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rdx);
+
+ movq %rsi, %rax;
+ leaq 8 * 16(%rax), %r8;
+
+ inpack16_post(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %r8);
+ aria_fo(%xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 0);
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 1);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 2);
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 3);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 4);
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 5);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 6);
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 7);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 8);
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 9);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 10);
+ cmp $12, %rcx;
+ jne .Laria192;
+ aria_ff(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 11, 12);
+ jmp .Laria_end;
+.Laria192:
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 11);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 12);
+ cmp $14, %rcx;
+ jne .Laria256;
+ aria_ff(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 13, 14);
+ jmp .Laria_end;
+.Laria256:
+ aria_fe(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 13);
+ aria_fo(%xmm9, %xmm8, %xmm11, %xmm10, %xmm12, %xmm13, %xmm14, %xmm15,
+ %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
+ %rax, %rdi, 14);
+ aria_ff(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %rdi, 15, 16);
+.Laria_end:
+ outunpack16(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7,
+ %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
+ %xmm15, %rax, %r8);
+
+ FRAME_END
+ RET;
+SYM_FUNC_END(aria_aesni_avx_crypt_16way)
diff --git a/arch/x86/crypto/aria_aesni_avx_glue.c b/arch/x86/crypto/aria_aesni_avx_glue.c
new file mode 100644
index 000000000000..53121970d169
--- /dev/null
+++ b/arch/x86/crypto/aria_aesni_avx_glue.c
@@ -0,0 +1,165 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Glue Code for the AVX assembler implementation of the ARIA Cipher
+ *
+ * Copyright (c) 2022 Taehee Yoo <[email protected]>
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/simd.h>
+#include <crypto/aria.h>
+#include <linux/crypto.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/types.h>
+
+#include "ecb_cbc_helpers.h"
+
+asmlinkage void aria_aesni_avx_crypt_16way(const u32 *rk, u8 *dst,
+ const u8 *src, int rounds);
+
+static int ecb_do_encrypt(struct skcipher_request *req, const u32 *rkey)
+{
+ struct aria_ctx *ctx = crypto_skcipher_ctx(crypto_skcipher_reqtfm(req));
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+
+ kernel_fpu_begin();
+ while (nbytes >= ARIA_AVX_BLOCK_SIZE) {
+ aria_aesni_avx_crypt_16way(rkey, dst, src, ctx->rounds);
+ dst += ARIA_AVX_BLOCK_SIZE;
+ src += ARIA_AVX_BLOCK_SIZE;
+ nbytes -= ARIA_AVX_BLOCK_SIZE;
+ }
+ while (nbytes >= ARIA_BLOCK_SIZE) {
+ aria_encrypt(ctx, dst, src);
+ dst += ARIA_BLOCK_SIZE;
+ src += ARIA_BLOCK_SIZE;
+ nbytes -= ARIA_BLOCK_SIZE;
+ }
+ kernel_fpu_end();
+
+ err = skcipher_walk_done(&walk, nbytes);
+ }
+
+ return err;
+}
+
+static int ecb_do_decrypt(struct skcipher_request *req, const u32 *rkey)
+{
+ struct aria_ctx *ctx = crypto_skcipher_ctx(crypto_skcipher_reqtfm(req));
+ struct skcipher_walk walk;
+ unsigned int nbytes;
+ int err;
+
+ err = skcipher_walk_virt(&walk, req, false);
+
+ while ((nbytes = walk.nbytes) > 0) {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+
+ kernel_fpu_begin();
+ while (nbytes >= ARIA_AVX_BLOCK_SIZE) {
+ aria_aesni_avx_crypt_16way(rkey, dst, src, ctx->rounds);
+ dst += ARIA_AVX_BLOCK_SIZE;
+ src += ARIA_AVX_BLOCK_SIZE;
+ nbytes -= ARIA_AVX_BLOCK_SIZE;
+ }
+ while (nbytes >= ARIA_BLOCK_SIZE) {
+ aria_decrypt(ctx, dst, src);
+ dst += ARIA_BLOCK_SIZE;
+ src += ARIA_BLOCK_SIZE;
+ nbytes -= ARIA_BLOCK_SIZE;
+ }
+ kernel_fpu_end();
+
+ err = skcipher_walk_done(&walk, nbytes);
+ }
+
+ return err;
+}
+
+static int aria_avx_ecb_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct aria_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return ecb_do_encrypt(req, ctx->enc_key[0]);
+}
+
+static int aria_avx_ecb_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct aria_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+ return ecb_do_decrypt(req, ctx->dec_key[0]);
+}
+
+static int aria_avx_set_key(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int keylen)
+{
+ return aria_set_key(&tfm->base, key, keylen);
+}
+
+static struct skcipher_alg aria_algs[] = {
+ {
+ .base.cra_name = "__ecb(aria)",
+ .base.cra_driver_name = "__ecb-aria-avx",
+ .base.cra_priority = 400,
+ .base.cra_flags = CRYPTO_ALG_INTERNAL,
+ .base.cra_blocksize = ARIA_BLOCK_SIZE,
+ .base.cra_ctxsize = sizeof(struct aria_ctx),
+ .base.cra_module = THIS_MODULE,
+ .min_keysize = ARIA_MIN_KEY_SIZE,
+ .max_keysize = ARIA_MAX_KEY_SIZE,
+ .setkey = aria_avx_set_key,
+ .encrypt = aria_avx_ecb_encrypt,
+ .decrypt = aria_avx_ecb_decrypt,
+ }
+};
+
+static struct simd_skcipher_alg *aria_simd_algs[ARRAY_SIZE(aria_algs)];
+
+static int __init aria_avx_init(void)
+{
+ const char *feature_name;
+
+ if (!boot_cpu_has(X86_FEATURE_AVX) ||
+ !boot_cpu_has(X86_FEATURE_AES) ||
+ !boot_cpu_has(X86_FEATURE_OSXSAVE)) {
+ pr_info("AVX or AES-NI instructions are not detected.\n");
+ return -ENODEV;
+ }
+
+ if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM,
+ &feature_name)) {
+ pr_info("CPU feature '%s' is not supported.\n", feature_name);
+ return -ENODEV;
+ }
+
+ return simd_register_skciphers_compat(aria_algs,
+ ARRAY_SIZE(aria_algs),
+ aria_simd_algs);
+}
+
+static void __exit aria_avx_exit(void)
+{
+ simd_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs),
+ aria_simd_algs);
+}
+
+module_init(aria_avx_init);
+module_exit(aria_avx_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Taehee Yoo <[email protected]>");
+MODULE_DESCRIPTION("ARIA Cipher Algorithm, AVX/AES-NI optimized");
+MODULE_ALIAS_CRYPTO("aria");
+MODULE_ALIAS_CRYPTO("aria-aesni-avx");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index b1ccf873779d..cd63ea83ddd7 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1659,6 +1659,27 @@ config CRYPTO_ARIA
See also:
<https://seed.kisa.or.kr/kisa/algorithm/EgovAriaInfo.do>

+config CRYPTO_ARIA_AESNI_AVX_X86_64
+ tristate "ARIA cipher algorithm (x86_64/AES-NI/AVX)"
+ depends on X86 && 64BIT
+ select CRYPTO_SKCIPHER
+ select CRYPTO_ARIA
+ select CRYPTO_SIMD
+ help
+ ARIA cipher algorithm (RFC5794).
+
+ ARIA is a standard encryption algorithm of the Republic of Korea.
+ The ARIA specifies three key sizes and rounds.
+ 128-bit: 12 rounds.
+ 192-bit: 14 rounds.
+ 256-bit: 16 rounds.
+
+ This module provides the ARIA cipher algorithm that processes
+ sixteen blocks parallel using the AVX instruction set.
+
+ See also:
+ <https://seed.kisa.or.kr/kisa/algorithm/EgovAriaInfo.do>
+
config CRYPTO_SERPENT
tristate "Serpent cipher algorithm"
select CRYPTO_ALGAPI
--
2.17.1

2022-08-24 19:39:32

by Elliott, Robert (Servers)

[permalink] [raw]

Subject: RE: [PATCH 2/3] crypto: aria-avx: add AES-NI/AVX/x86_64 assembler implementation of aria cipher

> -----Original Message-----
> From: Taehee Yoo <[email protected]>
> Sent: Wednesday, August 24, 2022 10:59 AM
> Subject: [PATCH 2/3] crypto: aria-avx: add AES-NI/AVX/x86_64 assembler
> implementation of aria cipher
>
...

> +#include "ecb_cbc_helpers.h"
> +
> +asmlinkage void aria_aesni_avx_crypt_16way(const u32 *rk, u8 *dst,
> + const u8 *src, int rounds);
> +
> +static int ecb_do_encrypt(struct skcipher_request *req, const u32 *rkey)
> +{
> + struct aria_ctx *ctx =
> crypto_skcipher_ctx(crypto_skcipher_reqtfm(req));
> + struct skcipher_walk walk;
> + unsigned int nbytes;
> + int err;
> +
> + err = skcipher_walk_virt(&walk, req, false);
> +
> + while ((nbytes = walk.nbytes) > 0) {
> + const u8 *src = walk.src.virt.addr;
> + u8 *dst = walk.dst.virt.addr;
> +
> + kernel_fpu_begin();
> + while (nbytes >= ARIA_AVX_BLOCK_SIZE) {
> + aria_aesni_avx_crypt_16way(rkey, dst, src, ctx->rounds);
> + dst += ARIA_AVX_BLOCK_SIZE;
> + src += ARIA_AVX_BLOCK_SIZE;
> + nbytes -= ARIA_AVX_BLOCK_SIZE;
> + }
> + while (nbytes >= ARIA_BLOCK_SIZE) {
> + aria_encrypt(ctx, dst, src);
> + dst += ARIA_BLOCK_SIZE;
> + src += ARIA_BLOCK_SIZE;
> + nbytes -= ARIA_BLOCK_SIZE;
> + }
> + kernel_fpu_end();

Is aria_encrypt() an FPU function that belongs between kernel_fpu_begin()
and kernel_fpu_end()?

Several drivers (camellia, serpent, twofish, cast5, cast6) use ECB_WALK_START,
ECB_BLOCK, and ECB_WALK_END macros for these loops. Those macros do only call
kernel_fpu_end() at the very end. That requires the functions to be structured
with (ctx, dst, src) arguments, not (rkey, dst, src, rounds).

2022-08-25 13:44:31

by Taehee Yoo

[permalink] [raw]

Subject: Re: [PATCH 2/3] crypto: aria-avx: add AES-NI/AVX/x86_64 assembler implementation of aria cipher

Hi Elliott, Robert
Thank you so much for your review!

On 8/25/22 04:35, Elliott, Robert (Servers) wrote:
>> -----Original Message-----
>> From: Taehee Yoo <[email protected]>
>> Sent: Wednesday, August 24, 2022 10:59 AM
>> Subject: [PATCH 2/3] crypto: aria-avx: add AES-NI/AVX/x86_64 assembler
>> implementation of aria cipher
>>
> ...
>
>> +#include "ecb_cbc_helpers.h"
>> +
>> +asmlinkage void aria_aesni_avx_crypt_16way(const u32 *rk, u8 *dst,
>> + const u8 *src, int rounds);
>> +
>> +static int ecb_do_encrypt(struct skcipher_request *req, const u32
*rkey)
>> +{
>> + struct aria_ctx *ctx =
>> crypto_skcipher_ctx(crypto_skcipher_reqtfm(req));
>> + struct skcipher_walk walk;
>> + unsigned int nbytes;
>> + int err;
>> +
>> + err = skcipher_walk_virt(&walk, req, false);
>> +
>> + while ((nbytes = walk.nbytes) > 0) {
>> + const u8 *src = walk.src.virt.addr;
>> + u8 *dst = walk.dst.virt.addr;
>> +
>> + kernel_fpu_begin();
>> + while (nbytes >= ARIA_AVX_BLOCK_SIZE) {
>> + aria_aesni_avx_crypt_16way(rkey, dst, src, ctx->rounds);
>> + dst += ARIA_AVX_BLOCK_SIZE;
>> + src += ARIA_AVX_BLOCK_SIZE;
>> + nbytes -= ARIA_AVX_BLOCK_SIZE;
>> + }
>> + while (nbytes >= ARIA_BLOCK_SIZE) {
>> + aria_encrypt(ctx, dst, src);
>> + dst += ARIA_BLOCK_SIZE;
>> + src += ARIA_BLOCK_SIZE;
>> + nbytes -= ARIA_BLOCK_SIZE;
>> + }
>> + kernel_fpu_end();
>
> Is aria_encrypt() an FPU function that belongs between kernel_fpu_begin()
> and kernel_fpu_end()?
>
> Several drivers (camellia, serpent, twofish, cast5, cast6) use
ECB_WALK_START,
> ECB_BLOCK, and ECB_WALK_END macros for these loops. Those macros do
only call
> kernel_fpu_end() at the very end. That requires the functions to be
structured
> with (ctx, dst, src) arguments, not (rkey, dst, src, rounds).
>

aria_{encrypt | decrypt}() are not FPU functions.
I think if it uses the ECB macro, aria_{encrypt | decrypt}() will be
still in the FPU context.
So, I will get them out of the kernel FPU context.

Thank you so much!
Taehee Yoo