2022-08-26 05:42:59

by Taehee Yoo

[permalink] [raw]
Subject: [PATCH v2 0/3] crypto: aria: add ARIA AES-NI/AVX/x86_64 implementation

The purpose of this patchset is to support the implementation of ARIA-AVX.
Many of the ideas in this implementation are from Camellia-avx,
especially byte slicing.
Like Camellia, ARIA also uses a 16way strategy.

ARIA cipher algorithm is similar to AES.
There are four s-boxes in the ARIA spec and the first and second s-boxes
are the same as AES's s-boxes.
Almost functions are based on aria-generic code except for s-box related
function.
The aria-avx doesn't implement the key expanding function.
it only support encrypt() and decrypt().

Encryption and Decryption logic is actually the same but it should use
separated keys(encryption key and decryption key).
En/Decryption steps are like below:
1. Add-Round-Key
2. S-box.
3. Diffusion Layer.

There is no special thing in the Add-Round-Key step.

There are some notable things in s-box step.
Like Camellia, it doesn't use a lookup table, instead, it uses aes-ni.

To calculate the first s-box, it just uses the aesenclast and then
inverts shift_row. No more process is needed for this job because the
first s-box is the same as the AES encryption s-box.

To calculate a second s-box(invert of s1), it just uses the aesdeclast
and then inverts shift_row. No more process is needed for this job
because the second s-box is the same as the AES decryption s-box.

To calculate a third and fourth s-boxes, it uses the aesenclast,
then inverts shift_row, and affine transformation.

The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation.
The aria-avx Diffusion Layer implementation is based on aria-generic
implementation because 8-bit implementation is not fit for parallel
implementation but 32-bit is fit for this.

The first patch in this series is to export functions for aria-avx.
The aria-avx uses existing functions in the aria-generic code.
The second patch is to implement aria-avx.
The last patch is to add async test for aria.

Benchmarks:
The tcrypt is used.
cpu: i3-12100

How to test:
modprobe aria-generic
tcrypt mode=610 num_mb=8192

Result:
testing speed of multibuffer ecb(aria) (ecb(aria-generic)) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 534 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 2006 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 3674 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 52374 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 608 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2586 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 4707 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 69794 cycles

testing speed of multibuffer ecb(aria) (ecb(aria-generic)) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 545 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 1995 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 3673 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 52359 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 615 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2588 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 4712 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 69916 cycles

How to test:
modprobe aria
tcrypt mode=610 num_mb=8192

Result:
testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 727 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 2040 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 1399 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 14758 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 702 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2615 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 1677 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 19454 cycles
testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 638 cycles
test 2 (128 bit key, 128 byte blocks): 1 operation in 2090 cycles
test 3 (128 bit key, 256 byte blocks): 1 operation in 1394 cycles
test 6 (128 bit key, 4096 byte blocks): 1 operation in 14824 cycles
test 7 (256 bit key, 16 byte blocks): 1 operation in 719 cycles
test 9 (256 bit key, 128 byte blocks): 1 operation in 2633 cycles
test 10 (256 bit key, 256 byte blocks): 1 operation in 1684 cycles
test 13 (256 bit key, 4096 byte blocks): 1 operation in 19457 cycles

v2:
- Do not call non-FPU functions(aria_{encrypt | decrypt}() in the
FPU context.
- Do not acquire FPU context for too long.

Taehee Yoo (3):
crypto: aria: prepare generic module for optimized implementations
crypto: aria-avx: add AES-NI/AVX/x86_64 assembler implementation of
aria cipher
crypto: tcrypt: add async speed test for aria cipher

arch/x86/crypto/Makefile | 3 +
arch/x86/crypto/aria-aesni-avx-asm_64.S | 648 ++++++++++++++++++++++++
arch/x86/crypto/aria_aesni_avx_glue.c | 165 ++++++
crypto/Kconfig | 21 +
crypto/Makefile | 2 +-
crypto/{aria.c => aria_generic.c} | 39 +-
crypto/tcrypt.c | 13 +
include/crypto/aria.h | 14 +-
8 files changed, 889 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/crypto/aria-aesni-avx-asm_64.S
create mode 100644 arch/x86/crypto/aria_aesni_avx_glue.c
rename crypto/{aria.c => aria_generic.c} (86%)

--
2.17.1


2022-09-01 20:36:50

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] crypto: aria: add ARIA AES-NI/AVX/x86_64 implementation

Hello,

On 26.8.2022 8.31, Taehee Yoo wrote:
> The purpose of this patchset is to support the implementation of ARIA-AVX.
> Many of the ideas in this implementation are from Camellia-avx,
> especially byte slicing.
> Like Camellia, ARIA also uses a 16way strategy.
>
> ARIA cipher algorithm is similar to AES.
> There are four s-boxes in the ARIA spec and the first and second s-boxes
> are the same as AES's s-boxes.
> Almost functions are based on aria-generic code except for s-box related
> function.
> The aria-avx doesn't implement the key expanding function.
> it only support encrypt() and decrypt().
>
> Encryption and Decryption logic is actually the same but it should use
> separated keys(encryption key and decryption key).
> En/Decryption steps are like below:
> 1. Add-Round-Key
> 2. S-box.
> 3. Diffusion Layer.
>
> There is no special thing in the Add-Round-Key step.
>
> There are some notable things in s-box step.
> Like Camellia, it doesn't use a lookup table, instead, it uses aes-ni.
>
> To calculate the first s-box, it just uses the aesenclast and then
> inverts shift_row. No more process is needed for this job because the
> first s-box is the same as the AES encryption s-box.
>
> To calculate a second s-box(invert of s1), it just uses the aesdeclast
> and then inverts shift_row. No more process is needed for this job
> because the second s-box is the same as the AES decryption s-box.
>
> To calculate a third and fourth s-boxes, it uses the aesenclast,
> then inverts shift_row, and affine transformation.
>
> The aria-generic implementation is based on a 32-bit implementation,
> not an 8-bit implementation.
> The aria-avx Diffusion Layer implementation is based on aria-generic
> implementation because 8-bit implementation is not fit for parallel
> implementation but 32-bit is fit for this.
>
> The first patch in this series is to export functions for aria-avx.
> The aria-avx uses existing functions in the aria-generic code.
> The second patch is to implement aria-avx.
> The last patch is to add async test for aria.
>
> Benchmarks:
> The tcrypt is used.
> cpu: i3-12100

This CPU also supports Galois Field New Instructions (GFNI) which are
even better suited for accelerating ciphers that use same building
blocks as AES. For example, I've recently implemented camellia using
GFNI for libgcrypt [1].

I quickly hacked GFNI to your implementation and it gives nice extra
bit of performance (~55% faster on Intel tiger-lake). Here's GFNI
version of 'aria_sbox_8way', that I used:

/////////////////////////////////////////////////////////
#define aria_sbox_8way(x0, x1, x2, x3, \
x4, x5, x6, x7, \
t0, t1, t2, t3, \
t4, t5, t6, t7) \
vpbroadcastq .Ltf_s2_bitmatrix, t0; \
vpbroadcastq .Ltf_inv_bitmatrix, t1; \
vpbroadcastq .Ltf_id_bitmatrix, t2; \
vpbroadcastq .Ltf_aff_bitmatrix, t3; \
vpbroadcastq .Ltf_x2_bitmatrix, t4; \
vgf2p8affineinvqb $(tf_s2_const), t0, x1, x1; \
vgf2p8affineinvqb $(tf_s2_const), t0, x5, x5; \
vgf2p8affineqb $(tf_inv_const), t1, x2, x2; \
vgf2p8affineqb $(tf_inv_const), t1, x6, x6; \
vgf2p8affineinvqb $0, t2, x2, x2; \
vgf2p8affineinvqb $0, t2, x6, x6; \
vgf2p8affineinvqb $(tf_aff_const), t3, x0, x0; \
vgf2p8affineinvqb $(tf_aff_const), t3, x4, x4; \
vgf2p8affineqb $(tf_x2_const), t4, x3, x3; \
vgf2p8affineqb $(tf_x2_const), t4, x7, x7; \
vgf2p8affineinvqb $0, t2, x3, x3; \
vgf2p8affineinvqb $0, t2, x7, x7;

#define BV8(a0,a1,a2,a3,a4,a5,a6,a7) \
( (((a0) & 1) << 0) | \
(((a1) & 1) << 1) | \
(((a2) & 1) << 2) | \
(((a3) & 1) << 3) | \
(((a4) & 1) << 4) | \
(((a5) & 1) << 5) | \
(((a6) & 1) << 6) | \
(((a7) & 1) << 7) )

#define BM8X8(l0,l1,l2,l3,l4,l5,l6,l7) \
( ((l7) << (0 * 8)) | \
((l6) << (1 * 8)) | \
((l5) << (2 * 8)) | \
((l4) << (3 * 8)) | \
((l3) << (4 * 8)) | \
((l2) << (5 * 8)) | \
((l1) << (6 * 8)) | \
((l0) << (7 * 8)) )

/* AES affine: */
#define tf_aff_const BV8(1, 1, 0, 0, 0, 1, 1, 0)
.Ltf_aff_bitmatrix:
.quad BM8X8(BV8(1, 0, 0, 0, 1, 1, 1, 1),
BV8(1, 1, 0, 0, 0, 1, 1, 1),
BV8(1, 1, 1, 0, 0, 0, 1, 1),
BV8(1, 1, 1, 1, 0, 0, 0, 1),
BV8(1, 1, 1, 1, 1, 0, 0, 0),
BV8(0, 1, 1, 1, 1, 1, 0, 0),
BV8(0, 0, 1, 1, 1, 1, 1, 0),
BV8(0, 0, 0, 1, 1, 1, 1, 1))

/* AES inverse affine: */
#define tf_inv_const BV8(1, 0, 1, 0, 0, 0, 0, 0)
.Ltf_inv_bitmatrix:
.quad BM8X8(BV8(0, 0, 1, 0, 0, 1, 0, 1),
BV8(1, 0, 0, 1, 0, 0, 1, 0),
BV8(0, 1, 0, 0, 1, 0, 0, 1),
BV8(1, 0, 1, 0, 0, 1, 0, 0),
BV8(0, 1, 0, 1, 0, 0, 1, 0),
BV8(0, 0, 1, 0, 1, 0, 0, 1),
BV8(1, 0, 0, 1, 0, 1, 0, 0),
BV8(0, 1, 0, 0, 1, 0, 1, 0))

/* S2: */
#define tf_s2_const BV8(0, 1, 0, 0, 0, 1, 1, 1)
.Ltf_s2_bitmatrix:
.quad BM8X8(BV8(0, 1, 0, 1, 0, 1, 1, 1),
BV8(0, 0, 1, 1, 1, 1, 1, 1),
BV8(1, 1, 1, 0, 1, 1, 0, 1),
BV8(1, 1, 0, 0, 0, 0, 1, 1),
BV8(0, 1, 0, 0, 0, 0, 1, 1),
BV8(1, 1, 0, 0, 1, 1, 1, 0),
BV8(0, 1, 1, 0, 0, 0, 1, 1),
BV8(1, 1, 1, 1, 0, 1, 1, 0))

/* X2: */
#define tf_x2_const BV8(0, 0, 1, 1, 0, 1, 0, 0)
.Ltf_x2_bitmatrix:
.quad BM8X8(BV8(0, 0, 0, 1, 1, 0, 0, 0),
BV8(0, 0, 1, 0, 0, 1, 1, 0),
BV8(0, 0, 0, 0, 1, 0, 1, 0),
BV8(1, 1, 1, 0, 0, 0, 1, 1),
BV8(1, 1, 1, 0, 1, 1, 0, 0),
BV8(0, 1, 1, 0, 1, 0, 1, 1),
BV8(1, 0, 1, 1, 1, 1, 0, 1),
BV8(1, 0, 0, 1, 0, 0, 1, 1))

/* Identity matrix: */
.Ltf_id_bitmatrix:
.quad BM8X8(BV8(1, 0, 0, 0, 0, 0, 0, 0),
BV8(0, 1, 0, 0, 0, 0, 0, 0),
BV8(0, 0, 1, 0, 0, 0, 0, 0),
BV8(0, 0, 0, 1, 0, 0, 0, 0),
BV8(0, 0, 0, 0, 1, 0, 0, 0),
BV8(0, 0, 0, 0, 0, 1, 0, 0),
BV8(0, 0, 0, 0, 0, 0, 1, 0),
BV8(0, 0, 0, 0, 0, 0, 0, 1))
/////////////////////////////////////////////////////////

GFNI also allows easy use of 256-bit vector registers so
there is way to get additional 2x speed increase (but
requires doubling number of parallel processed blocks).

-Jussi

[1] https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/camellia-aesni-avx2-amd64.h#l80

2022-09-02 09:51:41

by Taehee Yoo

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] crypto: aria: add ARIA AES-NI/AVX/x86_64 implementation

Hi Jussi,
Thank you so much for this great work!

On 9/2/22 05:09, Jussi Kivilinna wrote:
> Hello,
>
> On 26.8.2022 8.31, Taehee Yoo wrote:
>> The purpose of this patchset is to support the implementation of
>> ARIA-AVX.
>> Many of the ideas in this implementation are from Camellia-avx,
>> especially byte slicing.
>> Like Camellia, ARIA also uses a 16way strategy.
>>
>> ARIA cipher algorithm is similar to AES.
>> There are four s-boxes in the ARIA spec and the first and second s-boxes
>> are the same as AES's s-boxes.
>> Almost functions are based on aria-generic code except for s-box related
>> function.
>> The aria-avx doesn't implement the key expanding function.
>> it only support encrypt() and decrypt().
>>
>> Encryption and Decryption logic is actually the same but it should use
>> separated keys(encryption key and decryption key).
>> En/Decryption steps are like below:
>> 1. Add-Round-Key
>> 2. S-box.
>> 3. Diffusion Layer.
>>
>> There is no special thing in the Add-Round-Key step.
>>
>> There are some notable things in s-box step.
>> Like Camellia, it doesn't use a lookup table, instead, it uses aes-ni.
>>
>> To calculate the first s-box, it just uses the aesenclast and then
>> inverts shift_row. No more process is needed for this job because the
>> first s-box is the same as the AES encryption s-box.
>>
>> To calculate a second s-box(invert of s1), it just uses the aesdeclast
>> and then inverts shift_row. No more process is needed for this job
>> because the second s-box is the same as the AES decryption s-box.
>>
>> To calculate a third and fourth s-boxes, it uses the aesenclast,
>> then inverts shift_row, and affine transformation.
>>
>> The aria-generic implementation is based on a 32-bit implementation,
>> not an 8-bit implementation.
>> The aria-avx Diffusion Layer implementation is based on aria-generic
>> implementation because 8-bit implementation is not fit for parallel
>> implementation but 32-bit is fit for this.
>>
>> The first patch in this series is to export functions for aria-avx.
>> The aria-avx uses existing functions in the aria-generic code.
>> The second patch is to implement aria-avx.
>> The last patch is to add async test for aria.
>>
>> Benchmarks:
>> The tcrypt is used.
>> cpu: i3-12100
>
> This CPU also supports Galois Field New Instructions (GFNI) which are
> even better suited for accelerating ciphers that use same building
> blocks as AES. For example, I've recently implemented camellia using
> GFNI for libgcrypt [1].
>
> I quickly hacked GFNI to your implementation and it gives nice extra
> bit of performance (~55% faster on Intel tiger-lake). Here's GFNI
> version of 'aria_sbox_8way', that I used:
>
> /////////////////////////////////////////////////////////
> #define aria_sbox_8way(x0, x1, x2, x3, \
> x4, x5, x6, x7, \
> t0, t1, t2, t3, \
> t4, t5, t6, t7) \
> vpbroadcastq .Ltf_s2_bitmatrix, t0; \
> vpbroadcastq .Ltf_inv_bitmatrix, t1; \
> vpbroadcastq .Ltf_id_bitmatrix, t2; \
> vpbroadcastq .Ltf_aff_bitmatrix, t3; \
> vpbroadcastq .Ltf_x2_bitmatrix, t4; \
> vgf2p8affineinvqb $(tf_s2_const), t0, x1, x1; \
> vgf2p8affineinvqb $(tf_s2_const), t0, x5, x5; \
> vgf2p8affineqb $(tf_inv_const), t1, x2, x2; \
> vgf2p8affineqb $(tf_inv_const), t1, x6, x6; \
> vgf2p8affineinvqb $0, t2, x2, x2; \
> vgf2p8affineinvqb $0, t2, x6, x6; \
> vgf2p8affineinvqb $(tf_aff_const), t3, x0, x0; \
> vgf2p8affineinvqb $(tf_aff_const), t3, x4, x4; \
> vgf2p8affineqb $(tf_x2_const), t4, x3, x3; \
> vgf2p8affineqb $(tf_x2_const), t4, x7, x7; \
> vgf2p8affineinvqb $0, t2, x3, x3; \
> vgf2p8affineinvqb $0, t2, x7, x7;
>
> #define BV8(a0,a1,a2,a3,a4,a5,a6,a7) \
> ( (((a0) & 1) << 0) | \
> (((a1) & 1) << 1) | \
> (((a2) & 1) << 2) | \
> (((a3) & 1) << 3) | \
> (((a4) & 1) << 4) | \
> (((a5) & 1) << 5) | \
> (((a6) & 1) << 6) | \
> (((a7) & 1) << 7) )
>
> #define BM8X8(l0,l1,l2,l3,l4,l5,l6,l7) \
> ( ((l7) << (0 * 8)) | \
> ((l6) << (1 * 8)) | \
> ((l5) << (2 * 8)) | \
> ((l4) << (3 * 8)) | \
> ((l3) << (4 * 8)) | \
> ((l2) << (5 * 8)) | \
> ((l1) << (6 * 8)) | \
> ((l0) << (7 * 8)) )
>
> /* AES affine: */
> #define tf_aff_const BV8(1, 1, 0, 0, 0, 1, 1, 0)
> .Ltf_aff_bitmatrix:
> .quad BM8X8(BV8(1, 0, 0, 0, 1, 1, 1, 1),
> BV8(1, 1, 0, 0, 0, 1, 1, 1),
> BV8(1, 1, 1, 0, 0, 0, 1, 1),
> BV8(1, 1, 1, 1, 0, 0, 0, 1),
> BV8(1, 1, 1, 1, 1, 0, 0, 0),
> BV8(0, 1, 1, 1, 1, 1, 0, 0),
> BV8(0, 0, 1, 1, 1, 1, 1, 0),
> BV8(0, 0, 0, 1, 1, 1, 1, 1))
>
> /* AES inverse affine: */
> #define tf_inv_const BV8(1, 0, 1, 0, 0, 0, 0, 0)
> .Ltf_inv_bitmatrix:
> .quad BM8X8(BV8(0, 0, 1, 0, 0, 1, 0, 1),
> BV8(1, 0, 0, 1, 0, 0, 1, 0),
> BV8(0, 1, 0, 0, 1, 0, 0, 1),
> BV8(1, 0, 1, 0, 0, 1, 0, 0),
> BV8(0, 1, 0, 1, 0, 0, 1, 0),
> BV8(0, 0, 1, 0, 1, 0, 0, 1),
> BV8(1, 0, 0, 1, 0, 1, 0, 0),
> BV8(0, 1, 0, 0, 1, 0, 1, 0))
>
> /* S2: */
> #define tf_s2_const BV8(0, 1, 0, 0, 0, 1, 1, 1)
> .Ltf_s2_bitmatrix:
> .quad BM8X8(BV8(0, 1, 0, 1, 0, 1, 1, 1),
> BV8(0, 0, 1, 1, 1, 1, 1, 1),
> BV8(1, 1, 1, 0, 1, 1, 0, 1),
> BV8(1, 1, 0, 0, 0, 0, 1, 1),
> BV8(0, 1, 0, 0, 0, 0, 1, 1),
> BV8(1, 1, 0, 0, 1, 1, 1, 0),
> BV8(0, 1, 1, 0, 0, 0, 1, 1),
> BV8(1, 1, 1, 1, 0, 1, 1, 0))
>
> /* X2: */
> #define tf_x2_const BV8(0, 0, 1, 1, 0, 1, 0, 0)
> .Ltf_x2_bitmatrix:
> .quad BM8X8(BV8(0, 0, 0, 1, 1, 0, 0, 0),
> BV8(0, 0, 1, 0, 0, 1, 1, 0),
> BV8(0, 0, 0, 0, 1, 0, 1, 0),
> BV8(1, 1, 1, 0, 0, 0, 1, 1),
> BV8(1, 1, 1, 0, 1, 1, 0, 0),
> BV8(0, 1, 1, 0, 1, 0, 1, 1),
> BV8(1, 0, 1, 1, 1, 1, 0, 1),
> BV8(1, 0, 0, 1, 0, 0, 1, 1))
>
> /* Identity matrix: */
> .Ltf_id_bitmatrix:
> .quad BM8X8(BV8(1, 0, 0, 0, 0, 0, 0, 0),
> BV8(0, 1, 0, 0, 0, 0, 0, 0),
> BV8(0, 0, 1, 0, 0, 0, 0, 0),
> BV8(0, 0, 0, 1, 0, 0, 0, 0),
> BV8(0, 0, 0, 0, 1, 0, 0, 0),
> BV8(0, 0, 0, 0, 0, 1, 0, 0),
> BV8(0, 0, 0, 0, 0, 0, 1, 0),
> BV8(0, 0, 0, 0, 0, 0, 0, 1))
> /////////////////////////////////////////////////////////
>
> GFNI also allows easy use of 256-bit vector registers so
> there is way to get additional 2x speed increase (but
> requires doubling number of parallel processed blocks).
>

I checked that my i3-12100 supports GFNI.
Then, benchmarked this implementation.
It works very well and it shows amazing performance!
Before:
128bit, 4096 bytes: 14758 cycles
After:
128bit, 4096 bytes: 9404 cycles

I think I must add this implementation to the v3 patch.
Like your code[1], I also will add a condition on whether CPU supports
GFNI or not.
This will be very helpful to the AVX2 and AVX-512 implementation too.
I think AVX2 implementation will use 256-bit vector register.
So, as you mentioned, its potential performance increment is also great.

Thank you so much again for your great work!

Taehee Yoo

[1]
https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/camellia-aesni-avx2-amd64.h#l80


> -Jussi
>
> [1]
>
https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/camellia-aesni-avx2-amd64.h#l80

>