Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <9557c406-a3af-6b20-5933-b61fd759ca70@gmail.com>
Date:   Fri, 2 Sep 2022 18:39:47 +0900
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.11.0
Subject: Re: [PATCH v2 0/3] crypto: aria: add ARIA AES-NI/AVX/x86_64
 implementation
Content-Language: en-US
To:     Jussi Kivilinna <jussi.kivilinna@iki.fi>
Cc:     elliott@hpe.com, hpa@zytor.com, x86@kernel.org,
        davem@davemloft.net, mingo@redhat.com, tglx@linutronix.de,
        dave.hansen@linux.intel.com, bp@alien8.de,
        herbert@gondor.apana.org.au, linux-crypto@vger.kernel.org
References: <20220826053131.24792-1-ap420073@gmail.com>
 <afef1c3a-9a72-9006-da95-d63ec5aece5c@iki.fi>
From:   Taehee Yoo <ap420073@gmail.com>
In-Reply-To: <afef1c3a-9a72-9006-da95-d63ec5aece5c@iki.fi>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk

Hi Jussi,
Thank you so much for this great work!

On 9/2/22 05:09, Jussi Kivilinna wrote:
 > Hello,
 >
 > On 26.8.2022 8.31, Taehee Yoo wrote:
 >> The purpose of this patchset is to support the implementation of
 >> ARIA-AVX.
 >> Many of the ideas in this implementation are from Camellia-avx,
 >> especially byte slicing.
 >> Like Camellia, ARIA also uses a 16way strategy.
 >>
 >> ARIA cipher algorithm is similar to AES.
 >> There are four s-boxes in the ARIA spec and the first and second s-boxes
 >> are the same as AES's s-boxes.
 >> Almost functions are based on aria-generic code except for s-box related
 >> function.
 >> The aria-avx doesn't implement the key expanding function.
 >> it only support encrypt() and decrypt().
 >>
 >> Encryption and Decryption logic is actually the same but it should use
 >> separated keys(encryption key and decryption key).
 >> En/Decryption steps are like below:
 >> 1. Add-Round-Key
 >> 2. S-box.
 >> 3. Diffusion Layer.
 >>
 >> There is no special thing in the Add-Round-Key step.
 >>
 >> There are some notable things in s-box step.
 >> Like Camellia, it doesn't use a lookup table, instead, it uses aes-ni.
 >>
 >> To calculate the first s-box, it just uses the aesenclast and then
 >> inverts shift_row. No more process is needed for this job because the
 >> first s-box is the same as the AES encryption s-box.
 >>
 >> To calculate a second s-box(invert of s1), it just uses the aesdeclast
 >> and then inverts shift_row. No more process is needed for this job
 >> because the second s-box is the same as the AES decryption s-box.
 >>
 >> To calculate a third and fourth s-boxes, it uses the aesenclast,
 >> then inverts shift_row, and affine transformation.
 >>
 >> The aria-generic implementation is based on a 32-bit implementation,
 >> not an 8-bit implementation.
 >> The aria-avx Diffusion Layer implementation is based on aria-generic
 >> implementation because 8-bit implementation is not fit for parallel
 >> implementation but 32-bit is fit for this.
 >>
 >> The first patch in this series is to export functions for aria-avx.
 >> The aria-avx uses existing functions in the aria-generic code.
 >> The second patch is to implement aria-avx.
 >> The last patch is to add async test for aria.
 >>
 >> Benchmarks:
 >> The tcrypt is used.
 >> cpu: i3-12100
 >
 > This CPU also supports Galois Field New Instructions (GFNI) which are
 > even better suited for accelerating ciphers that use same building
 > blocks as AES. For example, I've recently implemented camellia using
 > GFNI for libgcrypt [1].
 >
 > I quickly hacked GFNI to your implementation and it gives nice extra
 > bit of performance (~55% faster on Intel tiger-lake). Here's GFNI
 > version of 'aria_sbox_8way', that I used:
 >
 > /////////////////////////////////////////////////////////
 > #define aria_sbox_8way(x0, x1, x2, x3,                  \
 >                         x4, x5, x6, x7,                  \
 >                         t0, t1, t2, t3,                  \
 >                         t4, t5, t6, t7)                  \
 >          vpbroadcastq .Ltf_s2_bitmatrix, t0;             \
 >          vpbroadcastq .Ltf_inv_bitmatrix, t1;            \
 >          vpbroadcastq .Ltf_id_bitmatrix, t2;             \
 >          vpbroadcastq .Ltf_aff_bitmatrix, t3;            \
 >          vpbroadcastq .Ltf_x2_bitmatrix, t4;             \
 >          vgf2p8affineinvqb $(tf_s2_const), t0, x1, x1;   \
 >          vgf2p8affineinvqb $(tf_s2_const), t0, x5, x5;   \
 >          vgf2p8affineqb $(tf_inv_const), t1, x2, x2;     \
 >          vgf2p8affineqb $(tf_inv_const), t1, x6, x6;     \
 >          vgf2p8affineinvqb $0, t2, x2, x2;               \
 >          vgf2p8affineinvqb $0, t2, x6, x6;               \
 >          vgf2p8affineinvqb $(tf_aff_const), t3, x0, x0;  \
 >          vgf2p8affineinvqb $(tf_aff_const), t3, x4, x4;  \
 >          vgf2p8affineqb $(tf_x2_const), t4, x3, x3;      \
 >          vgf2p8affineqb $(tf_x2_const), t4, x7, x7;      \
 >          vgf2p8affineinvqb $0, t2, x3, x3;               \
 >          vgf2p8affineinvqb $0, t2, x7, x7;
 >
 > #define BV8(a0,a1,a2,a3,a4,a5,a6,a7) \
 >          ( (((a0) & 1) << 0) | \
 >            (((a1) & 1) << 1) | \
 >            (((a2) & 1) << 2) | \
 >            (((a3) & 1) << 3) | \
 >            (((a4) & 1) << 4) | \
 >            (((a5) & 1) << 5) | \
 >            (((a6) & 1) << 6) | \
 >            (((a7) & 1) << 7) )
 >
 > #define BM8X8(l0,l1,l2,l3,l4,l5,l6,l7) \
 >          ( ((l7) << (0 * 8)) | \
 >            ((l6) << (1 * 8)) | \
 >            ((l5) << (2 * 8)) | \
 >            ((l4) << (3 * 8)) | \
 >            ((l3) << (4 * 8)) | \
 >            ((l2) << (5 * 8)) | \
 >            ((l1) << (6 * 8)) | \
 >            ((l0) << (7 * 8)) )
 >
 > /* AES affine: */
 > #define tf_aff_const BV8(1, 1, 0, 0, 0, 1, 1, 0)
 > .Ltf_aff_bitmatrix:
 >          .quad BM8X8(BV8(1, 0, 0, 0, 1, 1, 1, 1),
 >                      BV8(1, 1, 0, 0, 0, 1, 1, 1),
 >                      BV8(1, 1, 1, 0, 0, 0, 1, 1),
 >                      BV8(1, 1, 1, 1, 0, 0, 0, 1),
 >                      BV8(1, 1, 1, 1, 1, 0, 0, 0),
 >                      BV8(0, 1, 1, 1, 1, 1, 0, 0),
 >                      BV8(0, 0, 1, 1, 1, 1, 1, 0),
 >                      BV8(0, 0, 0, 1, 1, 1, 1, 1))
 >
 > /* AES inverse affine: */
 > #define tf_inv_const BV8(1, 0, 1, 0, 0, 0, 0, 0)
 > .Ltf_inv_bitmatrix:
 >          .quad BM8X8(BV8(0, 0, 1, 0, 0, 1, 0, 1),
 >                      BV8(1, 0, 0, 1, 0, 0, 1, 0),
 >                      BV8(0, 1, 0, 0, 1, 0, 0, 1),
 >                      BV8(1, 0, 1, 0, 0, 1, 0, 0),
 >                      BV8(0, 1, 0, 1, 0, 0, 1, 0),
 >                      BV8(0, 0, 1, 0, 1, 0, 0, 1),
 >                      BV8(1, 0, 0, 1, 0, 1, 0, 0),
 >                      BV8(0, 1, 0, 0, 1, 0, 1, 0))
 >
 > /* S2: */
 > #define tf_s2_const BV8(0, 1, 0, 0, 0, 1, 1, 1)
 > .Ltf_s2_bitmatrix:
 >          .quad BM8X8(BV8(0, 1, 0, 1, 0, 1, 1, 1),
 >                      BV8(0, 0, 1, 1, 1, 1, 1, 1),
 >                      BV8(1, 1, 1, 0, 1, 1, 0, 1),
 >                      BV8(1, 1, 0, 0, 0, 0, 1, 1),
 >                      BV8(0, 1, 0, 0, 0, 0, 1, 1),
 >                      BV8(1, 1, 0, 0, 1, 1, 1, 0),
 >                      BV8(0, 1, 1, 0, 0, 0, 1, 1),
 >                      BV8(1, 1, 1, 1, 0, 1, 1, 0))
 >
 > /* X2: */
 > #define tf_x2_const BV8(0, 0, 1, 1, 0, 1, 0, 0)
 > .Ltf_x2_bitmatrix:
 >          .quad BM8X8(BV8(0, 0, 0, 1, 1, 0, 0, 0),
 >                      BV8(0, 0, 1, 0, 0, 1, 1, 0),
 >                      BV8(0, 0, 0, 0, 1, 0, 1, 0),
 >                      BV8(1, 1, 1, 0, 0, 0, 1, 1),
 >                      BV8(1, 1, 1, 0, 1, 1, 0, 0),
 >                      BV8(0, 1, 1, 0, 1, 0, 1, 1),
 >                      BV8(1, 0, 1, 1, 1, 1, 0, 1),
 >                      BV8(1, 0, 0, 1, 0, 0, 1, 1))
 >
 > /* Identity matrix: */
 > .Ltf_id_bitmatrix:
 >          .quad BM8X8(BV8(1, 0, 0, 0, 0, 0, 0, 0),
 >                      BV8(0, 1, 0, 0, 0, 0, 0, 0),
 >                      BV8(0, 0, 1, 0, 0, 0, 0, 0),
 >                      BV8(0, 0, 0, 1, 0, 0, 0, 0),
 >                      BV8(0, 0, 0, 0, 1, 0, 0, 0),
 >                      BV8(0, 0, 0, 0, 0, 1, 0, 0),
 >                      BV8(0, 0, 0, 0, 0, 0, 1, 0),
 >                      BV8(0, 0, 0, 0, 0, 0, 0, 1))
 > /////////////////////////////////////////////////////////
 >
 > GFNI also allows easy use of 256-bit vector registers so
 > there is way to get additional 2x speed increase (but
 > requires doubling number of parallel processed blocks).
 >

I checked that my i3-12100 supports GFNI.
Then, benchmarked this implementation.
It works very well and it shows amazing performance!
Before:
128bit, 4096 bytes: 14758 cycles
After:
128bit, 4096 bytes: 9404 cycles

I think I must add this implementation to the v3 patch.
Like your code[1], I also will add a condition on whether CPU supports 
GFNI or not.
This will be very helpful to the AVX2 and AVX-512 implementation too.
I think AVX2 implementation will use 256-bit vector register.
So, as you mentioned, its potential performance increment is also great.

Thank you so much again for your great work!

Taehee Yoo

[1]
https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/camellia-aesni-avx2-amd64.h#l80 


 > -Jussi
 >
 > [1]
 > 
https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/camellia-aesni-avx2-amd64.h#l80 

 >