From: Mathias Krause Subject: Re: [PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization Date: Wed, 4 Jun 2014 08:53:23 +0200 Message-ID: <20140604065323.GA22521@jig.fritz.box> References: <1401842474.2367.180.camel@pegasus.jf.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Herbert Xu , "H. Peter Anvin" , "David S.Miller" , Wajdi Feghali , Tim Chen , Erdinc Ozturk , Aidan O'Mahony , Adrian Hoban , James Guilford , Gabriele Paoloni , "Tadeusz Struk,Huang Ying" , Vinodh Gopal , linux-crypto@vger.kernel.org To: chandramouli narayanan Return-path: Received: from mail-wg0-f49.google.com ([74.125.82.49]:32813 "EHLO mail-wg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932382AbaFDGxf (ORCPT ); Wed, 4 Jun 2014 02:53:35 -0400 Received: by mail-wg0-f49.google.com with SMTP id m15so7934774wgh.20 for ; Tue, 03 Jun 2014 23:53:33 -0700 (PDT) Content-Disposition: inline In-Reply-To: <1401842474.2367.180.camel@pegasus.jf.intel.com> Sender: linux-crypto-owner@vger.kernel.org List-ID: On Tue, Jun 03, 2014 at 05:41:14PM -0700, chandramouli narayanan wrote: > This patch introduces "by8" AES CTR mode AVX optimization inspired by > Intel Optimized IPSEC Cryptograhpic library. For additional information, > please see: > http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972 > > The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and > aes_ctr_enc_256_avx_by8() are adapted from > Intel Optimized IPSEC Cryptographic library. When both AES and AVX features > are enabled in a platform, the glue code in AESNI module overrieds the > existing "by4" CTR mode en/decryption with the "by8" > AES CTR mode en/decryption. > > On a Haswell desktop, with turbo disabled and all cpus running > at maximum frequency, the "by8" CTR mode optimization > shows better performance results across data & key sizes > as measured by tcrypt. > > The average performance improvement of the "by8" version over the "by4" > version is as follows: > > For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement. > For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement. > For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement. Nice improvement :) How does it perform on older processors that do have a penalty for unaligned loads (vmovdqu), e.g. SandyBridge? If those perform worse it might be wise to extend the CPU feature test in the glue code by a model test to enable the "by8" variant only for Haswell and newer processors that don't have such a penalty. > > A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8" > optimization shows the following results: > > tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop: > --------------------------------------------------------------------------- > > testing speed of __ctr-aes-aesni encryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 bytes) > test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes) > test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes) > test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes) > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 bytes) > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 bytes) > test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes) > test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes) > test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes) > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 bytes) > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 bytes) > > testing speed of __ctr-aes-aesni decryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 bytes) > test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes) > test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes) > test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes) > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 bytes) > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 bytes) > test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes) > test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes) > test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes) > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 bytes) > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 bytes) > > tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop: > --------------------------------------------------------------------------- > > testing speed of __ctr-aes-aesni encryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 bytes) > test 5 (192 bit key, 16 byte blocks): 1 operation in 339 cycles (16 bytes) > test 6 (192 bit key, 64 byte blocks): 1 operation in 352 cycles (64 bytes) > test 7 (192 bit key, 256 byte blocks): 1 operation in 539 cycles (256 bytes) > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1153 cycles (1024 bytes) > test 9 (192 bit key, 8192 byte blocks): 1 operation in 8458 cycles (8192 bytes) > test 10 (256 bit key, 16 byte blocks): 1 operation in 353 cycles (16 bytes) > test 11 (256 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes) > test 12 (256 bit key, 256 byte blocks): 1 operation in 512 cycles (256 bytes) > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1277 cycles (1024 bytes) > test 14 (256 bit key, 8192 byte blocks): 1 operation in 8745 cycles (8192 bytes) > > testing speed of __ctr-aes-aesni decryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 348 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 335 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 451 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1030 cycles (1024 bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 6611 cycles (8192 bytes) > test 5 (192 bit key, 16 byte blocks): 1 operation in 354 cycles (16 bytes) > test 6 (192 bit key, 64 byte blocks): 1 operation in 346 cycles (64 bytes) > test 7 (192 bit key, 256 byte blocks): 1 operation in 488 cycles (256 bytes) > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1154 cycles (1024 bytes) > test 9 (192 bit key, 8192 byte blocks): 1 operation in 8390 cycles (8192 bytes) > test 10 (256 bit key, 16 byte blocks): 1 operation in 357 cycles (16 bytes) > test 11 (256 bit key, 64 byte blocks): 1 operation in 362 cycles (64 bytes) > test 12 (256 bit key, 256 byte blocks): 1 operation in 515 cycles (256 bytes) > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1284 cycles (1024 bytes) > test 14 (256 bit key, 8192 byte blocks): 1 operation in 8681 cycles (8192 bytes) > > Signed-off-by: Chandramouli Narayanan > --- > arch/x86/crypto/Makefile | 2 +- > arch/x86/crypto/aes_ctrby8_avx-x86_64.S | 545 ++++++++++++++++++++++++++++++++ > arch/x86/crypto/aesni-intel_glue.c | 41 ++- > 3 files changed, 586 insertions(+), 2 deletions(-) > create mode 100644 arch/x86/crypto/aes_ctrby8_avx-x86_64.S > > diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile > index 61d6e28..f6fe1e2 100644 > --- a/arch/x86/crypto/Makefile > +++ b/arch/x86/crypto/Makefile > @@ -76,7 +76,7 @@ ifeq ($(avx2_supported),yes) > endif > > aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o > -aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o > +aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o > ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o > sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o > ifeq ($(avx2_supported),yes) > diff --git a/arch/x86/crypto/aes_ctrby8_avx-x86_64.S b/arch/x86/crypto/aes_ctrby8_avx-x86_64.S > new file mode 100644 > index 0000000..e49595f > --- /dev/null > +++ b/arch/x86/crypto/aes_ctrby8_avx-x86_64.S > @@ -0,0 +1,545 @@ > +/* > + * Implement AES CTR mode by8 optimization with AVX instructions. (x86_64) > + * > + * This is AES128/192/256 CTR mode optimization implementation. It requires > + * the support of Intel(R) AESNI and AVX instructions. > + * > + * This work was inspired by the AES CTR mode optimization published > + * in Intel Optimized IPSEC Cryptograhpic library. > + * Additional information on it can be found at: > + * http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972 > + * > + * This file is provided under a dual BSD/GPLv2 license. When using or > + * redistributing this file, you may do so under either license. > + * > + * GPL LICENSE SUMMARY > + * > + * Copyright(c) 2014 Intel Corporation. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * Contact Information: > + * James Guilford > + * Sean Gulley > + * Chandramouli Narayanan > + * > + * BSD LICENSE > + * > + * Copyright(c) 2014 Intel Corporation. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * > + * Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in > + * the documentation and/or other materials provided with the > + * distribution. > + * Neither the name of Intel Corporation nor the names of its > + * contributors may be used to endorse or promote products derived > + * from this software without specific prior written permission. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > + * > + */ > + > +#include > +#include > + > +#define CONCAT(a,b) a##b > +#define VMOVDQ vmovdqu > + > +#define xdata0 %xmm0 > +#define xdata1 %xmm1 > +#define xdata2 %xmm2 > +#define xdata3 %xmm3 > +#define xdata4 %xmm4 > +#define xdata5 %xmm5 > +#define xdata6 %xmm6 > +#define xdata7 %xmm7 > +#define xcounter %xmm8 > +#define xbyteswap %xmm9 > +#define xkey0 %xmm10 > +#define xkey3 %xmm11 > +#define xkey6 %xmm12 > +#define xkey9 %xmm13 > +#define xkey4 %xmm11 > +#define xkey8 %xmm12 > +#define xkey12 %xmm13 > +#define xkeyA %xmm14 > +#define xkeyB %xmm15 > + > +#define p_in %rdi > +#define p_iv %rsi > +#define p_keys %rdx > +#define p_out %rcx > +#define num_bytes %r8 > + > +#define tmp %r10 > +#define DDQ(i) CONCAT(ddq_add_,i) > +#define XMM(i) CONCAT(%xmm, i) > +#define DDQ_DATA 0 > +#define XDATA 1 > +#define KEY_128 1 > +#define KEY_192 2 > +#define KEY_256 3 > + > +.section .data .section .rodata, as already mentioned by hpa. > +.align 16 > + > +byteswap_const: > + .octa 0x000102030405060708090A0B0C0D0E0F > +ddq_add_1: > + .octa 0x00000000000000000000000000000001 > +ddq_add_2: > + .octa 0x00000000000000000000000000000002 > +ddq_add_3: > + .octa 0x00000000000000000000000000000003 > +ddq_add_4: > + .octa 0x00000000000000000000000000000004 > +ddq_add_5: > + .octa 0x00000000000000000000000000000005 > +ddq_add_6: > + .octa 0x00000000000000000000000000000006 > +ddq_add_7: > + .octa 0x00000000000000000000000000000007 > +ddq_add_8: > + .octa 0x00000000000000000000000000000008 > + > +.text > + > +/* generate a unique variable for ddq_add_x */ > + > +.macro setddq n > + var_ddq_add = DDQ(\n) > +.endm > + > +/* generate a unique variable for xmm register */ > +.macro setxdata n > + var_xdata = XMM(\n) > +.endm > + > +/* club the numeric 'id' to the symbol 'name' */ > + > +.macro club name, id > +.altmacro > + .if \name == DDQ_DATA > + setddq %\id > + .elseif \name == XDATA > + setxdata %\id > + .endif > +.noaltmacro > +.endm > + > +/* > + * do_aes num_in_par load_keys key_len > + * This increments p_in, but not p_out > + */ > +.macro do_aes b, k, key_len > + .set by, \b > + .set load_keys, \k > + .set klen, \key_len > + > + .if (load_keys) > + vmovdqa 0*16(p_keys), xkey0 > + .endif > + > + vpshufb xbyteswap, xcounter, xdata0 > + > + .set i, 1 > + .rept (by - 1) > + club DDQ_DATA, i > + club XDATA, i > + vpaddd var_ddq_add(%rip), xcounter, var_xdata > + vpshufb xbyteswap, var_xdata, var_xdata > + .set i, (i +1) > + .endr > + > + vmovdqa 1*16(p_keys), xkeyA > + > + vpxor xkey0, xdata0, xdata0 > + club DDQ_DATA, by > + vpaddd var_ddq_add(%rip), xcounter, xcounter > + > + .set i, 1 > + .rept (by - 1) > + club XDATA, i > + vpxor xkey0, var_xdata, var_xdata > + .set i, (i +1) > + .endr > + > + vmovdqa 2*16(p_keys), xkeyB > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyA, var_xdata, var_xdata /* key 1 */ > + .set i, (i +1) > + .endr > + > + .if (klen == KEY_128) > + .if (load_keys) > + vmovdqa 3*16(p_keys), xkeyA > + .endif > + .else > + vmovdqa 3*16(p_keys), xkeyA > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyB, var_xdata, var_xdata /* key 2 */ > + .set i, (i +1) > + .endr > + > + add $(16*by), p_in > + > + .if (klen == KEY_128) > + vmovdqa 4*16(p_keys), xkey4 > + .else > + .if (load_keys) > + vmovdqa 4*16(p_keys), xkey4 > + .endif > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyA, var_xdata, var_xdata /* key 3 */ > + .set i, (i +1) > + .endr > + > + vmovdqa 5*16(p_keys), xkeyA > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkey4, var_xdata, var_xdata /* key 4 */ > + .set i, (i +1) > + .endr > + > + .if (klen == KEY_128) > + .if (load_keys) > + vmovdqa 6*16(p_keys), xkeyB > + .endif > + .else > + vmovdqa 6*16(p_keys), xkeyB > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyA, var_xdata, var_xdata /* key 5 */ > + .set i, (i +1) > + .endr > + > + vmovdqa 7*16(p_keys), xkeyA > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyB, var_xdata, var_xdata /* key 6 */ > + .set i, (i +1) > + .endr > + > + .if (klen == KEY_128) > + vmovdqa 8*16(p_keys), xkey8 > + .else > + .if (load_keys) > + vmovdqa 8*16(p_keys), xkey8 > + .endif > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyA, var_xdata, var_xdata /* key 7 */ > + .set i, (i +1) > + .endr > + > + .if (klen == KEY_128) > + .if (load_keys) > + vmovdqa 9*16(p_keys), xkeyA > + .endif > + .else > + vmovdqa 9*16(p_keys), xkeyA > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkey8, var_xdata, var_xdata /* key 8 */ > + .set i, (i +1) > + .endr > + > + vmovdqa 10*16(p_keys), xkeyB > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyA, var_xdata, var_xdata /* key 9 */ > + .set i, (i +1) > + .endr > + > + .if (klen != KEY_128) > + vmovdqa 11*16(p_keys), xkeyA > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + /* key 10 */ > + .if (klen == KEY_128) > + vaesenclast xkeyB, var_xdata, var_xdata > + .else > + vaesenc xkeyB, var_xdata, var_xdata > + .endif > + .set i, (i +1) > + .endr > + > + .if (klen != KEY_128) > + .if (load_keys) > + vmovdqa 12*16(p_keys), xkey12 > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + vaesenc xkeyA, var_xdata, var_xdata /* key 11 */ > + .set i, (i +1) > + .endr > + > + .if (klen == KEY_256) > + vmovdqa 13*16(p_keys), xkeyA > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + .if (klen == KEY_256) > + /* key 12 */ > + vaesenc xkey12, var_xdata, var_xdata > + .else > + vaesenclast xkey12, var_xdata, var_xdata > + .endif > + .set i, (i +1) > + .endr > + > + .if (klen == KEY_256) > + vmovdqa 14*16(p_keys), xkeyB > + > + .set i, 0 > + .rept by > + club XDATA, i > + /* key 13 */ > + vaesenc xkeyA, var_xdata, var_xdata > + .set i, (i +1) > + .endr > + > + .set i, 0 > + .rept by > + club XDATA, i > + /* key 14 */ > + vaesenclast xkeyB, var_xdata, var_xdata > + .set i, (i +1) > + .endr > + .endif > + .endif > + > + .set i, 0 > + .rept (by / 2) > + .set j, (i+1) > + VMOVDQ (i*16 - 16*by)(p_in), xkeyA > + VMOVDQ (j*16 - 16*by)(p_in), xkeyB > + club XDATA, i > + vpxor xkeyA, var_xdata, var_xdata > + club XDATA, j > + vpxor xkeyB, var_xdata, var_xdata > + .set i, (i+2) > + .endr > + > + .if (i < by) > + VMOVDQ (i*16 - 16*by)(p_in), xkeyA > + club XDATA, i > + vpxor xkeyA, var_xdata, var_xdata > + .endif > + > + .set i, 0 > + .rept by > + club XDATA, i > + VMOVDQ var_xdata, i*16(p_out) > + .set i, (i+1) > + .endr > +.endm > + > +.macro do_aes_load val, key_len > + do_aes \val, 1, \key_len > +.endm > + > +.macro do_aes_noload val, key_len > + do_aes \val, 0, \key_len > +.endm > + > +/* main body of aes ctr load */ > + > +.macro do_aes_ctrmain key_len > + > + cmp $16, num_bytes > + jb .Ldo_return2\key_len > + > + vmovdqa byteswap_const(%rip), xbyteswap > + vmovdqu (p_iv), xcounter > + vpshufb xbyteswap, xcounter, xcounter > + > + mov num_bytes, tmp > + and $(7*16), tmp > + jz .Lmult_of_8_blks\key_len > + > + /* 1 <= tmp <= 7 */ > + cmp $(4*16), tmp > + jg .Lgt4\key_len > + je .Leq4\key_len > + > +.Llt4\key_len: > + cmp $(2*16), tmp > + jg .Leq3\key_len > + je .Leq2\key_len > + > +.Leq1\key_len: > + do_aes_load 1, \key_len > + add $(1*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > +.Leq2\key_len: > + do_aes_load 2, \key_len > + add $(2*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > + > +.Leq3\key_len: > + do_aes_load 3, \key_len > + add $(3*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > +.Leq4\key_len: > + do_aes_load 4, \key_len > + add $(4*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > +.Lgt4\key_len: > + cmp $(6*16), tmp > + jg .Leq7\key_len > + je .Leq6\key_len > + > +.Leq5\key_len: > + do_aes_load 5, \key_len > + add $(5*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > +.Leq6\key_len: > + do_aes_load 6, \key_len > + add $(6*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > +.Leq7\key_len: > + do_aes_load 7, \key_len > + add $(7*16), p_out > + and $(~7*16), num_bytes > + jz .Ldo_return2\key_len > + jmp .Lmain_loop2\key_len > + > +.Lmult_of_8_blks\key_len: > + .if (\key_len != KEY_128) > + vmovdqa 0*16(p_keys), xkey0 > + vmovdqa 4*16(p_keys), xkey4 > + vmovdqa 8*16(p_keys), xkey8 > + vmovdqa 12*16(p_keys), xkey12 > + .else > + vmovdqa 0*16(p_keys), xkey0 > + vmovdqa 3*16(p_keys), xkey4 > + vmovdqa 6*16(p_keys), xkey8 > + vmovdqa 9*16(p_keys), xkey12 > + .endif You might want to align the main loop, e.g. add '.align 4' or even '.align 16' here. > +.Lmain_loop2\key_len: > + /* num_bytes is a multiple of 8 and >0 */ > + do_aes_noload 8, \key_len > + add $(8*16), p_out > + sub $(8*16), num_bytes > + jne .Lmain_loop2\key_len > + > +.Ldo_return2\key_len: > + /* return updated IV */ > + vpshufb xbyteswap, xcounter, xcounter > + vmovdqu xcounter, (p_iv) > + ret > +.endm > + > +/* > + * routine to do AES128 CTR enc/decrypt "by8" > + * XMM registers are clobbered. > + * Saving/restoring must be done at a higher level > + * aes_ctr_enc_128_avx_by8(void *in, void *iv, void *keys, void *out, > + * unsigned int num_bytes) > + */ > +ENTRY(aes_ctr_enc_128_avx_by8) > + /* call the aes main loop */ > + do_aes_ctrmain KEY_128 > + > +ENDPROC(aes_ctr_enc_128_avx_by8) > + > +/* > + * routine to do AES192 CTR enc/decrypt "by8" > + * XMM registers are clobbered. > + * Saving/restoring must be done at a higher level > + * aes_ctr_enc_192_avx_by8(void *in, void *iv, void *keys, void *out, > + * unsigned int num_bytes) > + */ > +ENTRY(aes_ctr_enc_192_avx_by8) > + /* call the aes main loop */ > + do_aes_ctrmain KEY_192 > + > +ENDPROC(aes_ctr_enc_192_avx_by8) > + > +/* > + * routine to do AES256 CTR enc/decrypt "by8" > + * XMM registers are clobbered. > + * Saving/restoring must be done at a higher level > + * aes_ctr_enc_256_avx_by8(void *in, void *iv, void *keys, void *out, > + * unsigned int num_bytes) > + */ > +ENTRY(aes_ctr_enc_256_avx_by8) > + /* call the aes main loop */ > + do_aes_ctrmain KEY_256 > + > +ENDPROC(aes_ctr_enc_256_avx_by8) > diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c > index 948ad0e..b06e20f 100644 > --- a/arch/x86/crypto/aesni-intel_glue.c > +++ b/arch/x86/crypto/aesni-intel_glue.c > @@ -105,6 +105,9 @@ void crypto_fpu_exit(void); > #define AVX_GEN4_OPTSIZE 4096 > > #ifdef CONFIG_X86_64 > + > +static void (*aesni_ctr_enc_tfm)(struct crypto_aes_ctx *ctx, u8 *out, > + const u8 *in, unsigned int len, u8 *iv); > asmlinkage void aesni_ctr_enc(struct crypto_aes_ctx *ctx, u8 *out, > const u8 *in, unsigned int len, u8 *iv); > > @@ -154,6 +157,15 @@ asmlinkage void aesni_gcm_dec(void *ctx, u8 *out, > u8 *auth_tag, unsigned long auth_tag_len); > > > +#if defined(CONFIG_AS_AVX) > +asmlinkage void aes_ctr_enc_128_avx_by8(const u8 *in, u8 *iv, > + void *keys, u8 *out, unsigned int num_bytes); > +asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv, > + void *keys, u8 *out, unsigned int num_bytes); > +asmlinkage void aes_ctr_enc_256_avx_by8(const u8 *in, u8 *iv, > + void *keys, u8 *out, unsigned int num_bytes); > +#endif > + Move that code below the following #ifdef. No need to introduce yet another ifdef of the very same symbol. > #ifdef CONFIG_AS_AVX > /* > * asmlinkage void aesni_gcm_precomp_avx_gen2() > @@ -472,6 +484,25 @@ static void ctr_crypt_final(struct crypto_aes_ctx *ctx, > crypto_inc(ctrblk, AES_BLOCK_SIZE); > } > > +#if defined(CONFIG_AS_AVX) Please use '#ifdef CONFIG_AS_AVX' for simple preprocessor tests. That's easier to read and makes it consistent with the rest of the code in that file. > +static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out, > + const u8 *in, unsigned int len, u8 *iv) > +{ > + /* > + * based on key length, override with the by8 version > + * of ctr mode encryption/decryption for improved performance > + */ > + if (ctx->key_length == AES_KEYSIZE_128) > + aes_ctr_enc_128_avx_by8(in, iv, (void *)ctx, out, len); > + else if (ctx->key_length == AES_KEYSIZE_192) > + aes_ctr_enc_192_avx_by8(in, iv, (void *)ctx, out, len); > + else if (ctx->key_length == AES_KEYSIZE_256) > + aes_ctr_enc_256_avx_by8(in, iv, (void *)ctx, out, len); > + else > + aesni_ctr_enc(ctx, out, in, len, iv); How would that last case even be possible? aes_set_key_common() does only allow the above three key lengths. How would we end up here with a key length not being one of AES_KEYSIZE_128, AES_KEYSIZE_192 or AES_KEYSIZE_256? > +} > +#endif > + > static int ctr_crypt(struct blkcipher_desc *desc, > struct scatterlist *dst, struct scatterlist *src, > unsigned int nbytes) > @@ -486,7 +517,7 @@ static int ctr_crypt(struct blkcipher_desc *desc, > > kernel_fpu_begin(); > while ((nbytes = walk.nbytes) >= AES_BLOCK_SIZE) { > - aesni_ctr_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr, > + aesni_ctr_enc_tfm(ctx, walk.dst.virt.addr, walk.src.virt.addr, > nbytes & AES_BLOCK_MASK, walk.iv); Nitpick, but re-indent to one space after the parenthesis, please. > nbytes &= AES_BLOCK_SIZE - 1; > err = blkcipher_walk_done(desc, &walk, nbytes); > @@ -1493,6 +1524,14 @@ static int __init aesni_init(void) > aesni_gcm_enc_tfm = aesni_gcm_enc; > aesni_gcm_dec_tfm = aesni_gcm_dec; > } > + aesni_ctr_enc_tfm = aesni_ctr_enc; > +#if defined(CONFIG_AS_AVX) Make that an #ifdef CONFIG_AS_AVX > + if (boot_cpu_has(X86_FEATURE_AES) && boot_cpu_has(X86_FEATURE_AVX)) { The test for X86_FEATURE_AES is already done a few lines before in the x86_match_cpu() check. No need to duplicate it here. Therefore you can reduce that test to 'if (boot_cpu_has(X86_FEATURE_AVX))' or even shorter to 'if (cpu_has_avx))' as X86_FEATURE_AVX has a convenience macro for that test. > + /* optimize performance of ctr mode encryption trasform */ transform > + aesni_ctr_enc_tfm = aesni_ctr_enc_avx_tfm; > + pr_info("AES CTR mode optimization enabled\n"); If you're emitting a message it should also say, which kind of optimization. In this case something like the following might be appropriate: "AVX CTR mode optimization enabled". > + } > +#endif > #endif > > err = crypto_fpu_init(); Regards, Mathias > -- > 1.8.2.1 > >