Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp1672625rwe; Fri, 2 Sep 2022 01:34:54 -0700 (PDT) X-Google-Smtp-Source: AA6agR7/VwpXVMMLTLj5G+H+08swTVXVxxEu7X2sO/p3z97Xlj+7hFTceTPbRGLfxPospZ/QBBcz X-Received: by 2002:a50:a41d:0:b0:447:e4a3:ca32 with SMTP id u29-20020a50a41d000000b00447e4a3ca32mr27972220edb.122.1662107694687; Fri, 02 Sep 2022 01:34:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662107694; cv=none; d=google.com; s=arc-20160816; b=kvtm6Rg8U7AbE6OQa7paea+npREOdX7zKliSaAKUpqMqqMdT+E+ebmGJ07KQ8w9tdP 3rJJVMwTXYXM3x24XOGo8GS0M9/UQp6pNUtKQR6f+VrvuSWFinVqbpOKb7z+fMpTa47S +pJsaLCGiRn/ldaos/muPhM3AYMXafuoYiInJKrV8O6kZPP+MHmSvvrps71keGFSdKzX 2nJ9CorKXHzX1/o8tWD5EI01hH6parzvcEwt+NCyz2VhpWcH8EoHQS/v+F04aSFnvlZU yCEuWH19MH0mpA/KceGPK1Cd9fsJgLYiBiYMVHHVjtzYIWSxdSAZH70wgO8wLSNgquyu 8gCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=yi1YlG4o+83tAAztzmlLDIzklPdbfoDcNOyPLytpFSw=; b=HawYhsz4jzjql+WZswGPIQzGyhvnaiKo7yoI4vobUF2k2t3OmPsiO+1rW3WKm9sIUw iO6Lxt4uE5WRdv2pUy8VnaN/CbWnLsZ6Ywpp7/C4BYvfrKkGEMixz43TFws7fIDwrlOs UBAx5B6DMU12aiTbWS6Frwd6vYbRzLy1IV/n6GKibQuudHDaLZnz8Kyz5ZmHhnZZdSnW rG9yA4D8xwiDYIZr69JGSBoehZs4BESbSvUNH8DeD5SkbOt1ULU3FwJyFqWet50jKPQa tVq+Is7zntIbbizt9BdbLNfVJv/59m87dV+XziVUMcqRPvUxLM2NbMdKZLJA68RBUwrJ wrgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=quw8J7Sl; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y1-20020a50eb81000000b00445f21c2b32si1177826edr.132.2022.09.02.01.34.19; Fri, 02 Sep 2022 01:34:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=quw8J7Sl; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234042AbiIBIbh (ORCPT + 99 others); Fri, 2 Sep 2022 04:31:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33646 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232239AbiIBIbg (ORCPT ); Fri, 2 Sep 2022 04:31:36 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E079528731 for ; Fri, 2 Sep 2022 01:31:34 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id y1so1211980plb.2 for ; Fri, 02 Sep 2022 01:31:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date; bh=yi1YlG4o+83tAAztzmlLDIzklPdbfoDcNOyPLytpFSw=; b=quw8J7Sl2sDi4SqsO19TKnCUNgguWkgtcEZ2nomRbLc0dNBQMMHRMTF07L5GAJS6aq cAkH5+dQ+Bm/uUgOBrtphAuve5Ves80za+HEpqEHiZ45N9tSWV5dd0L3//DcvxP7gwhc hd1qdE0IIOaTNDcXVBe3rTIC7Ots2TMKp0TS/2ebCGoTL6t0sTLUompwQKfjyP44lD2A bBAwWBYsvRj3Cq4AhoDY4+MT2hZSNKHCWoOMG50/bp21YZWnxSBXrjb5+QBih86b9cdi Dkpbm7/0PfCb7YwrjcGf1L9TCV8g4IZoH0wV1WpEnOazys3l1NCh+5LglEi4lm6MkNwU dLgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date; bh=yi1YlG4o+83tAAztzmlLDIzklPdbfoDcNOyPLytpFSw=; b=gRO6/L6kjyLqqQUMYkUuI78xAsZ1J+EQQ6PL2LX+VyFRLCLNq+1ArVv4wOh6LXfsla T93vgaVDCnrqw9Kf6jAr2jvodIHiQkXByOmajam4vUDWO7SFuFscouagC2SpD0+k59mS zNOeMSVkAKzETUf4eV9ju25exlVpbxjN0eheTOzC2IYi+10YxoiF+o6KAy3WeEKLBYmH fXXx3QNlZ482LUI62P6e+zVP3PaVMuFPnT1m8e6kL2ufrnXcJl0nwE3JrBxOvFYRNW0d PSP0u4vbaB4Ns6YesljZf+H5GlArzy85iZdo0Pj5Tun6NCy9Llp7fjG5GLZ61kg+cFKW ig5A== X-Gm-Message-State: ACgBeo11Cqm1iYCyNNGxkRXKs6O+EvjFTp3d8t84voOLfSO7BpVMjynV A2Q9mg2kvA+e4dDV8Ph5W750hkynyKs= X-Received: by 2002:a17:903:187:b0:175:4e37:c294 with SMTP id z7-20020a170903018700b001754e37c294mr10038683plg.129.1662107494292; Fri, 02 Sep 2022 01:31:34 -0700 (PDT) Received: from [192.168.0.4] ([182.213.254.91]) by smtp.gmail.com with ESMTPSA id d9-20020a170903230900b001636d95fe59sm995906plh.172.2022.09.02.01.31.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 02 Sep 2022 01:31:33 -0700 (PDT) Message-ID: <438feee8-e529-8614-41cb-4f7bec2abcf6@gmail.com> Date: Fri, 2 Sep 2022 17:31:29 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [PATCH v2 2/3] crypto: aria-avx: add AES-NI/AVX/x86_64 assembler implementation of aria cipher Content-Language: en-US To: Jussi Kivilinna Cc: linux-crypto@vger.kernel.org, elliott@hpe.com, herbert@gondor.apana.org.au, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, davem@davemloft.net, bp@alien8.de, x86@kernel.org, hpa@zytor.com References: <20220826053131.24792-1-ap420073@gmail.com> <20220826053131.24792-3-ap420073@gmail.com> From: Taehee Yoo In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org Hi Jussi, Thank you so much for this work! On 9/2/22 04:51, Jussi Kivilinna wrote: > Hello, > > On 26.8.2022 8.31, Taehee Yoo wrote: >> +#define aria_sbox_8way(x0, x1, x2, x3, \ >> + x4, x5, x6, x7, \ >> + t0, t1, t2, t3, \ >> + t4, t5, t6, t7) \ >> + vpxor t0, t0, t0; \ >> + vaesenclast t0, x0, x0; \ >> + vaesenclast t0, x4, x4; \ >> + vaesenclast t0, x1, x1; \ >> + vaesenclast t0, x5, x5; \ >> + vaesdeclast t0, x2, x2; \ >> + vaesdeclast t0, x6, x6; \ >> + \ >> + /* AES inverse shift rows */ \ >> + vmovdqa .Linv_shift_row, t0; \ >> + vmovdqa .Lshift_row, t1; \ >> + vpshufb t0, x0, x0; \ >> + vpshufb t0, x4, x4; \ >> + vpshufb t0, x1, x1; \ >> + vpshufb t0, x5, x5; \ >> + vpshufb t0, x3, x3; \ >> + vpshufb t0, x7, x7; \ >> + vpshufb t1, x2, x2; \ >> + vpshufb t1, x6, x6; \ >> + \ >> + vmovdqa .Linv_lo, t0; \ >> + vmovdqa .Linv_hi, t1; \ >> + vmovdqa .Ltf_lo_s2, t2; \ >> + vmovdqa .Ltf_hi_s2, t3; \ >> + vmovdqa .Ltf_lo_x2, t4; \ >> + vmovdqa .Ltf_hi_x2, t5; \ >> + vbroadcastss .L0f0f0f0f, t6; \ >> + \ >> + /* extract multiplicative inverse */ \ >> + filter_8bit(x1, t0, t1, t6, t7); \ >> + /* affine transformation for S2 */ \ >> + filter_8bit(x1, t2, t3, t6, t7); \ > > Here's room for improvement. These two affine transformations > could be combined into single filter_8bit... > >> + /* extract multiplicative inverse */ \ >> + filter_8bit(x5, t0, t1, t6, t7); \ >> + /* affine transformation for S2 */ \ >> + filter_8bit(x5, t2, t3, t6, t7); \ >> + \ >> + /* affine transformation for X2 */ \ >> + filter_8bit(x3, t4, t5, t6, t7); \ >> + vpxor t7, t7, t7; \ >> + vaesenclast t7, x3, x3; \ >> + /* extract multiplicative inverse */ \ >> + filter_8bit(x3, t0, t1, t6, t7); \ >> + /* affine transformation for X2 */ \ >> + filter_8bit(x7, t4, t5, t6, t7); \ >> + vpxor t7, t7, t7; \ >> + vaesenclast t7, x7, x7; \ >> + /* extract multiplicative inverse */ \ >> + filter_8bit(x7, t0, t1, t6, t7); > > ... as well as these two filter_8bit could be replaced with > one operation if 'vaesenclast' would be changed to 'vaesdeclast'. > > With these optimizations, 'aria_sbox_8way' would look like this: > > ///////////////////////////////////////////////////////// > #define aria_sbox_8way(x0, x1, x2, x3, \ > x4, x5, x6, x7, \ > t0, t1, t2, t3, \ > t4, t5, t6, t7) \ > vpxor t7, t7, t7; \ > vmovdqa .Linv_shift_row, t0; \ > vmovdqa .Lshift_row, t1; \ > vpbroadcastd .L0f0f0f0f, t6; \ > vmovdqa .Ltf_lo__inv_aff__and__s2, t2; \ > vmovdqa .Ltf_hi__inv_aff__and__s2, t3; \ > vmovdqa .Ltf_lo__x2__and__fwd_aff, t4; \ > vmovdqa .Ltf_hi__x2__and__fwd_aff, t5; \ > \ > vaesenclast t7, x0, x0; \ > vaesenclast t7, x4, x4; \ > vaesenclast t7, x1, x1; \ > vaesenclast t7, x5, x5; \ > vaesdeclast t7, x2, x2; \ > vaesdeclast t7, x6, x6; \ > \ > /* AES inverse shift rows */ \ > vpshufb t0, x0, x0; \ > vpshufb t0, x4, x4; \ > vpshufb t0, x1, x1; \ > vpshufb t0, x5, x5; \ > vpshufb t1, x3, x3; \ > vpshufb t1, x7, x7; \ > vpshufb t1, x2, x2; \ > vpshufb t1, x6, x6; \ > \ > /* affine transformation for S2 */ \ > filter_8bit(x1, t2, t3, t6, t0); \ > /* affine transformation for S2 */ \ > filter_8bit(x5, t2, t3, t6, t0); \ > \ > /* affine transformation for X2 */ \ > filter_8bit(x3, t4, t5, t6, t0); \ > /* affine transformation for X2 */ \ > filter_8bit(x7, t4, t5, t6, t0); \ > vaesdeclast t7, x3, x3; \ > vaesdeclast t7, x7, x7; > > /* AES inverse affine and S2 combined: > * 1 1 0 0 0 0 0 1 x0 0 > * 0 1 0 0 1 0 0 0 x1 0 > * 1 1 0 0 1 1 1 1 x2 0 > * 0 1 1 0 1 0 0 1 x3 1 > * 0 1 0 0 1 1 0 0 * x4 + 0 > * 0 1 0 1 1 0 0 0 x5 0 > * 0 0 0 0 0 1 0 1 x6 0 > * 1 1 1 0 0 1 1 1 x7 1 > */ > .Ltf_lo__inv_aff__and__s2: > .octa 0x92172DA81A9FA520B2370D883ABF8500 > .Ltf_hi__inv_aff__and__s2: > .octa 0x2B15FFC1AF917B45E6D8320C625CB688 > > /* X2 and AES forward affine combined: > * 1 0 1 1 0 0 0 1 x0 0 > * 0 1 1 1 1 0 1 1 x1 0 > * 0 0 0 1 1 0 1 0 x2 1 > * 0 1 0 0 0 1 0 0 x3 0 > * 0 0 1 1 1 0 1 1 * x4 + 0 > * 0 1 0 0 1 0 0 0 x5 0 > * 1 1 0 1 0 0 1 1 x6 0 > * 0 1 0 0 1 0 1 0 x7 0 > */ > .Ltf_lo__x2__and__fwd_aff: > .octa 0xEFAE0544FCBD1657B8F95213ABEA4100 > .Ltf_hi__x2__and__fwd_aff: > .octa 0x3F893781E95FE1576CDA64D2BA0CB204 > ///////////////////////////////////////////////////////// > > I tested above quickly in userspace against aria-generic > and your original aria-avx implementation and output matches > to these references. In quick and dirty benchmark, function > execution time was ~30% faster on AMD Zen3 and ~20% faster > on Intel tiger-lake. I tested your implementation. It works very well and as you mentioned, it improves performance so much! Before: 128bit 4096bytes: 14758 cycles After: 128bit 4096bytes: 11972 cycles I will apply your implementation in the v3 patch! Thank you so much! Taehee Yoo