Received: by 2002:ac0:aa62:0:0:0:0:0 with SMTP id w31-v6csp318088ima; Sat, 20 Oct 2018 08:01:09 -0700 (PDT) X-Google-Smtp-Source: ACcGV60Kg58ZsmDbtubsX6eXz+PzQLlM7av4a36bcHdPt5cPPBwZuhk5duAM55Mi8ovU3Nlry4BB X-Received: by 2002:a17:902:bc8c:: with SMTP id bb12-v6mr37138457plb.275.1540047669295; Sat, 20 Oct 2018 08:01:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1540047669; cv=none; d=google.com; s=arc-20160816; b=GfZ7wKZhAlyXkWIqqI8zHdHtRAR8kaJ/l/xYk/IpAfLhj7JTCs7mTX2khE8WlZxwXW EftFZ0K6FOdfphTRwG1XaYyycvFcoWKG8/SBWLLZz7GNAKeRWk00DNN7iw4s7Lt/jRrg +5vIoCyG1hQNzg7wox3y3shRAM0fH/q7OIGqFc7V+aF3mZKh3hGaOhYz4AVh++5YaBxu XL14rcoaxWkjXBgzx9DB1EYmxLd8AwjHCAmxGA55PoR3G6dL5T1qRAqV3eutjpCH6UsB US0rXAyX+GuXBLn24uHtzINgUn6Ma0AC+1uP/ArbvV73234QtiZ6r1wOM9ePwGVkAiH1 nHfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dkim-signature; bh=DPV3GfcDQeGLJmx80Ehi4cTMV5YxhaOHhrX/OYXb34w=; b=qcLNJQXklJReydL9CYIHcXAKBqSeZvQaNGAwaJJvQ0GPmv/TERM5wqRuJB0ol0bzNk jS+JT9868Nk899lEelgfob0YyhpQtSrCJgFGwxKBHb5qGfNn76TXl3qraZEsjtiOf3IU +R1suifcng/6mHSYYM7jR3mpov3CKhWGXvdQKzGCZXaI/6u//xRRQjrUi/Js5Uv8zLF2 jyRonBNxWwMVFzQlvg3rInmw0NWvzn36sTpXODIENcuvZkTuAkHELqG5PRwNBdbS3t0M iU0FI+yS9NXWQRWTTZui6NkVtZA9tkVVVeeDyo9jmyGGv2SEl5LrMXnJHqQd7jUsSTs2 mhKw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="ff/qLj9p"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c19-v6si1605363plz.283.2018.10.20.08.00.39; Sat, 20 Oct 2018 08:01:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="ff/qLj9p"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727488AbeJTXKx (ORCPT + 99 others); Sat, 20 Oct 2018 19:10:53 -0400 Received: from mail-it1-f195.google.com ([209.85.166.195]:51403 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727413AbeJTXKx (ORCPT ); Sat, 20 Oct 2018 19:10:53 -0400 Received: by mail-it1-f195.google.com with SMTP id 74-v6so7467309itw.1 for ; Sat, 20 Oct 2018 08:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=DPV3GfcDQeGLJmx80Ehi4cTMV5YxhaOHhrX/OYXb34w=; b=ff/qLj9p0coDG4/4+SaWaGcTTgJviD/v5F668DAQXYGa1H1CKHfJnpR3EDgP/xQ3sc BoLrS+NPM4RQpmTSD0HSCkYrIYRkJP4piUQU0cW/Ul2JFcEufcnOdv1w3QQuInqOAr78 Bq0YA8VRZ+Pq8XRlooXXRp0P51KqEVn0obIy0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=DPV3GfcDQeGLJmx80Ehi4cTMV5YxhaOHhrX/OYXb34w=; b=nbzNEOiRiqLzCwa+dyk1WFo7bYdleSirMGvssduWKyT60R/e2a1FLhWiVNWoqndOyg TKy3kQ/3vEEoksKK8rk2fmbDqhZbnawD96aIDvgHWSenAGM2vhEDupTqHuW9qdW96jdj Y3vgkejCE98JdYGh5n2XjBJKVv5z2KcqFncMmG58xwZbtqj/Jt4VVEkIGtcHB7gF9XT9 KV9EMByR+jspXWgvASPEO3ERRlU1zLhAGSkywwTa8FSS2qslJ9eftI4qhn1g3bjRoyAa Qc7N9WIXNYYD8ZvCOWBtRuVUWnxJ4seJ/qa7sr+OuALEuGBRoGK7ECdEmsM+kDzkkbSx hbYg== X-Gm-Message-State: ABuFfog+VM/Vii+FxlLA1w3oMSeuEfgZcPSzy79vWr4DZp8ft8u0dmql UvbuSj0hjkZ3ajfqutNsgkqfXC1fEcacm6pd6H9K5g== X-Received: by 2002:a05:660c:383:: with SMTP id x3mr5235345itj.121.1540047605417; Sat, 20 Oct 2018 08:00:05 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:5910:0:0:0:0:0 with HTTP; Sat, 20 Oct 2018 08:00:04 -0700 (PDT) In-Reply-To: <20181020055136.GD876@sol.localdomain> References: <20181015175424.97147-1-ebiggers@kernel.org> <20181015175424.97147-11-ebiggers@kernel.org> <20181020055136.GD876@sol.localdomain> From: Ard Biesheuvel Date: Sat, 20 Oct 2018 23:00:04 +0800 Message-ID: Subject: Re: [RFC PATCH v2 10/12] crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305 To: Eric Biggers Cc: "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" , linux-fscrypt@vger.kernel.org, linux-arm-kernel , Linux Kernel Mailing List , Herbert Xu , Paul Crowley , Greg Kaiser , Michael Halcrow , "Jason A . Donenfeld" , Samuel Neves , Tomer Ashur Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 20 October 2018 at 13:51, Eric Biggers wrote: > On Sat, Oct 20, 2018 at 12:12:56PM +0800, Ard Biesheuvel wrote: >> On 16 October 2018 at 01:54, Eric Biggers wrote: >> > From: Eric Biggers >> > >> > Add an ARM NEON implementation of NHPoly1305, an =CE=B5-almost-=E2=88= =86-universal >> > hash function used in the Adiantum encryption mode. For now, only the >> > NH portion is actually NEON-accelerated; the Poly1305 part is less >> > performance-critical so is just implemented in C. >> > >> > Signed-off-by: Eric Biggers >> > --- >> > arch/arm/crypto/Kconfig | 5 ++ >> > arch/arm/crypto/Makefile | 2 + >> > arch/arm/crypto/nh-neon-core.S | 116 ++++++++++++++++++++++++= + >> > arch/arm/crypto/nhpoly1305-neon-glue.c | 78 +++++++++++++++++ >> > 4 files changed, 201 insertions(+) >> > create mode 100644 arch/arm/crypto/nh-neon-core.S >> > create mode 100644 arch/arm/crypto/nhpoly1305-neon-glue.c >> > >> > diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig >> > index cc932d9bba561..458562a34aabe 100644 >> > --- a/arch/arm/crypto/Kconfig >> > +++ b/arch/arm/crypto/Kconfig >> > @@ -122,4 +122,9 @@ config CRYPTO_CHACHA20_NEON >> > select CRYPTO_BLKCIPHER >> > select CRYPTO_CHACHA20 >> > >> > +config CRYPTO_NHPOLY1305_NEON >> > + tristate "NEON accelerated NHPoly1305 hash function (for Adian= tum)" >> > + depends on KERNEL_MODE_NEON >> > + select CRYPTO_NHPOLY1305 >> > + >> > endif >> > diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile >> > index 005482ff95047..b65d6bfab8e6b 100644 >> > --- a/arch/arm/crypto/Makefile >> > +++ b/arch/arm/crypto/Makefile >> > @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) +=3D sha1-arm-neo= n.o >> > obj-$(CONFIG_CRYPTO_SHA256_ARM) +=3D sha256-arm.o >> > obj-$(CONFIG_CRYPTO_SHA512_ARM) +=3D sha512-arm.o >> > obj-$(CONFIG_CRYPTO_CHACHA20_NEON) +=3D chacha-neon.o >> > +obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) +=3D nhpoly1305-neon.o >> > >> > ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) +=3D aes-arm-ce.o >> > ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) +=3D sha1-arm-ce.o >> > @@ -53,6 +54,7 @@ ghash-arm-ce-y :=3D ghash-ce-core.o ghash-ce-= glue.o >> > crct10dif-arm-ce-y :=3D crct10dif-ce-core.o crct10dif-ce-glue.o >> > crc32-arm-ce-y:=3D crc32-ce-core.o crc32-ce-glue.o >> > chacha-neon-y :=3D chacha-neon-core.o chacha-neon-glue.o >> > +nhpoly1305-neon-y :=3D nh-neon-core.o nhpoly1305-neon-glue.o >> > >> > ifdef REGENERATE_ARM_CRYPTO >> > quiet_cmd_perl =3D PERL $@ >> > diff --git a/arch/arm/crypto/nh-neon-core.S b/arch/arm/crypto/nh-neon-= core.S >> > new file mode 100644 >> > index 0000000000000..434d80ab531c2 >> > --- /dev/null >> > +++ b/arch/arm/crypto/nh-neon-core.S >> > @@ -0,0 +1,116 @@ >> > +/* SPDX-License-Identifier: GPL-2.0 */ >> > +/* >> > + * NH - =CE=B5-almost-universal hash function, NEON accelerated versi= on >> > + * >> > + * Copyright 2018 Google LLC >> > + * >> > + * Author: Eric Biggers >> > + */ >> > + >> > +#include >> > + >> > + .text >> > + .fpu neon >> > + >> > + KEY .req r0 >> > + MESSAGE .req r1 >> > + MESSAGE_LEN .req r2 >> > + HASH .req r3 >> > + >> > + PASS0_SUMS .req q0 >> > + PASS0_SUM_A .req d0 >> > + PASS0_SUM_B .req d1 >> > + PASS1_SUMS .req q1 >> > + PASS1_SUM_A .req d2 >> > + PASS1_SUM_B .req d3 >> > + PASS2_SUMS .req q2 >> > + PASS2_SUM_A .req d4 >> > + PASS2_SUM_B .req d5 >> > + PASS3_SUMS .req q3 >> > + PASS3_SUM_A .req d6 >> > + PASS3_SUM_B .req d7 >> > + K0 .req q4 >> > + K1 .req q5 >> > + K2 .req q6 >> > + K3 .req q7 >> > + T0 .req q8 >> > + T0_L .req d16 >> > + T0_H .req d17 >> > + T1 .req q9 >> > + T1_L .req d18 >> > + T1_H .req d19 >> > + T2 .req q10 >> > + T2_L .req d20 >> > + T2_H .req d21 >> > + T3 .req q11 >> > + T3_L .req d22 >> > + T3_H .req d23 >> > + >> > +.macro _nh_stride k0, k1, k2, k3 >> > + >> > + // Load next message stride >> > + vld1.8 {T3}, [MESSAGE]! >> > + >> > + // Load next key stride >> > + vld1.32 {\k3}, [KEY]! >> > + >> > + // Add message words to key words >> > + vadd.u32 T0, T3, \k0 >> > + vadd.u32 T1, T3, \k1 >> > + vadd.u32 T2, T3, \k2 >> > + vadd.u32 T3, T3, \k3 >> > + >> > + // Multiply 32x32 =3D> 64 and accumulate >> > + vmlal.u32 PASS0_SUMS, T0_L, T0_H >> > + vmlal.u32 PASS1_SUMS, T1_L, T1_H >> > + vmlal.u32 PASS2_SUMS, T2_L, T2_H >> > + vmlal.u32 PASS3_SUMS, T3_L, T3_H >> > +.endm >> > + >> >> Since we seem to have some spare NEON registers: would it help to have >> a double round version of this macro? >> > > It helps a little bit, but not much. The loads are the only part that ca= n be > optimized further. I think I'd rather have the shorter + simpler version= , > unless the loads can be optimized significantly more on other processors. > > Also, originally I had it loading the key and message for the next stride= while > doing the current one, but that didn't seem worthwhile either. > Fair enough. >> > +/* >> > + * void nh_neon(const u32 *key, const u8 *message, size_t message_len= , >> > + * u8 hash[NH_HASH_BYTES]) >> > + * >> > + * It's guaranteed that message_len % 16 =3D=3D 0. >> > + */ >> > +ENTRY(nh_neon) >> > + >> > + vld1.32 {K0,K1}, [KEY]! >> > + vmov.u64 PASS0_SUMS, #0 >> > + vmov.u64 PASS1_SUMS, #0 >> > + vld1.32 {K2}, [KEY]! >> > + vmov.u64 PASS2_SUMS, #0 >> > + vmov.u64 PASS3_SUMS, #0 >> > + >> > + subs MESSAGE_LEN, MESSAGE_LEN, #64 >> > + blt .Lloop4_done >> > +.Lloop4: >> > + _nh_stride K0, K1, K2, K3 >> > + _nh_stride K1, K2, K3, K0 >> > + _nh_stride K2, K3, K0, K1 >> > + _nh_stride K3, K0, K1, K2 >> > + subs MESSAGE_LEN, MESSAGE_LEN, #64 >> > + bge .Lloop4 >> > + >> > +.Lloop4_done: >> > + ands MESSAGE_LEN, MESSAGE_LEN, #63 >> > + beq .Ldone >> > + _nh_stride K0, K1, K2, K3 >> > + >> > + subs MESSAGE_LEN, MESSAGE_LEN, #16 >> > + beq .Ldone >> > + _nh_stride K1, K2, K3, K0 >> > + >> > + subs MESSAGE_LEN, MESSAGE_LEN, #16 >> > + beq .Ldone >> > + _nh_stride K2, K3, K0, K1 >> > + >> > +.Ldone: >> > + // Sum the accumulators for each pass, then store the sums to = 'hash' >> > + vadd.u64 T0_L, PASS0_SUM_A, PASS0_SUM_B >> > + vadd.u64 T0_H, PASS1_SUM_A, PASS1_SUM_B >> > + vadd.u64 T1_L, PASS2_SUM_A, PASS2_SUM_B >> > + vadd.u64 T1_H, PASS3_SUM_A, PASS3_SUM_B >> > + vst1.8 {T0-T1}, [HASH] >> > + bx lr >> > +ENDPROC(nh_neon) >> > diff --git a/arch/arm/crypto/nhpoly1305-neon-glue.c b/arch/arm/crypto/= nhpoly1305-neon-glue.c >> > new file mode 100644 >> > index 0000000000000..df48a00f4c50f >> > --- /dev/null >> > +++ b/arch/arm/crypto/nhpoly1305-neon-glue.c >> > @@ -0,0 +1,78 @@ >> > +// SPDX-License-Identifier: GPL-2.0 >> > +/* >> > + * NHPoly1305 - =CE=B5-almost-=E2=88=86-universal hash function for A= diantum >> > + * (NEON accelerated version) >> > + * >> > + * Copyright 2018 Google LLC >> > + */ >> > + >> > +#include >> > +#include >> > +#include >> > +#include >> > +#include >> > + >> > +asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t mes= sage_len, >> > + u8 hash[NH_HASH_BYTES]); >> > + >> > +static void _nh_neon(const u32 *key, const u8 *message, size_t messag= e_len, >> > + __le64 hash[NH_NUM_PASSES]) >> > +{ >> > + nh_neon(key, message, message_len, (u8 *)hash); >> > +} >> > + >> >> Why do we need this function? >> > > For now it's not needed so I should probably just remove it, but it seems= likely > that indirect calls to assembly functions in the kernel will be going awa= y in > order to add support for CFI (control flow integrity). The android-4.9 a= nd > android-4.14 kernels support CFI on arm64, so you might notice that some = of the > arm64 crypto code had to be patched for this reason. > I didn't actually look at those kernel trees so I hadn't noticed yet. In any case, I'd suggest that we just keep this wrapper then, but please add a comment describing why it's there.