Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp4234902pxb; Tue, 10 Nov 2020 11:05:28 -0800 (PST) X-Google-Smtp-Source: ABdhPJyQOJxJeEyzfEilzLGPsiZtUZEi+0beOgudKsxs3cUhBlKzxFvGb+RZ4Yg01rjya2fRoqwP X-Received: by 2002:a17:906:d7ad:: with SMTP id pk13mr22258405ejb.196.1605035127922; Tue, 10 Nov 2020 11:05:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605035127; cv=none; d=google.com; s=arc-20160816; b=HM+B6zlZCnsoxaubwpEQfKbmzSzQgmCHwWDc7KHjqhKfhAYT0lL8/Gu08g7taYTstx x2UgRsfOtCm/bbWFHqbkwByZ/rTP1YF9OQrxir2/T/VVv4KmcIpzmRDubs8CNASodpME EponU8yBu0GMpe70VEaQZBBPdINk492bOKm9oRiV+DC4T93JJLkpL3ksFGGwvipbpgne WFw5xbQpngjk/4wQRZ11H/ZTYIK39LikPvdTJ1cOQhLBuXZ+u4zyinkyW7IFMxlddqOs bqCrRjHv6jrv68SObK9cSXoMcsrQvCK7LUN476X3X6qyMvE4YitsPNno88kfm2YWU3NY C8Tw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:dkim-signature; bh=gHA0lGrcSHNoh+SCpOwZeLQWRga6J5w/jaw7vCaM8+I=; b=XXAfP3ZlqQgkzyEqTe7NmCShw5VTK2iuhw6xUCFb/ikzg7yK0hp34vFOGGrAwgFbFs v+DGSxfuROxrS9v+IHRtzv7l7RCA8GlzgSeEqV01oT1HYkFW2r1x9eYqAV0kpiTiliSU LeDqHPBITeGvK0GC64KnkPhOl6V35VeYKrS/RrkV2E1Pqfg2VhDrlePpQpnzbgR9ZT7R ygcK+9uk3pBakn6lya+/0ObFY/Ip9Ag0br74J1OHBPYiV0+jHWmLdSHEc3Cz27UxVshN 0AeQuN+3av7dgpdmztAwkPcIJNZzIIe7z1EHbL3BRwjRzGf7DacdL8ae6DTy+T9/pr9E WtBw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=FzWUAKJZ; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c9si9900001eja.517.2020.11.10.11.05.04; Tue, 10 Nov 2020 11:05:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=FzWUAKJZ; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731220AbgKJTE4 (ORCPT + 99 others); Tue, 10 Nov 2020 14:04:56 -0500 Received: from mail.kernel.org ([198.145.29.99]:51140 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731187AbgKJTE4 (ORCPT ); Tue, 10 Nov 2020 14:04:56 -0500 Received: from e123331-lin.nice.arm.com (lfbn-nic-1-188-42.w2-15.abo.wanadoo.fr [2.15.37.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B683120829; Tue, 10 Nov 2020 19:04:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1605035095; bh=N2U9menlQv60P/GBM6E2+KMGks+534jhCdZlvYWzxB4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=FzWUAKJZ9/izHyc31z3acKOo7VsShbL6umi7u1exaKr2X1MKKd6o2Q6X6e2Owoorb 4Jv9H7A5DoGSLInEXS8KXPKq/aiDXFGuHkwrekyAHHwpX2i1iGYzhV7IQ0cCdrdrui xwXUp4v33Sgn0d+OYCnIuozOcaOeQsw1o+PhbOP8= From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: herbert@gondor.apana.org.au, Ard Biesheuvel , Ondrej Mosnacek , Eric Biggers Subject: [PATCH v2 2/4] crypto: aegis128/neon - optimize tail block handling Date: Tue, 10 Nov 2020 20:04:42 +0100 Message-Id: <20201110190444.10634-3-ardb@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20201110190444.10634-1-ardb@kernel.org> References: <20201110190444.10634-1-ardb@kernel.org> Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org Avoid copying the tail block via a stack buffer if the total size exceeds a single AEGIS block. In this case, we can use overlapping loads and stores and NEON permutation instructions instead, which leads to a modest performance improvement on some cores (< 5%), and is slightly cleaner. Note that we still need to use a stack buffer if the entire input is smaller than 16 bytes, given that we cannot use 16 byte NEON loads and stores safely in this case. Signed-off-by: Ard Biesheuvel --- crypto/aegis128-neon-inner.c | 89 +++++++++++++++++--- 1 file changed, 75 insertions(+), 14 deletions(-) diff --git a/crypto/aegis128-neon-inner.c b/crypto/aegis128-neon-inner.c index 2a660ac1bc3a..cd1b3ad1d1f3 100644 --- a/crypto/aegis128-neon-inner.c +++ b/crypto/aegis128-neon-inner.c @@ -20,7 +20,6 @@ extern int aegis128_have_aes_insn; void *memcpy(void *dest, const void *src, size_t n); -void *memset(void *s, int c, size_t n); struct aegis128_state { uint8x16_t v[5]; @@ -173,10 +172,46 @@ void crypto_aegis128_update_neon(void *state, const void *msg) aegis128_save_state_neon(st, state); } +#ifdef CONFIG_ARM +/* + * AArch32 does not provide these intrinsics natively because it does not + * implement the underlying instructions. AArch32 only provides 64-bit + * wide vtbl.8/vtbx.8 instruction, so use those instead. + */ +static uint8x16_t vqtbl1q_u8(uint8x16_t a, uint8x16_t b) +{ + union { + uint8x16_t val; + uint8x8x2_t pair; + } __a = { a }; + + return vcombine_u8(vtbl2_u8(__a.pair, vget_low_u8(b)), + vtbl2_u8(__a.pair, vget_high_u8(b))); +} + +static uint8x16_t vqtbx1q_u8(uint8x16_t v, uint8x16_t a, uint8x16_t b) +{ + union { + uint8x16_t val; + uint8x8x2_t pair; + } __a = { a }; + + return vcombine_u8(vtbx2_u8(vget_low_u8(v), __a.pair, vget_low_u8(b)), + vtbx2_u8(vget_high_u8(v), __a.pair, vget_high_u8(b))); +} +#endif + +static const uint8_t permute[] __aligned(64) = { + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, +}; + void crypto_aegis128_encrypt_chunk_neon(void *state, void *dst, const void *src, unsigned int size) { struct aegis128_state st = aegis128_load_state_neon(state); + const int short_input = size < AEGIS_BLOCK_SIZE; uint8x16_t msg; preload_sbox(); @@ -186,7 +221,8 @@ void crypto_aegis128_encrypt_chunk_neon(void *state, void *dst, const void *src, msg = vld1q_u8(src); st = aegis128_update_neon(st, msg); - vst1q_u8(dst, msg ^ s); + msg ^= s; + vst1q_u8(dst, msg); size -= AEGIS_BLOCK_SIZE; src += AEGIS_BLOCK_SIZE; @@ -195,13 +231,26 @@ void crypto_aegis128_encrypt_chunk_neon(void *state, void *dst, const void *src, if (size > 0) { uint8x16_t s = st.v[1] ^ (st.v[2] & st.v[3]) ^ st.v[4]; - uint8_t buf[AEGIS_BLOCK_SIZE] = {}; + uint8_t buf[AEGIS_BLOCK_SIZE]; + const void *in = src; + void *out = dst; + uint8x16_t m; - memcpy(buf, src, size); - msg = vld1q_u8(buf); - st = aegis128_update_neon(st, msg); - vst1q_u8(buf, msg ^ s); - memcpy(dst, buf, size); + if (__builtin_expect(short_input, 0)) + in = out = memcpy(buf + AEGIS_BLOCK_SIZE - size, src, size); + + m = vqtbl1q_u8(vld1q_u8(in + size - AEGIS_BLOCK_SIZE), + vld1q_u8(permute + 32 - size)); + + st = aegis128_update_neon(st, m); + + vst1q_u8(out + size - AEGIS_BLOCK_SIZE, + vqtbl1q_u8(m ^ s, vld1q_u8(permute + size))); + + if (__builtin_expect(short_input, 0)) + memcpy(dst, out, size); + else + vst1q_u8(out - AEGIS_BLOCK_SIZE, msg); } aegis128_save_state_neon(st, state); @@ -211,6 +260,7 @@ void crypto_aegis128_decrypt_chunk_neon(void *state, void *dst, const void *src, unsigned int size) { struct aegis128_state st = aegis128_load_state_neon(state); + const int short_input = size < AEGIS_BLOCK_SIZE; uint8x16_t msg; preload_sbox(); @@ -228,14 +278,25 @@ void crypto_aegis128_decrypt_chunk_neon(void *state, void *dst, const void *src, if (size > 0) { uint8x16_t s = st.v[1] ^ (st.v[2] & st.v[3]) ^ st.v[4]; uint8_t buf[AEGIS_BLOCK_SIZE]; + const void *in = src; + void *out = dst; + uint8x16_t m; - vst1q_u8(buf, s); - memcpy(buf, src, size); - msg = vld1q_u8(buf) ^ s; - vst1q_u8(buf, msg); - memcpy(dst, buf, size); + if (__builtin_expect(short_input, 0)) + in = out = memcpy(buf + AEGIS_BLOCK_SIZE - size, src, size); - st = aegis128_update_neon(st, msg); + m = s ^ vqtbx1q_u8(s, vld1q_u8(in + size - AEGIS_BLOCK_SIZE), + vld1q_u8(permute + 32 - size)); + + st = aegis128_update_neon(st, m); + + vst1q_u8(out + size - AEGIS_BLOCK_SIZE, + vqtbl1q_u8(m, vld1q_u8(permute + size))); + + if (__builtin_expect(short_input, 0)) + memcpy(dst, out, size); + else + vst1q_u8(out - AEGIS_BLOCK_SIZE, msg); } aegis128_save_state_neon(st, state); -- 2.17.1