Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933225AbcKCWUg (ORCPT ); Thu, 3 Nov 2016 18:20:36 -0400 Received: from frisell.zx2c4.com ([192.95.5.64]:45333 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752984AbcKCWUX (ORCPT ); Thu, 3 Nov 2016 18:20:23 -0400 MIME-Version: 1.0 In-Reply-To: <20161103.130852.1456848512897088071.davem@davemloft.net> References: <20161103004934.GA30775@gondor.apana.org.au> <20161103.130852.1456848512897088071.davem@davemloft.net> From: "Jason A. Donenfeld" Date: Thu, 3 Nov 2016 23:20:08 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access To: David Miller Cc: Herbert Xu , linux-crypto@vger.kernel.org, LKML , Martin Willi , WireGuard mailing list , =?UTF-8?Q?Ren=C3=A9_van_Dorst?= Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uA3MKjn4018333 Content-Length: 1843 Lines: 49 Hi David, On Thu, Nov 3, 2016 at 6:08 PM, David Miller wrote: > In any event no piece of code should be doing 32-bit word reads from > addresses like "x + 3" without, at a very minimum, going through the > kernel unaligned access handlers. Excellent point. In otherwords, ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03; ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; should change to: ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; ctx->r[1] = (get_unaligned_le32(key + 3) >> 2) & 0x3ffff03; ctx->r[2] = (get_unaligned_le32(key + 6) >> 4) & 0x3ffc0ff; ctx->r[3] = (get_unaligned_le32(key + 9) >> 6) & 0x3f03fff; ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; > We know explicitly that these offsets will not be 32-bit aligned, so > it is required that we use the helpers, or alternatively do things to > avoid these unaligned accesses such as using temporary storage when > the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set. So the question is: is the clever avoidance of unaligned accesses of the original patch faster or slower than changing the unaligned accesses to use the helper function? I've put a little test harness together for playing with this: $ git clone git://git.zx2c4.com/polybench $ cd polybench $ make run To test with one method, do as normal. To test with the other, remove "#define USE_FIRST_METHOD" from the source code. @René: do you think you could retest on your MIPS32r2 hardware and report back which is faster? And if anybody else has other hardware and would like to try, this could be nice. Regards, Jason