Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933224AbcKGSJj (ORCPT ); Mon, 7 Nov 2016 13:09:39 -0500 Received: from frisell.zx2c4.com ([192.95.5.64]:56724 "EHLO frisell.zx2c4.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932700AbcKGSIs (ORCPT ); Mon, 7 Nov 2016 13:08:48 -0500 MIME-Version: 1.0 In-Reply-To: <20161104173723.GB34176@google.com> References: <20161103004934.GA30775@gondor.apana.org.au> <20161103.130852.1456848512897088071.davem@davemloft.net> <20161104173723.GB34176@google.com> From: "Jason A. Donenfeld" Date: Mon, 7 Nov 2016 19:08:22 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access To: Eric Biggers Cc: David Miller , Herbert Xu , linux-crypto@vger.kernel.org, LKML , Martin Willi , WireGuard mailing list , =?UTF-8?Q?Ren=C3=A9_van_Dorst?= Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2036 Lines: 48 Hi Eric, On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers wrote: > I agree, and the current code is wrong; but do note that this proposal is > correct for poly1305_setrkey() but not for poly1305_setskey() and > poly1305_blocks(). In the latter two cases, 4-byte alignment of the source > buffer is *not* guaranteed. Although crypto_poly1305_update() will be called > with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the > algorithm operates on 16-byte blocks and therefore has to buffer partial blocks. > If some number of bytes that is not 0 mod 4 is buffered, then the buffer will > fall out of alignment on the next update call. Hence, get_unaligned_le32() is > actually needed on all the loads, since the buffer will, in general, be of > unknown alignment. Hmm... The general data flow that strikes me as most pertinent is something like: struct sk_buff *skb = get_it_from_somewhere(); skb = skb_share_check(skb, GFP_ATOMIC); num_frags = skb_cow_data(skb, ..., ...); struct scatterlist sg[num_frags]; sg_init_table(sg, num_frags); skb_to_sgvec(skb, sg, ..., ...); blkcipher_walk_init(&walk, sg, sg, len); blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE); while (walk.nbytes >= BLOCK_SIZE) { size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE); poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len); blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE); } if (walk.nbytes) { poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes); blkcipher_walk_done(&desc, &walk, 0); } Is your suggestion that that in the final if block, walk.src.virt.addr might be unaligned? Like in the case of the last fragment being 67 bytes long? If so, what a hassle. I hope the performance overhead isn't too awful... I'll resubmit taking into account your suggestions. By the way -- offlist benchmarks sent to me concluded that using the unaligned load helpers like David suggested is just as fast as that handrolled bit magic in the v1. Regards, Jason