From: Ard Biesheuvel Subject: Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS Date: Tue, 19 Jun 2018 00:04:01 +0200 Message-ID: References: <20180214184223.254359-1-ebiggers@google.com> <20180214184223.254359-4-ebiggers@google.com> <8396d433caf1155f9ca422c6bad3200b@agner.ch> <20180618215657.GB8022@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Jeffrey Walton , Greg Kaiser , Herbert Xu , Michael Halcrow , Patrik Torstensson , Stefan Agner , Paul Lawrence , linux-fscrypt@vger.kernel.org, "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" , Greg Kroah-Hartman , Alex Cope , linux-crypto-owner@vger.kernel.org, linux-arm-kernel , Paul Crowley To: Eric Biggers Return-path: In-Reply-To: <20180618215657.GB8022@google.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=m.gmane.org@lists.infradead.org List-Id: linux-crypto.vger.kernel.org On 18 June 2018 at 23:56, Eric Biggers wrote: > On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote: >> >>>>> + >> >>>>> + // One-time XTS preparation >> >>>>> + >> >>>>> + /* >> >>>>> + * Allocate stack space to store 128 bytes worth of tweaks. For >> >>>>> + * performance, this space is aligned to a 16-byte boundary so that we >> >>>>> + * can use the load/store instructions that declare 16-byte alignment. >> >>>>> + */ >> >>>>> + sub sp, #128 >> >>>>> + bic sp, #0xf >> >>>> >> >>>> >> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y >> >>>> >> >>>> AS arch/arm/crypto/speck-neon-core.o >> >>>> >> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages: >> >>>> >> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> >> >>>> In a quick hack this change seems to address it: >> >>>> >> >>>> >> >>>> - sub sp, #128 >> >>>> - bic sp, #0xf >> >>>> + mov r6, sp >> >>>> + sub r6, #128 >> >>>> + bic r6, #0xf >> >>>> + mov sp, r6 >> >>>> >> >>>> But there is probably a better solution to address this. >> >>>> >> >>> >> >>> Given that there is no NEON on M class cores, I recommend we put something like >> >>> >> >>> THUMB(bx pc) >> >>> THUMB(nop.w) >> >>> THUMB(.arm) >> >>> >> >>> at the beginning and be done with it. >> >> >> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at >> >> the beginning as well. >> > >> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems >> > that bic sp,#0xf is the only issue... >> > >> >> Well, in general, yes. In the case of NEON code, not really, since the >> resulting code will not be smaller anyway, because the Thumb2 NEON >> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON >> units, so all cores that this code can run on will be able to run in >> ARM mode. >> >> So from a maintainability pov, having code that only assembles in one >> way is better than having code that must compile both to ARM and to >> Thumb2 opcodes. >> >> Just my 2 cents, anyway. > > I don't have too much of a preference, though Stefan's suggested 4 instructions > can be reduced to 3, which also matches what aes-neonbs-core.S does: > > sub r12, sp, #128 > bic r12, #0xf > mov sp, r12 > > Ard, is the following what you're suggesting instead? > Yes, but after looking at the actual code, I prefer the change above. The access occurs only once, not in the loop so the additional instructions should not affect performance. Apologies for the noise. > diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S > index 3c1e203e53b9..c989ce3dc057 100644 > --- a/arch/arm/crypto/speck-neon-core.S > +++ b/arch/arm/crypto/speck-neon-core.S > @@ -8,6 +8,7 @@ > */ > > #include > +#include > > .text > .fpu neon > @@ -233,6 +234,12 @@ > * nonzero multiple of 128. > */ > .macro _speck_xts_crypt n, decrypting > + > + .align 2 > + THUMB(bx pc) > + THUMB(nop) > + THUMB(.arm) > + > push {r4-r7} > mov r7, sp > > @@ -413,6 +420,8 @@ > mov sp, r7 > pop {r4-r7} > bx lr > + > + THUMB(.thumb) > .endm > > ENTRY(speck128_xts_encrypt_neon)