2020-04-22 20:18:14

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <[email protected]> wrote:
>
> On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <[email protected]> wrote:
> >
> > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <[email protected]> wrote:
> > >
> > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <[email protected]> wrote:
> > > > Seems this should just be a 'while' loop?
> > > >
> > > > while (bytes) {
> > > > unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > >
> > > > kernel_neon_begin();
> > > > chacha_doneon(state, dst, src, todo, nrounds);
> > > > kernel_neon_end();
> > > >
> > > > bytes -= todo;
> > > > src += todo;
> > > > dst += todo;
> > > > }
> > >
> > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > doesn't touch), because then we can break out of the loop before
> > > having to increment src and dst unnecessarily. Likely a pointless
> > > optimization as probably the compiler can figure out how to avoid
> > > that. But maybe it can't. If you have a strong preference, I can
> > > reactor everything to use `while (bytes)`, but if you don't care,
> > > let's keep this as-is. Opinion?
> > >
> >
> > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > given that bytes is guaranteed to be non-zero before we enter the
> > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > we can.
>
> Okay, will do-while it up for v2.

I just sent v2 containing do-while, and I'm fine with that going in
that way. But just in the interest of curiosity in the pan-tone
palette, check this out:

https://godbolt.org/z/VxXien

It looks like on mine, the compiler avoids unnecessarily calling those
adds on the last iteration, but on the other hand, it results in an
otherwise unnecessary unconditional jump for the < 4096 case. Sort of
interesting. Arm64 code is more or less the same difference too.


2020-04-23 08:46:48

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

On Wed, 22 Apr 2020 at 22:17, Jason A. Donenfeld <[email protected]> wrote:
>
> On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <[email protected]> wrote:
> >
> > On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <[email protected]> wrote:
> > >
> > > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <[email protected]> wrote:
> > > >
> > > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <[email protected]> wrote:
> > > > > Seems this should just be a 'while' loop?
> > > > >
> > > > > while (bytes) {
> > > > > unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > > >
> > > > > kernel_neon_begin();
> > > > > chacha_doneon(state, dst, src, todo, nrounds);
> > > > > kernel_neon_end();
> > > > >
> > > > > bytes -= todo;
> > > > > src += todo;
> > > > > dst += todo;
> > > > > }
> > > >
> > > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > > doesn't touch), because then we can break out of the loop before
> > > > having to increment src and dst unnecessarily. Likely a pointless
> > > > optimization as probably the compiler can figure out how to avoid
> > > > that. But maybe it can't. If you have a strong preference, I can
> > > > reactor everything to use `while (bytes)`, but if you don't care,
> > > > let's keep this as-is. Opinion?
> > > >
> > >
> > > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > > given that bytes is guaranteed to be non-zero before we enter the
> > > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > > we can.
> >
> > Okay, will do-while it up for v2.
>
> I just sent v2 containing do-while, and I'm fine with that going in
> that way. But just in the interest of curiosity in the pan-tone
> palette, check this out:
>
> https://godbolt.org/z/VxXien
>
> It looks like on mine, the compiler avoids unnecessarily calling those
> adds on the last iteration, but on the other hand, it results in an
> otherwise unnecessary unconditional jump for the < 4096 case. Sort of
> interesting. Arm64 code is more or less the same difference too.

Yeah, even if shaving off 1 or 2 cycles mattered here (since we've
just decided that ugh() may take up to 20,000 cycles), hiding a couple
of ALU instructions in the slots between the subs (which sets the zero
flag) and the conditional branch that tests it probably comes for free
on in-order cores anyway. And even if it didn't, backwards branches
are usually statically predicted as taken, in which case their results
are actually needed.

On out-of-order cores under speculation, none of this matters anyway.