by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

On Fri, Aug 03, 2018 at 04:33:50AM +0200, Jason A. Donenfeld wrote:
>
> > Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
> > Cortex-A7 it was actually quite a bit slower than the one in the Linux
> > kernel written by Ard Biesheuvel... I trust that when claiming the
> > performance of all implementations you're adding is "state-of-the-art
> > and unrivaled", you actually compared them to the ones already in the
> > Linux kernel which you're advocating replacing, right? :-)
>
> Yes, I have, and my results don't corroborate your findings. It will
> be interesting to get out a wider variety of hardware for comparisons.
> I suspect, also, that if the snarky emoticons subside, AndyP would be
> very interested in whatever we find and could have interest in
> improving implementations, should we actually find performance
> differences.
>

On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
byte), whereas Linux's is faster: 11.9 cpb. I've also recently improved the
Linux implementation to 11.3 cpb and would like to send out a patch soon...
I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
rotates; that would be useful for the single permutation used to compute
XChaCha's subkey, and also for the ends of messages.

The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's is that
Linux's does 4 blocks at once using NEON instructions, and the words are
de-interleaved so the rows don't need to be shifted between each round.
OpenSSL's implementation, on the other hand, only does 3 blocks at once with
NEON instructions and has to shift the rows between each round. OpenSSL's
implementation also does a 4th block at the same time using regular ARM
instructions, but that doesn't help on Cortex-A7; it just makes it slower.

I understand there are tradeoffs, and different implementations can be faster on
different CPUs. Just know that from my point of view, switching to the OpenSSL
implementation actually introduces a performance regression, and we care a *lot*
about this since we need ChaCha to be absolutely as fast as possible for HPolyC
disk encryption. So if your proposal goes in, I'd likely need to write a patch
to get the old performance back, at least on Cortex-A7...

Also, I don't know whether Andy P. considered the 4xNEON implementation
technique. It could even be fastest on other ARM CPUs too, I don't know.

- Eric

2018-08-16 06:31:11

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

Hi Eric,

On Tue, Aug 14, 2018 at 2:12 PM Eric Biggers <[email protected]> wrote:
> On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
> byte), whereas Linux's is faster: 11.9 cpb.
>
> The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's
>
> I understand there are tradeoffs, and different implementations can be faster on
> different CPUs.
>
> So if your proposal goes in, I'd likely need to write a patch
> to get the old performance back, at least on Cortex-A7...

Yes, absolutely. Different CPUs behave differently indeed, but if you
have improvements for hardware that matters to you, we should
certainly incorporate these, and also loop Andy Polyakov in (I've
added him to the CC for the WIP v2). ChaCha is generally pretty
obvious, but for big integer algorithms -- like Poly1305 and
Curve25519 -- I think it's all the more important to involve Andy and
the rest of the world in general, so that Linux benefits from bug
research and fuzzing in places that are typically and classically
prone to nasty issues. In other words, let's definitely incorporate
your improvements after the patchset goes in, and at the same time
we'll try to bring Andy and others into the fold, where our
improvements can generally track each others.

> Also, I don't know whether Andy P. considered the 4xNEON implementation
> technique. It could even be fastest on other ARM CPUs too, I don't know.

After v2, when he's CC'd in, let's plan to start discussing this with him.

Jason