2019-09-26 14:56:11

by Pascal Van Leeuwen

[permalink] [raw]
Subject: Chacha-Poly performance on ARM64

Hi,

I'm currently doing some performance benchmarking on a quad core Cortex
A72 (Macchiatobin dev board) for rfc7539esp (ChachaPoly) and the
relatively low performance kind of took me by surprise, considering how
everyone keeps shouting how efficient Chacha-Poly is in software on
modern CPU's.

Then I noticed that it was using chacha20-generic for the encrypt
direction, while a chacha20-neon implementation exists (it actually
DOES use that one for decryption). Why would that be?

Also, it also uses poly1305-generic in both cases. Is that the best
possible on ARM64? I did a quick search in the codebase but couldn't
find any ARM64 optimized version ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
http://www.insidesecure.com


2019-09-26 15:02:58

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: Chacha-Poly performance on ARM64

On Thu, 26 Sep 2019 at 16:55, Pascal Van Leeuwen
<[email protected]> wrote:
>
> Hi,
>
> I'm currently doing some performance benchmarking on a quad core Cortex
> A72 (Macchiatobin dev board) for rfc7539esp (ChachaPoly) and the
> relatively low performance kind of took me by surprise, considering how
> everyone keeps shouting how efficient Chacha-Poly is in software on
> modern CPU's.
>
> Then I noticed that it was using chacha20-generic for the encrypt
> direction, while a chacha20-neon implementation exists (it actually
> DOES use that one for decryption). Why would that be?
>
> Also, it also uses poly1305-generic in both cases. Is that the best
> possible on ARM64? I did a quick search in the codebase but couldn't
> find any ARM64 optimized version ...
>

The Poly1305 implementation is part of the 18 piece WireGuard series I
just sent out yesterday (which I know you have seen :-))

The Chacha20 code should be used in preference to the generic code, so
if you end up with the wrong version, there's a bug somewhere we need
to fix.

Also, how do you know which direction uses which transform? What are
the refcounts for the transforms in /proc/crypto?

2019-09-26 20:05:08

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: Chacha-Poly performance on ARM64

> -----Original Message-----
> From: Ard Biesheuvel <[email protected]>
> Sent: Thursday, September 26, 2019 4:59 PM
> To: Pascal Van Leeuwen <[email protected]>
> Cc: Linux Crypto Mailing List <[email protected]>
> Subject: Re: Chacha-Poly performance on ARM64
>
> On Thu, 26 Sep 2019 at 16:55, Pascal Van Leeuwen
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I'm currently doing some performance benchmarking on a quad core Cortex
> > A72 (Macchiatobin dev board) for rfc7539esp (ChachaPoly) and the
> > relatively low performance kind of took me by surprise, considering how
> > everyone keeps shouting how efficient Chacha-Poly is in software on
> > modern CPU's.
> >
> > Then I noticed that it was using chacha20-generic for the encrypt
> > direction, while a chacha20-neon implementation exists (it actually
> > DOES use that one for decryption). Why would that be?
> >
> > Also, it also uses poly1305-generic in both cases. Is that the best
> > possible on ARM64? I did a quick search in the codebase but couldn't
> > find any ARM64 optimized version ...
> >
>
> The Poly1305 implementation is part of the 18 piece WireGuard series I
> just sent out yesterday (which I know you have seen :-))
>
I've seen the series but I must have missed that detail. I had hunch you
would be the one working on it though :-) I'll look it up and try it
tomorrow.

> The Chacha20 code should be used in preference to the generic code, so
> if you end up with the wrong version, there's a bug somewhere we need
> to fix.
>
Yes, I think so too. In fact, I think it may be the same bug I reported
earlier regarding the selftests, where it also unexpectedly picked the
generic implementation. IIRC the response I got back was that this was
a known issue where for the very first use of a cipher, the generic
implementation gets chosen instead of the optimal one. I guess no one
has looked into that yet ...

> Also, how do you know which direction uses which transform?
>
Well, tcrypt just logs that to the message log.

> What are the refcounts for the transforms in /proc/crypto?
>
All refcnt's in /proc/crypto are 1.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
http://www.insidesecure.com