[+Cc linux-crypto]
Hi Jason,
Apologies for starting a new thread, but this patch apparently wasn't
Cc'ed to linux-crypto, despite adding over 24000 lines of crypto code.
So much for WireGuard being only 4000 lines :-)
(For anyone else like me who didn't receive the patch, it can be found at
https://patchwork.ozlabs.org/patch/951763/)
I have some preliminary comments below.
On Tue, 31 Jul 2018 21:11:01 +0200, Jason A. Donenfeld wrote:
> [PATCH v1 2/3] zinc: Introduce minimal cryptography library
>
> Zinc stands for "Zinc Is Not crypto/". It's also short, easy to type,
> and plays nicely with the recent trend of naming crypto libraries after
> elements. The guiding principle is "don't overdo it". It's less of a
> library and more of a directory tree for organizing well-curated direct
> implementations of cryptography primitives.
>
> Zinc is a new cryptography API that is much more minimal and lower-level
> than the current one. It intends to complement it and provide a basis
> upon which the current crypto API might build, and perhaps someday Zinc
> may altogether supplant the current crypto API. It is motivated by
> three primary observations in crypto API design:
>
> * Highly composable "cipher modes" and related abstractions from
> 90s cryptographers did not turn out to be as terrific an idea as
> hoped, leading to a host of API misuse problems.
>
> * Most programmers are afraid of crypto code, and so prefer to
> integrate it into libraries in a highly abstracted manner, so as to
> shield themselves from implementation details. Cryptographers, on
> the other hand, prefer simple direct implementations, which they're
> able to verify for high assurance and optimize in accordance with
> their expertise.
>
> * Overly abstracted and flexible cryptography APIs lead to a host of
> dangerous problems and performance issues. The kernel is in the
> business usually not of coming up with new uses of crypto, but
> rather implementing various constructions, which means it essentially
> needs a library of primitives, not a highly abstracted enterprise-ready
> pluggable system, with a few particular exceptions.
>
> This last observation has seen itself play out several times over and
> over again within the kernel:
>
> * The perennial move of actual primitives away from crypto/ and into
> lib/, so that users can actually call these functions directly with
> no overhead and without lots of allocations, function pointers,
> string specifier parsing, and general clunkiness. For example:
> sha256, chacha20, siphash, sha1, and so forth live in lib/ rather
> than in crypto/. Zinc intends to stop the cluttering of lib/ and
> introduce these direct primitives into their proper place, lib/zinc/.
>
> * An abundance of misuse bugs with the present crypto API that have
> been very unpleasant to clean up.
>
> * A hesitance to even use cryptography, because of the overhead and
> headaches involved in accessing the routines.
>
> Zinc goes in a rather different direction. Rather than providing a
> thoroughly designed and abstracted API, Zinc gives you simple functions,
> which implement some primitive, or some particular and specific
> construction of primitives. It is not dynamic in the least, though one
> could imagine implementing a complex dynamic dispatch mechanism (such as
> the current crypto API) on top of these basic functions. After all,
> dynamic dispatch is usually needed for applications with cipher agility,
> such as IPsec, dm-crypt, AF_ALG, and so forth, and the existing crypto
> API will continue to play that role. However, Zinc will provide a non-
> haphazard way of directly utilizing crypto routines in applications
> that do have neither need nor desire for abstraction and dynamic
> dispatch.
>
> It also organizes the implementations in a simple, straight-forward,
> and direct manner, making it enjoyable and intuitive to work on.
> Rather than moving optimized assembly implementations into arch/, it
> keeps them all together in lib/zinc/, making it simple and obvious to
> compare and contrast what's happening. This is, notably, exactly what
> the lib/raid6/ tree does, and that seems to work out rather well. It's
> also the pattern of most successful crypto libraries. Likewise, the
> cascade of architecture-specific functions is done as ifdefs within one
> file, so that it's easy and obvious and clear what's happening, how each
> architecture differs, and how to optimize for shared patterns. This is
> very much preferable to architecture-specific file splitting.
>
> All implementations have been extensively tested and fuzzed, and are
> selected for their quality, trustworthiness, and performance. Wherever
> possible and performant, formally verified implementations are used,
> such as those from HACL* [1] and Fiat-Crypto [2]. The routines also take
> special care to zero out secrets using memzero_explicit (and future work
> is planned to have gcc do this more reliably and performantly with
> compiler plugins). The performance of the selected implementations is
> state-of-the-art and unrivaled. Each implementation also comes with
> extensive self-tests and crafted test vectors, pulled from various
> places such as Wycheproof [9].
>
> Regularity of function signatures is important, so that users can easily
> "guess" the name of the function they want. Though, individual
> primitives are oftentimes not trivially interchangeable, having been
> designed for different things and requiring different parameters and
> semantics, and so the function signatures they provide will directly
> reflect the realities of the primitives' usages, rather than hiding it
> behind (inevitably leaky) abstractions. Also, in contrast to the current
> crypto API, Zinc functions can work on stack buffers, and can be called
> with different keys, without requiring allocations or locking.
>
> SIMD is used automatically when available, though some routines may
> benefit from either having their SIMD disabled for particular
> invocations, or to have the SIMD initialization calls amortized over
> several invocations of the function, and so Zinc provides helpers and
> function signatures enabling that.
>
> More generally, Zinc provides function signatures that allow just what
> is required by the various callers. This isn't to say that users of the
> functions will be permitted to pollute the function semantics with weird
> particular needs, but we are trying very hard not to overdo it, and that
> means looking carefully at what's actually necessary, and doing just that,
> and not much more than that. Remember: practicality and cleanliness rather
> than over-zealous infrastructure.
>
> Zinc provides also an opening for the best implementers in academia to
> contribute their time and effort to the kernel, by being sufficiently
> simple and inviting. In discussing this commit with some of the best and
> brightest over the last few years, there are many who are eager to
> devote rare talent and energy to this effort.
>
> This initial commit adds implementations of the primitives used by WireGuard:
>
> * Curve25519 [3]: formally verified 64-bit C, formally verified 32-bit C,
> x86_64 BMI2, x86_64 ADX, ARM NEON.
>
> * ChaCha20 [4]: generic C, x86_64 SSSE3, x86_64 AVX-2, x86_64 AVX-512F,
> x86_64 AVX-512VL, ARM NEON, ARM64 NEON, MIPS.
>
> * HChaCha20 [5]: generic C, x86_64 SSSE3.
>
> * Poly1305 [6]: generic C, x86_64, x86_64 AVX, x86_64 AVX-2, x86_64
> AVX-512F, ARM NEON, ARM64 NEON, MIPS, MIPS64.
>
> * BLAKE2s [7]: generic C, x86_64 AVX, x86_64 AVX-512VL.
>
> * ChaCha20Poly1305 [8]: generic C construction for both full buffers and
> scatter gather.
>
> * XChaCha20Poly1305 [5]: generic C construction.
>
> Following the merging of this, I expect for first the primitives that
> currently exist in lib/ to work their way into lib/zinc/, after intense
> scrutiny of each implementation, potentially replacing them with either
> formally-verified implementations, or better studied and faster
> state-of-the-art implementations. In a phase after that, I envision that
> certain instances from crypto/ will want to rebase themselves to simply
> be abstracted crypto API wrappers using the lower level Zinc functions.
> This is already what various crypto/ implementations do with the
> existing code in lib/.
>
> Currently Zinc exists as a single un-menued option, CONFIG_ZINC, but as
> this grows we will inevitably want to make that more granular. This will
> happen at the appropriate time, rather than doing so prematurely. There
> also is a CONFIG_ZINC_DEBUG menued option, performing several intense
> tests at startup and enabling various BUG_ONs.
>
> [1] https://github.com/project-everest/hacl-star
> [2] https://github.com/mit-plv/fiat-crypto
> [3] https://cr.yp.to/ecdh.html
> [4] https://cr.yp.to/chacha.html
> [5] https://cr.yp.to/snuffle/xsalsa-20081128.pdf
> [6] https://cr.yp.to/mac.html
> [7] https://blake2.net/
> [8] https://tools.ietf.org/html/rfc8439
> [9] https://github.com/google/wycheproof
>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Samuel Neves <[email protected]>
> Cc: D. J. Bernstein <[email protected]>
> Cc: Tanja Lange <[email protected]>
> Cc: Jean-Philippe Aumasson <[email protected]>
> Cc: Karthikeyan Bhargavan <[email protected]>
> ---
> MAINTAINERS | 7 +
> include/zinc/blake2s.h | 94 +
> include/zinc/chacha20.h | 46 +
> include/zinc/chacha20poly1305.h | 51 +
> include/zinc/curve25519.h | 25 +
> include/zinc/poly1305.h | 34 +
> include/zinc/simd.h | 60 +
> lib/Kconfig | 21 +
> lib/Makefile | 2 +
> lib/zinc/Makefile | 28 +
> lib/zinc/blake2s/blake2s-x86_64.S | 685 ++++++
> lib/zinc/blake2s/blake2s.c | 292 +++
> lib/zinc/chacha20/chacha20-arm.S | 1471 ++++++++++++
> lib/zinc/chacha20/chacha20-arm64.S | 1940 ++++++++++++++++
> lib/zinc/chacha20/chacha20-mips.S | 474 ++++
> lib/zinc/chacha20/chacha20-x86_64.S | 2630 +++++++++++++++++++++
> lib/zinc/chacha20/chacha20.c | 242 ++
> lib/zinc/chacha20poly1305.c | 286 +++
> lib/zinc/curve25519/curve25519-arm.S | 2110 +++++++++++++++++
> lib/zinc/curve25519/curve25519-arm.h | 14 +
> lib/zinc/curve25519/curve25519-fiat32.h | 838 +++++++
> lib/zinc/curve25519/curve25519-hacl64.h | 751 ++++++
> lib/zinc/curve25519/curve25519-x86_64.h | 2060 +++++++++++++++++
> lib/zinc/curve25519/curve25519.c | 86 +
> lib/zinc/main.c | 36 +
> lib/zinc/poly1305/poly1305-arm.S | 1115 +++++++++
> lib/zinc/poly1305/poly1305-arm64.S | 820 +++++++
> lib/zinc/poly1305/poly1305-mips.S | 417 ++++
> lib/zinc/poly1305/poly1305-mips64.S | 357 +++
> lib/zinc/poly1305/poly1305-x86_64.S | 2790 +++++++++++++++++++++++
> lib/zinc/poly1305/poly1305.c | 377 +++
> lib/zinc/selftest/blake2s.h | 559 +++++
> lib/zinc/selftest/chacha20poly1305.h | 1559 +++++++++++++
> lib/zinc/selftest/curve25519.h | 607 +++++
> lib/zinc/selftest/poly1305.h | 1568 +++++++++++++
> 35 files changed, 24452 insertions(+)
> create mode 100644 include/zinc/blake2s.h
> create mode 100644 include/zinc/chacha20.h
> create mode 100644 include/zinc/chacha20poly1305.h
> create mode 100644 include/zinc/curve25519.h
> create mode 100644 include/zinc/poly1305.h
> create mode 100644 include/zinc/simd.h
> create mode 100644 lib/zinc/Makefile
> create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S
> create mode 100644 lib/zinc/blake2s/blake2s.c
> create mode 100644 lib/zinc/chacha20/chacha20-arm.S
> create mode 100644 lib/zinc/chacha20/chacha20-arm64.S
> create mode 100644 lib/zinc/chacha20/chacha20-mips.S
> create mode 100644 lib/zinc/chacha20/chacha20-x86_64.S
> create mode 100644 lib/zinc/chacha20/chacha20.c
> create mode 100644 lib/zinc/chacha20poly1305.c
> create mode 100644 lib/zinc/curve25519/curve25519-arm.S
> create mode 100644 lib/zinc/curve25519/curve25519-arm.h
> create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h
> create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h
> create mode 100644 lib/zinc/curve25519/curve25519-x86_64.h
> create mode 100644 lib/zinc/curve25519/curve25519.c
> create mode 100644 lib/zinc/main.c
> create mode 100644 lib/zinc/poly1305/poly1305-arm.S
> create mode 100644 lib/zinc/poly1305/poly1305-arm64.S
> create mode 100644 lib/zinc/poly1305/poly1305-mips.S
> create mode 100644 lib/zinc/poly1305/poly1305-mips64.S
> create mode 100644 lib/zinc/poly1305/poly1305-x86_64.S
> create mode 100644 lib/zinc/poly1305/poly1305.c
> create mode 100644 lib/zinc/selftest/blake2s.h
> create mode 100644 lib/zinc/selftest/chacha20poly1305.h
> create mode 100644 lib/zinc/selftest/curve25519.h
> create mode 100644 lib/zinc/selftest/poly1305.h
[...]
In general this is great work, and I'm very excited for WireGuard to be
upstreamed! But for the new crypto code, I think a few things are on
the wrong track, for example treating it is a special library. Even the
name is contradicting itself: Zinc is "not crypto/", yet as you stated
it's intended that the "Zinc" algorithms will be exposed through the
crypto API -- just like how most of the existing crypto code in lib/ is
also exposed through the crypto API. So, I think that what you're doing
isn't actually *that* different from what already exists in some cases;
and pretending that it is very different is just going to cause
problems. Rather, the actual truly new thing seems to be that the
dispatch to architecture specific implementations is done at the lib/
level instead of handled by the crypto API priority numbers.
So, I don't see why you don't just add lib/blake2s/, lib/chacha20/,
lib/poly1305/, etc., without pretending that they all have some special
new "Zinc" thing in common and are part of some holy crusade against the
crypto API.
They could even still go in subdirectory lib/crypto/ -- but just for
logical code organization purposes, as opposed to a special library with
a name that isn't self-explanatory and sounds like some third-party
library rather than first-class kernel code.
CONFIG_ZINC also needs to go. Algorithms will need to be independently
configurable as soon as anything other than WireGuard needs to use any
of them, so you might as well do it right from the start with
CONFIG_BLAKE2, CONFIG_CHACHA20, CONFIG_POLY1305, etc.
I think the above changes would also naturally lead to a much saner
patch series where each algorithm is added by its own patch, rather than
one monster patch that adds many algorithms and 24000 lines of code.
Note that adding all the algorithms in one patch also makes the
description of them conflated, e.g. you wrote that "formally verified"
implementations were used whenever possible, but AFAICS that actually
only applies to the C implementations of Curve25519, and even those have
your copyright statement so presumably you had to change something from
the "formally verified" code :-). Note also that Poly1305
implementations are somewhat error-prone, since there can be overflow
bugs that are extremely rarely hit; see e.g. how OpenSSL's Poly1305 NEON
implementation was initially buggy and had to be fixed:
https://mta.openssl.org/pipermail/openssl-commits/2016-April/006639.html.
Not to mention that C glue code is error-prone, especially with the tons
of #ifdefs. So, I'd strongly prefer that you don't oversell the crypto
code you're adding, e.g. by implying that most of it is formally
verified, as it likely still has bugs, like any other code...
I also want to compare the performance of some of the assembly
implementations you're adding to the existing implementations in the
kernel they duplicate. I'm especially interested in the NEON
implementation of ChaCha20. But adding 11 implementations in one single
patch means there isn't really a chance to comment on them individually.
Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
Cortex-A7 it was actually quite a bit slower than the one in the Linux
kernel written by Ard Biesheuvel... I trust that when claiming the
performance of all implementations you're adding is "state-of-the-art
and unrivaled", you actually compared them to the ones already in the
Linux kernel which you're advocating replacing, right? :-)
Your patch description is also missing any mention of crypto accelerator
hardware. Quite a bit of the complexity in the crypto API, such as
scatterlist support and asynchronous execution, exists because it
supports crypto accelerators. AFAICS your new APIs cannot support
crypto accelerators, as your APIs are synchronous and operate on virtual
addresses. I assume your justification is that "djb algorithms" like
ChaCha and Poly1305 don't need crypto accelerators as they are fast in
software. But you never explicitly stated this and discussed the
tradeoffs. Since this is basically the foundation for the design you've
chosen, it really needs to be addressed.
As for doing the architecture-specific dispatch in lib/ rather than
through the crypto API, there definitely are some arguments in favor of
it. The main problem, though, is that your code is a mess due to all
the #ifdefs, and it will only get worse as people add more
architectures. You may think you already added all the architectures
that matter, but tomorrow people will come and want to add PowerPC,
RISC-V, etc. I really think you should consider splitting up
implementations by architecture; this would *not*, however, preclude the
implementations from still being accessed through a single top-level
"glue" function. For example chacha20() could look like:
void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len,
bool have_simd)
{
if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd))
goto out;
chacha20_generic(dst, src, len, state->key, state->counter);
out:
state->counter[0] += (len + 63) / 64;
}
So, each architecture would optionally define its own chacha20_arch()
that returns true if the data was processed, or false if not. (The data
wouldn't be processed if, for example, 'have_simd' was false but only
SIMD implementations are available; or if the input was too short for an
SIMD implementation to be faster than the generic one.) Note that this
would make the code much more consistent with the usual Linux kernel
coding style, which strongly prefers calling functions unconditionally
rather than having core logic littered with unmaintainable #ifdefs.
I'd also strongly prefer the patchset to include converting the crypto
API versions of ChaCha20 and Poly1305 over to use your new lib/ code, to
show that it's really possible. You mentioned that it's planned, but if
it's not done right away there will be things that were missed and will
require changes when someone finally does it. IMO it's not acceptable
to add your own completely separate ChaCha20 and Poly1305 just because
you don't like that the existing ones are part of the crypto API. You
need to refactor things properly. I think you'd need to expose the new
code under a cra_driver_name like "chacha20-software" and
"poly1305-software" to reflect that they use the fastest available
implementation of the algorithm on the CPU, e.g. "chacha20-generic",
"chacha20-neon", and "chacha20-simd" would all be replaced by a single
"chacha20-software". Is that what you had in mind?
I'm also wondering about the origin and licensing of some of the
assembly language files. Many have an OpenSSL copyright statement.
But, the OpenSSL license is often thought to be incompatible with GPL,
so using OpenSSL assembly code in the kernel has in the past required
getting special permission from Andy Polyakov (the person who's written
most of OpenSSL's assembly code so holds the copyright on it). As one
example, see arch/arm/crypto/sha256-armv4.pl: the file explicitly states
that Andy has relicensed it under GPLv2. For your new OpenSSL-derived
files, have you gone through and explicitly gotten GPLv2 permission from
Andy / the copyright holders?
Each assembly language file should also explicitly state where it came
from. For example lib/zinc/curve25519/curve25519-arm.S has your
copyright statement and says it's "Based on algorithms from Daniel J.
Bernstein and Peter Schwabe.", but it's not clarified whether the *code*
was written by those other people, as opposed to those people designing
the *algorithms* and then you writing the code; and if you didn't write
it, where you retrieved the file from and when, what license it had
(even if it was "public domain" like some of djb's code, this should be
mentioned for informational purposes), and what changes you made if any.
Oh, and please use 80-character lines like the rest of the kernel, so
that people's eyes don't bleed when reading your code :-)
Thanks for all your hard work on WireGuard!
- Eric
[I reordered the snippets below for readability.]
> On Aug 1, 2018, at 12:22 AM, Eric Biggers <[email protected]> wrote:
> In general this is great work, and I'm very excited for WireGuard to be
> upstreamed! But for the new crypto code, I think a few things are on
> the wrong track, for example treating it is a special library. Even the
> name is contradicting itself: Zinc is "not crypto/", yet as you stated
> it's intended that the "Zinc" algorithms will be exposed through the
> crypto API -- just like how most of the existing crypto code in lib/ is
> also exposed through the crypto API. So, I think that what you're doing
> isn't actually *that* different from what already exists in some cases;
> and pretending that it is very different is just going to cause
> problems. Rather, the actual truly new thing seems to be that the
> dispatch to architecture specific implementations is done at the lib/
> level instead of handled by the crypto API priority numbers.
...
>
>
> I think the above changes would also naturally lead to a much saner
> patch series where each algorithm is added by its own patch, rather than
> one monster patch that adds many algorithms and 24000 lines of code.
>
Yes, please.
>
> As for doing the architecture-specific dispatch in lib/ rather than
> through the crypto API, there definitely are some arguments in favor of
> it. The main problem, though, is that your code is a mess due to all
> the #ifdefs, and it will only get worse as people add more
> architectures. You may think you already added all the architectures
> that matter, but tomorrow people will come and want to add PowerPC,
> RISC-V, etc. I really think you should consider splitting up
> implementations by architecture; this would *not*, however, preclude the
> implementations from still being accessed through a single top-level
> "glue" function. For example chacha20() could look like:
>
> void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len,
> bool have_simd)
> {
> if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd))
> goto out;
>
> chacha20_generic(dst, src, len, state->key, state->counter);
>
> out:
> state->counter[0] += (len + 63) / 64;
> }
I like this a *lot*. (But why are you passing have_simd? Shouldn't
that check live in chacha20_arch? If there's some init code needed,
then chacha20_arch() should just return false before the init code
runs. Once the arch does whatever feature detection it needs, it can
make chacha20_arch() start returning true.)
As I see it, there there are two truly new thing in the zinc patchset:
the direct (in the direct call sense) arch dispatch, and the fact that
the functions can be called directly, without allocating contexts,
using function pointers, etc.
In fact, I had a previous patch set that added such an interface for SHA256.
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5
>
> Your patch description is also missing any mention of crypto accelerator
> hardware. Quite a bit of the complexity in the crypto API, such as
> scatterlist support and asynchronous execution, exists because it
> supports crypto accelerators. AFAICS your new APIs cannot support
> crypto accelerators, as your APIs are synchronous and operate on virtual
> addresses. I assume your justification is that "djb algorithms" like
> ChaCha and Poly1305 don't need crypto accelerators as they are fast in
> software. But you never explicitly stated this and discussed the
> tradeoffs. Since this is basically the foundation for the design you've
> chosen, it really needs to be addressed.
I see this as an advantage, not a disadvantage. A very large majority
of in-kernel crypto users (by number of call sites under a *very*
brief survey, not by number of CPU cycles) just want to do some
synchronous crypto on a buffer that is addressed by a regular pointer.
Most of these users would be slowed down if they used any form of
async crypto, since the CPU can complete the whole operation faster
than it could plausibly initiate and complete anything asynchronous.
And, right now, they suffer the full overhead of allocating a context
(often with alloca!), looking up (or caching) some crypto API data
structures, dispatching the operation, and cleaning up.
So I think the right way to do it is to have directly callable
functions like zinc uses and to have the fancy crypto API layer on top
of them. So if you actually want async accelerated crypto with
scatterlists or whatever, you can call into the fancy API, and the
fancy API can dispatch to hardware or it can dispatch to the normal
static API.
In fact, this is exactly what my patch above did. But Jason's work is
more complete than mine, and mine wasn't really done because it had
some configury issues.
All that being said, for algorithms where the crypto API already has a
reasonable implementation, I think the patch series should first
restructure the code (so the actual software implementation is moved
away from the crypto API wrappers) and then, if needed, replace the
implementation with something better. The latter should be its own
patch with its own justification.
Hi Eric,
Thanks for your feedback. Responses below.
On Wed, Aug 1, 2018 at 9:22 AM Eric Biggers <[email protected]> wrote:
> it's intended that the "Zinc" algorithms will be exposed through the
> crypto API -- just like how most of the existing crypto code in lib/ is
> also exposed through the crypto API. So, I think that what you're doing
> isn't actually *that* different from what already exists in some cases;
> and pretending that it is very different is just going to cause
> problems.
I think as an eventual thing, it'd be great to see the whole of the
software part of the Crypto API rebased ontop of Zinc's software
implementations. But the Cryto API is huge, and it's not really
realistic to do this all at once. I'd also like to have a bit more
scrutiny of each individual Zinc submission, rather than just dumping
it all in there wholesale. Additionally, it seems reasonable to do
this in accordance with actual need for using the algorithms from
Zinc, i.e., not rushing toward DES.
> Rather, the actual truly new thing seems to be that the [...]
> [...] are part of some holy crusade against the crypto API.
> So much for WireGuard being only
I sense a bit of snark in your comments, and I am sorry if I ruffled
some feathers. I can understand a number of things might have rubbed
you the wrong way. But I am pretty committed to getting this done
right, taking into account your and Andy's opinions, as well as
working hard to keep Zinc close to its design goals. I think you'll
find we're closer to being on the same page than not, and that I only
want to make things better rather than be on some kind of "crusade"
against anything.
> They could even still go in subdirectory lib/crypto/ -- but just for
> logical code organization purposes,
Indeed a subdirectory is important for organization. We can bikeshed
about names, I suppose. I'm partial zinc, and I think that it might
actually help foster contributors and interested academics. Anyway,
let's dive into the code issues first, though, as I think that'll be a
productive way to get going.
> CONFIG_ZINC also needs to go. Algorithms will need to be independently
> configurable as soon as anything other than WireGuard needs to use any
> of them, so you might as well do it right from the start with
> CONFIG_BLAKE2, CONFIG_CHACHA20, CONFIG_POLY1305, etc.
Okay, I can add this all as individual nobs. I'll do that for v2.
I'm tempted to still have this built as one module, though. For
example, should WireGuard depend on 5 mini-modules (blake2s, chacha20,
poly1305, chacha20poly1305, curve25519)? Rather, it seems a bit
sleeker to just have a single zinc.ko, and allow what's built into it
be configurable with the various algorithm configuration options. What
would you think of this? Or do you think you'd still prefer
one-algo-per-ko approach like with the current crypto API? I can see
good arguments both ways.
> I think the above changes would also naturally lead to a much saner
> patch series where each algorithm is added by its own patch, rather than
> one monster patch that adds many algorithms and 24000 lines of code.
Yes, absolutely! And sorry for that monster patch; it even got
rejected from LKML as you pointed out. I had started to split things
up, but I wanted to send the patchset foremost to netdev, to start
getting some feedback on WireGuard, and wanted to keep the
organization simple. I'll have this split up for v2.
>
> Note that adding all the algorithms in one patch also makes the
> description of them conflated, e.g. you wrote that "formally verified"
> implementations were used whenever possible, but AFAICS that actually
> only applies to the C implementations of Curve25519,
Yes, the commit message mentioned which implementations have been
verified. It's quite likely that in the near future we'll be adding
more formally verified implementations, and even formally verified
assembly implementations. While BLAKE2s and ChaCha20 are somewhat
straight forward, bigint stuff, such as Curve25519 and Poly1305,
always make me a bit nervous, and so we'll probably be additionally
incorporating a formally verified Poly1305 first of the bunch.
Either way, _all_ of those implementations have been reviewed in some
depth, fuzzed absurd amounts, and fed some pretty nasty test vectors.
I've spent a lot of time with all of them, and most have received a
good amount of external scrutiny, due to their origin.
> So, I'd strongly prefer that you don't oversell the crypto
> code you're adding, e.g. by implying that most of it is formally
> verified, as it likely still has bugs, like any other code...
This sounds like snark that's going to devolve into endless
bikeshedding, but I do note that I mentioned formally verified
implementations in these 3 contexts, which I think were appropriate
each time:
1) "Wherever possible and performant, formally verified
implementations are used, such as those from HACL* and Fiat-Crypto."
So, trying to state that we use it when we can, but we can't always. I
think it's important to mention it here, as already, I've been
contacted by formal verification people who are already eager to help.
This is exactly the kind of positive outcome I was hoping for.
2) "Curve25519: formally verified 64-bit C, formally verified 32-bit C, [...]"
Pointing out that these 2 implementations, in contradistinction to the
subsequent ones listed, are from formally verified C.
3) "potentially replacing them with either formally-verified
implementations [...]"
Again here, I indicate that I'd like to replace non formally-verified
implementations with formally-verified implementations wherever
there's such a potentiality. And like the previous comment, I believe
this has been interpreted as a welcome call to action by the people
who are actually working on this stuff.
So I don't think I'm "over selling" it. Formal verification is a good
thing. We have a little bit of it in Zinc. I'd like to have a lot more
of it. And all the while we're doing the best with what we've got.
Generally speaking, I believe the choices I've made on the individual
implementations in this submission have been made with a lot of care,
and I'm happy to work with you on any concerns you have about them and
do what's needed to get them in the proper shape.
> I also want to compare the performance of some of the assembly
> implementations you're adding to the existing implementations in the
> kernel they duplicate. I'm especially interested in the NEON
> implementation of ChaCha20. But adding 11 implementations in one single
> patch means there isn't really a chance to comment on them individually.
As noted, I'll certainly split this up for v2. With ChaCha20, my
benchmarks have indicated Zinc is faster than the Crypto API's
implementations. But there's also another benefit: the ChaCha20 and
Poly1305 implementations are based on Andy Polyakov's ones, which
means they benefit from the *huge* amount of scrutiny those have
already received and continue to receive. In fact, there are people
working on doing formally verified reproductions of some of his
implementations, and that alone is hugely valuable. The
implementations generally found in Linux haven't received nearly as
much scrutiny and attention, putting them at a bit of a disadvantage,
I think. In other words, there's a strong benefit from working
together and sharing in the enterprise with other projects.
There's another plus, too, which is that they all have a very common
interface, making the glue code simpler and more uniform. This is
probably mostly just a benefit for me, as a lazy coder who would
prefer an easier job, but usually less complexity and fewer
differences results in fewer bugs, too.
> Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
> Cortex-A7 it was actually quite a bit slower than the one in the Linux
> kernel written by Ard Biesheuvel... I trust that when claiming the
> performance of all implementations you're adding is "state-of-the-art
> and unrivaled", you actually compared them to the ones already in the
> Linux kernel which you're advocating replacing, right? :-)
Yes, I have, and my results don't corroborate your findings. It will
be interesting to get out a wider variety of hardware for comparisons.
I suspect, also, that if the snarky emoticons subside, AndyP would be
very interested in whatever we find and could have interest in
improving implementations, should we actually find performance
differences.
> Your patch description is also missing any mention of crypto accelerator
> hardware. Quite a bit of the complexity in the crypto API, such as
> scatterlist support and asynchronous execution, exists because it
> supports crypto accelerators.
I can augment that description, if you like, for v2. But with regards
to crypto accelerators -- I think the asynchronous API causes all
sorts of problems for some of the most common use cases, and so Zinc
is a software-based ordinary synchronous API. I think this is an
advantage. Of course you can continue to support wild async IPSec
accelerator hardware and such in the crypto API, but that's
intentionally not the focus of Zinc.
> I assume your justification is that "djb algorithms" like
> But you never explicitly stated this and discussed the
> tradeoffs.
It was already a pretty huge wall of text in the monster patch, but
even so there's probably tons of discussion about different details,
both large and small, that I must have left out. Sorry about that.
Hopefully these discussions here can help fill them in.
> The main problem, though, is that your code is a mess due to all
> the #ifdefs, and it will only get worse as people add more
> architectures.
I really thought the ifdefs were cleaner than the three other ways I
have implemented for doing it. When I showed this all to Greg KH
several months ago, he also thought the ifdefs were disgusting. It
seems you agree with him, and I'll certainly change this for v2. I
already have most of that code written, and based on your comments and
suggestions below, I'm sure you'll find it much better.
> Note that this
> would make the code much more consistent with the usual Linux kernel
> coding style, which strongly prefers calling functions unconditionally
> rather than having core logic littered with unmaintainable #ifdefs.
Noted. Will be changed.
> I'd also strongly prefer the patchset to include converting the crypto
> API versions of ChaCha20 and Poly1305 over to use your new lib/ code, to
> show that it's really possible. You mentioned that it's planned, but if
> it's not done right away there will be things that were missed and will
> require changes when someone finally does it. IMO it's not acceptable
> to add your own completely separate ChaCha20 and Poly1305 just because
> you don't like that the existing ones are part of the crypto API.
Alright. I didn't really want to do this right off the bat, but you
make a good argument that doing so is important in working out the
required details of just knowing how it's done. So I'll take a stab at
it for v2. I'll let you know if I run into any snags; we might be able
to collaborate nicely here.
> You
> need to refactor things properly. I think you'd need to expose the new
> code under a cra_driver_name like "chacha20-software" and
> "poly1305-software" to reflect that they use the fastest available
> implementation of the algorithm on the CPU, e.g. "chacha20-generic",
> "chacha20-neon", and "chacha20-simd" would all be replaced by a single
> "chacha20-software". Is that what you had in mind?
Calling it "chacha20-software" sounds like a reasonable way to name it.
>
> I'm also wondering about the origin and licensing of some of the
> assembly language files. Many have an OpenSSL copyright statement.
> But, the OpenSSL license is often thought to be incompatible with GPL,
> so using OpenSSL assembly code in the kernel has in the past required
> getting special permission from Andy Polyakov (the person who's written
> most of OpenSSL's assembly code so holds the copyright on it). As one
> example, see arch/arm/crypto/sha256-armv4.pl: the file explicitly states
> that Andy has relicensed it under GPLv2. For your new OpenSSL-derived
> files, have you gone through and explicitly gotten GPLv2 permission from
> Andy / the copyright holders?
Andy and I spoke in January about this and he was okay with the usual
CRYPTOGAMS GPLv2 origin dance for that code.
> For example lib/zinc/curve25519/curve25519-arm.S has your
> copyright statement and says it's "Based on algorithms from Daniel J.
> Bernstein and Peter Schwabe.", but it's not clarified whether the *code*
> was written by those other people, as opposed to those people designing
> the *algorithms* and then you writing the code; and if you didn't write
> it, where you retrieved the file from and when, what license it had
> (even if it was "public domain" like some of djb's code, this should be
> mentioned for informational purposes), and what changes you made if any.
It's indeed based on their public domain code. I'll change that
wording to make it very clear that it's from public domain code.
> Oh, and please use 80-character lines like the rest of the kernel, so
> that people's eyes don't bleed when reading your code :-)
Ack. Will be so in v2.
>
> Thanks for all your hard work on WireGuard!
Thanks for your excellent review. I really appreciate it you taking
the time. Hopefully v2 will start looking a lot more interesting to
you!
Regards,
Jason
Hey Andy,
Thanks too for the feedback. Responses below:
On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <[email protected]> wrote:
> > I think the above changes would also naturally lead to a much saner
> > patch series where each algorithm is added by its own patch, rather than
> > one monster patch that adds many algorithms and 24000 lines of code.
> >
>
> Yes, please.
Ack, will be in v2.
> I like this a *lot*. (But why are you passing have_simd? Shouldn't
> that check live in chacha20_arch? If there's some init code needed,
> then chacha20_arch() should just return false before the init code
> runs. Once the arch does whatever feature detection it needs, it can
> make chacha20_arch() start returning true.)
The have_simd stuff is so that the FPU state can be amortized across
several calls to the crypto functions. Here's a snippet from
WireGuard's send.c:
void packet_encrypt_worker(struct work_struct *work)
{
struct crypt_queue *queue = container_of(work, struct
multicore_worker, work)->ptr;
struct sk_buff *first, *skb, *next;
bool have_simd = simd_get();
while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) {
enum packet_state state = PACKET_STATE_CRYPTED;
skb_walk_null_queue_safe(first, skb, next) {
if (likely(skb_encrypt(skb, PACKET_CB(first)->keypair, have_simd)))
skb_reset(skb);
else {
state = PACKET_STATE_DEAD;
break;
}
}
queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first, state);
have_simd = simd_relax(have_simd);
}
simd_put(have_simd);
}
simd_get() and simd_put() do the usual irq_fpu_usable/kernel_fpu_begin
dance and return/take a boolean accordingly. simd_relax(on) is:
static inline bool simd_relax(bool was_on)
{
#ifdef CONFIG_PREEMPT
if (was_on && need_resched()) {
simd_put(true);
return simd_get();
}
#endif
return was_on;
}
With this, we most of the time get the FPU amortization, while still
doing the right thing for the preemption case (since kernel_fpu_begin
disables preemption). This is a quite important performance
optimization. However, I'd prefer the lazy FPU restoration proposal
discussed a few months ago, but it looks like that hasn't progressed,
hence the need for FPU call amortization:
https://lore.kernel.org/lkml/CALCETrU+2mBPDfkBz1i_GT1EOJau+mzj4yOK8N0UnT2pGjiUWQ@mail.gmail.com/
>
> As I see it, there there are two truly new thing in the zinc patchset:
> the direct (in the direct call sense) arch dispatch, and the fact that
> the functions can be called directly, without allocating contexts,
> using function pointers, etc.
>
> In fact, I had a previous patch set that added such an interface for SHA256.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5
Seems like SHA256 will be a natural next candidate for Zinc, given the demand.
> > Your patch description is also missing any mention of crypto accelerator
> > hardware. Quite a bit of the complexity in the crypto API, such as
> > scatterlist support and asynchronous execution, exists because it
> > supports crypto accelerators. AFAICS your new APIs cannot support
> > crypto accelerators, as your APIs are synchronous and operate on virtual
> > addresses. I assume your justification is that "djb algorithms" like
> > ChaCha and Poly1305 don't need crypto accelerators as they are fast in
> > software. But you never explicitly stated this and discussed the
> > tradeoffs. Since this is basically the foundation for the design you've
> > chosen, it really needs to be addressed.
>
> I see this as an advantage, not a disadvantage. A very large majority
> of in-kernel crypto users (by number of call sites under a *very*
> brief survey, not by number of CPU cycles) just want to do some
> synchronous crypto on a buffer that is addressed by a regular pointer.
> Most of these users would be slowed down if they used any form of
> async crypto, since the CPU can complete the whole operation faster
> than it could plausibly initiate and complete anything asynchronous.
> And, right now, they suffer the full overhead of allocating a context
> (often with alloca!), looking up (or caching) some crypto API data
> structures, dispatching the operation, and cleaning up.
>
> So I think the right way to do it is to have directly callable
> functions like zinc uses and to have the fancy crypto API layer on top
> of them. So if you actually want async accelerated crypto with
> scatterlists or whatever, you can call into the fancy API, and the
> fancy API can dispatch to hardware or it can dispatch to the normal
> static API.
Yes, exactly this.
Regards,
Jason
On Thu, Aug 2, 2018 at 7:48 PM, Jason A. Donenfeld <[email protected]> wrote:
> Hey Andy,
>
> Thanks too for the feedback. Responses below:
>
> On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <[email protected]> wrote:
>> > I think the above changes would also naturally lead to a much saner
>> > patch series where each algorithm is added by its own patch, rather than
>> > one monster patch that adds many algorithms and 24000 lines of code.
>> >
>>
>> Yes, please.
>
> Ack, will be in v2.
>
>
>> I like this a *lot*. (But why are you passing have_simd? Shouldn't
>> that check live in chacha20_arch? If there's some init code needed,
>> then chacha20_arch() should just return false before the init code
>> runs. Once the arch does whatever feature detection it needs, it can
>> make chacha20_arch() start returning true.)
>
> The have_simd stuff is so that the FPU state can be amortized across
> several calls to the crypto functions. Here's a snippet from
> WireGuard's send.c:
>
> void packet_encrypt_worker(struct work_struct *work)
> {
> struct crypt_queue *queue = container_of(work, struct
> multicore_worker, work)->ptr;
> struct sk_buff *first, *skb, *next;
> bool have_simd = simd_get();
Gotcha. That was very hidden in the 24k lines. Please make this (and
any similar goodies) be their own patches.
Also, please consider spelling it differently:
simd_context_t simd_context = simd_get();
Because we'll feel very silly the first time some architecture has
more than one possible state. (It wouldn't be entirely insane for x86
to distinguish between "no SIMD", "XMM only", and "go to town!", for
example.)
On Fri, Aug 3, 2018 at 11:29 PM Andy Lutomirski <[email protected]> wrote:
> Gotcha. That was very hidden in the 24k lines. Please make this (and
> any similar goodies) be their own patches.
I know, sorry, and I certainly will have this split out. The above
code snippet was from the much much shorter WireGuard patch (3/3), but
indeed the simd_get/put/relax code is in the monster patch here, so
that'll certainly be split for v2.
>
> Also, please consider spelling it differently:
>
> simd_context_t simd_context = simd_get();
>
> Because we'll feel very silly the first time some architecture has
> more than one possible state. (It wouldn't be entirely insane for x86
> to distinguish between "no SIMD", "XMM only", and "go to town!", for
> example.)
That's a great idea. It'll also make it clearer that users shouldn't
just dump a raw "true" in there; raw bools are more tempting to abuse
like that.
Hey Andy, Eric, & all,
I've started the work of separating this out into 16 individual
commits, have addressed numerous other things brought up like the
ifdef maze, and have begun rewriting (parts of) the original commit
message. It's still a work in progress, and I still have some work to
do, but if you want to follow along, things are happening here:
https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard
I'll be rebasing and reworking this continuously, but that's how it's
shaping up.
As I'm still working on it, I won't be submitting v2 today, but if you
have suggestions or concerns or want to poke me while I'm working on
v2, don't hesitate to send me private email or ping me in IRC (I'm
"zx2c4" there) to chat.
Regards,
Jason
On Tue, Aug 7, 2018 at 11:54 AM, Jason A. Donenfeld <[email protected]> wrote:
> Hey Andy, Eric, & all,
>
> I've started the work of separating this out into 16 individual
> commits, have addressed numerous other things brought up like the
> ifdef maze, and have begun rewriting (parts of) the original commit
> message. It's still a work in progress, and I still have some work to
> do, but if you want to follow along, things are happening here:
>
> https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard
>
> I'll be rebasing and reworking this continuously, but that's how it's
> shaping up.
>
> As I'm still working on it, I won't be submitting v2 today, but if you
> have suggestions or concerns or want to poke me while I'm working on
> v2, don't hesitate to send me private email or ping me in IRC (I'm
> "zx2c4" there) to chat.
>
> Regards,
> Jason
For "zinc: add simd helper", I think it should be in include/linux,
and include/linux/simd.h should (immediately or maybe in the future)
include <asm/simd.h> to pick up arch-specific stuff. And the patch
should get sent to [email protected].
In your blake2s_arch() implementation, you're not passing in a
simd_context_t. Is that still a work in progress? I thought the plan
was to pass it in rather than doing the check in the _arch()
functions.
Hey Andy,
On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski <[email protected]> wrote:
> For "zinc: add simd helper", I think it should be in include/linux,
> and include/linux/simd.h should (immediately or maybe in the future)
> include <asm/simd.h> to pick up arch-specific stuff. And the patch
> should get sent to [email protected].
I guess you saw my prompt about that in the previous commit message?
Based on your encouragement, I implemented it:
https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more
invasive than I wanted to be, as I don't want this patch submission to
grow unwieldy and never be merged, but I guess we can roll with this
for now...
> In your blake2s_arch() implementation, you're not passing in a
> simd_context_t. Is that still a work in progress? I thought the plan
> was to pass it in rather than doing the check in the _arch()
> functions.
I'm inclined to do the explicit context passing only when a function
is likely to be used in that kind of environment, and adjust as
needed. Long term, anyway, that API will be removed once the x86 guys
figure out lazy FPU restoration and the amortization doesn't add
anything.
Jason
> On Aug 7, 2018, at 4:48 PM, Jason A. Donenfeld <[email protected]> wrote:
>
> Hey Andy,
>
>> On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski <[email protected]> wrote:
>> For "zinc: add simd helper", I think it should be in include/linux,
>> and include/linux/simd.h should (immediately or maybe in the future)
>> include <asm/simd.h> to pick up arch-specific stuff. And the patch
>> should get sent to [email protected].
>
> I guess you saw my prompt about that in the previous commit message?
> Based on your encouragement, I implemented it:
> https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more
> invasive than I wanted to be, as I don't want this patch submission to
> grow unwieldy and never be merged, but I guess we can roll with this
> for now...
>
I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.
>> In your blake2s_arch() implementation, you're not passing in a
>> simd_context_t. Is that still a work in progress? I thought the plan
>> was to pass it in rather than doing the check in the _arch()
>> functions.
>
> I'm inclined to do the explicit context passing only when a function
> is likely to be used in that kind of environment, and adjust as
> needed. Long term, anyway, that API will be removed once the x86 guys
> figure out lazy FPU restoration and the amortization doesn't add
> anything.
Fair enough.
>
> Jason
On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski <[email protected]> wrote:
> I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.
It's really not that hard to do -- symlink asm-generic to a target
called "asm" inside an otherwise empty directory, and add that
otherwise empty directory to the -I paths just after arch/include.
Since it's searched second, it's only used if the first fails. Maybe
I'm missing something though, as this seems a bit too obvious. Perhaps
a project for another day.
On Tue, Aug 7, 2018 at 6:51 PM, Jason A. Donenfeld <[email protected]> wrote:
> On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski <[email protected]> wrote:
>> I really wish we had a way to see that we use asm-generic’s copy of a header in all cases except where an arch opts out.
>
> It's really not that hard to do -- symlink asm-generic to a target
> called "asm" inside an otherwise empty directory, and add that
> otherwise empty directory to the -I paths just after arch/include.
> Since it's searched second, it's only used if the first fails. Maybe
> I'm missing something though, as this seems a bit too obvious. Perhaps
> a project for another day.
The problem here (I think) is that it's preferable for stray files in
the kernel tree to have no effect and, with this scheme, stray files
in arch/*/include/asm could affect the build. So something explicit
has an advantage. I just think there should be way to say "this
asm-generic file should affect all arches unless they generic-n or
override-generic-y it or whatever".
On Fri, Aug 03, 2018 at 04:33:50AM +0200, Jason A. Donenfeld wrote:
>
> > Also, earlier when I tested OpenSSL's ChaCha NEON implementation on ARM
> > Cortex-A7 it was actually quite a bit slower than the one in the Linux
> > kernel written by Ard Biesheuvel... I trust that when claiming the
> > performance of all implementations you're adding is "state-of-the-art
> > and unrivaled", you actually compared them to the ones already in the
> > Linux kernel which you're advocating replacing, right? :-)
>
> Yes, I have, and my results don't corroborate your findings. It will
> be interesting to get out a wider variety of hardware for comparisons.
> I suspect, also, that if the snarky emoticons subside, AndyP would be
> very interested in whatever we find and could have interest in
> improving implementations, should we actually find performance
> differences.
>
On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
byte), whereas Linux's is faster: 11.9 cpb. I've also recently improved the
Linux implementation to 11.3 cpb and would like to send out a patch soon...
I've also written a scalar ChaCha20 implementation (no NEON instructions!) that
is 12.2 cpb on one block at a time on Cortex-A7, taking advantage of the free
rotates; that would be useful for the single permutation used to compute
XChaCha's subkey, and also for the ends of messages.
The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's is that
Linux's does 4 blocks at once using NEON instructions, and the words are
de-interleaved so the rows don't need to be shifted between each round.
OpenSSL's implementation, on the other hand, only does 3 blocks at once with
NEON instructions and has to shift the rows between each round. OpenSSL's
implementation also does a 4th block at the same time using regular ARM
instructions, but that doesn't help on Cortex-A7; it just makes it slower.
I understand there are tradeoffs, and different implementations can be faster on
different CPUs. Just know that from my point of view, switching to the OpenSSL
implementation actually introduces a performance regression, and we care a *lot*
about this since we need ChaCha to be absolutely as fast as possible for HPolyC
disk encryption. So if your proposal goes in, I'd likely need to write a patch
to get the old performance back, at least on Cortex-A7...
Also, I don't know whether Andy P. considered the 4xNEON implementation
technique. It could even be fastest on other ARM CPUs too, I don't know.
- Eric
Hi Eric,
On Tue, Aug 14, 2018 at 2:12 PM Eric Biggers <[email protected]> wrote:
> On ARM Cortex-A7, OpenSSL's ChaCha20 implementation is 13.9 cpb (cycles per
> byte), whereas Linux's is faster: 11.9 cpb.
>
> The reason Linux's ChaCha20 NEON implementation is faster than OpenSSL's
>
> I understand there are tradeoffs, and different implementations can be faster on
> different CPUs.
>
> So if your proposal goes in, I'd likely need to write a patch
> to get the old performance back, at least on Cortex-A7...
Yes, absolutely. Different CPUs behave differently indeed, but if you
have improvements for hardware that matters to you, we should
certainly incorporate these, and also loop Andy Polyakov in (I've
added him to the CC for the WIP v2). ChaCha is generally pretty
obvious, but for big integer algorithms -- like Poly1305 and
Curve25519 -- I think it's all the more important to involve Andy and
the rest of the world in general, so that Linux benefits from bug
research and fuzzing in places that are typically and classically
prone to nasty issues. In other words, let's definitely incorporate
your improvements after the patchset goes in, and at the same time
we'll try to bring Andy and others into the fold, where our
improvements can generally track each others.
> Also, I don't know whether Andy P. considered the 4xNEON implementation
> technique. It could even be fastest on other ARM CPUs too, I don't know.
After v2, when he's CC'd in, let's plan to start discussing this with him.
Jason