2016-05-26 16:18:33

by Stephan Mueller

[permalink] [raw]
Subject: AES-NI: slower than aes-generic?

Hi,

for the DRBG and the LRNG work I am doing, I also test the speed of the DRBG.
The DRBG can be considered as a form of block chaining mode on top of a raw
cipher.

What I am wondering is that when encrypting 256 16 byte blocks, I get a speed
of about 170 MB/s with the AES-NI driver. When using the aes-generic or aes-
asm, I get up to 180 MB/s with all else being equal. Note, that figure
includes a copy_to_user of the generated data.

To be precise, the code does the following steps:

1. setkey

2. AES encrypt 256 blocks and copy_to_user

3. Use AES to generate a new key and start at step 1.

I am wondering why the AES-NI driver is slower than the C/ASM implementations
given that all else is equal.

Note, if I have less blocks in step 2 above, AES-NI is becoming faster. E.g.
if I have 8 blocks or just one block in step 2 above, AES-NI is faster by 10
or 20%.

Ciao
Stephan
--
atsec information security GmbH, Steinstra?e 70, 81667 M?nchen, Germany
P: +49 89 442 49 830 - F: +49 89 442 49 831
M: +49 172 216 55 78 - HRB: 129439 (Amtsgericht M?nchen)
US: +1 949 545 4096
GF: Salvatore la Pietra, Staffan Persson
atsec it security news blog - atsec-information-security.blogspot.com


2016-05-26 17:30:06

by Stephan Müller

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Am Donnerstag, 26. Mai 2016, 13:25:02 schrieb Jeffrey Walton:

Hi Jeffrey,

> > What I am wondering is that when encrypting 256 16 byte blocks, I get a
> > speed of about 170 MB/s with the AES-NI driver. When using the
> > aes-generic or aes- asm, I get up to 180 MB/s with all else being equal.
> > Note, that figure includes a copy_to_user of the generated data.
> >
> > ...
>
> Something sounds amiss.
>
> AES-NI should be on the order of magnitude faster than a generic
> implementation. Can you verify AES-NI is actually using AES-NI, and
> aes-generic is a software implementation?

I am pretty sure I am using the right implementations as I checked the
refcount in /proc/crypto.
>
> Here are some OpenSSL numbers. EVP uses AES-NI when available.
> Omitting -evp means its software only (no hardware acceleration, like
> AES-NI).

I understand that AES-NI should be faster. That is what I am wondering about.

However, the key difference to a standard speed test is that I set up a new
key schedule quite frequently. And I would suspect that something is going on
here...

Ciao
Stephan

2016-05-26 17:31:27

by Jeffrey Walton

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

> What I am wondering is that when encrypting 256 16 byte blocks, I get a speed
> of about 170 MB/s with the AES-NI driver. When using the aes-generic or aes-
> asm, I get up to 180 MB/s with all else being equal. Note, that figure
> includes a copy_to_user of the generated data.
>
> ...

Something sounds amiss.

AES-NI should be on the order of magnitude faster than a generic
implementation. Can you verify AES-NI is actually using AES-NI, and
aes-generic is a software implementation?

Here are some OpenSSL numbers. EVP uses AES-NI when available.
Omitting -evp means its software only (no hardware acceleration, like
AES-NI).

$ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 626533.60k 669884.42k 680917.93k 682079.91k 684736.51k


$ openssl speed -elapsed aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128 cbc 106520.59k 114380.16k 116741.46k 117489.32k 117563.39k

Jeff

2016-05-26 18:14:30

by Stephan Müller

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Am Donnerstag, 26. Mai 2016, 19:30:01 schrieb Stephan Mueller:

Hi,

>
> However, the key difference to a standard speed test is that I set up a new
> key schedule quite frequently. And I would suspect that something is going
> on here...

With tcrypt, there is some interesting hint: on smaller blocks, the C
implementation is indeed faster:

[ 20.391510]
testing speed of async ecb(aes) (ecb(aes-generic)) encryption
20.391513] test 0 (128 bit key, 16 byte blocks): 1 operation in 275 cycles
(16 bytes)
[ 20.391517] test 1 (128 bit key, 64 byte blocks): 1 operation in 702 cycles
(64 bytes)
[ 20.391521] test 2 (128 bit key, 256 byte blocks): 1 operation in 2431
cycles (256 bytes)
[ 20.391532] test 3 (128 bit key, 1024 byte blocks): 1 operation in 9347
cycles (1024 bytes)
[ 20.391570] test 4 (128 bit key, 8192 byte blocks): 1 operation in 74375
cycles (8192 bytes)


vs for ecb-aes-aesni:

[ 143.482123] test 0 (128 bit key, 16 byte blocks): 1 operation in 1203
cycles (16 bytes)
[ 143.482138] test 1 (128 bit key, 64 byte blocks): 1 operation in 1328
cycles (64 bytes)
[ 143.482148] test 2 (128 bit key, 256 byte blocks): 1 operation in 1922
cycles (256 bytes)
[ 143.482159] test 3 (128 bit key, 1024 byte blocks): 1 operation in 3328
cycles (1024 bytes)
[ 143.482176] test 4 (128 bit key, 8192 byte blocks): 1 operation in 19483
cycles (8192 bytes)


As I use crypto_cipher_encrypt_one, I only send one block at a time to AES-NI.


Ciao
Stephan

2016-05-26 18:20:20

by Sandy Harris

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Stephan Mueller <[email protected]> wrote:

> for the DRBG and the LRNG work I am doing, I also test the speed of the DRBG.
> The DRBG can be considered as a form of block chaining mode on top of a raw
> cipher.
>
> What I am wondering is that when encrypting 256 16 byte blocks, I get a speed
> of about 170 MB/s with the AES-NI driver. When using the aes-generic or aes-
> asm, I get up to 180 MB/s with all else being equal. Note, that figure
> includes a copy_to_user of the generated data.

Why are you using AES? Granted, it is a reasonable idea, but when Ted
replaced the non-blocking pool with a DBRG, he used a different cipher
(I think chacha, not certain) and I think chose not to use the crypto
library implementation to avoid kernel bloat.

So he has adopted on of your better ideas. Why not follow his
lead on how to implement it?

2016-05-26 18:49:42

by Stephan Müller

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Am Donnerstag, 26. Mai 2016, 14:20:19 schrieb Sandy Harris:

Hi Sandy,
>
> Why are you using AES? Granted, it is a reasonable idea, but when Ted
> replaced the non-blocking pool with a DBRG, he used a different cipher
> (I think chacha, not certain) and I think chose not to use the crypto
> library implementation to avoid kernel bloat.
>
> So he has adopted on of your better ideas. Why not follow his
> lead on how to implement it?

Using the kernel crypto API one can relieve the CPU of the crypto work, if a
hardware or assembler implementation is available. This may be of particular
interest for smaller systems. So, for smaller systems (where kernel bloat is
bad, but where now these days more and more hardware crypto support is added),
we must weigh the kernel bloat (of 3 or 4 additional C files for the basic
kernel crypto API logic) against relieving the CPU of work.

Then, the use of the DRBG offers users to choose between a Hash/HMAC and CTR
implementation to suit their needs. The DRBG code is agnostic of the
underlying cipher. So, you could even use Blowfish instead of AES or whirlpool
instead of SHA -- these changes are just one entry in drbg_cores[] away
without any code change.

Finally, the LRNG code is completely agnostic of the underlying deterministic
RNG. You only need a replacement of two small functions to invoke the seeding
and generate operation of a DRNG. So, if one wants a Chacha20, he can have it.
If one wants X9.31, he can have it. See section 2.8.3 [1] -- note, that DRNG
does not even need to be a kernel crypto API registered implementation.

Bottom line, I want to give folks a full flexibility. That said, the LRNG code
is more of a logic to collect entropy and maintain two DRNG types which are
seeded according to a defined schedule than it is a tightly integrated RNG.

Also, I am not so sure that simply taking a cipher, sprinkling some
backtracking logic on it implies you have a good DRNG. As of now, I have not
seen assessments from others for the Chacha20 DRNG approach. I personally
would think that the Chacha20 approach from Ted is good. Yet others may have a
more conservative approach of using a DRNG implementation that has been
reviewed by a lot of folks.

[1] http://www.chronox.de/lrng/doc/lrng.pdf

Ciao
Stephan

2016-05-26 19:15:58

by Sandy Harris

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

On Thu, May 26, 2016 at 2:49 PM, Stephan Mueller <[email protected]> wrote:

> Then, the use of the DRBG offers users to choose between a Hash/HMAC and CTR
> implementation to suit their needs. The DRBG code is agnostic of the
> underlying cipher. So, you could even use Blowfish instead of AES or whirlpool
> instead of SHA -- these changes are just one entry in drbg_cores[] away
> without any code change.

Not Blowfish in anything like the code you describe! It has only
64-bit blocks which might or might not be a problem, but it also has
an extremely expensive key schedule which would be awful if you want
to rekey often.

I'd say if you want a block cipher there you can quite safely restrict
the interface to ciphers with the same block & key sizes as AES.
Implement AES and one of the other finalists (I'd pick Serpent) to
test, and others can add the remaining finalists or national standards
like Korean ARIA or the Japanese one if they want them.

2016-05-27 02:14:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

On Thu, May 26, 2016 at 08:49:39PM +0200, Stephan Mueller wrote:
>
> Using the kernel crypto API one can relieve the CPU of the crypto work, if a
> hardware or assembler implementation is available. This may be of particular
> interest for smaller systems. So, for smaller systems (where kernel bloat is
> bad, but where now these days more and more hardware crypto support is added),
> we must weigh the kernel bloat (of 3 or 4 additional C files for the basic
> kernel crypto API logic) against relieving the CPU of work.

There are a number of caveats with using hardware acceleration; one is
that many hardware accelerators are optimized for bulk data
encryption, and so key scheduling, or switching between key schedules,
can have a higher overhead that a pure software implementation.

There have also been situations where the hardware crypto engine is
actually slower than a highly optimized software implementation. This
has been the case for certain ARM SOC's, for example.

This is not that big of deal, if you are developing a cryptographic
application (such as file system level encryption, for example) for a
specific hardware platform (such as a specific Nexus device). But if
you are trying to develop a generic service that has to work on a wide
variety of CPU architectures, and specific CPU/SOC implementations,
life is a lot more complicated. I've worked on both problems, let me
assure you the second is way tricker than the first.

> Then, the use of the DRBG offers users to choose between a Hash/HMAC and CTR
> implementation to suit their needs. The DRBG code is agnostic of the
> underlying cipher. So, you could even use Blowfish instead of AES or whirlpool
> instead of SHA -- these changes are just one entry in drbg_cores[] away
> without any code change.

I really question how much this matters in practice. Unless you are a
US Government Agency, where you might be laboring under a Federal
mandate to use DUAL-EC DRBG (for example), most users really don't
care about the details of the algorithm used in their random number
generator. Giving users choice (or lots of knobs) isn't necessarily
always a feature, as the many TLS downgrade attacks have demonstrated.

This is why from my perspectve it's more important to implement an
interface which is always there, and which by default is secure, to
minimize the chances that random JV-team kernel developers working for
a Linux distribution or some consumer electronics manufacturer won't
actually make things worse. As the Debian's attempt to "improve" the
security of OpenSSL demonstrates, it doesn't always end well. :-)

If we implement something which happens to result in a 2 minute stall
in boot times, the danger is that a clueless engineer at Sony, or LGE,
or Motorola, or BMW, or Toyota, etc, will "fix" the problem without
telling anyone about what they did, and we might not notice right away
that the fix was in fact catastrophically bad.

These aren't the standard things which academics tend to worry about,
which tend to assume that attackers will be able to read arbitrary
kernel memory, and recovering such an exposure of the entropy pool is
_the_ most important thing to worry about (as opposed to say, the
contents of the user's private keys in the ssh-agent process). But
this will perhaps explain why worrying about accomodating users who
care about whether Blowfish or AES should be used in their random
number generator isn't near the top of my personal priority list.

Cheers,

- Ted

2016-05-27 07:08:45

by Stephan Müller

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Am Donnerstag, 26. Mai 2016, 22:14:29 schrieb Theodore Ts'o:

Hi Theodore,

> On Thu, May 26, 2016 at 08:49:39PM +0200, Stephan Mueller wrote:
> > Using the kernel crypto API one can relieve the CPU of the crypto work, if
> > a hardware or assembler implementation is available. This may be of
> > particular interest for smaller systems. So, for smaller systems (where
> > kernel bloat is bad, but where now these days more and more hardware
> > crypto support is added), we must weigh the kernel bloat (of 3 or 4
> > additional C files for the basic kernel crypto API logic) against
> > relieving the CPU of work.
>
> There are a number of caveats with using hardware acceleration; one is
> that many hardware accelerators are optimized for bulk data
> encryption, and so key scheduling, or switching between key schedules,
> can have a higher overhead that a pure software implementation.

Squeezing the last drop of speed out of the ciphers for the LRNG is not my
priority given that the speed is limited by the reseeding. The LRNG should
allow the CPU to offload the crypto work. For small systems, crypto is intense
work that could be spend elsewhere.
>
> There have also been situations where the hardware crypto engine is
> actually slower than a highly optimized software implementation. This
> has been the case for certain ARM SOC's, for example.

And I would be fine with that. Besides, if a user wants to use software
implementations with the LRNG and still offer HW support for all else, all
they need to do is to statically compile the software implementation and
compile the hardware support as a module. As the LRNG initializes before
kernel modules can be loaded, it can only use what it finds in the static
kernel.
>
> This is not that big of deal, if you are developing a cryptographic
> application (such as file system level encryption, for example) for a
> specific hardware platform (such as a specific Nexus device). But if
> you are trying to develop a generic service that has to work on a wide
> variety of CPU architectures, and specific CPU/SOC implementations,
> life is a lot more complicated. I've worked on both problems, let me
> assure you the second is way tricker than the first.
>
> > Then, the use of the DRBG offers users to choose between a Hash/HMAC and
> > CTR implementation to suit their needs. The DRBG code is agnostic of the
> > underlying cipher. So, you could even use Blowfish instead of AES or
> > whirlpool instead of SHA -- these changes are just one entry in
> > drbg_cores[] away without any code change.
>
> I really question how much this matters in practice. Unless you are a
> US Government Agency, where you might be laboring under a Federal
> mandate to use DUAL-EC DRBG (for example), most users really don't

I am not sure such references help the discussion.

> care about the details of the algorithm used in their random number
> generator. Giving users choice (or lots of knobs) isn't necessarily
> always a feature, as the many TLS downgrade attacks have demonstrated.

The options are at compile time, not at runtime.
>
> This is why from my perspectve it's more important to implement an
> interface which is always there, and which by default is secure, to
> minimize the chances that random JV-team kernel developers working for
> a Linux distribution or some consumer electronics manufacturer won't
> actually make things worse. As the Debian's attempt to "improve" the
> security of OpenSSL demonstrates, it doesn't always end well. :-)

Rest assured, the current implementation of /dev/random gives many people many
headaches. And I can tell you that I have seen "random JV-team kernel
developers" doing things you do not want to know just to make the behavior
better.

And even if they do not change anything, I am yet under the impression that
the current implementation has shortcommings in typical deployment scenarios
(mainly VMs and headless server systems).

Hence I want to give a framework where people can safely alter a few things to
suit their needs. But the things they can change should not affect the overall
security.
>
> If we implement something which happens to result in a 2 minute stall
> in boot times, the danger is that a clueless engineer at Sony, or LGE,
> or Motorola, or BMW, or Toyota, etc, will "fix" the problem without
> telling anyone about what they did, and we might not notice right away
> that the fix was in fact catastrophically bad.

Such "fixes" are employed these days already! And they are not employed
because of the used crypto (which was the topic this thread started), but due
to the handling and accounting of the initial entropy.

So, I think that the used crypto for the DRNG side is just the icing (hence I
said I can live with SP800-90A, your Chacha20, even X9.31 given that the LRNG
ensures proper seeding and reseeding). The real issues are in the entropy
accounting and maintenance and the reseeding of the DRNGs.

Ciao
Stephan

2016-05-27 20:40:38

by Jeffrey Walton

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

> If we implement something which happens to result in a 2 minute stall
> in boot times, the danger is that a clueless engineer at Sony, or LGE,
> or Motorola, or BMW, or Toyota, etc, will "fix" the problem without
> telling anyone about what they did, and we might not notice right away
> that the fix was in fact catastrophically bad.

This is an non-trivial threat. +1 for recognizing it.

I know of one VM hypervisor used in US Financial that was effectively
doing "One thing you should not do is the following..." from
http://lwn.net/Articles/525459/.

Jeff

2016-05-28 00:28:37

by Aaron Zauner

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Heya,

> On 27 May 2016, at 01:49, Stephan Mueller <[email protected]> wrote:
> Then, the use of the DRBG offers users to choose between a Hash/HMAC and CTR
> implementation to suit their needs. The DRBG code is agnostic of the
> underlying cipher. So, you could even use Blowfish instead of AES or whirlpool
> instead of SHA -- these changes are just one entry in drbg_cores[] away
> without any code change.

That's a really nice change and something I've been thinking about for a couple of months as well. Then I came across tytso's ChaCha patches to urandom and was thinking ISA dependent switches between ciphers would make sense, i.e. you get AESNI performance when there's support.

> Finally, the LRNG code is completely agnostic of the underlying deterministic
> RNG. You only need a replacement of two small functions to invoke the seeding
> and generate operation of a DRNG. So, if one wants a Chacha20, he can have it.
> If one wants X9.31, he can have it. See section 2.8.3 [1] -- note, that DRNG
> does not even need to be a kernel crypto API registered implementation.

It's valid criticism that the number of algorithms should be limited. Algorithmic agility is an issue and has caused many real-world security problems in protocols liberally granting crypto primitives to be chosen by the user isn't a good idea. We should think about algorithms that make sense. E.g. TLS 1.3 and HTTP/2 have been moving into this direction. TLS 1.3 will only allow a couple off cipher-suites as opposed to combinatorial explosion with previous iterations of the protocol.

I'd suggest sticking to AES-CTR and ChaCha20 for DRNG designs. That should fit almost all platforms with great performance, keep code-base small etc.

There's now heavily optimised assembly in OpenSSL for ChaCha20 if you want to take a look: https://github.com/openssl/openssl/tree/master/crypto/chacha/asm
But as mentioned in the ChaCha/urandom thread: architecture specific optimisation may be painful and error-prone.

> Bottom line, I want to give folks a full flexibility. That said, the LRNG code
> is more of a logic to collect entropy and maintain two DRNG types which are
> seeded according to a defined schedule than it is a tightly integrated RNG.
>
> Also, I am not so sure that simply taking a cipher, sprinkling some
> backtracking logic on it implies you have a good DRNG. As of now, I have not
> seen assessments from others for the Chacha20 DRNG approach. I personally
> would think that the Chacha20 approach from Ted is good. Yet others may have a
> more conservative approach of using a DRNG implementation that has been
> reviewed by a lot of folks.
>
> [1] http://www.chronox.de/lrng/doc/lrng.pdf

Currently reading that paper, it seems like a solid approach.

I don't like the approach that user-space programs may modify entropy. It's a myth that `haveged` etc. provide more security, and EGDs have been barely audited, usually written as academic work and have been completely unmaintained. I regularly end up in randomness[sic!] discussions with core language maintainers [0] [1] - they seem to have little understanding of what's going on in the kernel and either use /dev/random as a seed or a Userspace RNG (most of which aren't particularly safe to begin with -- OpenSSL is not fork safe [2] [3], a recent paper found weaknesses in the OpenSSL RNG at low entropy state leaking secrets [4] et cetera). This seems to be mostly the case because of the infamous `random(4)` man-page. With end-users (protocol implementers, core language designers,..) always pointing to upstream, which - of course - is the Linux kernel.

I can't really tell from the paper if /dev/random would still be blocking in some cases? If so that's unfortunate.

Thanks for your work on this,
Aaron

[0] https://bugs.ruby-lang.org/issues/9569
[1] https://github.com/nodejs/node/issues/5798
[2] https://emboss.github.io/blog/2013/08/21/openssl-prng-is-not-really-fork-safe/
[3] https://wiki.openssl.org/index.php/Random_fork-safety
[4] https://eprint.iacr.org/2016/367.pdf


Attachments:
signature.asc (801.00 B)
Message signed with OpenPGP using GPGMail

2016-05-29 19:52:03

by Stephan Müller

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Am Samstag, 28. Mai 2016, 07:28:25 schrieb Aaron Zauner:

Hi Aaron,

> Heya,
>
> > On 27 May 2016, at 01:49, Stephan Mueller <[email protected]> wrote:
> > Then, the use of the DRBG offers users to choose between a Hash/HMAC and
> > CTR implementation to suit their needs. The DRBG code is agnostic of the
> > underlying cipher. So, you could even use Blowfish instead of AES or
> > whirlpool instead of SHA -- these changes are just one entry in
> > drbg_cores[] away without any code change.
>
> That's a really nice change and something I've been thinking about for a
> couple of months as well. Then I came across tytso's ChaCha patches to
> urandom and was thinking ISA dependent switches between ciphers would make
> sense, i.e. you get AESNI performance when there's support.
> > Finally, the LRNG code is completely agnostic of the underlying
> > deterministic RNG. You only need a replacement of two small functions to
> > invoke the seeding and generate operation of a DRNG. So, if one wants a
> > Chacha20, he can have it. If one wants X9.31, he can have it. See section
> > 2.8.3 [1] -- note, that DRNG does not even need to be a kernel crypto API
> > registered implementation.
> It's valid criticism that the number of algorithms should be limited.
> Algorithmic agility is an issue and has caused many real-world security
> problems in protocols liberally granting crypto primitives to be chosen by
> the user isn't a good idea. We should think about algorithms that make
> sense. E.g. TLS 1.3 and HTTP/2 have been moving into this direction. TLS
> 1.3 will only allow a couple off cipher-suites as opposed to combinatorial
> explosion with previous iterations of the protocol.

I cannot agree more with you, if the attacker can choose the algo. However, I
would think that a compile time selection of one specific algo is not prone to
this issue. Also, the code of the LRNG provides a pre-defined set of DRBGs
that should not leave any wish open. Hence, I am not sure that too many folks
would change the code here.

Though, if folks really want to, they have the option to do so.
>
> I'd suggest sticking to AES-CTR and ChaCha20 for DRNG designs. That should
> fit almost all platforms with great performance, keep code-base small etc.

Regarding the CTR DRBG: I did not make it default out of two reasons:

- it is not the fastest -- as I just found a drag on the CTR DRBG performance
that I want to push upstream after closing the merge window. With that patch
the CTR DRBG now is the fastest by orders of magnitude. So, this issue does
not apply any more.

- the DF/BCC function in the DRBG is critical as I think it looses entropy
IMHO. When you seed the DRBG with, say 256 or 384 bits of data, the BCC acts
akin a MAC by taking the 256 or 384 bits and collapse it into one AES block of
128 bits. Then he DF function expands this one block into the DRBG internal
state including the AES key of 256 / 384 bits depending on the type of AES you
use. So, if you have 256 bits of entropy in the seed, you have 128 bits left
after the BCC operation.

Given that criticism, I am asking whether the use of the CTR DRBG with AES >
128 should be used as default. Also, the CTR DRBG is the most complex of all
three DRBGs (with the HMAC, the current default, is the leanest and cleanest).

But if folks think that the CTR DRBG should be made the default, I would
listen and make it so.

>
> There's now heavily optimised assembly in OpenSSL for ChaCha20 if you want
> to take a look:
> https://github.com/openssl/openssl/tree/master/crypto/chacha/asm But as
> mentioned in the ChaCha/urandom thread: architecture specific optimisation
> may be painful and error-prone.

I personally am not sure that taking some arbitrary cipher and turning it into
a DRNG by simply using a self-feeding loop based on the ideas of X9.31
Appendix A2.4 is good. Chacha20 is a good cipher, but is it equally good for a
DRNG? I do not know. There are too little assessments from mathematicians out
there regarding that topic.

Hence, I rather like to stick to DRNG designs that have been analyzed by
different folks.

> > Bottom line, I want to give folks a full flexibility. That said, the LRNG
> > code is more of a logic to collect entropy and maintain two DRNG types
> > which are seeded according to a defined schedule than it is a tightly
> > integrated RNG.
> >
> > Also, I am not so sure that simply taking a cipher, sprinkling some
> > backtracking logic on it implies you have a good DRNG. As of now, I have
> > not seen assessments from others for the Chacha20 DRNG approach. I
> > personally would think that the Chacha20 approach from Ted is good. Yet
> > others may have a more conservative approach of using a DRNG
> > implementation that has been reviewed by a lot of folks.
> >
> > [1] http://www.chronox.de/lrng/doc/lrng.pdf
>
> Currently reading that paper, it seems like a solid approach.

There was criticism on the entropy maintenance. I have now reverted it to the
classical, yet lockless, LFSR approach. Once the merge window closes, I will
release the new version.
>
> I don't like the approach that user-space programs may modify entropy. It's
> a myth that `haveged` etc. provide more security, and EGDs have been barely
> audited, usually written as academic work and have been completely
> unmaintained. I regularly end up in randomness[sic!] discussions with core
> language maintainers [0] [1] - they seem to have little understanding of
> what's going on in the kernel and either use /dev/random as a seed or a
> Userspace RNG (most of which aren't particularly safe to begin with --
> OpenSSL is not fork safe [2] [3], a recent paper found weaknesses in the
> OpenSSL RNG at low entropy state leaking secrets [4] et cetera). This seems
> to be mostly the case because of the infamous `random(4)` man-page. With
> end-users (protocol implementers, core language designers,..) always
> pointing to upstream, which - of course - is the Linux kernel.

Point taken, but I cannot simply change the existing user space interface.
Hence, I modified it such that the data does not end up in some entropy pool,
but is mixed into the DRBGs which are designed to handle date with and without
entropy.
>
> I can't really tell from the paper if /dev/random would still be blocking in
> some cases? If so that's unfortunate.

It is blocking, that is its nature: one bit of output data from /dev/random
shall be backed by one bit of entropy from the noise sources. As the noise
sources are not fast, it will block.

However, the /dev/urandom is now seeded with 256 bits of entropy very fast
during boot cycle. Hence, if you use getrandom(2) which blocks until the 256
bits of initial seed is reached, you should be good for any standard
cryptographic purposes.

With the new up-and-coming release of the LRNG code, I also provide an updated
documentation. That documentation contains a test of the entropy contained in
the first 50 interrupt events after boot. With that test I record the timing
of the first 50 interrupts for a test batch of 50,000 boots. That test shall
demonstrate that the basic entropy estimate behind the LRNG is sound even
during boot times. Hence, I think that when the LRNG defines that the DRBG
behind /dev/urandom is seeded with 256 bits of entropy, it really received
that amount of entropy.

>
> Thanks for your work on this,
> Aaron
>
> [0] https://bugs.ruby-lang.org/issues/9569
> [1] https://github.com/nodejs/node/issues/5798
> [2]
> https://emboss.github.io/blog/2013/08/21/openssl-prng-is-not-really-fork-sa
> fe/ [3] https://wiki.openssl.org/index.php/Random_fork-safety
> [4] https://eprint.iacr.org/2016/367.pdf


Ciao
Stephan

2016-05-30 04:08:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

On Sun, May 29, 2016 at 09:51:59PM +0200, Stephan Mueller wrote:
>
> I personally am not sure that taking some arbitrary cipher and turning it into
> a DRNG by simply using a self-feeding loop based on the ideas of X9.31
> Appendix A2.4 is good. Chacha20 is a good cipher, but is it equally good for a
> DRNG? I do not know. There are too little assessments from mathematicians out
> there regarding that topic.

If ChCha20 is a good (stream) cipher, it must be a good DRNG by
definition. In other words, if you can predict the output of
ChaCha20-base DRNG with any accuracy greater than chance, this can be
used as a wedge to attack the stream cipher..

I will note that OpenBSD's "ARC4" random number generator is currently
using ChaCha20, BTW.

Regards,

- Ted

2016-06-08 12:21:13

by Stephan Müller

[permalink] [raw]
Subject: Re: AES-NI: slower than aes-generic?

Am Donnerstag, 26. Mai 2016, 22:14:29 schrieb Theodore Ts'o:

Hi Theodore,

> On Thu, May 26, 2016 at 08:49:39PM +0200, Stephan Mueller wrote:
> > Using the kernel crypto API one can relieve the CPU of the crypto work, if
> > a hardware or assembler implementation is available. This may be of
> > particular interest for smaller systems. So, for smaller systems (where
> > kernel bloat is bad, but where now these days more and more hardware
> > crypto support is added), we must weigh the kernel bloat (of 3 or 4
> > additional C files for the basic kernel crypto API logic) against
> > relieving the CPU of work.
>
> There are a number of caveats with using hardware acceleration; one is
> that many hardware accelerators are optimized for bulk data
> encryption, and so key scheduling, or switching between key schedules,
> can have a higher overhead that a pure software implementation.

As a followup: I tweaked the DRBG side a bit to use the full speed of the AES-
NI implementation. With that tweak, the initial finding does not apply any
more.

I depending on the request size, I now get more than 800 MB/s (increase by
more than 450% compared to the initial implementation) from the AES-NI
implementation. Hence, the frequent key schedule update seems to be not too
much of an issue.

Ciao
Stephan