Hi,
I was wondering why reading from /dev/urandom is much slower on Ryzen
than on Intel, and did some analysis. It turns out that the RDRAND
instruction is at fault, which takes much longer on AMD.
if I read this correctly:
--- drivers/char/random.c ---
862 spin_lock_irqsave(&crng->lock, flags);
863 if (arch_get_random_long(&v))
864 crng->state[14] ^= v;
865 chacha20_block(&crng->state[0], out);
one call to RDRAND (with 64-bit operand) is issued per computation of a
chacha20 block. According to the measurements I did, it seems on Ryzen
this dominates the time usage:
On Broadwell E5-2650 v4:
---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s
# perf top
49.88% [kernel] [k] chacha20_block
31.22% [kernel] [k] _extract_crng
---
On Ryzen 1800X:
---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s
# perf top
76,40% [kernel] [k] _extract_crng
13,05% [kernel] [k] chacha20_block
---
An easy improvement might be to replace the usage of
arch_get_random_long() by arch_get_random_int(), as the state array
contains just 32-bit elements, and (contrary to Intel) on Ryzen 32-bit
RDRAND is supposed to be faster by roughly a factor of 2.
Best regards,
OM
On Fri, Jul 21, 2017 at 09:12:01AM +0200, Oliver Mangold wrote:
> Hi,
>
> I was wondering why reading from /dev/urandom is much slower on
> Ryzen than on Intel, and did some analysis. It turns out that the
> RDRAND instruction is at fault, which takes much longer on AMD.
>
> if I read this correctly:
>
> --- drivers/char/random.c ---
> 862 spin_lock_irqsave(&crng->lock, flags);
> 863 if (arch_get_random_long(&v))
> 864 crng->state[14] ^= v;
> 865 chacha20_block(&crng->state[0], out);
>
> one call to RDRAND (with 64-bit operand) is issued per computation
> of a chacha20 block. According to the measurements I did, it seems
> on Ryzen this dominates the time usage:
>
> On Broadwell E5-2650 v4:
>
> ---
> # dd if=/dev/urandom of=/dev/null bs=1M status=progress
> 28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s
> # perf top
> 49.88% [kernel] [k] chacha20_block
> 31.22% [kernel] [k] _extract_crng
> ---
>
> On Ryzen 1800X:
>
> ---
> # dd if=/dev/urandom of=/dev/null bs=1M status=progress
> 3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s
> # perf top
> 76,40% [kernel] [k] _extract_crng
> 13,05% [kernel] [k] chacha20_block
> ---
>
> An easy improvement might be to replace the usage of
> arch_get_random_long() by arch_get_random_int(), as the state array
> contains just 32-bit elements, and (contrary to Intel) on Ryzen
> 32-bit RDRAND is supposed to be faster by roughly a factor of 2.
Nice catch. How much does the performance improve on Ryzen when you
use arch_get_random_int()?
--Jan
> Best regards,
>
> OM
On 21.07.2017 11:26, Jan Glauber wrote:
>
> Nice catch. How much does the performance improve on Ryzen when you
> use arch_get_random_int()?
Okay, now I have some results for you:
On Ryzen 1800X (using arch_get_random_int()):
---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
8751415296 bytes (8,8 GB, 8,2 GiB) copied, 71,0079 s, 123 MB/s
# perf top
57,37% [kernel] [k] _extract_crng
26,20% [kernel] [k] chacha20_block
---
Better, but obviously there is still much room for improvement by
reducing the number of calls to RDRAND.
On Ryzen 1800X (with nordrand kernel option):
---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
22643998720 bytes (23 GB, 21 GiB) copied, 67,0025 s, 338 MB/s
---
Here is the patch I used:
--- drivers/char/random.c.orig 2017-07-03 01:07:02.000000000 +0200
+++ drivers/char/random.c 2017-07-21 11:57:40.541677118 +0200
@@ -859,13 +859,14 @@
static void _extract_crng(struct crng_state *crng,
__u8 out[CHACHA20_BLOCK_SIZE])
{
- unsigned long v, flags;
+ unsigned int v;
+ unsigned long flags;
if (crng_init > 1 &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
crng_reseed(crng, crng == &primary_crng ? &input_pool
: NULL);
spin_lock_irqsave(&crng->lock, flags);
- if (arch_get_random_long(&v))
+ if (arch_get_random_int(&v))
crng->state[14] ^= v;
chacha20_block(&crng->state[0], out);
if (crng->state[12] == 0)
On Fri, Jul 21, 2017 at 3:12 AM, Oliver Mangold <[email protected]> wrote:
> Hi,
>
> I was wondering why reading from /dev/urandom is much slower on Ryzen than
> on Intel, and did some analysis. It turns out that the RDRAND instruction is
> at fault, which takes much longer on AMD.
>
> if I read this correctly:
>
> --- drivers/char/random.c ---
> 862 spin_lock_irqsave(&crng->lock, flags);
> 863 if (arch_get_random_long(&v))
> 864 crng->state[14] ^= v;
> 865 chacha20_block(&crng->state[0], out);
>
> one call to RDRAND (with 64-bit operand) is issued per computation of a
> chacha20 block. According to the measurements I did, it seems on Ryzen this
> dominates the time usage:
AMD's implementation of RDRAND and RDSEED are simply slow. It dates
back to Bulldozer. While Intel can produce random numbers at 10
cycle/sbyte, AMD regularly takes thousands of cycles for one byte.
Bulldozer was measured at 4100 cycles per byte.
It also appears AMD uses the same circuit for random numbers for both
RDRAND and RDSEED. Both are equally fast (or equally slow).
Here are some benchmarks if you are interested:
https://www.cryptopp.com/wiki/RDRAND#Performance .
Jeff
On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> Better, but obviously there is still much room for improvement by reducing
> the number of calls to RDRAND.
Hmm, is there some way we can easily tell we are running on Ryzen? Or
do we believe this is going to be true for all AMD devices?
I guess we could add some kind of "has_crappy_arch_get_random()" call
which could be defined by arch/x86, and change how aggressively we use
arch_get_random_*() depending on whether has_crappy_arch_get_random()
returns true or not....
- Ted
On 21.07.2017 16:47, Theodore Ts'o wrote:
> On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
>> Better, but obviously there is still much room for improvement by reducing
>> the number of calls to RDRAND.
> Hmm, is there some way we can easily tell we are running on Ryzen? Or
> do we believe this is going to be true for all AMD devices?
I would like to note that my first measurement on Broadwell suggest that
the current frequency of RDRAND calls seems to slow things down on Intel
also (but not as much as on Ryzen).
On 07/21/2017 09:47 AM, Theodore Ts'o wrote:
> On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
>> Better, but obviously there is still much room for improvement by reducing
>> the number of calls to RDRAND.
>
> Hmm, is there some way we can easily tell we are running on Ryzen? Or
> do we believe this is going to be true for all AMD devices?
>
> I guess we could add some kind of "has_crappy_arch_get_random()" call
> which could be defined by arch/x86, and change how aggressively we use
> arch_get_random_*() depending on whether has_crappy_arch_get_random()
> returns true or not....
Just a quick note to say that we are aware of the issue, and that it is
being addressed.
On Fri, Jul 21, 2017 at 04:55:12PM +0200, Oliver Mangold wrote:
> On 21.07.2017 16:47, Theodore Ts'o wrote:
> > On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> > > Better, but obviously there is still much room for improvement by reducing
> > > the number of calls to RDRAND.
> > Hmm, is there some way we can easily tell we are running on Ryzen? Or
> > do we believe this is going to be true for all AMD devices?
> I would like to note that my first measurement on Broadwell suggest that the
> current frequency of RDRAND calls seems to slow things down on Intel also
> (but not as much as on Ryzen).
On my T470 laptop (with an Intel mobile core i7 processor), using your
benchmark, I am getting 136 MB/s, versus your 75 MB/s. But so what?
More realistically, if we are generating 256 bit keys (so we're
reading from /dev/urandom 32 bytes at a time), it takes 2.24
microseconds per key generation. What do you get when you run:
dd if=/dev/urandom of=/dev/zero bs=256 count=1000000
Even if on Ryzen it's slower by a factor of two, 5 microseconds per
key generation is pretty fast! The time to do the Diffie-Hellman
exchange and the RSA operations will probably completely swamp the
time to generate the session key.
And if you think 2.24 or 5 microseconds is to slow for the IV
generation --- then use a userspace ChaCha20 CRNG for that purpose.
I'm not really sure I see a real-life operational problem here.
- Ted
On Sat, Jul 22, 2017 at 02:16:41PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 21, 2017 at 04:55:12PM +0200, Oliver Mangold wrote:
> > On 21.07.2017 16:47, Theodore Ts'o wrote:
> > > On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> > > > Better, but obviously there is still much room for improvement by reducing
> > > > the number of calls to RDRAND.
> > > Hmm, is there some way we can easily tell we are running on Ryzen? Or
> > > do we believe this is going to be true for all AMD devices?
> > I would like to note that my first measurement on Broadwell suggest that the
> > current frequency of RDRAND calls seems to slow things down on Intel also
> > (but not as much as on Ryzen).
>
> On my T470 laptop (with an Intel mobile core i7 processor), using your
> benchmark, I am getting 136 MB/s, versus your 75 MB/s. But so what?
>
> More realistically, if we are generating 256 bit keys (so we're
> reading from /dev/urandom 32 bytes at a time), it takes 2.24
> microseconds per key generation. What do you get when you run:
>
> dd if=/dev/urandom of=/dev/zero bs=256 count=1000000
>
> Even if on Ryzen it's slower by a factor of two, 5 microseconds per
> key generation is pretty fast! The time to do the Diffie-Hellman
> exchange and the RSA operations will probably completely swamp the
> time to generate the session key.
>
> And if you think 2.24 or 5 microseconds is to slow for the IV
> generation --- then use a userspace ChaCha20 CRNG for that purpose.
>
> I'm not really sure I see a real-life operational problem here.
>
> - Ted
While I agree that it is not an issue if the hardware is just slow I
still wonder why we read 8 bytes with arch_get_random_long() and
only use half of them as Oliver pointed out.
If arch_get_random_int() is not slower on Intel we could use that.
Or am I missing something?
--Jan