2016-06-13 15:48:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH-v4 0/7] random: replace urandom pool with a CRNG

I plan to push this set of changes in the next merge window. Last
call for comments....

By using a CRNG to replace the urandom pool, we can more quickly
initialized the random number generator used for getrandom(2) and
/dev/urandom. On bare metal hardware this tends to get initialized
before the devices are finished being probed. We use a more
aggressive accounting for entropy accounting initially, and then fall
back to the original slower/more conservative entropy accounting
scheme.

We will also use a hardware rng (such as virtio-rng), if available to
initialize the getrandom(2) / /dev/urandom entropy pool.

In addition, on NUMA systems we make the CRNG state per-NUMA socket, to
address the NUMA locking contention problem which Andi Kleen has been
complaining about. I'm not entirely sure this will work well on the
crazy big SGI systems, but they are rare. Whether they are rarer than
abusive userspace programs that are continuously pounding /dev/urandom
is unclear. If necessary we can make a config option to turn off the
per-NUMA socket hack if it proves to be problematic.

Eric Biggers (1):
random: properly align get_random_int_hash

Stephan Mueller (1):
random: add interrupt callback to VMBus IRQ handler

Theodore Ts'o (5):
random: initialize the non-blocking pool via
add_hwgenerator_randomness()
random: print a warning for the first ten uninitialized random users
random: replace non-blocking pool with a Chacha20-based CRNG
random: make /dev/urandom scalable for silly userspace programs
random: add backtracking protection to the CRNG

Changes since -v3:
* Use a hardware rng (e.g., virtio-rng) if possible to initialize the
getrandom/urandom pool if available
* Print up to 10 warnings for uninitialized /dev/urandom reads, not just one
* Back out experiment to block /dev/urandom reads, since this will
break too many distributions or other user space setups (including
Python 3.5.2 and Debian Stretch's systemd-crontab-generator)
* Mark bug fixes for stable kernel backports

Changes since -v2:
* Rebased to v4.7-rc1
* Improved/reworked CRNG reseeding and backtracking protection
* Preseed the CRNG state from system data
* Added fix to properly align the get_random_int_hash[] array

crypto/chacha20_generic.c | 61 ------
drivers/char/random.c | 465 +++++++++++++++++++++++++++++++++++++---------
drivers/hv/vmbus_drv.c | 3 +
include/crypto/chacha20.h | 1 +
lib/Makefile | 2 +-
lib/chacha20.c | 79 ++++++++
6 files changed, 457 insertions(+), 154 deletions(-)
create mode 100644 lib/chacha20.c

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random.git 1d6e2eda6f60

--
2.5.0


2016-06-13 15:49:28

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 7/7] random: add backtracking protection to the CRNG

Signed-off-by: Theodore Ts'o <[email protected]>
---
drivers/char/random.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d640865..963a6a9 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -436,7 +436,8 @@ static int crng_init = 0;
#define crng_ready() (likely(crng_init > 0))
static void _extract_crng(struct crng_state *crng,
__u8 out[CHACHA20_BLOCK_SIZE]);
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
+static void _crng_backtrack_protect(struct crng_state *crng,
+ __u8 tmp[CHACHA20_BLOCK_SIZE], int used);
static void process_random_ready_list(void);

/**********************************************************************
@@ -826,8 +827,11 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
num = extract_entropy(r, &buf, 32, 16, 0);
if (num == 0)
return;
- } else
+ } else {
_extract_crng(&primary_crng, buf.block);
+ _crng_backtrack_protect(&primary_crng, buf.block,
+ CHACHA20_KEY_SIZE);
+ }
spin_lock_irqsave(&primary_crng.lock, flags);
for (i = 0; i < 8; i++) {
unsigned long rv;
@@ -889,9 +893,46 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
_extract_crng(crng, out);
}

+/*
+ * Use the leftover bytes from the CRNG block output (if there is
+ * enough) to mutate the CRNG key to provide backtracking protection.
+ */
+static void _crng_backtrack_protect(struct crng_state *crng,
+ __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
+{
+ unsigned long flags;
+ __u32 *s, *d;
+ int i;
+
+ used = round_up(used, sizeof(__u32));
+ if (used + CHACHA20_KEY_SIZE > CHACHA20_BLOCK_SIZE) {
+ extract_crng(tmp);
+ used = 0;
+ }
+ spin_lock_irqsave(&crng->lock, flags);
+ s = (__u32 *) &tmp[used];
+ d = &crng->state[4];
+ for (i=0; i < 8; i++)
+ *d++ ^= *s++;
+ spin_unlock_irqrestore(&crng->lock, flags);
+}
+
+static void crng_backtrack_protect(__u8 tmp[CHACHA20_BLOCK_SIZE], int used)
+{
+ struct crng_state *crng = NULL;
+
+#ifdef CONFIG_NUMA
+ if (crng_node_pool)
+ crng = crng_node_pool[numa_node_id()];
+ if (crng == NULL)
+#endif
+ crng = &primary_crng;
+ _crng_backtrack_protect(crng, tmp, used);
+}
+
static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
{
- ssize_t ret = 0, i;
+ ssize_t ret = 0, i = CHACHA20_BLOCK_SIZE;
__u8 tmp[CHACHA20_BLOCK_SIZE];
int large_request = (nbytes > 256);

@@ -916,6 +957,7 @@ static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
buf += i;
ret += i;
}
+ crng_backtrack_protect(tmp, i);

/* Wipe data just written to memory */
memzero_explicit(tmp, sizeof(tmp));
@@ -1473,8 +1515,10 @@ void get_random_bytes(void *buf, int nbytes)
if (nbytes > 0) {
extract_crng(tmp);
memcpy(buf, tmp, nbytes);
- memzero_explicit(tmp, nbytes);
- }
+ crng_backtrack_protect(tmp, nbytes);
+ } else
+ crng_backtrack_protect(tmp, CHACHA20_BLOCK_SIZE);
+ memzero_explicit(tmp, sizeof(tmp));
}
EXPORT_SYMBOL(get_random_bytes);

--
2.5.0

2016-06-13 15:48:33

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/7] random: initialize the non-blocking pool via add_hwgenerator_randomness()

If we have a hardware RNG and are using the in-kernel rngd, we should
use this to initialize the non-blocking pool so that getrandom(2)
doesn't block unnecessarily.

Cc: [email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
---
drivers/char/random.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0158d3b..4e2627a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1849,12 +1849,18 @@ void add_hwgenerator_randomness(const char *buffer, size_t count,
{
struct entropy_store *poolp = &input_pool;

- /* Suspend writing if we're above the trickle threshold.
- * We'll be woken up again once below random_write_wakeup_thresh,
- * or when the calling thread is about to terminate.
- */
- wait_event_interruptible(random_write_wait, kthread_should_stop() ||
+ if (unlikely(nonblocking_pool.initialized == 0))
+ poolp = &nonblocking_pool;
+ else {
+ /* Suspend writing if we're above the trickle
+ * threshold. We'll be woken up again once below
+ * random_write_wakeup_thresh, or when the calling
+ * thread is about to terminate.
+ */
+ wait_event_interruptible(random_write_wait,
+ kthread_should_stop() ||
ENTROPY_BITS(&input_pool) <= random_write_wakeup_bits);
+ }
mix_pool_bytes(poolp, buffer, count);
credit_entropy_bits(poolp, entropy);
}
--
2.5.0

2016-06-13 15:49:29

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

The CRNG is faster, and we don't pretend to track entropy usage in the
CRNG any more.

Signed-off-by: Theodore Ts'o <[email protected]>
---
crypto/chacha20_generic.c | 61 --------
drivers/char/random.c | 374 +++++++++++++++++++++++++++++++++-------------
include/crypto/chacha20.h | 1 +
lib/Makefile | 2 +-
lib/chacha20.c | 79 ++++++++++
5 files changed, 353 insertions(+), 164 deletions(-)
create mode 100644 lib/chacha20.c

diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
index da9c899..1cab831 100644
--- a/crypto/chacha20_generic.c
+++ b/crypto/chacha20_generic.c
@@ -15,72 +15,11 @@
#include <linux/module.h>
#include <crypto/chacha20.h>

-static inline u32 rotl32(u32 v, u8 n)
-{
- return (v << n) | (v >> (sizeof(v) * 8 - n));
-}
-
static inline u32 le32_to_cpuvp(const void *p)
{
return le32_to_cpup(p);
}

-static void chacha20_block(u32 *state, void *stream)
-{
- u32 x[16], *out = stream;
- int i;
-
- for (i = 0; i < ARRAY_SIZE(x); i++)
- x[i] = state[i];
-
- for (i = 0; i < 20; i += 2) {
- x[0] += x[4]; x[12] = rotl32(x[12] ^ x[0], 16);
- x[1] += x[5]; x[13] = rotl32(x[13] ^ x[1], 16);
- x[2] += x[6]; x[14] = rotl32(x[14] ^ x[2], 16);
- x[3] += x[7]; x[15] = rotl32(x[15] ^ x[3], 16);
-
- x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 12);
- x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 12);
- x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 12);
- x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 12);
-
- x[0] += x[4]; x[12] = rotl32(x[12] ^ x[0], 8);
- x[1] += x[5]; x[13] = rotl32(x[13] ^ x[1], 8);
- x[2] += x[6]; x[14] = rotl32(x[14] ^ x[2], 8);
- x[3] += x[7]; x[15] = rotl32(x[15] ^ x[3], 8);
-
- x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 7);
- x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 7);
- x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 7);
- x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 7);
-
- x[0] += x[5]; x[15] = rotl32(x[15] ^ x[0], 16);
- x[1] += x[6]; x[12] = rotl32(x[12] ^ x[1], 16);
- x[2] += x[7]; x[13] = rotl32(x[13] ^ x[2], 16);
- x[3] += x[4]; x[14] = rotl32(x[14] ^ x[3], 16);
-
- x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 12);
- x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 12);
- x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 12);
- x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 12);
-
- x[0] += x[5]; x[15] = rotl32(x[15] ^ x[0], 8);
- x[1] += x[6]; x[12] = rotl32(x[12] ^ x[1], 8);
- x[2] += x[7]; x[13] = rotl32(x[13] ^ x[2], 8);
- x[3] += x[4]; x[14] = rotl32(x[14] ^ x[3], 8);
-
- x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 7);
- x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 7);
- x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 7);
- x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 7);
- }
-
- for (i = 0; i < ARRAY_SIZE(x); i++)
- out[i] = cpu_to_le32(x[i] + state[i]);
-
- state[12]++;
-}
-
static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 83f5cd0..841f9a8 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -261,6 +261,7 @@
#include <linux/syscalls.h>
#include <linux/completion.h>
#include <linux/uuid.h>
+#include <crypto/chacha20.h>

#include <asm/processor.h>
#include <asm/uaccess.h>
@@ -413,6 +414,29 @@ static struct fasync_struct *fasync;
static DEFINE_SPINLOCK(random_ready_list_lock);
static LIST_HEAD(random_ready_list);

+struct crng_state {
+ __u32 state[16];
+ unsigned long init_time;
+ spinlock_t lock;
+};
+
+struct crng_state primary_crng = {
+ .lock = __SPIN_LOCK_UNLOCKED(primary_crng.lock),
+};
+
+/*
+ * crng_init = 0 --> Uninitialized
+ * 1 --> Initialized
+ * 2 --> Initialized from input_pool
+ *
+ * crng_init is protected by primary_crng->lock, and only increases
+ * its value (from 0->1->2).
+ */
+static int crng_init = 0;
+#define crng_ready() (likely(crng_init > 0))
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
+static void process_random_ready_list(void);
+
/**********************************************************************
*
* OS independent entropy store. Here are the functions which handle
@@ -442,10 +466,15 @@ struct entropy_store {
__u8 last_data[EXTRACT_SIZE];
};

+static ssize_t extract_entropy(struct entropy_store *r, void *buf,
+ size_t nbytes, int min, int rsvd);
+static ssize_t _extract_entropy(struct entropy_store *r, void *buf,
+ size_t nbytes, int fips);
+
+static void crng_reseed(struct crng_state *crng, struct entropy_store *r);
static void push_to_pool(struct work_struct *work);
static __u32 input_pool_data[INPUT_POOL_WORDS];
static __u32 blocking_pool_data[OUTPUT_POOL_WORDS];
-static __u32 nonblocking_pool_data[OUTPUT_POOL_WORDS];

static struct entropy_store input_pool = {
.poolinfo = &poolinfo_table[0],
@@ -466,16 +495,6 @@ static struct entropy_store blocking_pool = {
push_to_pool),
};

-static struct entropy_store nonblocking_pool = {
- .poolinfo = &poolinfo_table[1],
- .name = "nonblocking",
- .pull = &input_pool,
- .lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
- .pool = nonblocking_pool_data,
- .push_work = __WORK_INITIALIZER(nonblocking_pool.push_work,
- push_to_pool),
-};
-
static __u32 const twist_table[8] = {
0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -678,12 +697,6 @@ retry:
if (!r->initialized && r->entropy_total > 128) {
r->initialized = 1;
r->entropy_total = 0;
- if (r == &nonblocking_pool) {
- prandom_reseed_late();
- process_random_ready_list();
- wake_up_all(&urandom_init_wait);
- pr_notice("random: %s pool is initialized\n", r->name);
- }
}

trace_credit_entropy_bits(r->name, nbits,
@@ -693,30 +706,27 @@ retry:
if (r == &input_pool) {
int entropy_bits = entropy_count >> ENTROPY_SHIFT;

+ if (crng_init < 2 && entropy_bits >= 128) {
+ crng_reseed(&primary_crng, r);
+ entropy_bits = r->entropy_count >> ENTROPY_SHIFT;
+ }
+
/* should we wake readers? */
if (entropy_bits >= random_read_wakeup_bits) {
wake_up_interruptible(&random_read_wait);
kill_fasync(&fasync, SIGIO, POLL_IN);
}
/* If the input pool is getting full, send some
- * entropy to the two output pools, flipping back and
- * forth between them, until the output pools are 75%
- * full.
+ * entropy to the blocking pool until it is 75% full.
*/
if (entropy_bits > random_write_wakeup_bits &&
r->initialized &&
r->entropy_total >= 2*random_read_wakeup_bits) {
- static struct entropy_store *last = &blocking_pool;
struct entropy_store *other = &blocking_pool;

- if (last == &blocking_pool)
- other = &nonblocking_pool;
if (other->entropy_count <=
- 3 * other->poolinfo->poolfracbits / 4)
- last = other;
- if (last->entropy_count <=
- 3 * last->poolinfo->poolfracbits / 4) {
- schedule_work(&last->push_work);
+ 3 * other->poolinfo->poolfracbits / 4) {
+ schedule_work(&other->push_work);
r->entropy_total = 0;
}
}
@@ -736,6 +746,154 @@ static void credit_entropy_bits_safe(struct entropy_store *r, int nbits)

/*********************************************************************
*
+ * CRNG using CHACHA20
+ *
+ *********************************************************************/
+
+#define CRNG_RESEED_INTERVAL (300*HZ)
+
+static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
+
+static void crng_initialize(struct crng_state *crng)
+{
+ int i;
+ unsigned long rv;
+
+ memcpy(&crng->state[0], "expand 32-byte k", 16);
+ if (crng == &primary_crng)
+ _extract_entropy(&input_pool, &crng->state[4],
+ sizeof(__u32) * 12, 0);
+ else
+ get_random_bytes(&crng->state[4], sizeof(__u32) * 12);
+ for (i = 4; i < 16; i++) {
+ if (!arch_get_random_seed_long(&rv) &&
+ !arch_get_random_long(&rv))
+ rv = random_get_entropy();
+ crng->state[i] ^= rv;
+ }
+ crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
+}
+
+static int crng_fast_load(const char *cp, size_t len)
+{
+ unsigned long flags;
+ static size_t c = 0;
+ char *p;
+
+ if (!spin_trylock_irqsave(&primary_crng.lock, flags))
+ return 0;
+ if (crng_ready()) {
+ spin_unlock_irqrestore(&primary_crng.lock, flags);
+ return 0;
+ }
+ p = (unsigned char *) &primary_crng.state[4];
+ p += c;
+ while (len > 0 && c < CHACHA20_KEY_SIZE) {
+ *p ^= *cp;
+ p++; cp++; c++; len--;
+ }
+ if (c >= CHACHA20_KEY_SIZE) {
+ crng_init = 1;
+ wake_up_interruptible(&crng_init_wait);
+ pr_notice("random: fast init done\n");
+ }
+ spin_unlock_irqrestore(&primary_crng.lock, flags);
+ return 1;
+}
+
+static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
+{
+ unsigned long flags;
+ int i, num;
+ union {
+ __u8 block[CHACHA20_BLOCK_SIZE];
+ __u32 key[8];
+ } buf;
+
+ if (r) {
+ num = extract_entropy(r, &buf, 32, 16, 0);
+ if (num == 0)
+ return;
+ } else
+ extract_crng(buf.block);
+ spin_lock_irqsave(&primary_crng.lock, flags);
+ for (i = 0; i < 8; i++) {
+ unsigned long rv;
+ if (!arch_get_random_seed_long(&rv) &&
+ !arch_get_random_long(&rv))
+ rv = random_get_entropy();
+ crng->state[i+4] ^= buf.key[i] ^ rv;
+ }
+ memzero_explicit(&buf, sizeof(buf));
+ crng->init_time = jiffies;
+ if (crng == &primary_crng && crng_init < 2) {
+ crng_init = 2;
+ process_random_ready_list();
+ wake_up_interruptible(&crng_init_wait);
+ pr_notice("random: crng init done\n");
+ }
+ spin_unlock_irqrestore(&primary_crng.lock, flags);
+}
+
+static inline void crng_wait_ready(void)
+{
+ wait_event_interruptible(crng_init_wait, crng_ready());
+}
+
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+ unsigned long v, flags;
+ struct crng_state *crng = &primary_crng;
+
+ if (crng_init > 1 &&
+ time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
+ crng_reseed(crng, &input_pool);
+ spin_lock_irqsave(&crng->lock, flags);
+ if (arch_get_random_long(&v))
+ crng->state[14] ^= v;
+ chacha20_block(&crng->state[0], out);
+ if (crng->state[12] == 0)
+ crng->state[13]++;
+ spin_unlock_irqrestore(&crng->lock, flags);
+}
+
+static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
+{
+ ssize_t ret = 0, i;
+ __u8 tmp[CHACHA20_BLOCK_SIZE];
+ int large_request = (nbytes > 256);
+
+ while (nbytes) {
+ if (large_request && need_resched()) {
+ if (signal_pending(current)) {
+ if (ret == 0)
+ ret = -ERESTARTSYS;
+ break;
+ }
+ schedule();
+ }
+
+ extract_crng(tmp);
+ i = min_t(int, nbytes, CHACHA20_BLOCK_SIZE);
+ if (copy_to_user(buf, tmp, i)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ nbytes -= i;
+ buf += i;
+ ret += i;
+ }
+
+ /* Wipe data just written to memory */
+ memzero_explicit(tmp, sizeof(tmp));
+
+ return ret;
+}
+
+
+/*********************************************************************
+ *
* Entropy input management
*
*********************************************************************/
@@ -750,12 +908,12 @@ struct timer_rand_state {
#define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };

/*
- * Add device- or boot-specific data to the input and nonblocking
- * pools to help initialize them to unique values.
+ * Add device- or boot-specific data to the input pool to help
+ * initialize it.
*
- * None of this adds any entropy, it is meant to avoid the
- * problem of the nonblocking pool having similar initial state
- * across largely identical devices.
+ * None of this adds any entropy; it is meant to avoid the problem of
+ * the entropy pool having similar initial state across largely
+ * identical devices.
*/
void add_device_randomness(const void *buf, unsigned int size)
{
@@ -767,11 +925,6 @@ void add_device_randomness(const void *buf, unsigned int size)
_mix_pool_bytes(&input_pool, buf, size);
_mix_pool_bytes(&input_pool, &time, sizeof(time));
spin_unlock_irqrestore(&input_pool.lock, flags);
-
- spin_lock_irqsave(&nonblocking_pool.lock, flags);
- _mix_pool_bytes(&nonblocking_pool, buf, size);
- _mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
- spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
}
EXPORT_SYMBOL(add_device_randomness);

@@ -802,7 +955,7 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num)
sample.jiffies = jiffies;
sample.cycles = random_get_entropy();
sample.num = num;
- r = nonblocking_pool.initialized ? &input_pool : &nonblocking_pool;
+ r = &input_pool;
mix_pool_bytes(r, &sample, sizeof(sample));

/*
@@ -918,11 +1071,21 @@ void add_interrupt_randomness(int irq, int irq_flags)
fast_mix(fast_pool);
add_interrupt_bench(cycles);

+ if (!crng_ready()) {
+ if ((fast_pool->count >= 64) &&
+ crng_fast_load((char *) fast_pool->pool,
+ sizeof(fast_pool->pool))) {
+ fast_pool->count = 0;
+ fast_pool->last = now;
+ }
+ return;
+ }
+
if ((fast_pool->count < 64) &&
!time_after(now, fast_pool->last + HZ))
return;

- r = nonblocking_pool.initialized ? &input_pool : &nonblocking_pool;
+ r = &input_pool;
if (!spin_trylock(&r->lock))
return;

@@ -966,9 +1129,6 @@ EXPORT_SYMBOL_GPL(add_disk_randomness);
*
*********************************************************************/

-static ssize_t extract_entropy(struct entropy_store *r, void *buf,
- size_t nbytes, int min, int rsvd);
-
/*
* This utility inline function is responsible for transferring entropy
* from the primary pool to the secondary extraction pool. We make
@@ -1143,6 +1303,36 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
memzero_explicit(&hash, sizeof(hash));
}

+static ssize_t _extract_entropy(struct entropy_store *r, void *buf,
+ size_t nbytes, int fips)
+{
+ ssize_t ret = 0, i;
+ __u8 tmp[EXTRACT_SIZE];
+ unsigned long flags;
+
+ while (nbytes) {
+ extract_buf(r, tmp);
+
+ if (fips) {
+ spin_lock_irqsave(&r->lock, flags);
+ if (!memcmp(tmp, r->last_data, EXTRACT_SIZE))
+ panic("Hardware RNG duplicated output!\n");
+ memcpy(r->last_data, tmp, EXTRACT_SIZE);
+ spin_unlock_irqrestore(&r->lock, flags);
+ }
+ i = min_t(int, nbytes, EXTRACT_SIZE);
+ memcpy(buf, tmp, i);
+ nbytes -= i;
+ buf += i;
+ ret += i;
+ }
+
+ /* Wipe data just returned from memory */
+ memzero_explicit(tmp, sizeof(tmp));
+
+ return ret;
+}
+
/*
* This function extracts randomness from the "entropy pool", and
* returns it in a buffer.
@@ -1155,7 +1345,6 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
static ssize_t extract_entropy(struct entropy_store *r, void *buf,
size_t nbytes, int min, int reserved)
{
- ssize_t ret = 0, i;
__u8 tmp[EXTRACT_SIZE];
unsigned long flags;

@@ -1179,27 +1368,7 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf,
xfer_secondary_pool(r, nbytes);
nbytes = account(r, nbytes, min, reserved);

- while (nbytes) {
- extract_buf(r, tmp);
-
- if (fips_enabled) {
- spin_lock_irqsave(&r->lock, flags);
- if (!memcmp(tmp, r->last_data, EXTRACT_SIZE))
- panic("Hardware RNG duplicated output!\n");
- memcpy(r->last_data, tmp, EXTRACT_SIZE);
- spin_unlock_irqrestore(&r->lock, flags);
- }
- i = min_t(int, nbytes, EXTRACT_SIZE);
- memcpy(buf, tmp, i);
- nbytes -= i;
- buf += i;
- ret += i;
- }
-
- /* Wipe data just returned from memory */
- memzero_explicit(tmp, sizeof(tmp));
-
- return ret;
+ return _extract_entropy(r, buf, nbytes, fips_enabled);
}

/*
@@ -1254,15 +1423,26 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
*/
void get_random_bytes(void *buf, int nbytes)
{
+ __u8 tmp[CHACHA20_BLOCK_SIZE];
+
#if DEBUG_RANDOM_BOOT > 0
- if (unlikely(nonblocking_pool.initialized == 0))
+ if (!crng_ready())
printk(KERN_NOTICE "random: %pF get_random_bytes called "
- "with %d bits of entropy available\n",
- (void *) _RET_IP_,
- nonblocking_pool.entropy_total);
+ "with crng_init = %d\n", (void *) _RET_IP_, crng_init);
#endif
trace_get_random_bytes(nbytes, _RET_IP_);
- extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+
+ while (nbytes >= CHACHA20_BLOCK_SIZE) {
+ extract_crng(buf);
+ buf += CHACHA20_BLOCK_SIZE;
+ nbytes -= CHACHA20_BLOCK_SIZE;
+ }
+
+ if (nbytes > 0) {
+ extract_crng(tmp);
+ memcpy(buf, tmp, nbytes);
+ memzero_explicit(tmp, nbytes);
+ }
}
EXPORT_SYMBOL(get_random_bytes);

@@ -1280,7 +1460,7 @@ int add_random_ready_callback(struct random_ready_callback *rdy)
unsigned long flags;
int err = -EALREADY;

- if (likely(nonblocking_pool.initialized))
+ if (crng_ready())
return err;

owner = rdy->owner;
@@ -1288,7 +1468,7 @@ int add_random_ready_callback(struct random_ready_callback *rdy)
return -ENOENT;

spin_lock_irqsave(&random_ready_list_lock, flags);
- if (nonblocking_pool.initialized)
+ if (crng_ready())
goto out;

owner = NULL;
@@ -1352,7 +1532,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
}

if (nbytes)
- extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+ get_random_bytes(p, nbytes);
}
EXPORT_SYMBOL(get_random_bytes_arch);

@@ -1397,7 +1577,7 @@ static int rand_initialize(void)
{
init_std_data(&input_pool);
init_std_data(&blocking_pool);
- init_std_data(&nonblocking_pool);
+ crng_initialize(&primary_crng);
return 0;
}
early_initcall(rand_initialize);
@@ -1462,19 +1642,15 @@ urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
static int maxwarn = 10;
int ret;

- if (unlikely(nonblocking_pool.initialized == 0) &&
- maxwarn > 0) {
+ if (!crng_ready() && maxwarn > 0) {
maxwarn--;
printk(KERN_NOTICE "random: %s: uninitialized urandom read "
- "(%d bytes read, %d bits of entropy available)\n",
- current->comm, nbytes, nonblocking_pool.entropy_total);
+ "(%d bytes read)\n",
+ current->comm, nbytes);
}
-
nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
- ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
-
- trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
- ENTROPY_BITS(&input_pool));
+ ret = extract_crng_user(buf, nbytes);
+ trace_urandom_read(8 * nbytes, 0, ENTROPY_BITS(&input_pool));
return ret;
}

@@ -1520,10 +1696,7 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
{
size_t ret;

- ret = write_pool(&blocking_pool, buffer, count);
- if (ret)
- return ret;
- ret = write_pool(&nonblocking_pool, buffer, count);
+ ret = write_pool(&input_pool, buffer, count);
if (ret)
return ret;

@@ -1574,7 +1747,6 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
input_pool.entropy_count = 0;
- nonblocking_pool.entropy_count = 0;
blocking_pool.entropy_count = 0;
return 0;
default:
@@ -1616,11 +1788,10 @@ SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, count,
if (flags & GRND_RANDOM)
return _random_read(flags & GRND_NONBLOCK, buf, count);

- if (unlikely(nonblocking_pool.initialized == 0)) {
+ if (!crng_ready()) {
if (flags & GRND_NONBLOCK)
return -EAGAIN;
- wait_event_interruptible(urandom_init_wait,
- nonblocking_pool.initialized);
+ crng_wait_ready();
if (signal_pending(current))
return -ERESTARTSYS;
}
@@ -1856,18 +2027,17 @@ void add_hwgenerator_randomness(const char *buffer, size_t count,
{
struct entropy_store *poolp = &input_pool;

- if (unlikely(nonblocking_pool.initialized == 0))
- poolp = &nonblocking_pool;
- else {
- /* Suspend writing if we're above the trickle
- * threshold. We'll be woken up again once below
- * random_write_wakeup_thresh, or when the calling
- * thread is about to terminate.
- */
- wait_event_interruptible(random_write_wait,
- kthread_should_stop() ||
- ENTROPY_BITS(&input_pool) <= random_write_wakeup_bits);
+ if (!crng_ready()) {
+ crng_fast_load(buffer, count);
+ return;
}
+
+ /* Suspend writing if we're above the trickle threshold.
+ * We'll be woken up again once below random_write_wakeup_thresh,
+ * or when the calling thread is about to terminate.
+ */
+ wait_event_interruptible(random_write_wait, kthread_should_stop() ||
+ ENTROPY_BITS(&input_pool) <= random_write_wakeup_bits);
mix_pool_bytes(poolp, buffer, count);
credit_entropy_bits(poolp, entropy);
}
diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h
index 274bbae..20d20f68 100644
--- a/include/crypto/chacha20.h
+++ b/include/crypto/chacha20.h
@@ -16,6 +16,7 @@ struct chacha20_ctx {
u32 key[8];
};

+void chacha20_block(u32 *state, void *stream);
void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv);
int crypto_chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
unsigned int keysize);
diff --git a/lib/Makefile b/lib/Makefile
index 499fb35..34e205f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,7 @@ KCOV_INSTRUMENT_hweight.o := n
lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o dump_stack.o timerqueue.o\
idr.o int_sqrt.o extable.o \
- sha1.o md5.o irq_regs.o argv_split.o \
+ sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
flex_proportions.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o
diff --git a/lib/chacha20.c b/lib/chacha20.c
new file mode 100644
index 0000000..250ceed
--- /dev/null
+++ b/lib/chacha20.c
@@ -0,0 +1,79 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/bitops.h>
+#include <linux/cryptohash.h>
+#include <asm/unaligned.h>
+#include <crypto/chacha20.h>
+
+static inline u32 rotl32(u32 v, u8 n)
+{
+ return (v << n) | (v >> (sizeof(v) * 8 - n));
+}
+
+extern void chacha20_block(u32 *state, void *stream)
+{
+ u32 x[16], *out = stream;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(x); i++)
+ x[i] = state[i];
+
+ for (i = 0; i < 20; i += 2) {
+ x[0] += x[4]; x[12] = rotl32(x[12] ^ x[0], 16);
+ x[1] += x[5]; x[13] = rotl32(x[13] ^ x[1], 16);
+ x[2] += x[6]; x[14] = rotl32(x[14] ^ x[2], 16);
+ x[3] += x[7]; x[15] = rotl32(x[15] ^ x[3], 16);
+
+ x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 12);
+ x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 12);
+ x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 12);
+ x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 12);
+
+ x[0] += x[4]; x[12] = rotl32(x[12] ^ x[0], 8);
+ x[1] += x[5]; x[13] = rotl32(x[13] ^ x[1], 8);
+ x[2] += x[6]; x[14] = rotl32(x[14] ^ x[2], 8);
+ x[3] += x[7]; x[15] = rotl32(x[15] ^ x[3], 8);
+
+ x[8] += x[12]; x[4] = rotl32(x[4] ^ x[8], 7);
+ x[9] += x[13]; x[5] = rotl32(x[5] ^ x[9], 7);
+ x[10] += x[14]; x[6] = rotl32(x[6] ^ x[10], 7);
+ x[11] += x[15]; x[7] = rotl32(x[7] ^ x[11], 7);
+
+ x[0] += x[5]; x[15] = rotl32(x[15] ^ x[0], 16);
+ x[1] += x[6]; x[12] = rotl32(x[12] ^ x[1], 16);
+ x[2] += x[7]; x[13] = rotl32(x[13] ^ x[2], 16);
+ x[3] += x[4]; x[14] = rotl32(x[14] ^ x[3], 16);
+
+ x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 12);
+ x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 12);
+ x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 12);
+ x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 12);
+
+ x[0] += x[5]; x[15] = rotl32(x[15] ^ x[0], 8);
+ x[1] += x[6]; x[12] = rotl32(x[12] ^ x[1], 8);
+ x[2] += x[7]; x[13] = rotl32(x[13] ^ x[2], 8);
+ x[3] += x[4]; x[14] = rotl32(x[14] ^ x[3], 8);
+
+ x[10] += x[15]; x[5] = rotl32(x[5] ^ x[10], 7);
+ x[11] += x[12]; x[6] = rotl32(x[6] ^ x[11], 7);
+ x[8] += x[13]; x[7] = rotl32(x[7] ^ x[8], 7);
+ x[9] += x[14]; x[4] = rotl32(x[4] ^ x[9], 7);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(x); i++)
+ out[i] = cpu_to_le32(x[i] + state[i]);
+
+ state[12]++;
+}
+EXPORT_SYMBOL(chacha20_block);
--
2.5.0

2016-06-13 15:49:28

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 4/7] random: properly align get_random_int_hash

From: Eric Biggers <[email protected]>

get_random_long() reads from the get_random_int_hash array using an
unsigned long pointer. For this code to be guaranteed correct on all
architectures, the array must be aligned to an unsigned long boundary.

Cc: [email protected]
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
drivers/char/random.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 64e35a4..83f5cd0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1778,13 +1778,15 @@ int random_int_secret_init(void)
return 0;
}

+static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash)
+ __aligned(sizeof(unsigned long));
+
/*
* Get a random word for internal kernel use only. Similar to urandom but
* with the goal of minimal entropy pool depletion. As a result, the random
* value is not cryptographically secure but for several uses the cost of
* depleting entropy is too high
*/
-static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash);
unsigned int get_random_int(void)
{
__u32 *hash;
--
2.5.0

2016-06-13 15:49:28

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/7] random: add interrupt callback to VMBus IRQ handler

From: Stephan Mueller <[email protected]>

The Hyper-V Linux Integration Services use the VMBus implementation for
communication with the Hypervisor. VMBus registers its own interrupt
handler that completely bypasses the common Linux interrupt handling.
This implies that the interrupt entropy collector is not triggered.

This patch adds the interrupt entropy collection callback into the VMBus
interrupt handler function.

Cc: [email protected]
Signed-off-by: Stephan Mueller <[email protected]>
Signed-off-by: Stephan Mueller <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
drivers/char/random.c | 1 +
drivers/hv/vmbus_drv.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 74596d3..64e35a4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -946,6 +946,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
/* award one bit for the contents of the fast pool */
credit_entropy_bits(r, credit + 1);
}
+EXPORT_SYMBOL_GPL(add_interrupt_randomness);

#ifdef CONFIG_BLOCK
void add_disk_randomness(struct gendisk *disk)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 952f20f..e82f7e1 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -42,6 +42,7 @@
#include <linux/screen_info.h>
#include <linux/kdebug.h>
#include <linux/efi.h>
+#include <linux/random.h>
#include "hyperv_vmbus.h"

static struct acpi_device *hv_acpi_dev;
@@ -806,6 +807,8 @@ static void vmbus_isr(void)
else
tasklet_schedule(hv_context.msg_dpc[cpu]);
}
+
+ add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
}


--
2.5.0

2016-06-13 15:49:28

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/7] random: print a warning for the first ten uninitialized random users

Since systemd is consistently using /dev/urandom before it is
initialized, we can't see the other potentially dangerous users of
/dev/urandom immediately after boot. So print the first ten such
complaints instead.

Cc: [email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
---
drivers/char/random.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 4e2627a..74596d3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1458,12 +1458,16 @@ random_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
static ssize_t
urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
{
+ static int maxwarn = 10;
int ret;

- if (unlikely(nonblocking_pool.initialized == 0))
- printk_once(KERN_NOTICE "random: %s urandom read "
- "with %d bits of entropy available\n",
- current->comm, nonblocking_pool.entropy_total);
+ if (unlikely(nonblocking_pool.initialized == 0) &&
+ maxwarn > 0) {
+ maxwarn--;
+ printk(KERN_NOTICE "random: %s: uninitialized urandom read "
+ "(%d bytes read, %d bits of entropy available)\n",
+ current->comm, nbytes, nonblocking_pool.entropy_total);
+ }

nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
--
2.5.0

2016-06-13 15:48:38

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock. The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
drivers/char/random.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 841f9a8..d640865 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -434,6 +434,8 @@ struct crng_state primary_crng = {
*/
static int crng_init = 0;
#define crng_ready() (likely(crng_init > 0))
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE]);
static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
static void process_random_ready_list(void);

@@ -754,6 +756,16 @@ static void credit_entropy_bits_safe(struct entropy_store *r, int nbits)

static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);

+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel. The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
static void crng_initialize(struct crng_state *crng)
{
int i;
@@ -815,7 +827,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
if (num == 0)
return;
} else
- extract_crng(buf.block);
+ _extract_crng(&primary_crng, buf.block);
spin_lock_irqsave(&primary_crng.lock, flags);
for (i = 0; i < 8; i++) {
unsigned long rv;
@@ -835,19 +847,26 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
spin_unlock_irqrestore(&primary_crng.lock, flags);
}

+static inline void maybe_reseed_primary_crng(void)
+{
+ if (crng_init > 2 &&
+ time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+ crng_reseed(&primary_crng, &input_pool);
+}
+
static inline void crng_wait_ready(void)
{
wait_event_interruptible(crng_init_wait, crng_ready());
}

-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
{
unsigned long v, flags;
- struct crng_state *crng = &primary_crng;

if (crng_init > 1 &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
- crng_reseed(crng, &input_pool);
+ crng_reseed(crng, crng == &primary_crng ? &input_pool : NULL);
spin_lock_irqsave(&crng->lock, flags);
if (arch_get_random_long(&v))
crng->state[14] ^= v;
@@ -857,6 +876,19 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(&crng->lock, flags);
}

+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+ struct crng_state *crng = NULL;
+
+#ifdef CONFIG_NUMA
+ if (crng_node_pool)
+ crng = crng_node_pool[numa_node_id()];
+ if (crng == NULL)
+#endif
+ crng = &primary_crng;
+ _extract_crng(crng, out);
+}
+
static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
{
ssize_t ret = 0, i;
@@ -1575,9 +1607,31 @@ static void init_std_data(struct entropy_store *r)
*/
static int rand_initialize(void)
{
+#ifdef CONFIG_NUMA
+ int i;
+ int num_nodes = num_possible_nodes();
+ struct crng_state *crng;
+ struct crng_state **pool;
+#endif
+
init_std_data(&input_pool);
init_std_data(&blocking_pool);
crng_initialize(&primary_crng);
+
+#ifdef CONFIG_NUMA
+ pool = kmalloc(num_nodes * sizeof(void *),
+ GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
+ for (i=0; i < num_nodes; i++) {
+ crng = kmalloc_node(sizeof(struct crng_state),
+ GFP_KERNEL | __GFP_NOFAIL, i);
+ spin_lock_init(&crng->lock);
+ crng_initialize(crng);
+ pool[i] = crng;
+
+ }
+ mb();
+ crng_node_pool = pool;
+#endif
return 0;
}
early_initcall(rand_initialize);
--
2.5.0

2016-06-13 18:00:36

by Stephan Müller

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

Am Montag, 13. Juni 2016, 11:48:37 schrieb Theodore Ts'o:

Hi Theodore,

> /*
> @@ -1254,15 +1423,26 @@ static ssize_t extract_entropy_user(struct
> entropy_store *r, void __user *buf, */
> void get_random_bytes(void *buf, int nbytes)
> {
> + __u8 tmp[CHACHA20_BLOCK_SIZE];
> +
> #if DEBUG_RANDOM_BOOT > 0
> - if (unlikely(nonblocking_pool.initialized == 0))
> + if (!crng_ready())
> printk(KERN_NOTICE "random: %pF get_random_bytes called "
> - "with %d bits of entropy available\n",
> - (void *) _RET_IP_,
> - nonblocking_pool.entropy_total);
> + "with crng_init = %d\n", (void *) _RET_IP_, crng_init);
> #endif
> trace_get_random_bytes(nbytes, _RET_IP_);
> - extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
> +
> + while (nbytes >= CHACHA20_BLOCK_SIZE) {
> + extract_crng(buf);
> + buf += CHACHA20_BLOCK_SIZE;
> + nbytes -= CHACHA20_BLOCK_SIZE;
> + }
> +
> + if (nbytes > 0) {
> + extract_crng(tmp);
> + memcpy(buf, tmp, nbytes);
> + memzero_explicit(tmp, nbytes);
> + }

What is your take on the following issue:

1. The ChaCha20 is seeded with 256 bits (let us assume it is full entropy)

2. The ChaCha20 block operation shuffles the 256 bits of entropy over the 512
bit state -- already here we see that after shuffling, the entropy to bit
ratio fell from (256 bits of entropy / 256 data bits) to (256 bits of entropy
/ 512 data bits).

3. The code above directly returns the output of the ChaCha20 round to the
caller. Considering the discussion in step 2, I would assume that the entropy
content of the output size is cut in half.

Ciao
Stephan

2016-06-13 19:04:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Mon, Jun 13, 2016 at 08:00:33PM +0200, Stephan Mueller wrote:
>
> 1. The ChaCha20 is seeded with 256 bits (let us assume it is full entropy)
>
> 2. The ChaCha20 block operation shuffles the 256 bits of entropy over the 512
> bit state -- already here we see that after shuffling, the entropy to bit
> ratio fell from (256 bits of entropy / 256 data bits) to (256 bits of entropy
> / 512 data bits).
>
> 3. The code above directly returns the output of the ChaCha20 round to the
> caller. Considering the discussion in step 2, I would assume that the entropy
> content of the output size is cut in half.

This is a CSRNG, not an entropy pool. One could make the same
argument about an AES Counter based DRBG. So you could start with an
256 AES key, and then after you return ENCRYPT(AES_KEY, COUNTER++),
you've "halved" the entropy in the AES CTR-DRBG state. Oh, noes!!!

The reason why I don't view this as a problem is because we're
assuming that the cipher algorithm is secure. With /dev/urandom we
were always emitting more bytes than we had entropy available, because
not blocking was considered more important. Previously we were
relying on the security of SHA-1. With AES CTR-DRBG, you rely on the
security with AES. With this patch, we're relying on the security of
Chacha20. Given that this is being used in TLS for more and more
mobile connections, I'm comfortable with that choice.

For someone who doesn't trust the security of our underlying
algorithms, and want to trust solely in the correctness of our entropy
estimation algorithms, they can always use /dev/random.


So to be clear about what we're doing, ChaCha20 uses as its internal
state 16 bytes of constant, followed by 32 bytes of key, followed by
16 bytes of counter. It is a stream cipher where each successive
block in the keystream is generated by bumping the counter, and
encryption is done by XOR'ing the keystream with the plaintext.

If you ignore the backtracking protection introduced in patch 7/7 in
this series (which uses unreleased portions of the previous keystream
to mutate the key for future CRNG outputs), what we're doing is using
the keystream as the output for the CRNG/CSRNG, and morally, this is
no different from a CTR-DRBG as defined in SP 800-90A.

Cheers,

- Ted

2016-06-15 14:59:17

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Mon, Jun 13, 2016 at 11:48:37AM -0400, Theodore Ts'o wrote:
> The CRNG is faster, and we don't pretend to track entropy usage in the
> CRNG any more.
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> crypto/chacha20_generic.c | 61 --------
> drivers/char/random.c | 374 +++++++++++++++++++++++++++++++++-------------
> include/crypto/chacha20.h | 1 +
> lib/Makefile | 2 +-
> lib/chacha20.c | 79 ++++++++++
> 5 files changed, 353 insertions(+), 164 deletions(-)
> create mode 100644 lib/chacha20.c
>
> diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
> index da9c899..1cab831 100644
> --- a/crypto/chacha20_generic.c
> +++ b/crypto/chacha20_generic.c

I think you should be accessing this through the crypto API rather
than going direct. We already have at least one accelerated
implementation of chacha20 and there may well be more of them
in future. Going through the crypto API means that you will
automatically pick up the best implementation for the platform.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2016-06-19 23:18:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Wed, Jun 15, 2016 at 10:59:08PM +0800, Herbert Xu wrote:
> I think you should be accessing this through the crypto API rather
> than going direct. We already have at least one accelerated
> implementation of chacha20 and there may well be more of them
> in future. Going through the crypto API means that you will
> automatically pick up the best implementation for the platform.

While there are some benefits of going through the crypto API, there
are some downsides as well:

A) Unlike using ChaCha20 in cipher mode, only need the keystream, and
we don't need to XOR the output with plaintext. We could supply a
dummy zero-filled buffer to archive the same result, but now the
"accelerated" version is having to do an extra memory reference. Even
if the L1 cache is big enough so that we're not going all the way out
to DRAM, we're putting additional pressure the D cache.

B) The anti-backtracking feature involves taking the existing key and
XOR'ing it with unsued output from the keystream. We can't do that
using the Crypto API without keeping our own copy of the key, and then
calling setkey --- which means yet more extra memory references.

C) Simply compiling in the Crypto layer and the ChaCha20 generic
handling (all of which is doing extra work which we would then be
undoing in the random layer --- and I haven't included the extra code
in the random driver needed interface with the crypto layer) costs an
extra 20k. That's roughly the amount of extra kernel bloat that the
Linux kernel grew in its allnoconfig from version to version from 3.0
to 3.16. I don't have the numbers from the more recent kernels, but
roughly speaking, we would be responsible for **all** of the extra
kernel bloat (and if there was any extra kernel bloat, we would
helping to double it) in the kernel release where this code would go
in. I suspect the folks involved with the kernel tinificaiton efforts
wouldn't exactly be pleased with this.

Yes, I understand the argument that the networking stack is now
requiring the crypto layer --- but not all IOT devices may necessarily
require the IP stack (they might be using some alternate wireless
communications stack) and I'd much rather not make things worse.


The final thing is that it's not at all clear that the accelerated
implementation is all that important anyway. Consider the following
two results using the unaccelerated ChaCha20:

% dd if=/dev/urandom bs=4M count=32 of=/dev/null
32+0 records in
32+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 1.18647 s, 113 MB/s

% dd if=/dev/urandom bs=32 count=4194304 of=/dev/null
4194304+0 records in
4194304+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 7.08294 s, 18.9 MB/s

So in both cases, we are reading 128M from the CRNG. In the first
case, we see the sort of speed we would get if we were using the CRNG
for some illegitimate, such as "dd if=/dev/urandom of=/dev/sdX bs=4M"
(because they were too lazy to type "apt-get install nwipe").

In the second case, we see the use of /dev/urandom in a much more
reasonable, proper, real-world use case for /de/urandom, which is some
userspace process needing a 256 bit session key for a TLS connection,
or some such. In this case, we see that the other overheads of
providing the anti-backtracking protection, system call overhead,
etc., completely dominate the speed of the core crypto primitive.

So even if the AVX optimized is 100% faster than the generic version,
it would change the time needed to create a 256 byte session key from
1.68 microseconds to 1.55 microseconds. And this is ignoring the
extra overhead needed to set up AVX, the fact that this will require
the kernel to do extra work doing the XSAVE and XRESTORE because of
the use of the AVX registers, etc.

The bottom line is that optimized ChaCha20 optimizations might be
great for bulk encryption, but for the purposes of generating 256 byte
session keys, I don't think the costs outweigh the benefits.

Cheers,

- Ted

2016-06-20 01:25:28

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Sun, Jun 19, 2016 at 07:18:28PM -0400, Theodore Ts'o wrote:
>
> C) Simply compiling in the Crypto layer and the ChaCha20 generic
> handling (all of which is doing extra work which we would then be
> undoing in the random layer --- and I haven't included the extra code
> in the random driver needed interface with the crypto layer) costs an
> extra 20k. That's roughly the amount of extra kernel bloat that the
> Linux kernel grew in its allnoconfig from version to version from 3.0
> to 3.16. I don't have the numbers from the more recent kernels, but
> roughly speaking, we would be responsible for **all** of the extra
> kernel bloat (and if there was any extra kernel bloat, we would
> helping to double it) in the kernel release where this code would go
> in. I suspect the folks involved with the kernel tinificaiton efforts
> wouldn't exactly be pleased with this.
>
> Yes, I understand the argument that the networking stack is now
> requiring the crypto layer --- but not all IOT devices may necessarily
> require the IP stack (they might be using some alternate wireless
> communications stack) and I'd much rather not make things worse.

Sure, but 99% of the kernels out there will have a crypto API.
So why not use it if it's there and use the standalone chacha
code otherwise?

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2016-06-20 05:02:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Mon, Jun 20, 2016 at 09:25:28AM +0800, Herbert Xu wrote:
> > Yes, I understand the argument that the networking stack is now
> > requiring the crypto layer --- but not all IOT devices may necessarily
> > require the IP stack (they might be using some alternate wireless
> > communications stack) and I'd much rather not make things worse.
>
> Sure, but 99% of the kernels out there will have a crypto API.
> So why not use it if it's there and use the standalone chacha
> code otherwise?

It's work that I'm not convinced is worth the gain? Perhaps I
shouldn't have buried the lede, but repeating a paragraph from later
in the message:

So even if the AVX optimized is 100% faster than the generic version,
it would change the time needed to create a 256 byte session key from
1.68 microseconds to 1.55 microseconds. And this is ignoring the
extra overhead needed to set up AVX, the fact that this will require
the kernel to do extra work doing the XSAVE and XRESTORE because of
the use of the AVX registers, etc.

So in the absolute best case, this improves the time needed to create
a 256 bit session key by 0.13 microseconds. And that assumes that the
extra setup and teardown overhead of an AVX optimized ChaCha20
(including the XSAVE and XRESTORE of the AVX registers, etc.) don't
end up making the CRNG **slower**.

The thing to remember about these optimizations is that they are great
for bulk encryption, but that's not what the getrandom(2) and
get_random_bytes() are used for, in general. We don't need to create
multiple megabytes of random numbers at a time. We need to create
them 256 bits at a time, with anti-backtracking protections in
between. Think of this as the random number equivalent of artisinal
beer making, as opposed to Budweiser beer, which ferments the beer
literally in pipelines. :-)

Yes, Budweiser may be made more efficiently using continuous
fermentation --- but would you want to drink it? And if you have to
constantly start and stop the continuous fermentation pipeline, the net
result can actually be less efficient compared to doing it right in
the first place....

- Ted

P.S. I haven't measured this to see, mainly because I really don't
care about the difference between 1.68 vs 1.55 microseconds, but there
is a good chance in the crypto layer that it might be a good idea to
have the system be smart enough to automatically fall back to using
the **non** optimized version if you only need to encrypt a small
amount of data.

2016-06-20 06:03:49

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Mon, Jun 20, 2016 at 01:02:03AM -0400, Theodore Ts'o wrote:
>
> It's work that I'm not convinced is worth the gain? Perhaps I
> shouldn't have buried the lede, but repeating a paragraph from later
> in the message:
>
> So even if the AVX optimized is 100% faster than the generic version,
> it would change the time needed to create a 256 byte session key from
> 1.68 microseconds to 1.55 microseconds. And this is ignoring the
> extra overhead needed to set up AVX, the fact that this will require
> the kernel to do extra work doing the XSAVE and XRESTORE because of
> the use of the AVX registers, etc.

We do have figures on the efficiency of the accelerated chacha
implementation on 256-byte requests (I've picked the 8-block
version):

testing speed of chacha20 (chacha20-generic) encryption
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)

So it is a little bit more than 100%.

> So in the absolute best case, this improves the time needed to create
> a 256 bit session key by 0.13 microseconds. And that assumes that the
> extra setup and teardown overhead of an AVX optimized ChaCha20
> (including the XSAVE and XRESTORE of the AVX registers, etc.) don't
> end up making the CRNG **slower**.

The figures above include all of these overheads. The overheads
really only show up on 16-byte requests.

> P.S. I haven't measured this to see, mainly because I really don't
> care about the difference between 1.68 vs 1.55 microseconds, but there
> is a good chance in the crypto layer that it might be a good idea to
> have the system be smart enough to automatically fall back to using
> the **non** optimized version if you only need to encrypt a small
> amount of data.

You're right. chacha20-simd should use the generic version on
16-byte requests which is the only place where it is slower.
Something like this:

---8<---
Subject: crypto: chacha20-simd - Use generic code for small requests

On 16-byte requests the optimised version is actually slower than
the generic code, so we should simply use that instead.

Signed-off-by: Herbert Xu <[email protected]>

diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 2d5c2e0b..f910d1d 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -70,7 +70,7 @@ static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
struct blkcipher_walk walk;
int err;

- if (!may_use_simd())
+ if (nbytes <= CHACHA20_BLOCK_SIZE || !may_use_simd())
return crypto_chacha20_crypt(desc, dst, src, nbytes);

state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2016-06-20 15:01:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Mon, Jun 20, 2016 at 01:19:17PM +0800, Herbert Xu wrote:
> On Mon, Jun 20, 2016 at 01:02:03AM -0400, Theodore Ts'o wrote:
> >
> > It's work that I'm not convinced is worth the gain? Perhaps I
> > shouldn't have buried the lede, but repeating a paragraph from later
> > in the message:
> >
> > So even if the AVX optimized is 100% faster than the generic version,
> > it would change the time needed to create a 256 byte session key from
> > 1.68 microseconds to 1.55 microseconds. And this is ignoring the
> > extra overhead needed to set up AVX, the fact that this will require
> > the kernel to do extra work doing the XSAVE and XRESTORE because of
> > the use of the AVX registers, etc.
>
> We do have figures on the efficiency of the accelerated chacha
> implementation on 256-byte requests (I've picked the 8-block
> version):

Sorry, I typo'ed this. s/bytes/bits/. 256 bits / 32 bytes is the
much more common amount that someone might be trying to extract, to
get a 256 **bit** session key.

And also note my comments about how we need to permute the key
directly, and not just go through the set_key abstraction. And when
you did your benchmarks, how often was XSAVE / XRESTORE happening ---
in between every single block operation?

Remember, what we're talking about for getrandom(2) in the most common
case is syscall, extrate a 32 bytes worth of keystream, ***NOT***
XOR'ing it with plaintext buffer, and then permuting the key.

So simply doing chacha20 encryption in a tight loop in the kernel
might not be a good proxy for what would actually happen in real life
when someone calls getrandom(2). (Another good question to ask is
when someone might be needing to generate millions of 256-bit session
keys per second, when the D-H setup, even if you were using ECCDH,
would be largely dominating the time for the connection setup anyway.)

Cheers,

- Ted

2016-06-20 16:43:28

by Stephan Müller

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

Am Montag, 20. Juni 2016, 11:01:47 schrieb Theodore Ts'o:

Hi Theodore,

>
> So simply doing chacha20 encryption in a tight loop in the kernel
> might not be a good proxy for what would actually happen in real life
> when someone calls getrandom(2). (Another good question to ask is
> when someone might be needing to generate millions of 256-bit session
> keys per second, when the D-H setup, even if you were using ECCDH,
> would be largely dominating the time for the connection setup anyway.)

Is speed everything we should care about? What about:

- offloading of crypto operation from the CPU

- potentially additional security features a hardware cipher may provide like
cache coloring attack resistance?

Ciao
Stephan

2016-06-20 18:55:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On 06/20/16 08:49, Stephan Mueller wrote:
> Am Montag, 20. Juni 2016, 11:01:47 schrieb Theodore Ts'o:
>
> Hi Theodore,
>
>>
>> So simply doing chacha20 encryption in a tight loop in the kernel
>> might not be a good proxy for what would actually happen in real life
>> when someone calls getrandom(2). (Another good question to ask is
>> when someone might be needing to generate millions of 256-bit session
>> keys per second, when the D-H setup, even if you were using ECCDH,
>> would be largely dominating the time for the connection setup anyway.)
>
> Is speed everything we should care about? What about:
>
> - offloading of crypto operation from the CPU
>

This sounds like a speed operation (and very unlikely to be a win given
the usage).

> - potentially additional security features a hardware cipher may provide like
> cache coloring attack resistance?

How about burning that bridge when and if we get to it? It sounds very
hypothetical.

I guess I could add in some comments here about how a lot of these
problems can be eliminated by offloading an entire DRNG into hardware,
but I don't think it is productive.

-hpa

2016-06-20 23:48:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Mon, Jun 20, 2016 at 05:49:17PM +0200, Stephan Mueller wrote:
>
> Is speed everything we should care about? What about:
>
> - offloading of crypto operation from the CPU

In practice CPU offland is not helpful, and in fact, in most cases is
harmful, when one is only encrypting a tiny amount of data. That's
because the cost of setup and teardown, not to mention key scheduling,
dominate. This is less of the case in the case of the SIMD / AVX
optimizations --- but that's because these are CPU instructions, and
there really isn't any CPU offloading going on.

> - potentially additional security features a hardware cipher may provide like
> cache coloring attack resistance?

Um.... have you even taken a *look* at how ChaCha20 is implemented?
*What* cache coloring attack is possible at all, period?

Hint: where are the lookup tables? Where are the data-dependent
memory accesses in the ChaCha20 core?

- Ted

2016-06-26 18:47:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

Hi!

> Yes, I understand the argument that the networking stack is now
> requiring the crypto layer --- but not all IOT devices may necessarily
> require the IP stack (they might be using some alternate wireless
> communications stack) and I'd much rather not make things worse.
>
>
> The final thing is that it's not at all clear that the accelerated
> implementation is all that important anyway. Consider the following
> two results using the unaccelerated ChaCha20:
>
> % dd if=/dev/urandom bs=4M count=32 of=/dev/null
> 32+0 records in
> 32+0 records out
> 134217728 bytes (134 MB, 128 MiB) copied, 1.18647 s, 113 MB/s
>
> % dd if=/dev/urandom bs=32 count=4194304 of=/dev/null
> 4194304+0 records in
> 4194304+0 records out
> 134217728 bytes (134 MB, 128 MiB) copied, 7.08294 s, 18.9 MB/s
>
> So in both cases, we are reading 128M from the CRNG. In the first
> case, we see the sort of speed we would get if we were using the CRNG
> for some illegitimate, such as "dd if=/dev/urandom of=/dev/sdX bs=4M"
> (because they were too lazy to type "apt-get install nwipe").
>
> In the second case, we see the use of /dev/urandom in a much more
> reasonable, proper, real-world use case for /de/urandom, which is some
> userspace process needing a 256 bit session key for a TLS connection,
> or some such. In this case, we see that the other overheads of
> providing the anti-backtracking protection, system call overhead,
> etc., completely dominate the speed of the core crypto primitive.
>
> So even if the AVX optimized is 100% faster than the generic version,
> it would change the time needed to create a 256 byte session key from
> 1.68 microseconds to 1.55 microseconds. And this is ignoring the

Ok, so lets say I'm writing some TLS server, and I know that traffic
is currently heavy because it was heavy in last 5 minutes. Would it
make sense for me to request 128M of randomness from /dev/urandom, and
then use that internally, to avoid the syscall overhead?

Ok, maybe 128M is a bit much because by requesting that much in single
request i'd turn urandom into PRNG, but perhaps 1MB block makes sense?

And I guess even requesting 128M would make sense, as kernel can
select best crypto implementation for CRNG, and I'd prefer to avoid
that code in my application as it is hardware-specific...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2016-06-26 18:47:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 7/7] random: add backtracking protection to the CRNG

On Mon 2016-06-13 11:48:39, Theodore Ts'o wrote:
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> drivers/char/random.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 49 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index d640865..963a6a9 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -436,7 +436,8 @@ static int crng_init = 0;
> #define crng_ready() (likely(crng_init > 0))
> static void _extract_crng(struct crng_state *crng,
> __u8 out[CHACHA20_BLOCK_SIZE]);
> -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
> +static void _crng_backtrack_protect(struct crng_state *crng,
> + __u8 tmp[CHACHA20_BLOCK_SIZE], int used);
> static void process_random_ready_list(void);

You can do u8 and u32 in the code, we know we are in kernel.

> +/*
> + * Use the leftover bytes from the CRNG block output (if there is
> + * enough) to mutate the CRNG key to provide backtracking protection.
> + */
> +static void _crng_backtrack_protect(struct crng_state *crng,
> + __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
> +{
> + unsigned long flags;
> + __u32 *s, *d;
> + int i;
> +
> + used = round_up(used, sizeof(__u32));
> + if (used + CHACHA20_KEY_SIZE > CHACHA20_BLOCK_SIZE) {
> + extract_crng(tmp);
> + used = 0;
> + }
> + spin_lock_irqsave(&crng->lock, flags);
> + s = (__u32 *) &tmp[used];
> + d = &crng->state[4];
> + for (i=0; i < 8; i++)
> + *d++ ^= *s++;
> + spin_unlock_irqrestore(&crng->lock, flags);
> +}

You are basically trying to turn CRNG into one way hash function here,
right? Do you have any explanation that it has the required
properties?

Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2016-06-26 19:10:29

by Stephan Müller

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

Am Sonntag, 26. Juni 2016, 20:47:43 schrieb Pavel Machek:

Hi Pavel,

> Hi!
>
> > Yes, I understand the argument that the networking stack is now
> > requiring the crypto layer --- but not all IOT devices may necessarily
> > require the IP stack (they might be using some alternate wireless
> > communications stack) and I'd much rather not make things worse.
> >
> >
> > The final thing is that it's not at all clear that the accelerated
> > implementation is all that important anyway. Consider the following
> > two results using the unaccelerated ChaCha20:
> >
> > % dd if=/dev/urandom bs=4M count=32 of=/dev/null
> > 32+0 records in
> > 32+0 records out
> > 134217728 bytes (134 MB, 128 MiB) copied, 1.18647 s, 113 MB/s
> >
> > % dd if=/dev/urandom bs=32 count=4194304 of=/dev/null
> > 4194304+0 records in
> > 4194304+0 records out
> > 134217728 bytes (134 MB, 128 MiB) copied, 7.08294 s, 18.9 MB/s
> >
> > So in both cases, we are reading 128M from the CRNG. In the first
> > case, we see the sort of speed we would get if we were using the CRNG
> > for some illegitimate, such as "dd if=/dev/urandom of=/dev/sdX bs=4M"
> > (because they were too lazy to type "apt-get install nwipe").
> >
> > In the second case, we see the use of /dev/urandom in a much more
> > reasonable, proper, real-world use case for /de/urandom, which is some
> > userspace process needing a 256 bit session key for a TLS connection,
> > or some such. In this case, we see that the other overheads of
> > providing the anti-backtracking protection, system call overhead,
> > etc., completely dominate the speed of the core crypto primitive.
> >
> > So even if the AVX optimized is 100% faster than the generic version,
> > it would change the time needed to create a 256 byte session key from
> > 1.68 microseconds to 1.55 microseconds. And this is ignoring the
>
> Ok, so lets say I'm writing some TLS server, and I know that traffic
> is currently heavy because it was heavy in last 5 minutes. Would it
> make sense for me to request 128M of randomness from /dev/urandom, and
> then use that internally, to avoid the syscall overhead?
>
> Ok, maybe 128M is a bit much because by requesting that much in single
> request i'd turn urandom into PRNG, but perhaps 1MB block makes sense?

I would definitely say that this is appropriate when you use the data stream
from urandom directly as keying material.
>
> And I guess even requesting 128M would make sense, as kernel can
> select best crypto implementation for CRNG, and I'd prefer to avoid
> that code in my application as it is hardware-specific...

The code in the server should only make sure it memset(0) any of the random
data immediately after use. This way even when the server dumps core, all you
see is *unused* random data which does not hurt.

Ciao
Stephan

2016-06-26 22:51:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/7] random: replace non-blocking pool with a Chacha20-based CRNG

On Sun, Jun 26, 2016 at 08:47:43PM +0200, Pavel Machek wrote:
> Ok, so lets say I'm writing some TLS server, and I know that traffic
> is currently heavy because it was heavy in last 5 minutes. Would it
> make sense for me to request 128M of randomness from /dev/urandom, and
> then use that internally, to avoid the syscall overhead?

Probably not. Keep in mind that even requesting a 256 bit key one at
a time, it's still only taking 1.7 microseconds. The time to do the
Diffie-Hellman will vary depending on whether you are doing DH in a
log field or a ECC field, but whether it's around 18-20ms or 3-4ms,
it's over *three* orders of magnitude more than the time it takes to
generate a random session key.

So trying to grab 1M of randomness and managing it in your TLS server
is almost certainly optimizing for The Wrong Thing. (Not to mention
the potential for disaster if that randomness gets stolen via some
kind of wild pointer bug, etc., etc.)

This is why I think it's a waste of time talknig about trying to use
AVX or SIMD optimized version of ChaCha20. Even if the cost to do the
XSAVE / XRESTORE doesn't crush the optimization of advantages of
ChaCha20, and if you ignore the costs of adding backtracking defenses,
etc., bloating the kernel by adding support for the crypto layer
doesn't make sense when using the generic ChaCha20 already gets you
down into the microsecond arena. You might be able to get the session
key generation down by another .3 or .4 microseconds, but even if you
cut it down by half to .9 microseconds, the public key operations and
the diffie-hellman operations are going to make such savings be almost
completely unnoticeable.

- Ted

2016-06-26 23:05:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 7/7] random: add backtracking protection to the CRNG

On Sun, Jun 26, 2016 at 08:47:53PM +0200, Pavel Machek wrote:
>
> You are basically trying to turn CRNG into one way hash function here,
> right? Do you have any explanation that it has the required
> properties?

Well, not really. A CRNG has the property that if you generate a
series of outputs: O_N-1, O_N, O_N+1, etc., knowledge of O_N does not
give you any special knowledge with respect to O_N+1 or O_N-1.

The anti-backtracking protection means that when we generate O_N, we
use O_N+1 to mutate the state used for the CRNG; specifically, we are
XOR'ing O_N+1 into the state. Now let's suppose that state gets
exposed. Even if you know O_N, that's not going to let you know
O_N+1, so knowledge of the exposed state post XOR with O_N+1 isn't
going to help you get back the original state.

More generally, if we assume ChaCha20 is secure, that means that you
can't derive the key even if you have known plaintext. The output of
the CRNG is basically the keystream --- what you have after you XOR
the ciphertext with the plaintext. If ChaCha20 is secure, knowledge
of large portions of the keystream should not help you determine the
key, which means is why knowledge of O_N-1, O_N, won't help you derive
either (a) the state of CRNG, aka the ChaCha20 key, or (b) O_N+1.

Cheers,

- Ted

2016-08-21 09:53:15

by Jan Varho

[permalink] [raw]
Subject: Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o <[email protected]> wrote:
> +static inline void maybe_reseed_primary_crng(void)
> +{
> + if (crng_init > 2 &&
> + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
> + crng_reseed(&primary_crng, &input_pool);
> +}

Hi,

Is the above function (which is now in 4.8-rc2) supposed to do
something? It seems to have no callers and the maximum value of
crng_init is 2.
--
Jan Varho

2016-08-21 11:37:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

On Sun, Aug 21, 2016 at 12:53:15PM +0300, Jan Varho wrote:
> On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o <[email protected]> wrote:
> > +static inline void maybe_reseed_primary_crng(void)
> > +{
> > + if (crng_init > 2 &&
> > + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
> > + crng_reseed(&primary_crng, &input_pool);
> > +}
>
> Is the above function (which is now in 4.8-rc2) supposed to do
> something? It seems to have no callers and the maximum value of
> crng_init is 2.

It's dead code. Its function got moved into _extra_crng(), and you're
right, these days crng_init never gets above 2. Thanks for pointing
that out. I'll take it out as a cleanup patch.

- Ted