2014-06-09 00:05:32

by George Spelvin

[permalink] [raw]
Subject: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

I have a few assorted cleanups in-work, but I was thinking about the
following and it's led to some thinking about the (non-)usefulness
of the out[] buffer from _mix_pool_bytes.

The problem I started trying to solve is readers (entropy extractors)
monopolizing the pool lock and stalling writers (entropy mixers), which
are supposed to be fast and low overhead.

So I thought I could rely on the "out" buffer returned from
_mix_pool_bytes to make concurrent extractors produce different output.
That led to the patch below, which drops the lock during the main pool
hash and only locks around the mix-back. (Using the mix_pool_bytes
wrapper rather than explicit locking and __mix_pool_bytes.)
But I'm not sure it's quite right.

Suppose we have a pool that's mostly all-zero, but has a small dense
high-entropy section, and no arch_get_random_long(). Narrowing the
lock lets concurrent readers get to the __mix_pool_bytes() mix-back
(last hunk of the diff) with identical SHA states.

With the original code, which fills out[] with data from just *after*
the add_ptr (i.e. the oldest portion of the pool), it's possible for
multiple readers to obtain all-zero out[] buffers and return identical
hashes.

(Or, in a less extreme case, very low-entropy out[] buffers and
strongly correlated outputs. I'll keep using "all-zero" to make
the examples simpler, and assume people can see the extrapolation.)

Well, okay, I think, easily fixed (included in the patch below):
change it to return the most recently modified section just *before*
the add_ptr, which includes the previus extractor's SHA hash. That
way, I'm guaranteed to get different data on multiple concurrent
calls and everything is okay.

But is it?

Suppose I have three concurrent callers. (I need three because the
hash folding covers up the problem with two.) The pool contains
30 bytes of entropy, so it should be possible to return uncorrelated
outputs to each of them, but as mentioned that's in a small dense
region of the pool and the area around add_ptr is all zeros.

Each hashes the pool to obtain identical 20-byte SHA-1 hashes. Then they
serialize arond calls to _mix_pool_bytes. The first gets zeros in out[],
the second gets the first's SHA-1 hash and so generates a different hash,
and the third gets the second's SHA-1 hash.

So they all produce different outputs, but the outputs (totalling 30
bytes) passed through a 20-byte choke point and so have at most 20 bytes
of entropy between them. This violates the /dev/random security goals.


Here are the solutions I've been thinking about:

1) Continue to lock the entire hashing operation.

As long as we do this, the out[] parameter to _mix_pool_bytes is
unnecessary and should be deleted. Since the pool can't change
between the bulk hash and the hash of out[], there's no point to
re-hashing some data that was just hashed. (You are sampling the
add_ptr, which differs between callers, but that's negligible.)

One way to address my concern would be to split r->lock into
r->mix_lock and r->extract_lock. Extractors would obtain the
latter for the entire hash, and only obtain the mix_lock
for the brief mix-back operation.

The downside, of course, is a second lock on the extract path, but
given that the efficiency of the extract path is not a design goal,
perhaps that's fine.

2) Shrink the lock and add a nonce (like a CPUID) to the *initial* SHA state.

This is fun because of its greater cryptographic cleverness,
but that means it should be discussed carefully first.

Due to the non-linearity of SHA hashing, this would result in
the concurrent hashes sampling different parts of the pool's
entropy, and the 20-byte entropy bottleneck would disappear.

But in this case, the out[] buffer also does not contribute to
the security guarantee and could also disappear.

Thoughts, anyone?

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 102c50d..d847367 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -499,6 +499,15 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
input_rotate = ACCESS_ONCE(r->input_rotate);
i = ACCESS_ONCE(r->add_ptr);

+ /*
+ * Out gets a copy of the portion of the pool most recently
+ * modified by previous writers. Since add_ptr counts down,
+ * this is at positive offsets from the initial value.
+ */
+ if (out)
+ for (j = 0; j < 16; j++)
+ ((__u32 *)out)[j] = r->pool[(i + j) & wordmask];
+
/* mix one byte at a time to simplify size handling and churn faster */
while (nbytes--) {
w = rol32(*bytes++, input_rotate);
@@ -527,10 +536,6 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
ACCESS_ONCE(r->input_rotate) = input_rotate;
ACCESS_ONCE(r->add_ptr) = i;
smp_wmb();
-
- if (out)
- for (j = 0; j < 16; j++)
- ((__u32 *)out)[j] = r->pool[(i - j) & wordmask];
}

static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1043,7 +1048,6 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
}

/* Generate a hash across the pool, 16 words (512 bits) at a time */
- spin_lock_irqsave(&r->lock, flags);
for (i = 0; i < r->poolinfo->poolwords; i += 16)
sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);

@@ -1056,8 +1060,7 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
* brute-forcing the feedback as hard as brute-forcing the
* hash.
*/
- __mix_pool_bytes(r, hash.w, sizeof(hash.w), extract);
- spin_unlock_irqrestore(&r->lock, flags);
+ mix_pool_bytes(r, hash.w, sizeof(hash.w), extract);

/*
* To avoid duplicates, we atomically extract a portion of the


2014-06-09 01:36:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

On Sun, Jun 08, 2014 at 08:05:24PM -0400, George Spelvin wrote:
>
> The problem I started trying to solve is readers (entropy extractors)
> monopolizing the pool lock and stalling writers (entropy mixers), which
> are supposed to be fast and low overhead.

Which writer are you worried about, specifically? A userspace write
to /dev/random from rgnd?

And have you measured this to be something significant, or is this a
theoretical concern. If you've measured it, what's the conditions
where this is stalling an entropy mixer a significant amount of time?

- Ted

2014-06-09 02:11:05

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

> Which writer are you worried about, specifically? A userspace write
> to /dev/random from rgnd?

Actually, the part I'm actually worried about is latency in
add_interrupt_randomness.

But I may have not understood how the locking works. There's a per-CPU
fast pool which isn't locked at all, which feeds the input_pool, and
the add to the input pool would be slow if it were locked, but it
seems hard for user space to force a lot of locking activity on
the input_pool.

A bulk read will drain the secondary pool, but random_min_urandom_seed
will prevent reseeding, so lock contention on input_pool will be
almost nil.

Maybe I need to turn this into a documentation patch explaining all of this.

> And have you measured this to be something significant, or is this a
> theoretical concern. If you've measured it, what's the conditions
> where this is stalling an entropy mixer a significant amount of time?

It's a theoretical extrapolation from timing of user-space writes.
Some time comparisons (on a multicore machine so the two threads should run
on separate processors, and with scaling_governor = performance):

$ dd if=/dev/zero of=/dev/random count=65536
65536+0 records in
65536+0 records out
33554432 bytes (34 MB) copied, 0.356898 s, 94.0 MB/s
$ dd if=/dev/zero of=/dev/random count=65536
65536+0 records in
65536+0 records out
33554432 bytes (34 MB) copied, 0.357693 s, 93.8 MB/s

$ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 ; sleep 4
65536+0 records in
65536+0 records out
33554432 bytes (34 MB) copied, 0.505941 s, 66.3 MB/s
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 0.715132 s, 11.7 MB/s
$ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 ; sleep 4
65536+0 records in
65536+0 records out
33554432 bytes (34 MB) copied, 0.509075 s, 65.9 MB/s
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 0.734479 s, 11.4 MB/s

$ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536
65536+0 records in
65536+0 records out
33554432 bytes (34 MB) copied, 0.66224 s, 50.7 MB/s
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 0.894693 s, 9.4 MB/s
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 0.895628 s, 9.4 MB/s
$ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 ; sleep 4
65536+0 records in
65536+0 records out
33554432 bytes (34 MB) copied, 0.657055 s, 51.1 MB/s
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 0.908407 s, 9.2 MB/s
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 0.908989 s, 9.2 MB/s

Summarizing that, time to feed in 32 MiB of zeros (from user-space):

0 concurrent reads: 0.356898 0.357693
1 concurrent read: 0.505941 0.509075 (+42%)
2 concurrent reads: 0.662240 0.657055 (+84%)

... so it seems like there are some noticeable contention effects.


And then I started thinking, and realized that the out[] parameter wasn't
doing anything useful at all, and even if we don't change anything else
at all, maybe it should be deleted and de-clutter the code?

It dates back to the days when entropy adding tried to be lockless,
but that was a long time ago. And once you have locking, it no longer
serves any function.

It's a bit of meandering stream of consciousness, sorry. The latency
issue is where I started, but the general uselessness of out[] is what
I felt really needed discussing before I proposed a patch to dike it out.

2014-06-09 02:18:29

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

> Summarizing that, time to feed in 32 MiB of zeros (from user-space):
>
> 0 concurrent reads: 0.356898 0.357693
> 1 concurrent read: 0.505941 0.509075 (+42%)
> 2 concurrent reads: 0.662240 0.657055 (+84%)

Er, wait a minute... I'm not sure which kernel (patched or unpatched)
I did that measurement on.

That may be a completely useless number. I need to compile some more
kernels and reboot a few more times. Sorry about that.

2014-06-09 04:04:06

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

Sigh, adventures in "unable to mount root filesystem" currently underway.

Tested on a different computer without patches (half size, since it's
an older machine):

Writing 16 MiB:
16777216 bytes (17 MB) copied, 0.289169 s, 58.0 MB/s
16777216 bytes (17 MB) copied, 0.289378 s, 58.0 MB/s

Writing while reading:
16777216 bytes (17 MB) copied, 0.538839 s, 31.1 MB/s
4194304 bytes (4.2 MB) copied, 0.544769 s, 7.7 MB/s
16777216 bytes (17 MB) copied, 0.537425 s, 31.2 MB/s
4194304 bytes (4.2 MB) copied, 0.544259 s, 7.7 MB/s

16777216 bytes (17 MB) copied, 0.740495 s, 22.7 MB/s
4194304 bytes (4.2 MB) copied, 0.879353 s, 4.8 MB/s
4194304 bytes (4.2 MB) copied, 0.879629 s, 4.8 MB/s
16777216 bytes (17 MB) copied, 0.7262 s, 23.1 MB/s
4194304 bytes (4.2 MB) copied, 0.877035 s, 4.8 MB/s
4194304 bytes (4.2 MB) copied, 0.880627 s, 4.8 MB/s

16777216 bytes (17 MB) copied, 0.996933 s, 16.8 MB/s
4194304 bytes (4.2 MB) copied, 1.24551 s, 3.4 MB/s
4194304 bytes (4.2 MB) copied, 1.26138 s, 3.3 MB/s
4194304 bytes (4.2 MB) copied, 1.2664 s, 3.3 MB/s
16777216 bytes (17 MB) copied, 0.969144 s, 17.3 MB/s
4194304 bytes (4.2 MB) copied, 1.25311 s, 3.3 MB/s
4194304 bytes (4.2 MB) copied, 1.26076 s, 3.3 MB/s
4194304 bytes (4.2 MB) copied, 1.25887 s, 3.3 MB/s

Summarized:
0 readers: 0.289169 0.289378
1 reader: 0.538839 0.537425 (+86%)
2 readers: 0.740495 0.726200 (+153%)
3 readers: 0.996933 0.969144 (+240%)

That seems... noticeable. Causing iterrupt latency problems is defintiely
a theoretical extrapolation, however.

For comparison, on this system, dd from /dev/zero runs at 1 GB/s per
thread for up to 4 threads with no interference.

*Really* confusingly, dd from /dev/zero to tmpfs runs at 450 MB/s (per
thread) for 2 to 4 threads, but 325 MB/s for 1 thread. NFclue.
(This is writing to separate files; writing the the same file is
slower.)

dd from tmpfs to tmpfs runs at about 380 MB/s, again independent
of the number of threads up to the number of CPUs.

2014-06-09 09:23:21

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

> Sigh, adventures in "unable to mount root filesystem" currently underway.

PEBKAC resolved. (I had CONFIG_ATA=m for some testing, but forgot
to change it back to y when compiling a kernel I expected to boot.)

32 MiB write:

Without patch With patch
0 readers: 0.495523 0.494998 0.515995 0.516026
1 reader: 0.842245 0.845308 0.704283 0.704318
2 readers: 1.18658 1.18762 0.904635 0.904844
3 readers: 1.49615 1.49616 1.14311 1.16836

I haven't rebooted back to the without-patch kernel to see if the 20 ms
loss is consistent, but obviously that needs fixing. (There's more than
just that patch I posted in there.)

Anyway, it's noticeable, but maybe not worth fixing.

The interesting question is whether we should get rid of out[].
That's obscurity that contributes no security, and thus a Bad
Thing by definition.

2014-06-09 13:17:43

by George Spelvin

[permalink] [raw]
Subject: drivers/char/random.c: More futzing about

Just as an example of some more ambitious changes I'm playing with...

I really think the polynomial + twist has outlived its usefulness.
In particular, table lookups in infrequently accessed code are called
D-cache misses and are undesirable. And the input_rotate is an ugly
kludge to compensate for inadequate mixing.

Here's an example of a smaller, faster, and better fast_mix() function.
The mix is invertible (thus preserving entropy), but causes each input
bit or pair of bits to avalanche to at least 43 bits after 2 rounds and
120 bit0 after 3.

For comparison, with the current linear fast_mix function, bits above
the 12th in the data words only affect 4 other bits after one repetition.

With 3 iterations, it runs in 2/3 the time of the current fast_mix
and is half the size: 84 bytes of object code as opposed to 168.

void fast_mix2(struct fast_pool *f, u32 const input[4])
{
u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1];
u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3];
int i;

for (i = 0; i < 3; i++) {
/*
* Inspired by ChaCha's QuarterRound, but
* modified for much greater parallelism.
* Surprisingly, rotating a and c seems to work
* better than b and d. And it runs faster.
*/
a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 15); c = rol32(c, 21);

a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 3); c = rol32(c, 7);
}
f->pool[0] = a; f->pool[1] = b;
f->pool[2] = c; f->pool[3] = d;
f->count++;
}

Other good rotate sets:
score = 43/120/121: 23 6 18 11
score = 42/120/123: 17 15 4 24
score = 42/120/122: 4 7 19 17
score = 43/122/122: 24 6 16 12
score = 42/121/122: 25 22 11 8
score = 42/122/121: 16 20 11 23
score = 43/120/122: 8 11 17 19
score = 46/121/123: 15 21 3 7
score = 43/120/122: 7 11 15 21
score = 42/120/122: 7 10 17 13
score = 42/120/123: 11 3 18 23
score = 44/121/122: 20 10 26 24
score = 42/123/122: 10 5 18 21
score = 44/120/122: 26 21 7 9
score = 42/121/122: 21 11 7 8
score = 42/120/122: 11 11 27 19
score = 42/121/122: 6 18 4 27
score = 42/121/122: 13 23 3 4


If you want smaller code, or more flexibility in the number of rounds,
try (24 5 24 5) or (27 8 27 8). The avalanche is only slightly worse,
but unrolling twice shaves a smidgen off the run time, so the extra 2
rotate constants come for free.

If you want something linear, there are better ways to get better mixing.
Here's one based on a period-2^128-1 xorshift generator:

void fast_mix3(struct fast_pool *f, u32 const input[4])
{
u32 a = f->pool[0] ^ input[0];
u32 b = f->pool[1] ^ input[1];
u32 c = f->pool[2] ^ input[2];
u32 d = f->pool[3] ^ input[3];

/*
* See Marsagalia, "Xorshift RNGs". Possible shift amounts
* are [5, 14, 1], [15, 4, 21], [23, 24, 3], [5, 12, 29].
*/
a ^= a<<15; a ^= a>>4 ^ d ^ d>>21; f->pool[0] = a;
b ^= b<<15; b ^= b>>4 ^ a ^ a>>21; f->pool[1] = b;
c ^= c<<15; c ^= c>>4 ^ b ^ b>>21; f->pool[2] = c;
d ^= d<<15; d ^= d>>4 ^ c ^ c>>21; f->pool[3] = d;

f->count++;
}

... However this is slower than 2 iterations of fast_mix2, and
doesn't avalanche as much.

2014-06-09 13:34:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

On Mon, Jun 09, 2014 at 12:03:55AM -0400, George Spelvin wrote:
>
> That seems... noticeable. Causing iterrupt latency problems is defintiely
> a theoretical extrapolation, however.

Actually, if you look very closely, or take a look at the commit
description for 902c098a3663de, you'll see that we're actually
updating the entropy pool from add_interrupt_randomness() w/o taking
any locks. So that's actually not a problem.

What could be a problem is if we get really unlucky (an interrupt
update happening at the same time as an update from rngd), it's
possible we could end up with an entropy addition getting lost. It's
not such a big deal if a contribution from add_interrupt_randomness()
gets lost, since we only giving one bit worth of entropy credit. But
it's possible than a much larger input from rngd could potentially end
up getting overwritten from the fast_pool contribution.

This isn't a disaster per se, but it probably means that it's worth
taking a closer look at how we do the entropy pool mixing input to see
if we can actually a make a fully lockless add_entropy work correctly.
One approach might be to use atomics but of course then we have to
balance the increased overhead of using atomic_t types versus the
locking overhead.

Cheers,

- Ted

2014-06-09 15:04:30

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

> Actually, if you look very closely, or take a look at the commit
> description for 902c098a3663de, you'll see that we're actually
> updating the entropy pool from add_interrupt_randomness() w/o taking
> any locks. So that's actually not a problem.

Arrgh... I could swear I saw a path that locked, but now that I look
again I can't see it. I think you're right and I was hallucinating.

Nice work on credit_entropy_bits, BTW.

> What could be a problem is if we get really unlucky (an interrupt
> update happening at the same time as an update from rngd), it's
> possible we could end up with an entropy addition getting lost. It's
> not such a big deal if a contribution from add_interrupt_randomness()
> gets lost, since we only giving one bit worth of entropy credit. But
> it's possible than a much larger input from rngd could potentially end
> up getting overwritten from the fast_pool contribution.

What I can't see in the code is how that case is detected and the entropy
credit suppressed.

I just want to be sure I understand your use of the word "overwritten".
There's the entropy itself, and the associated credit.

I'm taking about the fact that _mix_pool_bytes can cause the
actual entropy bits to get overwritten in the pool if two writers
race each other.

As you say, it's always possible to fall back to zero credit,
but you have to *detect* the situation, and I'm not sure how
that happens.

While it's easy enough to have a special case for one interrupt handler,
there's one per processor that can be trying to write. The obvious way
is to do a trylock of some sort from the interrupt handler and hold off
spilling the fast_pool if the attempt fails. So there's a lock
of a sort, but no spinning on it.

> This isn't a disaster per se, but it probably means that it's worth
> taking a closer look at how we do the entropy pool mixing input to see
> if we can actually a make a fully lockless add_entropy work correctly.
> One approach might be to use atomics but of course then we have to
> balance the increased overhead of using atomic_t types versus the
> locking overhead.

Doing it really right, including clamping at the endpoints, is
extremely tricky in the face of concurrent access.

Consider the case of a full pool, and a concurrent write and read
of a full pool's worth of entropy.

If the write wins the race, it overfills the pool, so may not give
any credit, and the read them empties it.

If the read wins the race, it empties the pool, and then the write
refills it.

But if there's no enforced serialization, the final condition is
that the amount of entropy in the pool is completely unknown;
it could be anything.


The implementation that seems most obvious to me, if we can have the
readers and the writers cooperating at least, uses monotonically
increasing head and tail counters, with (head - tail) being the current
entropy budget.

To handle concurrent update, the head counter is split in two,
into head_max and head_min. The "max" is incremented before the
pool itself is accessed, and the "min" is incremented after.

When adding, the writer looks at the tail and chooses an amount
to credit that will not overflow the pool. Then increments head_max,
mixes in to the pool, and finally lets head_min catch up.

When extracting, the reader first blocks (if desired) on head_min.

Then the extractor does the extract and advances the tail by the
amount extracted, but no more than head_max after the extract.



Oh! I just realized that concurrent mix_pool_bytes can be handled with
a similar scheme. (To be clear, I've jumped from talking about entropy
accounting back to pool mixing.)

*If* we can bound the total number of bytes to be concurrently
mixed in to the pool to the size of the pool, then you can
assign ranges of bytes to write using a osrt of ticket lock.

Each adder bumps the add_ptr atomically by the amount of data they
want to add, masks the pre-incrmeent counter value, and uses that
to determine the area of the pool they will be updating.

What gets nasty is if you have a large SMP system and more writers
than can be handled thus. Because there's no guarantee on the
order that writers will *finish*, there's no simple way to keep
track of adds completed and thereby know if an additional writer
has room to work.

I can't just atomically increment the completed index when I finish adding,
because that could be misleading. If processor A reads the add_ptr
as 10 and increments it to 20, then processor B reads the add_ptr as
20 and increments it to 30, everything is fine so far.

But if processor B finishes first and increments the adds_completetd
index to 20, it's lying: the range from 10 to 20 is still busy and
attempting to access it will lead to trouble.

Ah! It turns out that there is a simple technique after all.
I have three add counters, and due to lack of imagination thinking
of good descriptive names, I'll call them add1, add2 and add3.

add1-add2 is the amount of outstanding writes in progress. The location
of the writes is not known, excpe that it's somewhere between add3 and add1.

The invariant is add3 <= add2 <= add1.

add1 is the ticket lock. A would-be writer checks that it is
not trying to increment add1 to more than add3+POOLSIZE, and
if so does an atomic fetch-and-add on add1.

This tells it the range of the pool it's allowed to write to. All good,
and it does the appropriate update.

When the write complete, atomically bump add2. Then check if add2 ==
add1. If not, return. If they are equal, set add3 to the shared value,
collapsing the "writes in progress somewhere in this range" window.

This would be bad TCP, preiodically collapsing the window, but it
should be fine for an entropy pool.

2014-06-09 15:50:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

On Mon, Jun 09, 2014 at 11:04:26AM -0400, George Spelvin wrote:
> > What could be a problem is if we get really unlucky (an interrupt
> > update happening at the same time as an update from rngd), it's
> > possible we could end up with an entropy addition getting lost. It's
> > not such a big deal if a contribution from add_interrupt_randomness()
> > gets lost, since we only giving one bit worth of entropy credit. But
> > it's possible than a much larger input from rngd could potentially end
> > up getting overwritten from the fast_pool contribution.
>
> What I can't see in the code is how that case is detected and the entropy
> credit suppressed.

Sorry, I didn't make myself clear. In the common case where we don't
have rngd running, and so we only have add_interrupt_randomness()
racing with itself, what's at stake is only one bit of entropy credit.
In the absolute worse case, where two CPU's run
add_interrupt_randomness() at the the exact same time (after servicing
different irq's), what can happen is that both cpu's will get the same
add_ptr, and so both CPU's will end up updating the same portion of
the entropy pool, and so one CPU's worth of add_interrupt_randomness()
will get lost by being overwritten by the other CPU's updates. This
means that we will have given two bits worth of entropy credit, but
only one add_interrupt_randomness()'s worth of updates. Given that
the single bit of entropy bit credits was being very conservative to
begin with, it's not such a big deal.

However, and this is the case I failed to consider when I made this
change almost two years ago, is what happens if you have one updater
coming from add_interrupt_randomness(), and another one coming from
rngd calling RNGADDENTROPY ioctl. Now if add_interrupt_randomness()
wins the race and all of the entropy pool updates are coming from it,
then the entropy credit coming from RNDADDENTROPY will be credited to
the pool without the contribution from rngd being actually added to
the pool, since those contributions might have gotten overwritten by
add_interrupt_randomness().

So I was't saying that we could detect the situation; it's just that
in cases where we aren't adding any credit anyway, or as in the case
of add_interrupt_randomness() racing with each other, it's 50-50 with
CPU will win the race, and so an adversary which can't predict which
CPU's contribution to the entropy pool will actually "win" has to
explore both possibilities, and that mitigates the fact that entropy
count might be bumped without the corresponding entropy having been
added.

But there is the RNGADDENTROPY icotl case where this reasoning doesn't
hold, and that's the case which I'm now worried about.

> Consider the case of a full pool, and a concurrent write and read
> of a full pool's worth of entropy.
>
> If the write wins the race, it overfills the pool, so may not give
> any credit, and the read them empties it.
>
> If the read wins the race, it empties the pool, and then the write
> refills it.

Yes, but again, if we're only adding a single bit's worth of entropy
credit, then at worst we'll only be off by one. And if the race is a
50-50 proposition (and I know life is not that simple; for example, it
doesn't deal with N-way races), then we might not even be off by one. :-)

One other thing to consider is that we don't really need to care about
how efficient RNGADDENTROPY needs to be. So if necessary, we can put
a mutex around that so that we know that there's only a single
RNGADDENTROPY being processed at a time. So if we need to play
cmpxchg games to make sure the RNGADDENTROPY contribution doesn't get
lost, and retry the input mixing multiple times if necessary, that's
quite acceptable, so long as we can minimize the overhead on the
add_interrupt_randomness() side of the path (and where again, if there
is a 50-50 change that we lose the contribution from
add_interrupt_randomness(), that's probably acceptable so long as we
don't lose the RNGADDENTROPY contribution).

Come to think of it, something that we

> Ah! It turns out that there is a simple technique after all.
> I have three add counters, and due to lack of imagination thinking
> of good descriptive names, I'll call them add1, add2 and add3.
>
> add1-add2 is the amount of outstanding writes in progress. The location
> of the writes is not known, excpe that it's somewhere between add3 and add1.
>
> The invariant is add3 <= add2 <= add1.
>
> add1 is the ticket lock. A would-be writer checks that it is
> not trying to increment add1 to more than add3+POOLSIZE, and
> if so does an atomic fetch-and-add on add1.
>
> This tells it the range of the pool it's allowed to write to. All good,
> and it does the appropriate update.
>
> When the write complete, atomically bump add2. Then check if add2 ==
> add1. If not, return. If they are equal, set add3 to the shared value,
> collapsing the "writes in progress somewhere in this range" window.

I need to think about this more, but that's clever! And if we end up
having to throw away a contribution from add_interrupt_entropy(),
that's fine.

Speaking of which, there's an even simpler solution that I've failed
to consider. We could simply use a trylock in
add_interrupt_randomness(), and if we can't get the lock, decrement
fast_pool->count so we try to transfer the entropy from the fast_pool
to the entropy pool on the next interrupt.

Doh! That solves all of the problems, and it's much simpler and
easier to reason about.

- Ted

2014-06-09 16:11:47

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe?

> Yes, but again, if we're only adding a single bit's worth of entropy
> credit, then at worst we'll only be off by one. And if the race is a
> 50-50 proposition (and I know life is not that simple; for example, it
> doesn't deal with N-way races), then we might not even be off by one. :-)

Indeed.

> One other thing to consider is that we don't really need to care about
> how efficient RNGADDENTROPY needs to be. So if necessary, we can put
> a mutex around that so that we know that there's only a single
> RNGADDENTROPY being processed at a time. So if we need to play
> cmpxchg games to make sure the RNGADDENTROPY contribution doesn't get
> lost, and retry the input mixing multiple times if necessary, that's
> quite acceptable, so long as we can minimize the overhead on the
> add_interrupt_randomness() side of the path (and where again, if there
> is a 50-50 change that we lose the contribution from
> add_interrupt_randomness(), that's probably acceptable so long as we
> don't lose the RNGADDENTROPY contribution).

Yes, that was something I was thinking: if you can just *detect*
the collision, you can always retry the mixing in.

But getting the entropy accounting right is tricky. I completely fail
to see how the current RNDADDENTROPY is race-free. There's no attempt
at locking between the write_pool() and credit_entropy_bits_safe().
So a reader could sneak in and drain the pool, only to have us credit
it back up later.

Crediting either before or after the write_pool is unsafe; you
have to either wrap mutual exclusion around the whole thing,
or do a sort of 2-phase commit.

> Speaking of which, there's an even simpler solution that I've failed
> to consider. We could simply use a trylock in
> add_interrupt_randomness(), and if we can't get the lock, decrement
> fast_pool->count so we try to transfer the entropy from the fast_pool
> to the entropy pool on the next interrupt.
>
> Doh! That solves all of the problems, and it's much simpler and
> easier to reason about.

That's exacty what I was talking about when I wrote (two e-mails ago):

>> While it's easy enough to have a special case for one interrupt handler,
>> there's one per processor that can be trying to write. The obvious way
>> is to do a trylock of some sort from the interrupt handler and hold off
>> spilling the fast_pool if the attempt fails. So there's a lock
>> of a sort, but no spinning on it.

We just used different terminology. I used the more abstract "hold off
spilling the fast_pool" and you described a specific implementation technique
with "decrement fast_pool->count".

The whole "multiple concurrent writers" thing is probably overkill, but
it's fun to figure out how to do some of these things anyway.

2014-06-10 00:21:00

by George Spelvin

[permalink] [raw]
Subject: drivers/char/random.c: more ruminations

My original reason for writing turned into a never mind...

I came up with a clever and intricate algorithm for preventing
a bulk writer from DoSing add_interrupt_randomness via the proposed
additional locking on the input_pool.

Then I re-checked the code and saw that only the (trusted) RNDADDENTROPY
call is allowed to write to the input_pool. A plain write(2) is copied
to the two output pools.

And writing to the input_pool can never try to take a lock on an
output pool; spilling is done in a workqueue.


I'm getting to a dangerous place: I think I'm starting to understand this.
I should probably turn that into a doc patch.


I have an idea for a patch to change _xfer_secondary_pool
to use extract_buf rather than extract_entropy; is all that
FIPS stuff needed for purely internal transfers?


Also, shouldn't the r->last_pulled holdoff in xfer_secondary_pool be
really limited to actual transfers? I.e. reorder the conditions as:

static void xfer_secondary_pool(struct entropy_store *r, size_t nbytes)
{
if (!r->pull ||
r->entropy_count >= (nbytes << (ENTROPY_SHIFT + 3)) ||
r->entropy_count >= r->poolinfo->poolfracbits)
return;

if (r->limit == 0 && random_min_urandom_seed) {
unsigned long now = jiffies;

if (time_before(now,
r->last_pulled + random_min_urandom_seed * HZ))
return;
r->last_pulled = now;
}
_xfer_secondary_pool(r, nbytes);
}

2014-06-10 01:20:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Mon, Jun 09, 2014 at 08:20:57PM -0400, George Spelvin wrote:
>
> I have an idea for a patch to change _xfer_secondary_pool
> to use extract_buf rather than extract_entropy; is all that
> FIPS stuff needed for purely internal transfers?

That's not the part of extract_entropy() which is critical. What's
critical is the control over only transfering entropy when there is at
least a certain minimum amount of entropy. This provides the
Yarrow-like avalanche property which is required to provide recovery
after the internal state of the entropy pools have been exposed.

> Also, shouldn't the r->last_pulled holdoff in xfer_secondary_pool be
> really limited to actual transfers? I.e. reorder the conditions as...

Yes, that makes sense.

Cheers,

- Ted

2014-06-10 03:10:20

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

>> I have an idea for a patch to change _xfer_secondary_pool
>> to use extract_buf rather than extract_entropy; is all that
>> FIPS stuff needed for purely internal transfers?

> That's not the part of extract_entropy() which is critical. What's
> critical is the control over only transfering entropy when there is at
> least a certain minimum amount of entropy. This provides the
> Yarrow-like avalanche property which is required to provide recovery
> after the internal state of the entropy pools have been exposed.

I do understand the importance of catastrophic reseeding, but
I don't see that implemented in extract_entropy().

Extract_entropy() itself contains a call to xfer_secondary_pool, but
that is a no-op in the case I'm considering (when it's called from
_xfer_secondary_pool) because in that case, r is the the input_pool,
which doesn't have an r->pull pool.

The number of bytes transferred is adjusted by account(), but
it's only adjusted downward. (If it were adjusted up, that would be
a buffer overrun.)

Normal reads seem to ask for a reseed equal in size to read itself,
which is "ctastrophic" enough.

The only *minimum* reseed transfer size I can see is enforced in
_xfer_secondary_pool and push_to_pool, which use random_read_wakeup_bits
(default 64) as a minimum reseed. (Prehaps increase this?)
standards, IMHO.)


Now, having said all that, I'll try to avoid embarrassing myself
in public again and ask... what am I missing?



On other matters...

If I play around with getting the entropy locking right, do you
have a strong preference for the single-lock or two-lock design?
I can't decide which to start with.

The latter is an extensive core code change, but I can easily convince
myself there are no deadlock issues. That's the one I described
with separate adder and extractor locks and code paths, two "amount
of entropy ever added" variables, ane one "amont of entropy ever removed".

The single-lock design is hopefully less code, but (to me, at least) less
obviously deadlock-free.

2014-06-10 15:25:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Mon, Jun 09, 2014 at 11:10:17PM -0400, George Spelvin wrote:
>
> Extract_entropy() itself contains a call to xfer_secondary_pool, but
> that is a no-op in the case I'm considering (when it's called from
> _xfer_secondary_pool) because in that case, r is the the input_pool,
> which doesn't have an r->pull pool.
>
> The number of bytes transferred is adjusted by account(), but
> it's only adjusted downward. (If it were adjusted up, that would be
> a buffer overrun.)

It's the adjustment downward which is what gives us the catastrophic
reseeding --- specifically, it's adjusting the number of bytes down to
zero until we can transfer enough entropy to make it intractable for
the adversary to test the 2**N possible pool states that might
correspond to the observed /dev/random output.

In general, this is how to mitigate a state compromise scenario; it's
to delay reseeding until there we have confidence that there is enough
entropy built up that the bad guys can't maintain their model of the
entropy pool state by observing some or all of the RNG output.
Adjusting the bytes down to zero is basically a way of delaying
reseeding.

As far as the FIPS wankery is considered, *all* of FIPS is wankery,
and most serious practitioners believe that the FIPS requirements have
net been actively negative for overall security. ("Do not touch a
line of this file, including the comments, else it will cost us
hundreds of thousands of dollars in recertification fees" -- Apple, in
their SSL library code where the "goto fail" bug was found.)

The only reason why the FIPS code is there is because apparently it's
necessary for those foolish US government customers who require FIPS
certification of OpenSSL, and apparently since /dev/random is an input
to OpenSSL. (It was also added while Matt Mackall was the maintainer;
I'm not entirely certain I would have accepted it, because I think
FIPS is actively harmful; OTOH, I was working for IBM at the time, and
IBM does make money from selling to silly customers who require FIPS,
so maybe I would have.)

I suspect that the FIPS check is not necessary for intra-pool
transfers, but quite frankly, I can't be bothered to care about
improving the efficiency of systems put into FIPS mode. And there is
a possibility that changing it might break the FIPS certification.
Not that I care either way, but I'm just not motiviated to verify that
any change to improve FIPS efficiency might not break security for the
use cases that I actually care about. :-/


> If I play around with getting the entropy locking right, do you
> have a strong preference for the single-lock or two-lock design?
> I can't decide which to start with.
>
> The latter is an extensive core code change, but I can easily convince
> myself there are no deadlock issues. That's the one I described
> with separate adder and extractor locks and code paths, two "amount
> of entropy ever added" variables, ane one "amont of entropy ever removed".
>
> The single-lock design is hopefully less code, but (to me, at least) less
> obviously deadlock-free.

Since as near as I can tell, there isn't really real performance issue
here, what's most important to me is that (a) the changes are easily
auditable, and (b) the resulting code is easily auditable and
Obviously Correct.

The null hypothesis that any change would have to compete against is
adding a trylock to add_interrupt_randomness(), since the change is
small, and obviously not going to make things worse.

If we can do better than that in terms of the "the commit is easy to
read/audit" and "the resulting code is easy to read/audit", then
great. But otherwise, it seems that the "add a trylock" is the
simplest way to make things better.

Does that make sense?

- Ted

2014-06-10 20:40:33

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

> It's the adjustment downward which is what gives us the catastrophic
> reseeding --- specifically, it's adjusting the number of bytes down to
> zero until we can transfer enough entropy to make it intractable for
> the adversary to test the 2**N possible pool states that might
> correspond to the observed /dev/random output.

Er, yes, I understand that... as I said, I understand the
purpose and importance of catastrophic reseeding.

Look, maybe it's easier to see in the draft patch appended.
I added a comment pointing out the catastrophic reseeding.

(Your opinions on the issues labelled FIXME in the comments
are definitely solicited.)

An obvious follow-up patch is to delete the last two arguments from
extract_entropy(), since they're always zero in all remaining calls.

> I suspect that the FIPS check is not necessary for intra-pool
> transfers, but quite frankly, I can't be bothered to care about
> improving the efficiency of systems put into FIPS mode. And there is
> a possibility that changing it might break the FIPS certification.
> Not that I care either way, but I'm just not motiviated to verify that
> any change to improve FIPS efficiency might not break security for the
> use cases that I actually care about. :-/

I don't care about the efficiency either; I just wanted to avoid the
stack usage of a "u32 tmp[OUTPUT_POOL_WORDS]" when the actual extraction
is done in EXTRACT_SIZE chunks anyhway.

(This stuff is called from fs code in some cases, although I haven't
checked if it's on the page-out path that's always the stack smasher.)

My question was "do I have to preserve that crap when it would be
easier to dike it out?" Look at the example patch below.

(If I did have to preserve it, then I'd move the check into
extract_buf.)

> The null hypothesis that any change would have to compete against is
> adding a trylock to add_interrupt_randomness(), since the change is
> small, and obviously not going to make things worse.

Er, you seem to underestimate the changes. It also involves moving the
existing locks outward to encompass entropy accounting in many places in
the code (after which the cmpxchg in credit_entropy is no longer needed
and may be deleted).

> Does that make sense?

The goals make perfect sense, I'm just saying that "add a trylock"
is not a literal implementation guide; there are a lot of consequent
changes that are being glossed over.

I really think the result will be much clearer, but I'll know for
sure when I get there.


Draft patch to reduce stack usage in _xfer_secondary_pool:

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 102c50d3..5bbe5167 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -910,8 +910,9 @@ void add_disk_randomness(struct gendisk *disk)
*
*********************************************************************/

-static ssize_t extract_entropy(struct entropy_store *r, void *buf,
- size_t nbytes, int min, int rsvd);
+static size_t account(struct entropy_store *r, size_t nbytes, int min,
+ int reserved);
+static void extract_buf(struct entropy_store *r, u8 out[EXTRACT_SIZE]);

/*
* This utility inline function is responsible for transferring entropy
@@ -937,23 +938,51 @@ static void xfer_secondary_pool(struct entropy_store *r, size_t nbytes)

static void _xfer_secondary_pool(struct entropy_store *r, size_t nbytes)
{
- __u32 tmp[OUTPUT_POOL_WORDS];
-
- /* For /dev/random's pool, always leave two wakeups' worth */
- int rsvd_bytes = r->limit ? 0 : random_read_wakeup_bits / 4;
+ u8 tmp[EXTRACT_SIZE];
int bytes = nbytes;

/* pull at least as much as a wakeup */
bytes = max_t(int, bytes, random_read_wakeup_bits / 8);
/* but never more than the buffer size */
- bytes = min_t(int, bytes, sizeof(tmp));
+ bytes = min_t(int, bytes, OUTPUT_POOL_WORDS*sizeof(u32));

+ /*
+ * FIXME: Move this to after account(), so it shows the true amount
+ * transferred?
+ */
trace_xfer_secondary_pool(r->name, bytes * 8, nbytes * 8,
ENTROPY_BITS(r), ENTROPY_BITS(r->pull));
- bytes = extract_entropy(r->pull, tmp, bytes,
- random_read_wakeup_bits / 8, rsvd_bytes);
- mix_pool_bytes(r, tmp, bytes, NULL);
- credit_entropy_bits(r, bytes*8);
+
+ /*
+ * This is the only place we call account() with non-zero
+ * "min" and "reserved" values. The minimum is used to
+ * enforce catastrophic reseeding: if we can't get at least
+ * random_read_wakeup_bits of entropy, don't bother reseeding
+ * at all, but wait until a useful amount is available.
+ *
+ * The "reserved" is used to prevent reads from /dev/urandom
+ * from emptying the input pool; leave two wakeups' worth
+ * for /dev/random.
+ */
+ bytes = account(r->pull, bytes, random_read_wakeup_bits / 8,
+ r->limit ? 0 : random_read_wakeup_bits / 4);
+
+ /* Now to the actual transfer, in EXTRACT_SIZE units */
+ while (bytes) {
+ int i = min_t(int, bytes, EXTRACT_SIZE);
+
+ extract_buf(r->pull, tmp);
+ mix_pool_bytes(r, tmp, i, NULL);
+ credit_entropy_bits(r, i*8);
+ bytes -= i;
+ }
+
+ /*
+ * Wipe data just returned from memory.
+ * FIXME: gcc understands memset and will probably optimize
+ * this out. Use OpenBSD-style explicit_bzero()?
+ */
+ memset(tmp, 0, sizeof(tmp));
}

/*
@@ -1019,7 +1048,7 @@ retry:
*
* Note: we assume that .poolwords is a multiple of 16 words.
*/
-static void extract_buf(struct entropy_store *r, __u8 *out)
+static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
{
int i;
union {

2014-06-10 21:20:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Tue, Jun 10, 2014 at 04:40:28PM -0400, George Spelvin wrote:
>
> Look, maybe it's easier to see in the draft patch appended.
> I added a comment pointing out the catastrophic reseeding.

Yes, I see what you're trying to do.

What I'd suggest is that you do this in a series of small patches.
Each patch should solve one problem at a time, so it's easy to audit
each one. For example, adding a trylock to add_interrupt_randomness()
should do that, and ***only*** that. This fixes the case where
add_interrupt_randomness() races with RNDADDENTROPY's call to
write_pool().

Yes, there are other potential problems, especially relating to
whether we need to atomically update the entropy credit and the input
pool. And actually, it doesn't matter, since we never try to extract
more than the input pool's has limit set to 1, and so we never extract
more entropy than the entropy counter. So if we add the entropy, it's
impossible that it will get used before we update the entropy counter.

This is why changes should be separate commits, so we can review each
one of them separately.

Similarly, your comment about memset vs. explicit_bzero is a valid
one, but that's a problem that should be fixed in a separate commit.
And yes, that's a real problem --- I've confirmed this using gcc -S,
and it's not just a problem in drivers/char/random.c, but also in a
number of use cases under crypto/*.c. But that's a problem that
should be fixed separately, and to be honest, I'd actually consider
this problem than some of the other issues we've talked about in this
thread.

This is also why I objected to your change to use the stale stack
contents for the input array. That does something different from the
rest of the changes in the patch. I'm sorry if keep harping on this,
but this is critically important. It also means that we can accept
the small, obviously correct changes, while we discuss the larger,
more invasive changes. It also makes it easier to backport the
smaller and more important changes to stable kernels.

> I don't care about the efficiency either; I just wanted to avoid the
> stack usage of a "u32 tmp[OUTPUT_POOL_WORDS]" when the actual extraction
> is done in EXTRACT_SIZE chunks anyhway.

Huh? The FIPS wankery requires:

__u8 tmp[10];

not

u32 tmp[OUTPUT_POOL_WORDS];

and when you replace the former with the latter, but still try to move

The xfer_seconary_pool code does use OUTPUT_POOL_WORDS*sizeof(32), but
only declare tmp as a 10-byte char array, it looks like you're end up
overflowing the stack (have you actually tested your draft patch?)

In any case, this is another reason why I really request people to
send a series of git commits, where each one is small, easy to read,
and Obviously Correct.

When you try to make multiple changes in a single commit, it makes
things harder to review, and that opens up "goto fail" sort of errors.
(Well, hopefully not because I spend a lot of time very carefully
scrutinizing patches, and if I don't have time, I won't apply the
patch until I do have time. So put another way, if you want to
increase the likelihood that I'll process your patches quicktly, it's
to your advantage to keep each separate commit small and obviously
correct(tm). Yes, this means that sometimes when you start making
changes, and run into other cleanups, you may need to use commands
like "git stash", create a separate commit that does just the
cleanup", and then "git stash pop" to resume work --- or use quilt or
guilt if that's more convenient for you, but small patches really,
REALLY, helps the review process.)

> > The null hypothesis that any change would have to compete against is
> > adding a trylock to add_interrupt_randomness(), since the change is
> > small, and obviously not going to make things worse.
>
> Er, you seem to underestimate the changes. It also involves moving the
> existing locks outward to encompass entropy accounting in many places in
> the code (after which the cmpxchg in credit_entropy is no longer needed
> and may be deleted).

Sure, but you can put the deletion of the cmpxchg and the out[] array
in separate commits, where the tree remains building and secure at
each step.

I generally have the biggest problems with academics who want modify
the random number generator, where they dump a several thousand line
diff on my porch, and then wonder why I don't just blindly apply it....

- Ted

2014-06-11 00:10:06

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

>> I don't care about the efficiency either; I just wanted to avoid the
>> stack usage of a "u32 tmp[OUTPUT_POOL_WORDS]" when the actual extraction
>> is done in EXTRACT_SIZE chunks anyhway.

> Huh? The FIPS wankery requires:
>
> __u8 tmp[10];
>
> not
>
> u32 tmp[OUTPUT_POOL_WORDS];

Sorry, we're still not communicating right. You seem to think I'm trying
to optimize the FIPS code, and nothing could be further from the truth.

Please consider the merits of my patch as if the FIPS stuff weas
#ifdefed out. I don't care about it, I'm not trying to optimize it,
it was just *in my way* when I was trying to do something else, and I
only had to ask about it because it's a legal/regulatory requirement
rather than a technical one I couldn't understand it by logic.

What I wanted to do was eliminate that huge tmp buffer from
_xfer_secondary_pool. There's no good reason why it needs to be there.
and several reasons for getting rid of it.

A big one for me is auditability, a.k.a. the current code confused me
when I was trying to understand it. Based on static code analysis,
extract_entropy and xfer_secondary_pool appear to be recursive; the fact
that they're not in practice takes more insight.

extract_entropy_user(&nonblocking_pool)
-> xfer_secondary_pool(&nonblocking_pool)
-> _xfer_secondary_pool(&nonblocking_pool)
-> extract_entropy(&input_pool)
-> xfer_secondary_pool(&input_pool)
recursion terminates because input_pool.pull == NULL

> and when you replace the former with the latter, but still try to move
>
> The xfer_seconary_pool code does use OUTPUT_POOL_WORDS*sizeof(32), but
> only declare tmp as a 10-byte char array, it looks like you're end up
> overflowing the stack (have you actually tested your draft patch?)

Please look again; it doesn't overflow.

I hadn't tested the patch when I mailed it to you (I prepared it in
order to reply to your e-mail, and it's annoying to reboot the machine
I'm composing an e-mail on), but I have since. It works.

> This is also why I objected to your change to use the stale stack
> contents for the input array. That does something different from the
> rest of the changes in the patch. I'm sorry if keep harping on this,
> but this is critically important. It also means that we can accept
> the small, obviously correct changes, while we discuss the larger,
> more invasive changes. It also makes it easier to backport the
> smaller and more important changes to stable kernels.
[...]
> In any case, this is another reason why I really request people to
> send a series of git commits, where each one is small, easy to read,
> and Obviously Correct.
[...]
> small patches really, REALLY, helps the review process.)

I'm very aware of this. As I've said before on the git mailing list,
the reason that git bisect is such a killer feature is due to an even
more important, but also more subtle, killer feature of git: it makes
it practical to use lots of small commits, so it can point you at a very
small change.

If you had CVS-style weeks-in-preparation-and-testing "code drop"
commits, it wouldn't be nearly as useful.

Indeed, if you look at some of the 10-patch series I've posted recently
(I'm on a "clear my tree of private patches" binge), I think I'm doing
fairly well. If anything I worry about erring on the side of too minor.
Does this meet with your approval:

http://www.spinics.net/lists/linux-media/msg76435.html

But even I get annoyed when I have a 1-line comment typo fix and wonder
if it really deserves its own commit or if I can just include it with
the other changes I'm making to that file.


Anwyay, please stop wearing out your fingers and consider the point Made.

I will do my utmost to divide things as finely as possible.

The one problem is that too small a change can be Obviously Correct
(to the extent that the overall code complexity allows *anything* to
be obvious), but not Obviously Useful.

Still, combining patches is a whole lot less effort than splitting them,
so I'll err far on that side.

> Sure, but you can put the deletion of the cmpxchg and the out[] array
> in separate commits, where the tree remains building and secure at
> each step.

Definitely! I'm going to produce as long as patch series as I possibly
can. But there's always an overall goal to the entire series, and
that's what I'm trying to discuss beforehand.

Looking back, I understand now how my words could lead to confusion.
To me, the race in add_interrupt_randomness is fairly inconsequential
and fixing it is a minor side effect of my overall goal of fixing *all*
the races.

The *fundamental* race, as I see it, is the one between modifying pools
and crediting entropy.

As I noted, you can't safely do the credit either before *or* after modifying
the pool; you will always end up with the wrong answer in some situation.
This is why the RNDADDENTROPY ioctl is required and the RNDADDTOENTCNT
ioctl is old legacy cruft that should never be used. (Note to self:
add a warning if it's ever used, or even disable it entirely.)

When I said "it's a lot bigger job than that" I was thinking about *my*
agenda of the entire project, when I had just asked "which path do you
think I should take?". I wasn't trying to imply it would be one patch.

I have half a dozen patches to random.c already waiting. For example,
one is a bulk conversion of __u8 and __u32 to u8 and u32. The underscore
versions are only intended for public header files where namespace
pollution is a problem.

But I have to think about ordering and how to present them; a patch like
that is annoying to reorder with others.

(Honestly, my personal preference is to the longer names from <stdint.h>
just because they *are* standard and so very widely used. But that's
an issue for a maintainer's esthetic judgement.)

2014-06-11 02:08:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Tue, Jun 10, 2014 at 08:10:03PM -0400, George Spelvin wrote:
> What I wanted to do was eliminate that huge tmp buffer from
> _xfer_secondary_pool. There's no good reason why it needs to be there.
> and several reasons for getting rid of it.

So have you actually instrumented the kernel to demonstrate that in
fact we have super deep stack call paths where the 128 bytes worth of
stack actually matters?

Premature optimization being the root of all evil (not to mention
wasting a lot of time of kernel developers) and all that....

> I hadn't tested the patch when I mailed it to you (I prepared it in
> order to reply to your e-mail, and it's annoying to reboot the machine
> I'm composing an e-mail on), but I have since. It works.

As an aside, I'd strongly suggest that you use kvm to do your kernel
testing. It means you can do a lot more testing which is always a
good thing....

> The *fundamental* race, as I see it, is the one between modifying pools
> and crediting entropy.
>
> As I noted, you can't safely do the credit either before *or* after modifying
> the pool; you will always end up with the wrong answer in some situation.

Actually, it's **fine**. That's because RNDADDENTROPY adds the
entropy to the input pool, which is has the limit flag set. So we
will never pull more entropy than the pool is credited as having.
This means that race can't happen. It ***is*** safe.

1) Assume the entropy count starts at 10 bytes.

2) Random writer mixes in 20 bytes of entropy into the entropy pool.

3) Random extractor tries to extract 32 bytes of entropy. Since the
entropy count is still is 10, it will only get 10 bytes. (And if we
started with the entropy count started at zero, we wouldn't extract
any entropy at all.)

4) Random writer credit the entropy counter with the 20 bytes mixed in
step #2.

See? no problems!

- Ted

2014-06-11 02:21:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Tue, Jun 10, 2014 at 08:10:03PM -0400, George Spelvin wrote:
>
> But even I get annoyed when I have a 1-line comment typo fix and wonder
> if it really deserves its own commit or if I can just include it with
> the other changes I'm making to that file.

Unless you're actually modifying that section of code, I usually
recommend that people just not bother. The fact that someone included
a one-line comment fix caused a merge conflict with the ext4 pull in
this merge window that Linus had to fix up. Not that it's a big deal,
but unless it's something that's really going to confuse the reader, I
treat it as white space fixes; something that's only worth fixing if
that particular function is being modified for a "real" change.

> I have half a dozen patches to random.c already waiting. For example,
> one is a bulk conversion of __u8 and __u32 to u8 and u32. The underscore
> versions are only intended for public header files where namespace
> pollution is a problem.

And sorry, I consider this sort of thing is to be just code wankery.
We use __u32 in a number of places in the kernel, and it doesn't
really affect code readability. A change like this almost guarantees
that stable patches won't apply manually, and will have to be ported
manually. It's just not worth it.

- Ted

2014-06-11 03:58:09

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

> Actually, it's **fine**. That's because RNDADDENTROPY adds the
> entropy to the input pool, which is has the limit flag set. So we
> will never pull more entropy than the pool is credited as having.
> This means that race can't happen. It ***is*** safe.
>
> 1) Assume the entropy count starts at 10 bytes.
>
> 2) Random writer mixes in 20 bytes of entropy into the entropy pool.
>
> 3) Random extractor tries to extract 32 bytes of entropy. Since the
> entropy count is still is 10, it will only get 10 bytes. (And if we
> started with the entropy count started at zero, we wouldn't extract
> any entropy at all.)
>
> 4) Random writer credit the entropy counter with the 20 bytes mixed in
> step #2.
>
> See? no problems!

You can forbid underflows, but the code doesn't forbid overflows.

1. Assume the entropy count starts at 512 bytes (input pool full)
2. Random writer mixes in 20 bytes of entropy into the input pool.
2a. Input pool entropy is, however, capped at 512 bytes.
3. Random extractor extracts 32 bytes of entropy from the pool.
Succeeds because 32 < 512. Pool is left with 480 bytes of
entropy.
3a. Random extractor decrements pool entropy estimate to 480 bytes.
This is accurate.
4. Random writer credits pool with 20 bytes of entropy.
5. Input pool entropy is now 480 bytes, estimate is 500 bytes.

Problem, no?

2014-06-11 04:34:18

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

> So have you actually instrumented the kernel to demonstrate that in
> fact we have super deep stack call paths where the 128 bytes worth of
> stack actually matters?

I haven't got a specific call chain where 128 bytes pushes it
over a limit. But kernel stack usage is a perennial problem.
Wasn't there some discussion about that just recenty?
6538b8ea8: "x86_64: expand kernel stack to 16K"

I agree a 128 byte stack frame is not one of the worst offenders,
but it's enough to try to clean up if possible.

You can search LKML for a bunch of discussion of 176 bytes
in __alloc_pages_slowpath().

And in this case, it's so *easy*. extract_buf() works 10 bytes at a
time anyway, and _mix_pool_bytes is byte at a time.

>> I hadn't tested the patch when I mailed it to you (I prepared it in
>> order to reply to your e-mail, and it's annoying to reboot the machine
>> I'm composing an e-mail on), but I have since. It works.

> As an aside, I'd strongly suggest that you use kvm to do your kernel
> testing. It means you can do a lot more testing which is always a
> good thing....

H'mmm. I need to learn what KVM *is*. Apparently there's a second
meaning other than "keyboard, video & mouse". :-)

Normally, I just test using modules. Especially when working on a
driver for a hardware device, virtualization makes life difficult.
But /dev/random is (for good reasons) not modularizable.

(I can see how it'd be useful for filesystem development, however.)

2014-06-11 13:09:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Wed, Jun 11, 2014 at 12:34:16AM -0400, George Spelvin wrote:
>
> I haven't got a specific call chain where 128 bytes pushes it
> over a limit. But kernel stack usage is a perennial problem.
> Wasn't there some discussion about that just recenty?
> 6538b8ea8: "x86_64: expand kernel stack to 16K"

Yes, but that was a call path involving file systems
writepage/writepages, block devices, and the writeback code paths.
None of which need random numbers...

> Normally, I just test using modules. Especially when working on a
> driver for a hardware device, virtualization makes life difficult.
> But /dev/random is (for good reasons) not modularizable.

Using virtualization is much faster because you don't have to reboot
your system. And if you want to test what happens to the random
driver at boot time, again, booting a guest kernel under KVM is much
faster, especially if you screw up and the system locks up....

Sure, if you are testing an actual hardware device, it's harder to use
virtualization. But if you are doing any kind of core kernel work
(and /dev/random counts as core kernel), virtualization is really
convenient.

Cheers,

- Ted

2014-06-11 13:11:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On Tue, Jun 10, 2014 at 11:58:06PM -0400, George Spelvin wrote:
> You can forbid underflows, but the code doesn't forbid overflows.
>
> 1. Assume the entropy count starts at 512 bytes (input pool full)
> 2. Random writer mixes in 20 bytes of entropy into the input pool.
> 2a. Input pool entropy is, however, capped at 512 bytes.
> 3. Random extractor extracts 32 bytes of entropy from the pool.
> Succeeds because 32 < 512. Pool is left with 480 bytes of
> entropy.
> 3a. Random extractor decrements pool entropy estimate to 480 bytes.
> This is accurate.
> 4. Random writer credits pool with 20 bytes of entropy.
> 5. Input pool entropy is now 480 bytes, estimate is 500 bytes.

Good point, that's a potential problem, although messing up the
accounting betewen 480 and 500 bytes is not nearly as bad as messing
up 0 and 20.

It's not something where if the changes required massive changes, that
I'd necessarily feel the need to backport them to stable. It's a
certificational weakness, but it's a not disaster.

- Ted

2014-06-11 16:38:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote:
> Here's an example of a smaller, faster, and better fast_mix() function.
> The mix is invertible (thus preserving entropy), but causes each input
> bit or pair of bits to avalanche to at least 43 bits after 2 rounds and
> 120 bit0 after 3.

I've been looking at your fast_mix2(), and it does look interesting.

> For comparison, with the current linear fast_mix function, bits above
> the 12th in the data words only affect 4 other bits after one repetition.
>
> With 3 iterations, it runs in 2/3 the time of the current fast_mix
> and is half the size: 84 bytes of object code as opposed to 168.

... but how did you measure the "2/3 the time"? I've done some
measurements, using both "time calling fast_mix() and fast_mix2() N
times and divide by N (where N needs to be quite large). Using that
metric, fast_mix2() takes seven times as long to run.

If I only run the two mixing functions once, and use RDTSC to measure
the time, fast_mix2() takes only three times as long. (That's because
the memory cache effects are much less, which favors fast_mix2).

But either way, fast_mix2() is slower than the current fast_mix(), and
using the measurements that are as advantageous (and most realistic)
that I could come up with, it's still three times slower.

My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU.

- Ted

2014-06-11 16:48:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

On 06/11/2014 09:38 AM, Theodore Ts'o wrote:
> On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote:
>> Here's an example of a smaller, faster, and better fast_mix() function.
>> The mix is invertible (thus preserving entropy), but causes each input
>> bit or pair of bits to avalanche to at least 43 bits after 2 rounds and
>> 120 bit0 after 3.
>
> I've been looking at your fast_mix2(), and it does look interesting.
>
>> For comparison, with the current linear fast_mix function, bits above
>> the 12th in the data words only affect 4 other bits after one repetition.
>>
>> With 3 iterations, it runs in 2/3 the time of the current fast_mix
>> and is half the size: 84 bytes of object code as opposed to 168.
>
> ... but how did you measure the "2/3 the time"? I've done some
> measurements, using both "time calling fast_mix() and fast_mix2() N
> times and divide by N (where N needs to be quite large). Using that
> metric, fast_mix2() takes seven times as long to run.
>
> If I only run the two mixing functions once, and use RDTSC to measure
> the time, fast_mix2() takes only three times as long. (That's because
> the memory cache effects are much less, which favors fast_mix2).
>
> But either way, fast_mix2() is slower than the current fast_mix(), and
> using the measurements that are as advantageous (and most realistic)
> that I could come up with, it's still three times slower.
>
> My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU.
>

While talking about performance, I did a quick prototype of random using
Skein instead of SHA-1, and it was measurably faster, in part because
Skein produces more output per hash.

-hpa

2014-06-11 19:26:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote:
> While talking about performance, I did a quick prototype of random using
> Skein instead of SHA-1, and it was measurably faster, in part because
> Skein produces more output per hash.

Which Skein parameters did you use, and how much stack space was
required for it? Skein-512 is described as needing 200 bytes of
state, IIRC (which I assume most of which comes from Threefish key
schedule).

- Ted

2014-06-11 20:41:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

On 06/11/2014 12:25 PM, Theodore Ts'o wrote:
> On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote:
>> While talking about performance, I did a quick prototype of random using
>> Skein instead of SHA-1, and it was measurably faster, in part because
>> Skein produces more output per hash.
>
> Which Skein parameters did you use, and how much stack space was
> required for it? Skein-512 is described as needing 200 bytes of
> state, IIRC (which I assume most of which comes from Threefish key
> schedule).
>

I believe I used Skein-256, but I'd have to dig to find it again.

-hpa

2014-06-12 00:32:53

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

> ... but how did you measure the "2/3 the time"? I've done some
> measurements, using both "time calling fast_mix() and fast_mix2() N
> times and divide by N (where N needs to be quite large). Using that
> metric, fast_mix2() takes seven times as long to run.

Wow, *massively* different results. Using a large iteration count,
it's defintiely faster. Here's with 1,000,000 iterations:

# ./perftest ./random
pool 1 = 6b469698 596ceb8f 70536d2d e262b8ed
pool 2 = 72e2554d fd20e020 a51baf43 19472f63
0: 40587696 19952756 (-20634940)
1: 40569444 19986700 (-20582744)
2: 40529396 19983108 (-20546288)
3: 40492468 19959528 (-20532940)
4: 40461808 19977316 (-20484492)
5: 40421980 19977436 (-20444544)
6: 40495440 19959904 (-20535536)
7: 40436676 19975400 (-20461276)
8: 40454912 19971360 (-20483552)
9: 40463260 19957324 (-20505936)

Here's with 1 iteration (which you're very right, I failed to test,
and instruction counts matter more when code is not hot in cache)

# ./perftest ./random
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 76 84 (+8)
1: 36 36 (+0)
2: 32 36 (+4)
3: 32 40 (+8)
4: 28 40 (+12)
5: 32 36 (+4)
6: 32 36 (+4)
7: 28 40 (+12)
8: 32 40 (+8)
9: 28 40 (+12)

Comparable, but slightly slower. Clearly, I need to do better.
And you can see the first-iteration effects clearly. Still,
noting *remotely* like 7x!

That seems crazy, as the overall operation counts are
not that dissimilar.


Here's the "perftest" shell wrapper:
#!/bin/sh

old="`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`"

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $i
done

"$@"

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo "$old" > $i
done


And "random.c" is appended. Apologies for the comemnts and #ifdefs
I used while experimenting; I wanted to give you *exactly* what
I'm running. Do you see a bug in it anywhere?

I know P4s have slow rotates, so the larger number of them compared
to shifts in the original will definitely cause a hit. But neither
of us are using that.

Compile line and platform:
cc -W -Wall -m64 -O -march=native random.c -o random
model name : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz




#include <stdint.h>
typedef uint32_t u32;

static u32 const twist_table[8] = {
0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

struct fast_pool {
u32 pool[4];
unsigned long last;
unsigned short count;
unsigned char rotate;
unsigned char last_timer_intr;
};

static inline u32 rol32(u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}


/*
* This is a fast mixing routine used by the interrupt randomness
* collector. It's hardcoded for an 128 bit pool and assumes that any
* locks that might be needed are taken by the caller.
*/
void fast_mix(struct fast_pool *f, u32 const input[4])
{
u32 w;
unsigned input_rotate = f->rotate;

w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
f->pool[0] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 14) & 31;
w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
f->pool[1] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
f->pool[2] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
f->pool[3] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;

f->rotate = input_rotate;
f->count++;
}

void fast_mix2(struct fast_pool *f, u32 const input[4])
{
/* Copy pool to registers */
u32 a = f->pool[0] ^ input[0];
u32 b = f->pool[1] ^ input[1];
u32 c = f->pool[2] ^ input[2];
u32 d = f->pool[3] ^ input[3];
#if 1
int i;

for (i = 0; i < 2; i++) {
/*
* Inspired by ChaCha QuarterRound, but
* modified for much greater parallelism.
*/
a += b; c += d;
d ^= a; b ^= c;
// a = rol32(a, 24); c = rol32(c, 5);
a = rol32(a, 27); c = rol32(c, 7);
// d = rol32(d, 27); b = rol32(b, 7);

a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 17); c = rol32(c, 11);
// d = rol32(d, 17); b = rol32(b, 11);
#if 0
a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 24); c = rol32(c, 5);
#endif
}
f->pool[0] = a;
f->pool[1] = b;
f->pool[2] = c;
f->pool[3] = d;
#elif 1

a ^= a<<15; a ^= a>>4 ^ d ^ d>>21; f->pool[0] = a;
b ^= b<<15; b ^= b>>4 ^ a ^ a>>21; f->pool[1] = b;
c ^= c<<15; c ^= c>>4 ^ b ^ b>>21; f->pool[2] = c;
d ^= d<<15; d ^= d>>4 ^ c ^ c>>21; f->pool[3] = d;
#else
f->pool[0] = a ^= a<<20 ^ b ^ b>>11 ^ c ^ c<<27 ^ d ^ d >> 6;
f->pool[1] = b ^= b<<20 ^ c ^ c>>11 ^ d ^ d<<27 ^ a ^ a >> 6;
f->pool[2] = c ^= c<<20 ^ d ^ d>>11 ^ a ^ a<<27 ^ b ^ b >> 6;
f->pool[3] = d ^= d<<20 ^ a ^ a>>11 ^ b ^ b<<27 ^ c ^ c >> 6;
#endif

f->count++;
}

static uint64_t
rdtsc(void)
{
uint32_t low, high;
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
return ((uint64_t)high << 32) | low;
}

#include <stdio.h>
#include <stdlib.h>

#define ITER 1
#define N 10

int
main(void)
{
struct fast_pool fp1 = {
.pool = {0},
.count = 0,
.rotate = 0
};
struct fast_pool fp2 = {
.pool = {0},
.count = 0
};
uint32_t times[N][2];
int i, j;

for (i = 0; i < N; i++) {
uint64_t t1, t2, t3;
uint32_t data[4];

for (j = 0; j < 4; j++)
data[j] = random();

t1 = rdtsc();
for (j = 0; j < ITER; j++)
fast_mix(&fp1, data);
t2 = rdtsc();
for (j = 0; j < ITER; j++)
fast_mix2(&fp2, data);
t3 = rdtsc();
times[i][0] = t2 - t1;
times[i][1] = t3 - t2;
}
printf("pool 1 = %08x %08x %08x %08x\n", fp1.pool[0],
fp1.pool[1], fp1.pool[2], fp1.pool[3]);
printf("pool 2 = %08x %08x %08x %08x\n", fp2.pool[0],
fp2.pool[1], fp2.pool[2], fp2.pool[3]);

for (i = 0; i < N; i++)
printf("%2d: %10u %10u (%+d)\n",
i, times[i][0], times[i][1],
(int)(times[i][1] - times[i][0]));
return 0;
}

2014-06-12 00:43:01

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

> It's not something where if the changes required massive changes, that
> I'd necessarily feel the need to backport them to stable. It's a
> certificational weakness, but it's a not disaster.

Agreed! It's been there for years, and I'm not too worried. It takes
a pretty tight race to cause the problem in the first place.

As you note, it only happens with a full pool (already a very secure
situation), and the magnitude is limited by the size of entropy additions,
which are normally small.

I'm just never happy with bugs in security-critical code. "I don't
think that bug is exploitable" is almost as ominous a phrase as "Y'all
watch this!"

2014-06-12 00:45:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

On 06/11/2014 01:41 PM, H. Peter Anvin wrote:
> On 06/11/2014 12:25 PM, Theodore Ts'o wrote:
>> On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote:
>>> While talking about performance, I did a quick prototype of random using
>>> Skein instead of SHA-1, and it was measurably faster, in part because
>>> Skein produces more output per hash.
>>
>> Which Skein parameters did you use, and how much stack space was
>> required for it? Skein-512 is described as needing 200 bytes of
>> state, IIRC (which I assume most of which comes from Threefish key
>> schedule).
>>
>
> I believe I used Skein-256, but I'd have to dig to find it again.
>
> -hpa
>

Sadly I can't find the tree, but I'm 94% sure it was Skein-256
(specifically the SHA3-256 candidate parameter set.)

-hpa

2014-06-12 01:03:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: more ruminations

On 06/11/2014 06:11 AM, Theodore Ts'o wrote:
> On Tue, Jun 10, 2014 at 11:58:06PM -0400, George Spelvin wrote:
>> You can forbid underflows, but the code doesn't forbid overflows.
>>
>> 1. Assume the entropy count starts at 512 bytes (input pool full)
>> 2. Random writer mixes in 20 bytes of entropy into the input pool.
>> 2a. Input pool entropy is, however, capped at 512 bytes.
>> 3. Random extractor extracts 32 bytes of entropy from the pool.
>> Succeeds because 32 < 512. Pool is left with 480 bytes of
>> entropy.
>> 3a. Random extractor decrements pool entropy estimate to 480 bytes.
>> This is accurate.
>> 4. Random writer credits pool with 20 bytes of entropy.
>> 5. Input pool entropy is now 480 bytes, estimate is 500 bytes.
>
> Good point, that's a potential problem, although messing up the
> accounting betewen 480 and 500 bytes is not nearly as bad as messing
> up 0 and 20.
>
> It's not something where if the changes required massive changes, that
> I'd necessarily feel the need to backport them to stable. It's a
> certificational weakness, but it's a not disaster.
>

Actually, with the new accounting code it will be even less serious,
because mixing into a nearly full pool is discounted heavily -- because
it is not like filling a queue; the mixing function will
probabilistically overwrite existing pool entropy.

So it is still a race condition, and still wrong, but it is a lot less
wrong.

-hpa

2014-06-12 01:51:41

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

> Sadly I can't find the tree, but I'm 94% sure it was Skein-256
> (specifically the SHA3-256 candidate parameter set.)

It would be nice to have two hash functions, optimized separately for 32-
and 64-bit processors. As the Skein report says, the algorithm can
be adapted to 32 bits easily enough.

I also did some work a while ago to adapt the Skein parameter search
code to develop a Skein-192 (6x32 bits) that would fit into registers
on x86-32. (It got stalled when I e-mailed Niels Ferguson about it
and never heard back; it fell off the to-do list while I was waiting.)

The intended target was IPv6 address hashing for sequence number
randomization, but it could be used for pool hashing, too.

2014-06-12 03:22:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

On Wed, Jun 11, 2014 at 08:32:49PM -0400, George Spelvin wrote:
> Comparable, but slightly slower. Clearly, I need to do better.
> And you can see the first-iteration effects clearly. Still,
> noting *remotely* like 7x!

I redid my numbers, and I can no longer reproduce the 7x slowdown. I
do see that if you compile w/o -O2, fast_mix2 is twice as slow. But
it's not 7x slower.

When compiling w/o -O2:

fast_mix fast_mix2
task-clock 221.3 ms 460.7 ms

When compiling with -O2 -Os:

fast_mix fast_mix2
task-clock 115.4 ms 71.5 ms

And here's the numbers I got with a single iteration using rdtsc:

fast_mix: 164 fast_mix2: 237
fast_mix: 168 fast_mix2: 230
fast_mix: 166 fast_mix2: 228
fast_mix: 164 fast_mix2: 230
fast_mix: 166 fast_mix2: 230
fast_mix: 168 fast_mix2: 232
fast_mix: 166 fast_mix2: 228
fast_mix: 164 fast_mix2: 228
fast_mix: 166 fast_mix2: 234
fast_mix: 166 fast_mix2: 230

- Ted


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

typedef unsigned int __u32;

struct fast_pool {
__u32 pool[4];
unsigned long last;
unsigned short count;
unsigned char rotate;
unsigned char last_timer_intr;
};


/**
* rol32 - rotate a 32-bit value left
* @word: value to rotate
* @shift: bits to roll
*/
static inline __u32 rol32(__u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}

static __u32 const twist_table[8] = {
0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

/*
* This is a fast mixing routine used by the interrupt randomness
* collector. It's hardcoded for an 128 bit pool and assumes that any
* locks that might be needed are taken by the caller.
*/
extern void fast_mix(struct fast_pool *f, __u32 input[4])
{
__u32 w;
unsigned input_rotate = f->rotate;

w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
f->pool[0] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 14) & 31;
w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
f->pool[1] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
f->pool[2] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
f->pool[3] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;

f->rotate = input_rotate;
f->count++;
}

extern fast_mix2(struct fast_pool *f, __u32 const input[4])
{
__u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1];
__u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3];
int i;


for (i = 0; i < 3; i++) {
/*
* Inspired by ChaCha's QuarterRound, but
* modified for much greater parallelism.
* Surprisingly, rotating a and c seems to work
* better than b and d. And it runs faster.
*/
a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 15); c = rol32(c, 21);

a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 3); c = rol32(c, 7);
}
f->pool[0] = a; f->pool[1] = b;
f->pool[2] = c; f->pool[3] = d;
f->count++;
}

static __inline__ volatile unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

int main(int argc, char **argv)
{
struct fast_pool f;
int i;
__u32 input[4];
unsigned volatile long long start_time, end_time;

memset(&f, 0, sizeof(f));
memset(&input, 0, sizeof(input));
f.pool[0] = 1;

#if !defined(BENCH_FASTMIX) && !defined(BENCH_FASTMIX2)
for (i=0; i < 10; i++) {
usleep(50000);
start_time = rdtsc();
fast_mix(&f, input);
end_time = rdtsc();
printf("fast_mix: %llu\t", end_time - start_time);
usleep(50000);
start_time = rdtsc();
fast_mix2(&f, input);
end_time = rdtsc();
printf("fast_mix2: %llu\n", end_time - start_time);
}

#endif

#ifdef BENCH_FASTMIX
for (i=0; i < 10240000; i++) {
fast_mix(&f, input);
}
#endif

#ifdef BENCH_FASTMIX2
for (i=0; i < 10240000; i++) {
fast_mix2(&f, input);
}
#endif
}

2014-06-12 03:44:00

by George Spelvin

[permalink] [raw]
Subject: Re: drivers/char/random.c: More futzing about

Just to add to my total confusion about the totally disparate performance
numbers we're seeing, I did some benchmarks on other machines.

The speedup isn't as good one-pass as it is iterated, and as I mentioned
it's slower on a P4, but it's not 7 times slower by any stretch.

There are all 1-iteration numbers, run immediately after scp-ing the
binary to the machine so there's no possibility if anything being cached.

(The "64" and "32" versions are compiled -m32 and -m64,
of course.)

2.5 GHz Phenom 9850:

$ /tmp/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 199 142 (-57)
1: 104 95 (-9)
2: 104 110 (+6)
3: 103 109 (+6)
4: 105 89 (-16)
5: 103 88 (-15)
6: 104 89 (-15)
7: 104 95 (-9)
8: 105 85 (-20)
9: 105 85 (-20)
$ /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 324 147 (-177)
1: 100 86 (-14)
2: 100 99 (-1)
3: 100 88 (-12)
4: 100 86 (-14)
5: 100 86 (-14)
6: 100 89 (-11)
7: 100 111 (+11)
8: 100 111 (+11)
9: 100 88 (-12)
$ /tmp/random64 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 554788 220327 (-334461)
1: 554825 220176 (-334649)
2: 553505 220148 (-333357)
3: 554661 220064 (-334597)
4: 569559 220064 (-349495)
5: 612798 220065 (-392733)
6: 570287 220064 (-350223)
7: 554790 220064 (-334726)
8: 554715 220065 (-334650)
9: 569840 220064 (-349776)
$ /tmp/random32 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 520117 280225 (-239892)
1: 520125 280154 (-239971)
2: 520104 280094 (-240010)
3: 520079 280060 (-240019)
4: 520069 280060 (-240009)
5: 520060 280060 (-240000)
6: 558971 280060 (-278911)
7: 520102 280060 (-240042)
8: 520082 280060 (-240022)
9: 520058 280060 (-239998)


3 GHz i5-3330:
$ /tmp/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 78 75 (-3)
1: 36 33 (-3)
2: 33 39 (+6)
3: 36 30 (-6)
4: 36 33 (-3)
5: 30 33 (+3)
6: 30 54 (+24)
7: 24 48 (+24)
8: 27 33 (+6)
9: 30 33 (+3)
$ /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 66 78 (+12)
1: 39 39 (+0)
2: 36 39 (+3)
3: 45 33 (-12)
4: 42 33 (-9)
5: 33 42 (+9)
6: 45 33 (-12)
7: 39 36 (-3)
8: 105 48 (-57)
9: 42 39 (-3)
$ /tmp/random64 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 406188 218104 (-188084)
1: 402620 246968 (-155652)
2: 402652 239840 (-162812)
3: 402720 200312 (-202408)
4: 402584 200080 (-202504)
5: 447488 200228 (-247260)
6: 402788 200312 (-202476)
7: 402688 200080 (-202608)
8: 427140 224320 (-202820)
9: 402576 200080 (-202496)
$ /tmp/random32 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 406485 266670 (-139815)
1: 392694 266463 (-126231)
2: 392496 266763 (-125733)
3: 426003 266145 (-159858)
4: 392688 266667 (-126021)
5: 432231 266589 (-165642)
6: 392754 298734 (-94020)
7: 392883 284994 (-107889)
8: 392637 266694 (-125943)
9: 392985 267024 (-125961)


3.5 GHz i7-2700:
# /tmp/perftest /tmp/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 82 90 (+8)
1: 38 41 (+3)
2: 46 38 (-8)
3: 35 41 (+6)
4: 46 41 (-5)
5: 38 38 (+0)
6: 41 55 (+14)
7: 41 35 (-6)
8: 46 24 (-22)
9: 35 38 (+3)
# /tmp/perftest /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 82 76 (-6)
1: 32 53 (+21)
2: 49 44 (-5)
3: 35 41 (+6)
4: 46 35 (-11)
5: 35 44 (+9)
6: 49 50 (+1)
7: 41 41 (+0)
8: 32 44 (+12)
9: 49 44 (-5)
# /tmp/perftest /tmp/random64 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 445486 227993 (-217493)
1: 445089 227602 (-217487)
2: 445381 227736 (-217645)
3: 445220 227661 (-217559)
4: 445159 227798 (-217361)
5: 445285 227608 (-217677)
6: 445005 227806 (-217199)
7: 445439 227608 (-217831)
8: 445328 227798 (-217530)
9: 445533 227625 (-217908)
# /tmp/perftest /tmp/random32 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 437001 335959 (-101042)
1: 437004 336120 (-100884)
2: 436683 335936 (-100747)
3: 436806 336452 (-100354)
4: 436996 335988 (-101008)
5: 436916 336210 (-100706)
6: 437042 354722 (-82320)
7: 436896 336146 (-100750)
8: 436981 336741 (-100240)
9: 437016 331920 (-105096)


And on a 1.6 GHz P4:
$ /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 312 176 (-136)
1: 112 112 (+0)
2: 112 112 (+0)
3: 112 112 (+0)
4: 112 112 (+0)
5: 112 112 (+0)
6: 112 112 (+0)
7: 112 112 (+0)
8: 112 112 (+0)
9: 112 112 (+0)
$ /tmp/random32 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 480344 550192 (+69848)
1: 480088 550084 (+69996)
2: 480140 550128 (+69988)
3: 480088 550084 (+69996)
4: 480088 550084 (+69996)
5: 480088 550084 (+69996)
6: 480088 550084 (+69996)
7: 480088 550084 (+69996)
8: 480088 550084 (+69996)
9: 480088 550084 (+69996)


And just for lulz, a 1.533 GHz Athlon XP:
$ /tmp/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 91 83 (-8)
1: 58 57 (-1)
2: 58 57 (-1)
3: 58 57 (-1)
4: 58 45 (-13)
5: 58 45 (-13)
6: 58 45 (-13)
7: 58 45 (-13)
8: 58 45 (-13)
9: 58 45 (-13)
$ /tmp/random32 10000
pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e
pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1
0: 570094 380159 (-189935)
1: 550045 380088 (-169957)
2: 550028 380043 (-169985)
3: 568533 380087 (-188446)
4: 550028 380021 (-170007)
5: 550028 380021 (-170007)
6: 550028 380021 (-170007)
7: 550028 380021 (-170007)
8: 550028 382985 (-167043)
9: 550088 380021 (-170067)

2014-06-12 04:13:20

by George Spelvin

[permalink] [raw]
Subject: random: Benchamrking fast_mix2

> I redid my numbers, and I can no longer reproduce the 7x slowdown. I
> do see that if you compile w/o -O2, fast_mix2 is twice as slow. But
> it's not 7x slower.

For my single-round, I needed to drop to 2 loops rather than 3 to match
the speed. That's in the source I posted, but I didn't point it out.

(It wasn't an attempt to be deceptive, that's just how I happened
to have left the file when I was experimenting with various options.
I figured if we were looking for 7x, 1.5x wasn't all that important.)

That explains some of the residual difference between our figures.

When developing, I was using a many-iteration benchmark, and I suspect it
fitted in the Ivy Bridge uop cache, which let it saturate the execution
resources.

Sorry for the premature alarm; I'll go back to work and find something
better.

I still get comparable speed for 2 loops and -O2:
$ cc -W -Wall -m32 -O2 -march=native random.c -o random32
# ./perftest ../spooky/random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 148 124 (-24)
1: 48 36 (-12)
2: 40 36 (-4)
3: 44 40 (-4)
4: 44 40 (-4)
5: 36 36 (+0)
6: 52 36 (-16)
7: 44 32 (-12)
8: 44 36 (-8)
9: 48 36 (-12)
$ cc -W -Wall -m64 -O2 -march=native random.c -o random64
# ./perftest ../spooky/random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 132 104 (-28)
1: 40 40 (+0)
2: 36 44 (+8)
3: 32 40 (+8)
4: 40 36 (-4)
5: 32 40 (+8)
6: 36 44 (+8)
7: 40 40 (+0)
8: 36 44 (+8)
9: 40 36 (-4)
$ cc -W -Wall -m32 -O3 -march=native random.c -o random32
# ./perftest ./random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 88 48 (-40)
1: 36 40 (+4)
2: 36 44 (+8)
3: 32 40 (+8)
4: 36 40 (+4)
5: 96 40 (-56)
6: 40 40 (+0)
7: 36 40 (+4)
8: 28 48 (+20)
9: 28 40 (+12)
$ cc -W -Wall -m64 -O3 -march=native random.c -o random64
# ./perftest ./random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 72 80 (+8)
1: 36 52 (+16)
2: 32 36 (+4)
3: 32 36 (+4)
4: 28 40 (+12)
5: 32 40 (+8)
6: 32 40 (+8)
7: 32 36 (+4)
8: 28 44 (+16)
9: 36 36 (+0)
$ cc -W -Wall -m32 -Os -march=native random.c -o random32
# ./perftest ./random32
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 108 132 (+24)
1: 44 44 (+0)
2: 76 40 (-36)
3: 44 48 (+4)
4: 36 40 (+4)
5: 32 44 (+12)
6: 40 56 (+16)
7: 44 36 (-8)
8: 44 40 (-4)
9: 32 40 (+8)
$ $ cc -W -Wall -m64 -Os -march=native random.c -o random64
# ./perftest ./random64
pool 1 = 85670974 e96b1f8f 51244abf 5863283f
pool 2 = 03564c6c eba81d03 55c77fa1 760374a7
0: 96 108 (+12)
1: 44 52 (+8)
2: 40 40 (+0)
3: 40 36 (-4)
4: 40 32 (-8)
5: 36 36 (+0)
6: 44 32 (-12)
7: 36 36 (+0)
8: 40 36 (-4)
9: 40 36 (-4)

Yours looks much more careful about the timing.

A few GCC warnings I ended up fixing:
1) "volatile" on rdtsc is meaningless and ignore (with a warning)
2) fast_mix2() needs a void return type; it defaults to int.
3) int main() needs a "return 0"


Here's what I got running *your* program, unmodified except
for the above (meaning 3 inner loop iterations).
Compiled with GCC 4.9.0 (Devian 4.9.0-6), -O2.

i7-4940K# ./perftest ./ted32
fast_mix: 430 fast_mix2: 431
fast_mix: 442 fast_mix2: 464
fast_mix: 442 fast_mix2: 465
fast_mix: 442 fast_mix2: 431
fast_mix: 442 fast_mix2: 465
fast_mix: 431 fast_mix2: 430
fast_mix: 442 fast_mix2: 431
fast_mix: 431 fast_mix2: 465
fast_mix: 431 fast_mix2: 465
fast_mix: 431 fast_mix2: 431
i7-4940K# ./perftest ./ted64
fast_mix: 454 fast_mix2: 465
fast_mix: 453 fast_mix2: 465
fast_mix: 442 fast_mix2: 464
fast_mix: 453 fast_mix2: 464
fast_mix: 454 fast_mix2: 465
fast_mix: 453 fast_mix2: 465
fast_mix: 442 fast_mix2: 464
fast_mix: 453 fast_mix2: 464
fast_mix: 453 fast_mix2: 464
fast_mix: 453 fast_mix2: 465

In other words, pretty damn near the same
speed (with 3 loops).

So we still have some discrepancy to track down.

A few other machines.
i5-3330$ /tmp/ted32
fast_mix: 226 fast_mix2: 277
fast_mix: 561 fast_mix2: 429
fast_mix: 156 fast_mix2: 406
fast_mix: 504 fast_mix2: 534
fast_mix: 579 fast_mix2: 270
fast_mix: 240 fast_mix2: 270
fast_mix: 494 fast_mix2: 270
fast_mix: 240 fast_mix2: 138
fast_mix: 750 fast_mix2: 277
fast_mix: 124 fast_mix2: 270
i5-3330$ /tmp/ted64
fast_mix: 224 fast_mix2: 277
fast_mix: 226 fast_mix2: 312
fast_mix: 646 fast_mix2: 276
fast_mix: 233 fast_mix2: 456
fast_mix: 591 fast_mix2: 570
fast_mix: 413 fast_mix2: 563
fast_mix: 584 fast_mix2: 270
fast_mix: 231 fast_mix2: 261
fast_mix: 233 fast_mix2: 459
fast_mix: 528 fast_mix2: 277

Pentium4$ /tmp/ted32
fast_mix: 912 fast_mix2: 396
fast_mix: 792 fast_mix2: 160
fast_mix: 524 fast_mix2: 160
fast_mix: 1460 fast_mix2: 440
fast_mix: 496 fast_mix2: 160
fast_mix: 672 fast_mix2: 160
fast_mix: 700 fast_mix2: 160
fast_mix: 336 fast_mix2: 540
fast_mix: 896 fast_mix2: 160
fast_mix: 1052 fast_mix2: 156

Phemom9850$ /tmp/ted32
fast_mix: 463 fast_mix2: 158
fast_mix: 276 fast_mix2: 174
fast_mix: 194 fast_mix2: 135
fast_mix: 620 fast_mix2: 424
fast_mix: 584 fast_mix2: 424
fast_mix: 610 fast_mix2: 418
fast_mix: 651 fast_mix2: 1107
fast_mix: 634 fast_mix2: 439
fast_mix: 632 fast_mix2: 456
fast_mix: 534 fast_mix2: 205
Phemom9850$ /tmp/ted64
fast_mix: 783 fast_mix2: 185
fast_mix: 903 fast_mix2: 144
fast_mix: 955 fast_mix2: 178
fast_mix: 515 fast_mix2: 437
fast_mix: 642 fast_mix2: 580
fast_mix: 610 fast_mix2: 525
fast_mix: 523 fast_mix2: 119
fast_mix: 180 fast_mix2: 315
fast_mix: 596 fast_mix2: 570
fast_mix: 598 fast_mix2: 775

AthlonXP$ /tmp/ted32
fast_mix: 119 fast_mix2: 113
fast_mix: 139 fast_mix2: 109
fast_mix: 155 fast_mix2: 123
fast_mix: 134 fast_mix2: 140
fast_mix: 126 fast_mix2: 154
fast_mix: 134 fast_mix2: 113
fast_mix: 176 fast_mix2: 140
fast_mix: 145 fast_mix2: 113
fast_mix: 134 fast_mix2: 144
fast_mix: 155 fast_mix2: 112


So I'm still a bit confused. Would any bystanders like to
chip in? Ted, shall I send you some binaries?

2014-06-12 11:18:52

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

Ted, just as an example of a possible tweak, the following change seems
to sightly improve speed without reducing diffusion. (It now looks a
bit more like the Skein mix() function.)

I'll look for more even more efficient kernels. It appears that the
inital XOR is costing a lot; 8 memory fetches take a while. Spacing that
out a bit would help, but it having input partway through the round ends
up requiring a major surgery to my avalanche-measuring code.

These rotation constants aren't final, but tweaking them shouldn't affect
the speed.

IMHO *one* pass of this is as good as the current fast_mix, and two
are considerably better. Three achieve almost complete avalanche,
but aren't really necessary.

Since the pass is made of two identical halves, half-passes are also
possible. Either using only 2 rotate constants, or by unrolling
the last half-pass.

// (29/66353) score = 49/121/123: 6 27 16 14

a += b; c += d;
+ b = rol32(a, 6); d = rol32(c, 27);
d ^= a; b ^= c;
- a = rol32(a, 15); c = rol32(c, 21);

a += b; c += d;
+ b = rol32(a, 16); d = rol32(c, 14);
d ^= a; b ^= c;
- a = rol32(a, 3); c = rol32(c, 7);


Of course, using wider words works fantastically.
These constants give 76 bits if avalanche after 2 rounds,
essentially full after 3:

extern void fast_mix2(struct fast_pool *f, __u32 const input[4])
{

uint64_t a = ((uint64_t *)f->pool)[0] ^ ((uint64_t const *)input)[0];
uint64_t b = ((uint64_t *)f->pool)[1] ^ ((uint64_t const *)input)[1];
int i;

for (i = 0; i < 3; i++) {
a += b; b = rol64(b, 52);
b ^= a; a = rol64(a, 10);
a += b; b = rol64(b, 47);
b ^= a; a = rol64(a, 17);
}

((uint64_t *)f->pool)[0] = a;
((uint64_t *)f->pool)[1] = b;
f->count++;
}

And it measures as faster than fast_mix even with 3 rounds:
# ./perftest ./ted64
fast_mix: 499 fast_mix2: 430
fast_mix: 431 fast_mix2: 430
fast_mix: 510 fast_mix2: 419
fast_mix: 430 fast_mix2: 419
fast_mix: 430 fast_mix2: 419
fast_mix: 431 fast_mix2: 419
fast_mix: 431 fast_mix2: 430
fast_mix: 510 fast_mix2: 419
fast_mix: 510 fast_mix2: 431
fast_mix: 510 fast_mix2: 430

And with 2 it's even faster.
(But so is fast_mix; there appears to be
significant timing instability here.)
# ./perftest ./ted64
fast_mix: 430 fast_mix2: 306
fast_mix: 431 fast_mix2: 306
fast_mix: 442 fast_mix2: 306
fast_mix: 442 fast_mix2: 306
fast_mix: 442 fast_mix2: 306
fast_mix: 442 fast_mix2: 363
fast_mix: 442 fast_mix2: 306
fast_mix: 442 fast_mix2: 329
fast_mix: 408 fast_mix2: 306
fast_mix: 442 fast_mix2: 329

2014-06-12 20:17:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

One of the reasons for the timing instability is because of the
usleep() calls. This allows some other process to schedule, and thus
disrupts the I-cache and uop caches. With the usleep() calls:

i7-4900MQ# schedtool -R -p 1 -e /tmp/fast_mix2_49
fast_mix: 213 fast_mix2: 280
fast_mix: 336 fast_mix2: 356
fast_mix: 174 fast_mix2: 392
fast_mix: 202 fast_mix2: 403
fast_mix: 152 fast_mix2: 280
fast_mix: 212 fast_mix2: 403
fast_mix: 213 fast_mix2: 403
fast_mix: 213 fast_mix2: 392
fast_mix: 202 fast_mix2: 403
fast_mix: 191 fast_mix2: 392

... and without the usleep calls:

i7-4900MQ# schedtool -R -p 1 -e /tmp/fast_mix2_49
fast_mix: 146 fast_mix2: 347
fast_mix: 157 fast_mix2: 90
fast_mix: 78 fast_mix2: 90
fast_mix: 78 fast_mix2: 89
fast_mix: 78 fast_mix2: 90
fast_mix: 78 fast_mix2: 90
fast_mix: 90 fast_mix2: 90
fast_mix: 79 fast_mix2: 90
fast_mix: 90 fast_mix2: 89
fast_mix: 79 fast_mix2: 90

I had originally added the usleep calls() in my test infrastructure to
more accurately disable the uop cache effects, since we are going to
be called from an interrupt handler, not in a loop. But anyway, this
is one of the reasons for the differences that you were seeing with
your benchmarking framework and mine, and why micro-benchmarking can
be so hard to get right. :-)

(BTW, this is your original mixer; I haven't tried playing with your
modified Skein-like core round yet.)

- Ted

2014-06-12 20:46:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

So I just tried your modified 32-bit mixing function where you the
rotation to the middle step instead of the last step. With the
usleep(), it doesn't make any difference:

# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 212 fast_mix2: 400 fast_mix3: 400
fast_mix: 208 fast_mix2: 408 fast_mix3: 388
fast_mix: 208 fast_mix2: 396 fast_mix3: 404
fast_mix: 224 fast_mix2: 408 fast_mix3: 392
fast_mix: 200 fast_mix2: 404 fast_mix3: 404
fast_mix: 208 fast_mix2: 412 fast_mix3: 396
fast_mix: 208 fast_mix2: 392 fast_mix3: 392
fast_mix: 212 fast_mix2: 408 fast_mix3: 388
fast_mix: 200 fast_mix2: 716 fast_mix3: 773
fast_mix: 426 fast_mix2: 717 fast_mix3: 728

without the usleep() I get:

692# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 104 fast_mix2: 224 fast_mix3: 176
fast_mix: 56 fast_mix2: 112 fast_mix3: 56
fast_mix: 56 fast_mix2: 64 fast_mix3: 64
fast_mix: 64 fast_mix2: 64 fast_mix3: 48
fast_mix: 56 fast_mix2: 64 fast_mix3: 56
fast_mix: 56 fast_mix2: 64 fast_mix3: 64
fast_mix: 56 fast_mix2: 64 fast_mix3: 64
fast_mix: 56 fast_mix2: 72 fast_mix3: 56
fast_mix: 56 fast_mix2: 64 fast_mix3: 56
fast_mix: 64 fast_mix2: 64 fast_mix3: 56

I'm beginning to suspect that some of the differences between your
measurements and mine might be that in addition to having a smaller
cache (8M instead of 12M), I suspect there are some other caches,
perhaps the uop cache, which are also smaller on the mobile processor,
and that is explaining why you are seeing some different results.

>
> Of course, using wider words works fantastically.
> These constants give 76 bits if avalanche after 2 rounds,
> essentially full after 3....

And here is my testing using your 64-bit variant:

# schedtool -R -p 1 -e /tmp/fast_mix2_49
fast_mix: 294 fast_mix2: 476 fast_mix4: 442
fast_mix: 286 fast_mix2: 1058 fast_mix4: 448
fast_mix: 958 fast_mix2: 460 fast_mix4: 1002
fast_mix: 940 fast_mix2: 1176 fast_mix4: 826
fast_mix: 476 fast_mix2: 840 fast_mix4: 826
fast_mix: 462 fast_mix2: 840 fast_mix4: 826
fast_mix: 462 fast_mix2: 826 fast_mix4: 826
fast_mix: 462 fast_mix2: 826 fast_mix4: 826
fast_mix: 462 fast_mix2: 826 fast_mix4: 826
fast_mix: 462 fast_mix2: 840 fast_mix4: 826

... and without usleep()

690# schedtool -R -p 1 -e /tmp/fast_mix2_48
fast_mix: 52 fast_mix2: 116 fast_mix4: 96
fast_mix: 32 fast_mix2: 32 fast_mix4: 24
fast_mix: 28 fast_mix2: 36 fast_mix4: 24
fast_mix: 32 fast_mix2: 32 fast_mix4: 24
fast_mix: 32 fast_mix2: 36 fast_mix4: 24
fast_mix: 36 fast_mix2: 32 fast_mix4: 24
fast_mix: 32 fast_mix2: 36 fast_mix4: 28
fast_mix: 28 fast_mix2: 28 fast_mix4: 24
fast_mix: 32 fast_mix2: 36 fast_mix4: 28
fast_mix: 32 fast_mix2: 32 fast_mix4: 24

The bottom line is that what we are primarily measuring here is all
different cache effects. And these are going to be quite different on
different microarchitectures.

That being said, I wouldn't be at all surprised if there are some
CPU's where the extract memory dereference to the twist_table[] would
definitely hurt, since Intel's amazing cache architecture(tm) is no
doubt covering a lot of sins. I wouldn't be at all surprised if some
of these new mixing functions would fare much better if we tried
benchmarking them on an 32-bit ARM processor, for example....

- Ted

2014-06-13 00:23:07

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

> So I just tried your modified 32-bit mixing function where you the
> rotation to the middle step instead of the last step. With the
> usleep(), it doesn't make any difference:
>
> # schedtool -R -p 1 -e /tmp/fast_mix2_48
> fast_mix: 212 fast_mix2: 400 fast_mix3: 400
> fast_mix: 208 fast_mix2: 408 fast_mix3: 388
> fast_mix: 208 fast_mix2: 396 fast_mix3: 404
> fast_mix: 224 fast_mix2: 408 fast_mix3: 392
> fast_mix: 200 fast_mix2: 404 fast_mix3: 404
> fast_mix: 208 fast_mix2: 412 fast_mix3: 396
> fast_mix: 208 fast_mix2: 392 fast_mix3: 392
> fast_mix: 212 fast_mix2: 408 fast_mix3: 388
> fast_mix: 200 fast_mix2: 716 fast_mix3: 773
> fast_mix: 426 fast_mix2: 717 fast_mix3: 728

> And here is my testing using your 64-bit variant:
>
> # schedtool -R -p 1 -e /tmp/fast_mix2_49
> fast_mix: 294 fast_mix2: 476 fast_mix4: 442
> fast_mix: 286 fast_mix2: 1058 fast_mix4: 448
> fast_mix: 958 fast_mix2: 460 fast_mix4: 1002
> fast_mix: 940 fast_mix2: 1176 fast_mix4: 826
> fast_mix: 476 fast_mix2: 840 fast_mix4: 826
> fast_mix: 462 fast_mix2: 840 fast_mix4: 826
> fast_mix: 462 fast_mix2: 826 fast_mix4: 826
> fast_mix: 462 fast_mix2: 826 fast_mix4: 826
> fast_mix: 462 fast_mix2: 826 fast_mix4: 826
> fast_mix: 462 fast_mix2: 840 fast_mix4: 826

> The bottom line is that what we are primarily measuring here is all
> different cache effects. And these are going to be quite different on
> different microarchitectures.

So adding fast_mix4 doubled the time taken by fast_mix.
Yeah, that's trustworthy timing! :-)

Still, you do seem to observe a pretty consistent factor of about 2x
difference, which confuses me because I can't reproduce it.

But it's hard to reach definite conclusions with this much measurement noise.

Another cache we might be hitting is the branch predictor. Could you try
unrolling fast_mix2 and fast_mix4 and see what difference that makes?
(I'd send you a patch but you could probably do it by hand faster than
appying one.)

It only makes a slight difference on my high-end Intel box, but almost
doubles the speed on the Phenom:

Rolled (64-bit core, 2 rounds):
fast_mix: 293 fast_mix2: 205
fast_mix: 257 fast_mix2: 162
fast_mix: 170 fast_mix2: 137
fast_mix: 283 fast_mix2: 218
fast_mix: 270 fast_mix2: 185
fast_mix: 288 fast_mix2: 199
fast_mix: 423 fast_mix2: 131
fast_mix: 286 fast_mix2: 218
fast_mix: 681 fast_mix2: 165
fast_mix: 268 fast_mix2: 190

Unrolled (64-bit core, 2 rounds):
fast_mix: 394 fast_mix2: 108
fast_mix: 145 fast_mix2: 80
fast_mix: 270 fast_mix2: 112
fast_mix: 145 fast_mix2: 81
fast_mix: 145 fast_mix2: 79
fast_mix: 662 fast_mix2: 107
fast_mix: 145 fast_mix2: 78
fast_mix: 140 fast_mix2: 127
fast_mix: 164 fast_mix2: 182
fast_mix: 205 fast_mix2: 79

Since the original fast_mix is unrolled, a penalty there wouldn't
hit it.

> That being said, I wouldn't be at all surprised if there are some
> CPU's where the extract memory dereference to the twist_table[] would
> definitely hurt, since Intel's amazing cache architecture(tm) is no
> doubt covering a lot of sins. I wouldn't be at all surprised if some
> of these new mixing functions would fare much better if we tried
> benchmarking them on an 32-bit ARM processor, for example....

Yes, Intel's D-caches are quite impressive.

2014-06-13 15:52:49

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

On Thu, Jun 12, 2014 at 08:23:04PM -0400, George Spelvin wrote:
> Another cache we might be hitting is the branch predictor. Could you try
> unrolling fast_mix2 and fast_mix4 and see what difference that makes?
> (I'd send you a patch but you could probably do it by hand faster than
> appying one.)

Unrolling doesn't make much difference; which isn't surprising given
that almost all of the differences go away when I commented out the
udelay(). Basically, at this point what we're primarily measuring is
how good various CPU's caches work, especially across context switches
where other code gets to run in between.

If that's the case, all else being equal, removing the extra memory
reference for twist_table[] does make sense, and something else I've
considered doing is to remove the input[] array entirely, and have
add_interrupt_randomness[] xor values directly into the pool, and then
let thast fast_mix function stir the pool.

It's harder to benchmark this, but at this point, I think we know
enough to be confident that this will be a win on at least some
platforms, and so long as it's not a massvie lose from what we had
before, I'll be fine with it.

I also think that it's going to be worthwhile to do the RDTSC
measurement in vitro, and calculate average and max latencies, since
it's clear that there are real limitations to userspace benchmarking.

Cheers,

- Ted

2014-06-14 02:10:18

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

> Unrolling doesn't make much difference; which isn't surprising given
> that almost all of the differences go away when I commented out the
> udelay(). Basically, at this point what we're primarily measuring is
> how good various CPU's caches work, especially across context switches
> where other code gets to run in between.

Huh. As I was referring to when I talked about the branch
predictor, I was hoping the removing *conditional* branches would
help.

> If that's the case, all else being equal, removing the extra memory
> reference for twist_table[] does make sense, and something else I've
> considered doing is to remove the input[] array entirely, and have
> add_interrupt_randomness[] xor values directly into the pool, and then
> let that fast_mix function stir the pool.

Are you trying for an XOR to memory, or is the idea to remain in registers
for the entire operation?

I'm not sure an XOR to memory is that much better; it's 2 pool loads
and 1 pool store either way. Currently, the store is first (to
input[]) and then both it and the fast_pool are fetched in fast_mix.

With an XOR to memory, it's load-store-load, but is that really better?

In case it's useful, below is a small patch I made to
add_interrupt_randomness to take advantage of 64-bit processors and make
it a bit clearer what it's doing. Not submitted officially because:
1) I haven't examined the consequences on 32-bit processors carefully yet.
2) It's more of a "code cleanup", meaning personal style preference,
and you've expressed some pretty strong unhappiness with such churn.

It's also useful preparation for changing to a native 64-bit fast_mix.

In general, I dislike "pre-compressing" the input; if the input hash isn't
fast enough, fix that for all users rather than adding something ad-hoc.

If you want a last_mix function with different input and state widths,
I can try to come up with one. (Given the iterated nature of the current
fast_mix2, you can also just add additional seed material betwene the
rounds.)

> I also think that it's going to be worthwhile to do the RDTSC
> measurement in vitro, and calculate average and max latencies, since
> it's clear that there are real limitations to userspace benchmarking.

I'm not sure you're not making a clever joke about the use of silicon
dioxide in real chips, but don't you mean "in vivo"?

(Also, if we're reading the TSC twice, and we know the delta is noisy
as heck, seed with it!)



commit d3c0a185991a45e420925d040f19e764808b354e
Author: George Spelvin <[email protected]>
Date: Sat Jun 7 21:16:45 2014 -0400

random: Simplify add_interrupt_randomness using 64-bit math

The same values (except word-swapped on big-endian machines) are passed
to fast_mix, but the code is simpler, smaller, and uses 64-bit operations
if available.

Signed-off-by: George Spelvin <[email protected]>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 868760e1..acc9bb1a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -563,7 +563,7 @@ struct fast_pool {
* collector. It's hardcoded for an 128 bit pool and assumes that any
* locks that might be needed are taken by the caller.
*/
-static void fast_mix(struct fast_pool *f, __u32 input[4])
+static void fast_mix(struct fast_pool *f, __u32 const input[4])
{
__u32 w;
unsigned input_rotate = f->rotate;
@@ -839,22 +839,16 @@ void add_interrupt_randomness(int irq, int irq_flags)
struct entropy_store *r;
struct fast_pool *fast_pool = &__get_cpu_var(irq_randomness);
struct pt_regs *regs = get_irq_regs();
- unsigned long now = jiffies;
- cycles_t cycles = random_get_entropy();
- __u32 input[4], c_high, j_high;
- __u64 ip;
unsigned long seed;
int credit;
+ unsigned long now = jiffies;
+ cycles_t cycles = random_get_entropy();
+ __u64 const input[2] = {
+ cycles ^ irq ^ rol64(now, 32),
+ regs ? instruction_pointer(regs) : _RET_IP_
+ };

- c_high = (sizeof(cycles) > 4) ? cycles >> 32 : 0;
- j_high = (sizeof(now) > 4) ? now >> 32 : 0;
- input[0] = cycles ^ j_high ^ irq;
- input[1] = now ^ c_high;
- ip = regs ? instruction_pointer(regs) : _RET_IP_;
- input[2] = ip;
- input[3] = ip >> 32;
-
- fast_mix(fast_pool, input);
+ fast_mix(fast_pool, (__u32 const *)input);

if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ))
return;

2014-06-14 03:06:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

On Fri, Jun 13, 2014 at 10:10:14PM -0400, George Spelvin wrote:
> > Unrolling doesn't make much difference; which isn't surprising given
> > that almost all of the differences go away when I commented out the
> > udelay(). Basically, at this point what we're primarily measuring is
> > how good various CPU's caches work, especially across context switches
> > where other code gets to run in between.
>
> Huh. As I was referring to when I talked about the branch
> predictor, I was hoping the removing *conditional* branches would
> help.

At least for Intel, between its branch predictor and speculative
execution engine, it doesn't make a difference.

> Are you trying for an XOR to memory, or is the idea to remain in
> registers for the entire operation?
>
> I'm not sure an XOR to memory is that much better; it's 2 pool loads
> and 1 pool store either way. Currently, the store is first (to
> input[]) and then both it and the fast_pool are fetched in fast_mix.
>
> With an XOR to memory, it's load-store-load, but is that really better?

The second load can be optimized away. If the compiler isn't smart
enough, the store means that the data is almost certainly still in the
D-cache. But with a smart compiler (and gcc should be smart enough),
if fast_mix is a static function, gcc will inline fast_mix, and then
it should be able to optimize out the load. In fact, it might be
smart enough to optimize out the first store, since it should be able
to realize that first store to the pool[] array will get overwritten
by the final store to the pool[] array.

So hopefully, it will remain in registers for the entire operation,
and the compilers will hopefully be smart enough to make the right
hting happy without the code having to be really ugly.

> In case it's useful, below is a small patch I made to
> add_interrupt_randomness to take advantage of 64-bit processors and make
> it a bit clearer what it's doing. Not submitted officially because:
> 1) I haven't examined the consequences on 32-bit processors carefully yet.

When I did a quick comparison of your 64-bit fast_mix2 variant, it's
much slower than either the 32-bit fast_mix2, or the original fast_mix
alrogithm. So given that 32-bit processors tend to be slower, I'm
pretty sure if we want to add a 64-bit optimization, we'll have to
conditionalize it on BITS_PER_LONG == 64 and include both the original
code and the 64-bit optimized code.

- Ted

2014-06-14 04:55:26

by George Spelvin

[permalink] [raw]
Subject: [RFC] random: is the IRQF_TIMER test working as intended?

I'm trying to understand the entropy credit computation in
add_interrupt_randomness. A few things confuse me, and I'm
wondering if it's intended to be that way.

1) Since the number of samples between spills to the input pool is
variable (with > 64 samples now possible due to the trylock), wouldn't
it make more sense to accumulate an entropy estimate?
2) Why only deny entropy credit for back-to-back timer interrupts?
If both both t2 - x and x - t1 are worth credit, why not for t2 - t1?
It seems a lot better (not to mention simpler) to not credit any
timer interrupt, so x - t1 will get credit but not t2 - x.
3) Why only consider the status of the interrupts when spills occur?
This is the most confusing. The whole __IRQF_TIMER and last_timer_intr
logic simply skips over the intermediate samples, so it actually
detects timer interrupts 64 interrupt (or 1 second) apart.
Shouldn't that sort of thing actually be looking at *consecutive*
calls to add_interrupt_randomness?
4) If the above logic denies credit, why deny credit for
arch_get_random_seed_long as well?

For discussion, here's an example of a change that fixes all of the
above, in patch form. (The credit_entropy_frac function is omitted but
hopefully obvious.)

The amount of entropy credit particularly needs thought. I'm currently
using 1/8 of a bit per sample to keep the patch as simple as possible.
This is 8x the current credit if interrupts are frequent, but less if they
occur at less than 8 Hz. That actually seems on the conservative side
of reasonable to me (1/8 of a bit is odds of 1 in 58.3817), particularly
if there's a cycle timer.


diff --git a/drivers/char/random.c b/drivers/char/random.c
index 03c385f5..c877cb65 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -548,9 +548,8 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in,
struct fast_pool {
__u32 pool[4];
unsigned long last;
- unsigned short count;
+ unsigned short entropy; /* Entropy, in fractional bits */
unsigned char rotate;
- unsigned char last_timer_intr;
};

/*
@@ -577,7 +576,6 @@ static void fast_mix(struct fast_pool *f, __u32 input[4])
input_rotate = (input_rotate + 7) & 31;

f->rotate = input_rotate;
- f->count++;
}

/*
@@ -851,15 +849,33 @@ void add_interrupt_randomness(int irq, int irq_flags)

fast_mix(fast_pool, input);

- if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ))
+ /*
+ * If we don't have a vaid cycle counter, don't give credit for
+ * timer interrupts. Otherwise, credit 1/8 bit per interrupt.
+ * (Should there be a difference if there's a cycle counter?)
+ */
+ if (cycles || (irq_flags & IRQF_TIMER == 0))
+ credit = 1; /* 1/8 bit */
+ else
+ credit = 0;
+
+ credit += fast_pool->entropy;
+
+ if (credit < 8 << ENTROPY_SHIFT &&
+ !time_after(now, fast_pool->last + HZ)) {
+ fast_pool->entropy = credit;
return;
+ }
+
+ credit = min_t(int, credit, 32 << ENTROPY_SHIFT);

r = nonblocking_pool.initialized ? &input_pool : &nonblocking_pool;
if (!spin_trylock(&r->lock)) {
- fast_pool->count--;
+ fast_pool->entropy = credit;
return;
}
fast_pool->last = now;
+ fast_pool->entropy = 0;
__mix_pool_bytes(r, &fast_pool->pool, sizeof(fast_pool->pool));

/*
@@ -867,28 +883,13 @@ void add_interrupt_randomness(int irq, int irq_flags)
* add it to the pool. For the sake of paranoia count it as
* 50% entropic.
*/
- credit = 1;
if (arch_get_random_seed_long(&seed)) {
__mix_pool_bytes(r, &seed, sizeof(seed));
- credit += sizeof(seed) * 4;
+ credit += sizeof(seed) * 4 << entropy_shift;
}
spin_unlock(&r->lock);

- /*
- * If we don't have a valid cycle counter, and we see
- * back-to-back timer interrupts, then skip giving credit for
- * any entropy, otherwise credit 1 bit.
- */
- if (cycles == 0) {
- if (irq_flags & __IRQF_TIMER) {
- if (fast_pool->last_timer_intr)
- credit = 0;
- fast_pool->last_timer_intr = 1;
- } else
- fast_pool->last_timer_intr = 0;
- }
-
- credit_entropy_bits(r, credit);
+ credit_entropy_frac(r, credit);
}

#ifdef CONFIG_BLOCK

2014-06-14 05:25:29

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

> At least for Intel, between its branch predictor and speculative
> execution engine, it doesn't make a difference.

*Sigh*. We need live measurement. My testing (in your test
harness!) showed a noticeable (~10%) speedup.

> When I did a quick comparison of your 64-bit fast_mix2 variant, it's
> much slower than either the 32-bit fast_mix2, or the original fast_mix
> alrogithm.

That is f***ing *bizarre*. For me, it's *significantly* faster.
You *are* compiling -m64, right? Because I agree with you it'd
be stupid to try to use it on 32-bit machines.

Forcing max-speed CPU:
# ./perftest ./ted64
fast_mix: 419 fast_mix2: 419 fast_mix4: 318
fast_mix: 386 fast_mix2: 419 fast_mix4: 112
fast_mix: 419 fast_mix2: 510 fast_mix4: 328
fast_mix: 420 fast_mix2: 510 fast_mix4: 306
fast_mix: 420 fast_mix2: 510 fast_mix4: 317
fast_mix: 419 fast_mix2: 510 fast_mix4: 318
fast_mix: 362 fast_mix2: 510 fast_mix4: 317
fast_mix: 420 fast_mix2: 510 fast_mix4: 306
fast_mix: 419 fast_mix2: 499 fast_mix4: 318
fast_mix: 420 fast_mix2: 510 fast_mix4: 328

And not:
$ ./ted64
fast_mix: 328 fast_mix2: 430 fast_mix4: 272
fast_mix: 442 fast_mix2: 442 fast_mix4: 272
fast_mix: 442 fast_mix2: 430 fast_mix4: 272
fast_mix: 329 fast_mix2: 442 fast_mix4: 272
fast_mix: 329 fast_mix2: 430 fast_mix4: 272
fast_mix: 328 fast_mix2: 442 fast_mix4: 272
fast_mix: 329 fast_mix2: 431 fast_mix4: 272
fast_mix: 328 fast_mix2: 442 fast_mix4: 272
fast_mix: 328 fast_mix2: 431 fast_mix4: 272
fast_mix: 329 fast_mix2: 442 fast_mix4: 272

And on a Phenom:
$ /tmp/ted64
fast_mix: 250 fast_mix2: 174 fast_mix4: 109
fast_mix: 258 fast_mix2: 170 fast_mix4: 114
fast_mix: 371 fast_mix2: 285 fast_mix4: 109
fast_mix: 516 fast_mix2: 156 fast_mix4: 90
fast_mix: 140 fast_mix2: 184 fast_mix4: 170
fast_mix: 406 fast_mix2: 146 fast_mix4: 88
fast_mix: 185 fast_mix2: 114 fast_mix4: 94
fast_mix: 161 fast_mix2: 116 fast_mix4: 98
fast_mix: 152 fast_mix2: 104 fast_mix4: 94
fast_mix: 352 fast_mix2: 140 fast_mix4: 79

> So given that 32-bit processors tend to be slower, I'm pretty sure
> if we want to add a 64-bit optimization, we'll have to conditionalize
> it on BITS_PER_LONG == 64 and include both the original code and the
> 64-bit optimized code.

Sorry I neglected to say so earlier; that has *always* been my intention.
The 32-bit version is primary; the 64-bit version is a conditional
optimization.

If I can make it faster *and* have more avalanche (and less register
pressure, too), it seems worth the hassle of having two versions.

2014-06-14 06:25:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

OK, so I normally do my testing using 32-bit kernels under KVM. On a
i386 kernel, running under kvm (again using a Lenovo T540 with a
i7-4900MQ CPU), with the original fast_mix, I get a timestamp count of
166 cycles using a weighted smoothed average. Using your fast_mix2 I
get around 450 cycles.

I'll try a 64-bit kernel build later this weekend.

Want to give this patch a try on your collection of machines?

- Ted

P.S. ADD_INTERRUPT_BENCH works only on x86 architectures because
random_get_entropy() mapes to get_cycles() which is RDTSC. It won't
work on many other architectures, which don't have a timestamp
counter.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 4bb6e37..90a1c9f 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -267,6 +267,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/random.h>

+#define ADD_INTERRUPT_BENCH
+
/*
* Configuration information
*/
@@ -553,6 +555,7 @@ struct fast_pool {
unsigned char last_timer_intr;
};

+#ifdef ORIG_FAST_MIX
/*
* This is a fast mixing routine used by the interrupt randomness
* collector. It's hardcoded for an 128 bit pool and assumes that any
@@ -579,6 +582,34 @@ static void fast_mix(struct fast_pool *f, __u32 input[4])
f->rotate = input_rotate;
f->count++;
}
+#else
+extern void fast_mix(struct fast_pool *f, __u32 const input[4])
+{
+ __u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1];
+ __u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3];
+ int i;
+
+
+ for (i = 0; i < 3; i++) {
+ /*
+ * Inspired by ChaCha's QuarterRound, but
+ * modified for much greater parallelism.
+ * Surprisingly, rotating a and c seems to work
+ * better than b and d. And it runs faster.
+ */
+ a += b; c += d;
+ d ^= a; b ^= c;
+ a = rol32(a, 15); c = rol32(c, 21);
+
+ a += b; c += d;
+ d ^= a; b ^= c;
+ a = rol32(a, 3); c = rol32(c, 7);
+ }
+ f->pool[0] = a; f->pool[1] = b;
+ f->pool[2] = c; f->pool[3] = d;
+ f->count++;
+}
+#endif

/*
* Credit (or debit) the entropy store with n bits of entropy.
@@ -829,6 +860,27 @@ EXPORT_SYMBOL_GPL(add_input_randomness);

static DEFINE_PER_CPU(struct fast_pool, irq_randomness);

+#ifdef ADD_INTERRUPT_BENCH
+static unsigned long avg_cycles;
+
+#define FSHIFT 11 /* nr of bits of precision */
+#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */
+#define EXP 2040
+
+static void add_interrupt_bench(cycles_t start)
+{
+ cycles_t duration = random_get_entropy() - start;
+
+ /* use a weighted moving average */
+ avg_cycles *= EXP;
+ avg_cycles += duration * (FIXED_1 - EXP);
+ avg_cycles += 1UL << (FSHIFT - 1);
+ avg_cycles >>= FSHIFT;
+}
+#else
+#define add_interrupt_bench(x)
+#endif
+
void add_interrupt_randomness(int irq, int irq_flags)
{
struct entropy_store *r;
@@ -850,6 +902,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
input[3] = ip >> 32;

fast_mix(fast_pool, input);
+ add_interrupt_bench(cycles);

if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ))
return;
@@ -1644,6 +1697,15 @@ struct ctl_table random_table[] = {
.mode = 0444,
.proc_handler = proc_do_uuid,
},
+#ifdef ADD_INTERRUPT_BENCH
+ {
+ .procname = "add_interrupt_avg_cycles",
+ .data = &avg_cycles,
+ .maxlen = sizeof(avg_cycles),
+ .mode = 0444,
+ .proc_handler = proc_doulongvec_minmax,
+ },
+#endif
{ }
};
#endif /* CONFIG_SYSCTL */

2014-06-14 06:27:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

On Sat, Jun 14, 2014 at 01:25:27AM -0400, George Spelvin wrote:
> > When I did a quick comparison of your 64-bit fast_mix2 variant, it's
> > much slower than either the 32-bit fast_mix2, or the original fast_mix
> > alrogithm.
>
> That is f***ing *bizarre*. For me, it's *significantly* faster.
> You *are* compiling -m64, right? Because I agree with you it'd
> be stupid to try to use it on 32-bit machines.

Sorry for not being clear. My results are similar to yours with -m64.
It was much slower when compiled with -m32. (Which may have been
obvious, but one of the things I've learned very early on with
benchmarking is you measure, and don't assume, whenever possible....)

- Ted

2014-06-14 06:43:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] random: is the IRQF_TIMER test working as intended?

On Sat, Jun 14, 2014 at 12:55:20AM -0400, George Spelvin wrote:
> I'm trying to understand the entropy credit computation in
> add_interrupt_randomness. A few things confuse me, and I'm
> wondering if it's intended to be that way.

In general, yes. It's intended this way. I'm trying to be extremely
conservative with my entropy measurements, and part of it is because
there is generally a huge amount of interrupts available, at least on
desktop systems, and I'd much rather be very conservative than not.

The only downside to being slow is that it takes longer to generate a
GPG key since that uses /dev/random, but I'm OK with that....

> 1) Since the number of samples between spills to the input pool is
> variable (with > 64 samples now possible due to the trylock), wouldn't
> it make more sense to accumulate an entropy estimate?

In general, we probably will only retry a few times, so it's not
worth it.

> 2) Why only deny entropy credit for back-to-back timer interrupts?
> If both both t2 - x and x - t1 are worth credit, why not for t2 - t1?
> It seems a lot better (not to mention simpler) to not credit any
> timer interrupt, so x - t1 will get credit but not t2 - x.
> 3) Why only consider the status of the interrupts when spills occur?
> This is the most confusing. The whole __IRQF_TIMER and last_timer_intr
> logic simply skips over the intermediate samples, so it actually
> detects timer interrupts 64 interrupt (or 1 second) apart.
> Shouldn't that sort of thing actually be looking at *consecutive*
> calls to add_interrupt_randomness?

What I'd probably do instead is to count the number of timer
interrupts, and if it's more than 50% time interrupts, give 0 bits of
credit, else give 1 bit of credit each time we push from the fast pool
to the input pool. Yes, that's being super conservative.

Part of this is because on modern machines most of the oscillators are
driven off of a single clock, and because not all architectures have a
timestamp clock. We could probably be more aggressive here x86
systems, but I wouldn't be comfortable more being aggressive on ARM
systems. And so to keep things simple, I've only given a single
credit per push.

The other reason why I haven't been in a hurry to try to be more
aggressive about entropy credit is even with the current super
conservative estimates, on my T540 laptop, I get the "nonblocking pool
is initialized" message 2.8 seconds into the boot, which is before all
of my USB devices have been enumerated, and before the root file
system is mounted (4 seconds into the boot). Since this is well
before the SSH host keys get generated in the init scripts after the
first boot, I figured it's quite good enough. :-)

> 4) If the above logic denies credit, why deny credit for
> arch_get_random_seed_long as well?

Whoops, that's a bug, that was caused when I reordered the
arch_get_random_sed_long so it would be done while the spinlock was
still taken. Thanks for pointing that out. I'll get that fixed on
the random.git tree's dev branch.

- Ted

2014-06-14 07:23:14

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC] random: is the IRQF_TIMER test working as intended?

> In general, yes. It's intended this way. I'm trying to be extremely
> conservative with my entropy measurements, and part of it is because
> there is generally a huge amount of interrupts available, at least on
> desktop systems, and I'd much rather be very conservative than not.

To be absolutely clear: being more aggressive is not the point.

Using 1/8 of a bit per sample was simply for convenience, to keep the
patch smaller. It can be easily adapted to be strictly more conservative.

Consider the changes that make it more conservative:
- Allow credit of less than 1 bit
- If we get interrupts very rarely, credit *less* entropy.
- Only allow credit for one side of a timer interrupt, not both. If t2-t1
is too predictable, then x-t1 has all of the entropy that's available.
t2-x provides no new information.

> What I'd probably do instead is to count the number of timer
> interrupts, and if it's more than 50% time interrupts, give 0 bits of
> credit, else give 1 bit of credit each time we push from the fast pool
> to the input pool. Yes, that's being super conservative.

If we're down in the 0/1 range, I really like the idea of allowing
fractional credit. How about crediting 1/64 of a bit per non-timer
interrupt? Equivalent result, but more linear.

(Sorry if my digression about the sanity of 1/8 bit per sample confused
things. I was just trying to say "it's not totally crazy", not "you should
do this".)

>> 1) Since the number of samples between spills to the input pool is
>> variable (with > 64 samples now possible due to the trylock), wouldn't
>> it make more sense to accumulate an entropy estimate?

> In general, we probably will only retry a few times, so it's not
> worth it.

I'm not actually worrying about the "too many samples" case, but the
"too few". The worrisome case is when someone on an energy-saving quest
succeeds in tuning the kernel (or just this particular processor) so it
gets less than 1 interrupt per second. Every interrupt credits 1 bit
of entropy. Is *that* super-conservative?

I agree that longer delays have more jitter, so it's worth a little
bit more, but shouldn't we try to get a curve the same shape as reality
and *then* apply the safety factors? Surely the presence or absence of
intermediate samples makes *some* difference?

2014-06-14 08:03:36

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

> Want to give this patch a try on your collection of machines?

Not until I fix it to

+#if ADD_INTERRUPT_BENCH
+static unsigned long avg_cycles;
+
+#define AVG_SHIFT 8 /* Exponential average factor k=1/256 */
+#define FIXED_1_2 (1 << (AVG_SHIFT-1))
+
+static void add_interrupt_bench(cycles_t start)
+{
+ cycles_t duration = random_get_entropy() - start;
+
+ /* Use a weighted moving average */
+ avg_cycles += ((avg_cycles + FIXED_1_2) >> AVG_SHIFT) - duration;
+}
+#else
+#define add_interrupt_bench(x)
+#endif
+

... because the way you did it was just silly.

See net/ipv4/tcp_input.c:tcp_rtt_estimator().

That also shows you how to add a mean deviation estimator, too.
I'll go do that now...

2014-06-14 11:14:10

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

And I have of course embarrassed myself publicly by getting the sign
wrong. That's what I get for posting *before* booting the result.

You may now point and bray like a donkey. :-)


Anyway. the following actually works:

#if ADD_INTERRUPT_BENCH
static unsigned long avg_cycles, avg_deviation;

#define AVG_SHIFT 8 /* Exponential average factor k=1/256 */
#define FIXED_1_2 (1 << (AVG_SHIFT-1))

static void add_interrupt_bench(cycles_t start)
{
long delta = random_get_entropy() - start;

/* Use a weighted moving average */
delta = delta - ((avg_cycles + FIXED_1_2) >> AVG_SHIFT);
avg_cycles += delta;
/* And average deviation */
delta = abs(delta) - ((avg_deviation + FIXED_1_2) >> AVG_SHIFT);
avg_deviation += delta;
}
#else
#define add_interrupt_bench(x)
#endif

And here are some measurements (uncorrected for *256 scaling) on my
primary (Ivy Bridge E) test machine. I've included 10 samples of
each value, takesn at 10s intervals. avg_cycles is first, followed
by avg_deviation. The three conditions are idle (1.2 GHz), idle with
performance governor enabled (3.9 GHz), and during a "make -j7" in the
kernel tree (also all processors at maximum).

Rather against my intuition, a busy system greatly *reduces* the
time spent. Just to see what interrupt rate did, on the last kernel
I also tested it while being ping flooded.

They're sorted in increasing order of speed. Unrolling definitely
makes a difference, but it's not faster than the old code until I drop
to 2 iterations in the inner loop (which would be called 4 rounds by
most people). The 64-bit mix is noticeably faster yet.

Idle performance make -j7

ORIG_FAST_MIX=0
74761 22228 78799 20305 46527 24966
71984 23619 78466 20599 50044 25202
71949 23760 77262 21363 48295 25460
72966 23859 76188 21921 47393 25130
73974 23543 76040 22135 42979 24341
74601 23407 75294 22602 50502 26715
75359 23169 71267 24990 45649 25338
75450 22855 71065 25022 48792 25731
76338 22711 71569 25016 48564 26040
76546 22567 71143 24972 48414 27882

ORIG_FAST_MIX=0, unrolled:
54830 20312 60343 21814 29577 16699
55510 20787 60655 22504 40344 24749
56994 21080 60691 22497 41095 27184
57674 21566 60261 22713 39578 26717
57560 22221 60690 22709 41361 26138
58220 22593 59978 22924 36334 24249
58646 22422 58391 23466 37125 25089
59485 21927 58000 23968 24091 11892
60444 21959 58633 24486 28816 15585
60637 22133 58576 24593 25125 13174

ORIG_FAST_MIX=1
50554 13117 54732 13010 24043 12804
51294 13623 53269 14066 35671 25957
51063 13682 52531 14214 34391 22749
49833 13612 51833 14272 24097 13802
49458 13624 49288 15046 31378 18110
50383 13936 48720 15540 25088 17320
51167 14210 49058 15637 26478 13247
51356 14157 48206 15787 30542 19717
51077 14155 48587 15742 27461 15865
52199 14361 48710 15933 27608 14826

ORIG_FAST_MIX=0, unrolled, 2 (double) rounds:
43011 10685 44846 10523 21469 10994
42568 10261 43968 10834 19694 8501
42012 10304 43657 10619 19174 8557
42166 10063 43064 10598 20221 9398
41496 10353 42125 10843 19034 6685
42176 10826 41547 10984 19462 8002
41671 10947 40756 11242 21654 12140
41691 10643 40309 11312 20526 9188
41091 10817 40135 11318 20159 9267
41151 10553 39877 11484 19653 8393

64-bit hash, 2 (double) rounds (which is excellent avalanche):
36117 11269 39285 11171 16953 5664 35107 14735
35391 11035 36564 11600 18143 7216 35322 14176
34728 11280 35278 12085 16815 6245 35479 14453
35552 11606 35627 11863 16876 5841 34717 14505
35553 11633 35145 11892 17825 6166 35241 14555
35468 11406 35773 11857 16834 5094 34814 14719
35301 11390 35357 11771 16750 4987 35248 14566
34841 10821 35785 11531 19170 8296 35627 14103
34818 10942 35045 11592 17004 6814 34948 14399
35113 11158 35469 11343 19344 7969 33859 14035

Idle performance make -j7 ping -f (from outside)

(Again, all numbers must be divided by 256 to get cycles. You
can probably divide by 1000 amd multiply by 5 in your head, which
is a pretty good approximation.))

2014-06-14 15:13:19

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

Unfortunately, my test suite is not particularly lightweight.
Here are some figures (idle at full speed, and doing a parallel
kernel compile) for a 2.5 GHz Phenom.

Quite different numbers compared to the Ivy Bridge. When the processor is
bust, fast_mix takes longer, instead. And even the rolled up fast_mix2 is
(slightly) faster than the original fast_mix. The 64-bit code is basically
the same speed as the 2x unrolled 32-bit code.

Presented in the same strange order as the Ivy Bridge, even though it's
not fastest-to-slowest this time.


Rolled up, 3x:
30748 5432 49066 24126
30583 6526 50271 23788
28621 4541 52966 25610
28017 3425 54181 25733
30425 5213 54441 25878
28556 3895 48969 26499
27853 3057 50813 25431
27471 2396 54065 27685
30328 5053 55931 27500
28702 3821 50811 24877

Unrolled, 3x:
26785 4278 40386 12695
25617 3302 41349 16593
25154 2357 41166 15904
24816 1886 40571 17449
27537 4927 41716 15723
28128 6657 42085 18098
25640 3410 41453 16486
25087 2199 39964 12193
27750 5384 38960 13672
26584 4699 38280 13697

Original:
35291 10362 82640 37663
32043 7848 81887 44853
32968 8994 80527 44847
31087 7207 84894 47535
29904 5602 81426 45401
34312 9824 54889 29060
36471 12948 65112 34232
31622 8539 63531 33871
31570 7745 57946 29070
30074 6116 62829 34393

Unrolled, 2x:
24491 2411 40335 16992
27644 6721 38981 16068
25473 4347 35260 10750
24784 3372 34211 10360
24326 2689 34178 11200
26404 4896 34498 10483
24905 3517 33793 10409
24107 2150 35321 11330
23968 1903 33818 11301
26254 4671 32728 9547

64-bit:
26877 7949 52258 25946
24862 5940 55765 27634
24528 5032 56568 28142
24844 5223 55007 27671
27115 8018 55754 27873
25504 6504 54286 27471
24772 5673 52717 26609
24427 4889 51362 27109
27219 8276 51431 22619
25640 6718 52835 26329

2014-06-14 16:33:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

OK, using your averaging scheme, on a 32-bit KVM kernel, running at
idle, here are my quick results:

Original

48028 51419
46021 50065
44750 49231

Fastmix 2, 3 interations

95956 58313
97295 57599
97242 56942

Fastmix 2, 2 iterations

68998 41496
68940 41471
68619 41576

Fastmix 2, 2 iterations, unrolled

48725 39281
48313 38424
47349 37847

So with two iterations are at least no worse than before (and in fact
the deviation is less, which makes sense since we don't have the
twist_array memory accesses), and I can easily believe there will be
architectures/microarchitectures where it will be better.

I'll need to do a bit more looking to convince myself that 2
iterations is better from a mixing perspective, but this looks like it
might be a promising replacement for the 32-bit mixer.

- Ted

2014-06-15 00:23:37

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

> I'll need to do a bit more looking to convince myself that 2
> iterations is better from a mixing perspective, but this looks like it
> might be a promising replacement for the 32-bit mixer.

This was tested with a hacked-up copy of Bob Jenkins' testing code.

It tests a few random starting values, and finds the minimum number of
outputs bits affected by an input delta for:

- Each input bit or pair of bits
- The function run forward or in reverse
- XOR, subtraction and addtion delta metrics. (5x2 = 10 in total)

The example I posted:

// (29/66353) score = 49/121/123: 6 27 16 14

a += b; c += d;
b = rol32(a, 6); d = rol32(c, 27);
d ^= a; b ^= c;

a += b; c += d;
b = rol32(a, 16); d = rol32(c, 14);
d ^= a; b ^= c;

has, after 2 rounds, a minimum avalanche of 49 bits, taken over all of
the variables just mentioned. The only thing maximized over is the
different starting values.

That seems adequate for something that's only being asked to preserve
1 bit of entropy.

(BTW, the way the deltas are defined, the maximum possibe score is 124.)

2014-06-15 01:17:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

On Sat, Jun 14, 2014 at 08:23:33PM -0400, George Spelvin wrote:
> The example I posted:
>
> // (29/66353) score = 49/121/123: 6 27 16 14
>
> a += b; c += d;
> b = rol32(a, 6); d = rol32(c, 27);
> d ^= a; b ^= c;
>
> a += b; c += d;
> b = rol32(a, 16); d = rol32(c, 14);
> d ^= a; b ^= c;
>
> has, after 2 rounds, a minimum avalanche of 49 bits, taken over all of
> the variables just mentioned. The only thing maximized over is the
> different starting values.

I'm seeing a minimum delta of 40 bits, actually. Which makes it
slightly better than your original fast_mix2 (which had a minimum
delta of 39) when using 1024 random samples using random(3) to
generate a starting pool and setting a single bit in each possible bit
position in the input array. So it's slightly better, and as I
mentioned, on my CPU, I'm really not seeing that much difference
between fast_mix2() and fast_mix3().

But I'm willing to go with this as being quite sufficient as a mixing
function.

- Ted

(Compile the following with -DANALYZE to see the analysis I did.)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

typedef unsigned int __u32;
typedef unsigned long long __u64;

struct fast_pool {
__u32 pool[4];
unsigned long last;
unsigned short count;
unsigned char rotate;
unsigned char last_timer_intr;
};


/**
* rol32 - rotate a 32-bit value left
* @word: value to rotate
* @shift: bits to roll
*/
static inline __u32 rol32(__u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}

static inline __u64 rol64(__u64 word, unsigned int shift)
{
return (word << shift) | (word >> (64 - shift));
}

static __u32 const twist_table[8] = {
0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

extern void fast_mix(struct fast_pool *f, __u32 input[4])
{
__u32 w;
unsigned input_rotate = f->rotate;

w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
f->pool[0] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 14) & 31;
w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
f->pool[1] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
f->pool[2] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
f->pool[3] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;

f->rotate = input_rotate;
f->count++;
}

extern void fast_mix2(struct fast_pool *f, __u32 const input[4])
{
__u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1];
__u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3];
int i;

for (i = 0; i < 2; i++) {
/*
* Inspired by ChaCha's QuarterRound, but
* modified for much greater parallelism.
*/
a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 15); c = rol32(c, 21);

a += b; c += d;
d ^= a; b ^= c;
a = rol32(a, 3); c = rol32(c, 7);
}
f->pool[0] = a; f->pool[1] = b;
f->pool[2] = c; f->pool[3] = d;
f->count++;
}

extern void fast_mix3(struct fast_pool *f, __u32 const input[4])
{
__u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1];
__u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3];
int i;


for (i = 0; i < 2; i++) {
a += b; c += d;
a = rol32(a, 6); c = rol32(c, 27);
d ^= a; b ^= c;

a += b; c += d;
a = rol32(a, 16); c = rol32(c, 14);
d ^= a; b ^= c;
}
f->pool[0] = a; f->pool[1] = b;
f->pool[2] = c; f->pool[3] = d;
f->count++;
}

extern void fast_mix4(struct fast_pool *f, __u32 const input[4])
{
__u64 a = ((__u64 *)f->pool)[0] ^ ((__u64 const *)input)[0];
__u64 b = ((__u64 *)f->pool)[1] ^ ((__u64 const *)input)[1];
int i;

for (i = 0; i < 2; i++) {
a += b; b = rol64(b, 52);
b ^= a; a = rol64(a, 10);
a += b; b = rol64(b, 47);
b ^= a; a = rol64(a, 17);
}

((__u64 *)f->pool)[0] = a;
((__u64 *)f->pool)[1] = b;
f->count++;
}

static void rotate(__u32 a[4])
{
int i;
int carry = 0;
__u32 tmp;

for (i=0; i < 4; i++) {
tmp = a[i];
a[i] = (tmp << 1) + carry;
carry = (tmp & 0x80000000) ? 1 : 0;
}
if (carry)
a[0]++;
}

int global_min = 9999;

void analyze(void)
{
struct fast_pool f;
int i, pc;
int sum = 0, max = 0, min=9999;
__u32 input[4];
__u32 start[4];

start[0] = random();
start[1] = random();
start[2] = random();
start[3] = random();
memset(&f, 0, sizeof(f));
memset(&input, 0, sizeof(input));
input[0] = 1;

for (i=0; i < 32; i++) {
memcpy(f.pool, start, sizeof(start));
fast_mix3(&f, input);
pc = (__builtin_popcount(f.pool[0] ^ start[0]) +
__builtin_popcount(f.pool[1] ^ start[1]) +
__builtin_popcount(f.pool[2] ^ start[2]) +
__builtin_popcount(f.pool[3] ^ start[3]));
sum += pc;
if (pc > max)
max = pc;
if (pc < min)
min = pc;
if (pc < global_min)
global_min = pc;
rotate(input);
// printf("%d ", pc);
}
// printf("\n");
// printf("average popcount: %d, max: %d min %d\n", sum / 128, max, min);
}

static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

int main(int argc, char **argv)
{
struct fast_pool f;
int i;
__u32 input[4];
unsigned volatile long long start_time, end_time;

#ifdef ANALYZE
for (i=0; i < 1024; i++)
analyze();
printf("Global minimum: %d\n", global_min);
return 0;
#endif

#if !defined(BENCH_FASTMIX) && !defined(BENCH_FASTMIX2)
for (i=0; i < 20; i++) {
usleep(50000);
start_time = rdtsc();
fast_mix2(&f, input);
end_time = rdtsc();
printf("fast_mix2: %llu\t", end_time - start_time);
#if 0
usleep(50000);
start_time = rdtsc();
fast_mix2(&f, input);
end_time = rdtsc();
printf("fast_mix2: %llu\t", end_time - start_time);
usleep(50000);
start_time = rdtsc();
fast_mix3(&f, input);
end_time = rdtsc();
printf("fast_mix3: %llu\t", end_time - start_time);
#endif
fputc('\n', stdout);
}

#endif

#ifdef BENCH_FASTMIX
for (i=0; i < 10240000; i++) {
fast_mix(&f, input);
}
#endif

#ifdef BENCH_FASTMIX2
for (i=0; i < 10240000; i++) {
fast_mix2(&f, input);
}
#endif

return 0;
}

2014-06-15 06:58:10

by George Spelvin

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

Actually, could you hold off merging that for a bit? Or if you really
want to, add a comment that the rotate constants are preliminary and
will be further optimized.

I posted that as WIP, just as evidence that something with better
avalanche could be as fast or faster, and I'd like to go back and do a
careful re-check of the analysis software, then do an exhaustive search
for good rotate constants (there are only 1M possibilities, after all).

I believe the basic structure is solid, and you appear to be persuaded
too, but this is an area littered with embarrassing failures by
cryptographic amateurs and I'd like to be *very* careful before attaching
my S-o-b to it.

Also, credit for inspiration and the analysis techniques: Bob
Jenkins. I actually started with his analysis software for SpookyHash
(http://www.burtleburtle.net/bob/hash/spooky.html) and then converted
it back to "block function" form.

Appended is the software I'm using. It began life as
http://www.burtleburtle.net/bob/c/screen.cpp but was translated into
plain C and heavily hacked up. It's not pretty, but youuld you either
look at it for mistakes or tell me to pretty it up so you can look at it?


The one thing I notice is that your analyze() function is measuring
something totally different than what I'm measuring.

Considering fast_mix as a 128->128 bit function (i.e. ignoring
the input XOR), you appear to be considering
popcount(input[] ^ output[]), and then *minimizing* over a
variety of inputs.

I'm doing basically the following:

min_score = 128
for (each 1- or 2-bit delta[]) {
avalanche[] = 0
for (5 random arrays input[]) {
output1[] = fast_mix3(input[])
output2[] = fast_mix3(input[] ^ delta[])
avalanche[] |= output1[] ^ output2[]
}
score = popcount(avalanche[])
min_score = min(score, min_score)
}

For each input delta, consider several random inputs and find
all of the output bits that that input delta can cause to toggle.

The addition carries make the operation non-linear and so the output
delta depends on the input state. I'm interested in all the bits that
can *possibly* be toggled.

Bob's code considers output deltas other than xor (minimizing over
all of theose), and both forward and reverse mixing functions.

One of the deltas considered is ~(output1[] ^ output2[]), the set of bits
sometimes *not* toggled by an input delta. If an input delta
*always* changes most of the bits, that's not as interesting either.
(Simple parity and replication can do that.)

More discussion of his analysis ideas here:
http://www.burtleburtle.net/bob/hash/avalanche.html
http://www.burtleburtle.net/bob/hash/evahash.html


The following rotate constants achieve 57 bits of avalanche after
2 iterations:
// (91/238891) score(4/5/6) = 57/87/116: 26 14 19 9
// (113/269149) score(4/5/6) = 57/84/115: 28 29 20 8
// (118/277648) score(4/5/6) = 57/89/121: 4 27 9 24
// (179/462783) score(4/5/6) = 57/93/117: 11 27 14 22

This one does 58 bits after 2 iteration, but later scores (for 2.5
and 3 iterations) are a bit low:
// (191/488173) score(4/5/6) = 58/78/113: 9 23 8 15

This one has high scores for 2, 2.5 and 3 iterations:
// (22/48296) score(4/5/6) = 56/98/122: 8 10 22 15

I might go with that one pending further work (like a proper
exhaustive search; the current code is random).


I'm also experimenting to see if swapping + for - helps, e.g.:
a ^= b; c ^= d;
b = Rotl(b, K1); d = Rotl(d, K2);
d += a; b -= c;
a ^= b; c ^= d;
b = Rotl(b, K3); d = Rotl(d, K4);
d += a; b -= c;

That has rotate sets like
// (13/484210) score(4/5/6) = 57/90/118: 6 25 16 21
// (14/490774) score(4/5/6) = 56/99/122: 25 9 19 7
// (23/804540) score(4/5/6) = 56/98/122: 17 20 10 4
// (35/1014495) score(4/5/6) = 57/94/119: 7 27 4 20
// (46/1254928) score(4/5/6) = 56/100/120: 20 18 5 10
// (47/1260723) score(4/5/6) = 56/97/120: 13 22 8 5
// (62/1627276) score(4/5/6) = 57/94/120: 6 5 10 20
// (74/1943366) score(4/5/6) = 58/92/121: 12 11 18 26
// (77/2097408) score(4/5/6) = 57/96/119: 19 18 5 11




(The following code was recently hacked up to operate in "half-iterations"
where not every rotate constant is used an equal number of times.
You'll still see residue of the old semantics in places.)

#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <assert.h>

#if 1
#define BITS 32
typedef uint32_t word_t;
#define Count(x) __builtin_popcount(x)

#else
#define BITS 64
typedef uint64_t word_t;
#define Count(x) __builtin_popcountll(x)

#endif

static word_t Rotl(word_t x, int k)
{
return (x << k) | (x >> (BITS-k));
}

static word_t Rotr(word_t x, int k)
{
return (x << (BITS-k)) | (x >> k);
}

static uint64_t Rotl64(uint64_t x, int k)
{
return (x << k) | (x >> (64 - k));
}


//
// try to find an adequate long-message mixing function for SpookyHash
//

struct Random {
uint64_t a, b, c, d;
};

static uint64_t RandValue(struct Random *r)
{
uint64_t e = r->a - Rotl64(r->b, 23);
r->a = r->b ^ Rotl64(r->c, 16);
r->b = r->c + Rotl64(r->d, 11);
r->c = r->d + e;
return r->d = e + r->a;
}

static void RandInit(struct Random *r, uint64_t seed)
{
int i;

r->a = 0xdeadbeef;
r->b = r->c = r->d = seed;
for (i = 0; i < 20; ++i)
(void)RandValue(r);
}

#define VARS 4
#define ROTS 4

struct Sieve {
int sh[ROTS];
struct Random r;
};

void
SieveInit(struct Sieve *s, int seed)
{
RandInit(&s->r, seed);
}

void
SievePrint(struct Sieve const *s, FILE *f)
{
int i;

for (i = 0; i < ROTS; i++)
fprintf(f, " %2d", s->sh[i]);
putc('\n', f);
}

// generate a new function at random
static void SieveGenerate(struct Sieve *s)
{
uint64_t v = RandValue(&s->r);
int i;

/* Fill in the shift amounts */
for (i = 0; i < ROTS; i++) {
s->sh[i] = v % BITS;
v /= BITS;
}
}

#if 1
/* Quick option */
#define DO_ROT(a, b, s) (a = Rotl(a, s))
#else
#define DO_ROT(a, b, s) (b = Rotl(b, s))
#endif

#if VARS == 4
// evaluate the function
static void
Fun(int const sh[VARS], word_t state[VARS], int iter)
{
int i;
word_t a = state[0];
word_t b = state[1];
word_t c = state[2];
word_t d = state[3];

for (i = 0; i < 2*iter; i += 2) {
#if 1
a ^= b; c ^= d;
b = Rotl(b, sh[i%ROTS]); d = Rotl(d, sh[(i+1)%ROTS]);
d += a; b += c;
#elif 1
a += b; c += d; d = Rotl(d, sh[i%ROTS]);
d ^= a; b ^= c; c = Rotl(c, sh[(i+1)%ROTS]);
#elif 1
a += b; b ^= c; c -= d; d = Rotl(d, sh[i%ROTS]);
c += d; d ^= a; a -= b; b = Rotl(b, sh[(i+1)%ROTS]);
#else
a += b; c += d; d = Rotl(d, sh[i%ROTS]);
d ^= a; b ^= c; c = Rotl(c, sh[(2*i+1)%ROTS1]);
#endif
}

state[0] = a;
state[1] = b;
state[2] = c;
state[3] = d;
}

static void
RFun(int const sh[VARS], word_t state[VARS], int iter)
{
int i;
word_t a = state[0];
word_t b = state[1];
word_t c = state[2];
word_t d = state[3];

while (iter) {
i = 2 * --iter;
#if 1
d -= a; b -= c;
b = Rotr(b, sh[i%ROTS]); d = Rotr(d, sh[(i+1)%ROTS]);
a ^= b; c ^= d;
#elif 1
c -= d; d ^= a; a += b; b = Rotr(b, sh[(i+1) % ROTS]);
a -= b; b ^= c; c += d; d = Rotr(d, sh[i % ROTS]);
#else
d ^= a; b ^= c; a = Rotr(a, sh[(i+1)%ROTS]);
a -= b; c -= d; b = Rotr(b, sh[i%ROTS]);
#endif
}

state[0] = a;
state[1] = b;
state[2] = c;
state[3] = d;
}
#elif VARS == 2
// evaluate the function
static void
Fun(int const sh[VARS], word_t state[VARS], int iter)
{
int i;
word_t a = state[0];
word_t b = state[1];

for (i = 0; i < iter; i++) {
a += b; b = Rotl(b, sh[0]);
b ^= a; a = Rotl(a, sh[1]);
#if ROTS > 2
a += b; b = Rotl(b, sh[2]);
b ^= a; a = Rotl(a, sh[3]);
#endif
}

state[0] = a;
state[1] = b;
}

static void
RFun(int const sh[VARS], word_t state[VARS], int iter)
{
int i;
word_t a = state[0];
word_t b = state[1];

for (i = 0; i < iter; i++) {
a = Rotr(a, sh[3]); b ^= a;
b = Rotr(b, sh[2]); a -= b;
#if ROTS > 2
a = Rotr(a, sh[1]); b ^= a;
b = Rotr(b, sh[0]); a -= b;
#endif
}
state[0] = a;
state[1] = b;
}
#endif

#define TRIALS 8
#define MEASURES 10

static int
OneTest(struct Sieve *s, bool forward, int limit, int iter)
{
int minVal = VARS*BITS;
int i, j, k, l;

for (i = 0; i < VARS*BITS; i++) {
for (j = i; j < VARS*BITS; j++) {
word_t total[MEASURES][VARS] = { { 0 } };

for (k = 0; k < TRIALS; k++) {
word_t a[VARS], b[VARS];

for (l = 0; l < VARS; l++)
b[l] = a[l] = RandValue(&s->r);
/* Alter the second one */
b[i/BITS] ^= (word_t)1 << (i % BITS);
if (i != j)
b[j/BITS] ^= (word_t)1 << (j % BITS);
/* Run the function */
if (forward) {
Fun(s->sh, a, iter);
Fun(s->sh, b, iter);
} else {
RFun(s->sh, a, iter);
RFun(s->sh, b, iter);
}
/* Now compute differences */
for (l = 0; l < VARS; l++) {
word_t t;

total[0][l] |= a[l]; total[5][l] |= ~a[l];
total[1][l] |= b[l]; total[6][l] |= ~b[l];

t = a[l] ^ b[l];
total[2][l] |= t; total[7][l] |= ~t;

t = a[l] - b[l];
t ^= t >> 1; /* Gray-code it */
total[3][l] |= t; total[8][l] |= ~t;

t = a[l] + b[l];
t ^= t >> 1; /* Gray-code it */
total[4][l] |= t; total[9][l] |= ~t;
}
}
/* Find minimum weight by all measures */
for (k = 0; k < MEASURES; k++) {
int counter = 0;
for (l = 0; l < VARS; l++)
counter += Count(total[k][l]);
if (counter < minVal) {
minVal = counter;
if (counter < limit)
return counter;
}
}
} /* j = second bit */
} /* i = first bit */

return minVal;
}

static int
TwoTest(struct Sieve *s, int limit, int iter)
{
int score1, score2;
word_t a[VARS], b[VARS];
int i;

/* Quick sanity test */
for (i = 0; i <= iter; i++) {
int j;
for (j = 0; j < VARS; j++)
b[j] = a[j] = RandValue(&s->r);
Fun(s->sh, a, i);
RFun(s->sh, a, i);
for (j = 0; j < VARS; j++)
assert(a[j] == b[j]);
RFun(s->sh, a, i);
Fun(s->sh, a, i);
for (j = 0; j < VARS; j++)
assert(a[j] == b[j]);
}

score1 = OneTest(s, true, limit, iter);
if (score1 < limit)
return score1;
score2 = OneTest(s, false, limit, iter);

return score1 < score2 ? score1 : score2;
}

static void
driver(int seed, FILE *f, unsigned n, int iter, int limit, int limit2)
{
unsigned i, found = 0;
struct Sieve s;
int best = 0;

SieveInit(&s, seed);

for (i = 0; found < n; i++) {
int score0, score, score2;

if (i % 1000 == 0) {
printf("%u\r", i);
fflush(stdout);
}

SieveGenerate(&s);

score0 = TwoTest(&s, limit, iter-1);
if (score0 < limit)
continue;
score = TwoTest(&s, limit2, iter);
if (score < limit2)
continue;
if (score > best)
best = score;

score2 = TwoTest(&s, 0, iter+1);
fprintf(f, "// (%u/%u) score(%d/%d/%d) = %d/%d/%d:", found,
i, iter-1, iter, iter+1, score0, score, score2);
SievePrint(&s, f);
found++;
}
printf("Best score: %d\n", best);
}

int
main(void)
{
// driver(21, stdout, 200, 6, 108);
// driver(21, stdout, 200, 5, 70);
// driver(21, stdout, 200, 2, 42);
driver(21, stdout, 200, 5, 45, 70);
// driver(21, stdout, 200, 4, 72, 90);
return 0;
}

2014-06-15 13:01:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: random: Benchamrking fast_mix2

On Sun, Jun 15, 2014 at 02:58:03AM -0400, George Spelvin wrote:
> Actually, could you hold off merging that for a bit? Or if you really
> want to, add a comment that the rotate constants are preliminary and
> will be further optimized.
>
> I posted that as WIP, just as evidence that something with better
> avalanche could be as fast or faster, and I'd like to go back and do a
> careful re-check of the analysis software, then do an exhaustive search
> for good rotate constants (there are only 1M possibilities, after all).
>
> I believe the basic structure is solid, and you appear to be persuaded
> too, but this is an area littered with embarrassing failures by
> cryptographic amateurs and I'd like to be *very* careful before attaching
> my S-o-b to it.

Keep in mind that we're using this *just* as a mixing function. It's
not like we have to worry about things like a chosen plaintext attack
--- the attacker can't look at the output of /dev/random, force time
to go backwards, and then force the CPU counter to be one bit
different, run time forwards, and then compare the new output of
/dev/random and try to do some kind of differential cryptanalysis, for
example. Or if they could, the ability to run time backwards could be
used in much more entertaining ways. (See also the recent movie,
"Edge of Tomorrow" :-)

And given my quick tests, I'm convinced that it's better than what we
had before, which is why I decided to commit the code. We can further
deltas from what we have, so long as each step forward is an
improvement.

That's not to say that more analysis wouldn't be a bad thing, but I
have some other things I need to attend to; if you come up with better
rotational values, let me know, and when I do have more time, I can
come back to trying to do a bit more work on this particular front.

> The one thing I notice is that your analyze() function is measuring
> something totally different than what I'm measuring.
>
> Considering fast_mix as a 128->128 bit function (i.e. ignoring
> the input XOR), you appear to be considering
> popcount(input[] ^ output[]), and then *minimizing* over a
> variety of inputs.

Yes, part of this comes from my bias from the previous mixing
function. Since we were using a modified LFSR, it had the property
that if you repeatedly mixed with the input array containing all
zeroes, we would iterate over every possible state in the pool, with
equal probability. So the XOR of the input array simply jumped us to
a different point in the 2**N cycle of possible states, such that even
if there was a bias where (for example) only the bottom 3 bits in the
input array were significant, those three bits of uncertainty could
potentially affect any bit in the entropy pool --- as opposed to only
twiddling the bottom three bits in the pool in the absence of having
the mixing function.

Anyway, for the purposes of the fast_mix, it's quite possible that
simply using the input_rotate and discarding everything else would
have been sufficient. There was a bit of overdesign even at the
beginning.

Cheers,

- Ted