2014-06-10 16:14:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] random: mix all saved registers into entropy pool

On Mon, May 19, 2014 at 05:17:19PM -0400, J?rn Engel wrote:
> +/*
> + * Ratelimit to a steady state of about once per jiffy. A na?ve approach
> + * would be to return 1 every time jiffies changes. But we want to avoid
> + * being closely coupled to the timer interrupt. So instead we increment
> + * a counter on every call and shift it right every time jiffies changes.
> + * If the counter is a power of two, return false;
> + *
> + * Effect is that some time after a jiffies change and cutting the counter
> + * in half we reach another power of two and return false. But the
> + * likelihood of this happening is about the same at any time within a
> + * jiffies interval.
> + */
> +static inline int ratelimited(struct fast_pool *p)
> +{
> + int ret = !(p->regs_count == 0 || is_power_of_2(p->regs_count));
> +
> + p->regs_count++;
> + if (p->last_shift != (u32)jiffies) {
> + p->regs_count >>= min(31u, (u32)jiffies - p->last_shift);
> + p->last_shift = (u32)jiffies;
> + }
> + return ret;
> +}

I wasn't convinced this would do the right thing, so I wrote a quick
test program where the main loop was basically this:

for (i=0; i < 1024; i++) {
jiffies = i >> 2;
r = ratelimited(jiffies);
printf("%5u %5u %5u %d\n", i, jiffies, regs_count, r);
}

... which basically simulated a very simple scheme where there were
four interrupts for each clock tick. In the steady state ratelimited
returns true 75% of the time. If this was as documented, we would
expect it to return true 25% of the time. So I don't think this is
working quite right:

20 5 3 1
21 5 4 1
22 5 5 0
23 5 6 1
24 6 3 1
25 6 4 1
26 6 5 0
27 6 6 1
28 7 3 1
29 7 4 1
30 7 5 0
31 7 6 1
32 8 3 1


> +static void mix_regs(struct pt_regs *regs, struct fast_pool *fast_pool)
> +{
> + struct entropy_store *r;
> + /* Two variables avoid decrementing by two without using atomics */
> + static int boot_count = BOOT_IRQS;
> + int in_boot = boot_count;
> +
> + if (in_boot) {
> + boot_count = in_boot - 1;
> + } else if (ratelimited(fast_pool))
> + return;
> +
> + /* During bootup we alternately feed both pools */
> + r = (in_boot & 1) ? &nonblocking_pool : &input_pool;
> + __mix_pool_bytes(r, regs, sizeof(*regs), NULL);
> +}

I'm also concerned about how much overhead this might eat up. I've
already had someone who was benchmarking a high performance storage
array where the increased interrupt latency before adding something
like this was something he noticed, and kvetched to me about. The
pt_regs structure is going to be larger than the fast_pool (which is
only 16 bytes), and we're going to be doing it once every jiffy, which
means between 100 and 1000 times per second.

If we do this only during boot up, I'd be much less concerned, but
doing it all the time is making me a bit nervous about the overhead.

If we knew which parts of the pt_regs were actually the "best" in
terms of entropy, we could xor that into the input[] array in
add_interrupt_randomness(), perhaps?


> - input[0] = cycles ^ j_high ^ irq;
> - input[1] = now ^ c_high;
> + input[0] ^= cycles ^ j_high ^ irq;
> + input[1] ^= now ^ c_high;

This is an unrelated change, so if you want to do this, let's discuss
this on a separate commit. We used to have something like this, but
it causes static code checkers to complain about our using stack
garbage, and people argued convincingly that it really did add as much
unpredictability, especially when correlated with the ip field.

(And yes, I know that instruction_pointer(regs) doesn't work on all
platforms, and _RET_IP_ becomes largely static --- but this is why I'd
much rather have each architecture tell me which parts of regs were
actually the best, and use that as the initialized portions of the
input[] array.)

Cheers,

- Ted


2014-06-11 00:12:13

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] random: mix all saved registers into entropy pool

On Tue, 10 June 2014 12:14:01 -0400, Theodore Ts'o wrote:
> On Mon, May 19, 2014 at 05:17:19PM -0400, Jörn Engel wrote:
> > +/*
> > + * Ratelimit to a steady state of about once per jiffy. A naïve approach
> > + * would be to return 1 every time jiffies changes. But we want to avoid
> > + * being closely coupled to the timer interrupt. So instead we increment
> > + * a counter on every call and shift it right every time jiffies changes.
> > + * If the counter is a power of two, return false;
> > + *
> > + * Effect is that some time after a jiffies change and cutting the counter
> > + * in half we reach another power of two and return false. But the
> > + * likelihood of this happening is about the same at any time within a
> > + * jiffies interval.
> > + */
> > +static inline int ratelimited(struct fast_pool *p)
> > +{
> > + int ret = !(p->regs_count == 0 || is_power_of_2(p->regs_count));
> > +
> > + p->regs_count++;
> > + if (p->last_shift != (u32)jiffies) {
> > + p->regs_count >>= min(31u, (u32)jiffies - p->last_shift);
> > + p->last_shift = (u32)jiffies;
> > + }
> > + return ret;
> > +}
>
> I wasn't convinced this would do the right thing, so I wrote a quick
> test program where the main loop was basically this:
>
> for (i=0; i < 1024; i++) {
> jiffies = i >> 2;
> r = ratelimited(jiffies);
> printf("%5u %5u %5u %d\n", i, jiffies, regs_count, r);
> }
>
> ... which basically simulated a very simple scheme where there were
> four interrupts for each clock tick. In the steady state ratelimited
> returns true 75% of the time. If this was as documented, we would
> expect it to return true 25% of the time. So I don't think this is
> working quite right:
>
> 20 5 3 1
> 21 5 4 1
> 22 5 5 0
> 23 5 6 1
> 24 6 3 1
> 25 6 4 1
> 26 6 5 0
> 27 6 6 1
> 28 7 3 1
> 29 7 4 1
> 30 7 5 0
> 31 7 6 1
> 32 8 3 1

Doh! Removing the "!" from "!(p->regs_count..." will fix that. Shows
I spent 100x more time testing the bootup entropy collection.

> > +static void mix_regs(struct pt_regs *regs, struct fast_pool *fast_pool)
> > +{
> > + struct entropy_store *r;
> > + /* Two variables avoid decrementing by two without using atomics */
> > + static int boot_count = BOOT_IRQS;
> > + int in_boot = boot_count;
> > +
> > + if (in_boot) {
> > + boot_count = in_boot - 1;
> > + } else if (ratelimited(fast_pool))
> > + return;
> > +
> > + /* During bootup we alternately feed both pools */
> > + r = (in_boot & 1) ? &nonblocking_pool : &input_pool;
> > + __mix_pool_bytes(r, regs, sizeof(*regs), NULL);
> > +}
>
> I'm also concerned about how much overhead this might eat up. I've
> already had someone who was benchmarking a high performance storage
> array where the increased interrupt latency before adding something
> like this was something he noticed, and kvetched to me about. The
> pt_regs structure is going to be larger than the fast_pool (which is
> only 16 bytes), and we're going to be doing it once every jiffy, which
> means between 100 and 1000 times per second.

Would that someone be me? ;)

The reason I prefer doing it once every jiffy is that it doesn't
change with interrupt load. You get 10k interrupts per second?
You pay once per jiffy. 100k interrupts per second? Pay once per
jiffy.

I dislike the fact that HZ is anywhere between 100 and 1000. We could
sample every time jiffies has advanced by HZ/50, which gives a stable
sample rate of 50Hz for every currently possible config.

> If we do this only during boot up, I'd be much less concerned, but
> doing it all the time is making me a bit nervous about the overhead.
>
> If we knew which parts of the pt_regs were actually the "best" in
> terms of entropy, we could xor that into the input[] array in
> add_interrupt_randomness(), perhaps?
>
>
> > - input[0] = cycles ^ j_high ^ irq;
> > - input[1] = now ^ c_high;
> > + input[0] ^= cycles ^ j_high ^ irq;
> > + input[1] ^= now ^ c_high;
>
> This is an unrelated change, so if you want to do this, let's discuss
> this on a separate commit. We used to have something like this, but
> it causes static code checkers to complain about our using stack
> garbage, and people argued convincingly that it really did add as much
> unpredictability, especially when correlated with the ip field.

Will remove.

> (And yes, I know that instruction_pointer(regs) doesn't work on all
> platforms, and _RET_IP_ becomes largely static --- but this is why I'd
> much rather have each architecture tell me which parts of regs were
> actually the best, and use that as the initialized portions of the
> input[] array.)

What I like about my approach is quite the opposite. The only thing
an architecture maintainer can mess up is not saving registers on
interrupts. And if they mess that one up, they are very likely to
notice.

Jörn

--
Victory in war is not repetitious.
-- Sun Tzu

2014-06-11 15:27:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] random: mix all saved registers into entropy pool

On Tue, Jun 10, 2014 at 08:10:09PM -0400, J?rn Engel wrote:
> > I'm also concerned about how much overhead this might eat up. I've
> > already had someone who was benchmarking a high performance storage
> > array where the increased interrupt latency before adding something
> > like this was something he noticed, and kvetched to me about. The
> > pt_regs structure is going to be larger than the fast_pool (which is
> > only 16 bytes), and we're going to be doing it once every jiffy, which
> > means between 100 and 1000 times per second.
>
> Would that someone be me? ;)
>
> The reason I prefer doing it once every jiffy is that it doesn't
> change with interrupt load. You get 10k interrupts per second?
> You pay once per jiffy. 100k interrupts per second? Pay once per
> jiffy.

No, actually, it was someone who was worried about tail latency. So
even if the average overhead was small, if once a jiffy we had a much
larger time spent in the interrupt handler, that's something that
would a big concern for someone who was worried about big storage
array performance.

I'd be happier if we used fast_mix() and mixed the registers into the
fast pool. That's much lighter weight than using mix_pool_bytes(),
which involves many more memory accesses, MUCH higher probably of
cache line bouncing between CPU's, etc.

And there has been some proposals that I've been looking at to make
fast_mix to be use even less overhead, so if we use fast_mix, any
changes we make to improve that will help here too.

> What I like about my approach is quite the opposite. The only thing
> an architecture maintainer can mess up is not saving registers on
> interrupts. And if they mess that one up, they are very likely to
> notice.

What probably makes sense is to make the defaults use pt_regs, but
make it be overrideable by the architecture maintainer. For example,
if the CPU has RDRAND or RDSEED, it probably doesn't make sense to
worry about mixing in the registers. So we could make the default be
to mix in the entire set of registers, and if someone complains about
the overhead on their system, and they have a better way of harvesting
high quality entropy, then they can use that instead of trying to rely
on the registers.

It's unlikely that someone is going to be attaching a multi-million
dollar RAID array or a PCIe attached flash device on a MIPS or m68k
platform, after all. :-)

- Ted

2014-06-12 20:07:47

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] random: mix all saved registers into entropy pool

On Tue, 10 June 2014 20:10:09 -0400, Jörn Engel wrote:
> On Tue, 10 June 2014 12:14:01 -0400, Theodore Ts'o wrote:
> > On Mon, May 19, 2014 at 05:17:19PM -0400, Jörn Engel wrote:
> > > +/*
> > > + * Ratelimit to a steady state of about once per jiffy. A naïve approach
> > > + * would be to return 1 every time jiffies changes. But we want to avoid
> > > + * being closely coupled to the timer interrupt. So instead we increment
> > > + * a counter on every call and shift it right every time jiffies changes.
> > > + * If the counter is a power of two, return false;
> > > + *
> > > + * Effect is that some time after a jiffies change and cutting the counter
> > > + * in half we reach another power of two and return false. But the
> > > + * likelihood of this happening is about the same at any time within a
> > > + * jiffies interval.
> > > + */
> > > +static inline int ratelimited(struct fast_pool *p)
> > > +{
> > > + int ret = !(p->regs_count == 0 || is_power_of_2(p->regs_count));
> > > +
> > > + p->regs_count++;
> > > + if (p->last_shift != (u32)jiffies) {
> > > + p->regs_count >>= min(31u, (u32)jiffies - p->last_shift);
> > > + p->last_shift = (u32)jiffies;
> > > + }
> > > + return ret;
> > > +}
> >
> > I wasn't convinced this would do the right thing, so I wrote a quick
> > test program where the main loop was basically this:
> >
> > for (i=0; i < 1024; i++) {
> > jiffies = i >> 2;
> > r = ratelimited(jiffies);
> > printf("%5u %5u %5u %d\n", i, jiffies, regs_count, r);
> > }
> >
> > ... which basically simulated a very simple scheme where there were
> > four interrupts for each clock tick. In the steady state ratelimited
> > returns true 75% of the time. If this was as documented, we would
> > expect it to return true 25% of the time. So I don't think this is
> > working quite right:
> >
> > 20 5 3 1
> > 21 5 4 1
> > 22 5 5 0
> > 23 5 6 1
> > 24 6 3 1
> > 25 6 4 1
> > 26 6 5 0
> > 27 6 6 1
> > 28 7 3 1
> > 29 7 4 1
> > 30 7 5 0
> > 31 7 6 1
> > 32 8 3 1
>
> Doh! Removing the "!" from "!(p->regs_count..." will fix that. Shows
> I spent 100x more time testing the bootup entropy collection.

On second look, the function was correct. If ratelimited() returns
true, the interrupt is ratelimited, i.e. ignored. So correct
behaviour is to return false 25% of the time, just like it does.

You confused me for a moment!

Jörn

--
There are two ways of constructing a software design: one way is to make
it so simple that there are obviously no deficiencies, and the other is
to make it so complicated that there are no obvious deficiencies.
-- C. A. R. Hoare

2014-06-12 20:27:33

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] random: mix all saved registers into entropy pool

On Wed, 11 June 2014 11:27:42 -0400, Theodore Ts'o wrote:
> On Tue, Jun 10, 2014 at 08:10:09PM -0400, Jörn Engel wrote:
> > > I'm also concerned about how much overhead this might eat up. I've
> > > already had someone who was benchmarking a high performance storage
> > > array where the increased interrupt latency before adding something
> > > like this was something he noticed, and kvetched to me about. The
> > > pt_regs structure is going to be larger than the fast_pool (which is
> > > only 16 bytes), and we're going to be doing it once every jiffy, which
> > > means between 100 and 1000 times per second.
> >
> > Would that someone be me? ;)
> >
> > The reason I prefer doing it once every jiffy is that it doesn't
> > change with interrupt load. You get 10k interrupts per second?
> > You pay once per jiffy. 100k interrupts per second? Pay once per
> > jiffy.
>
> No, actually, it was someone who was worried about tail latency. So
> even if the average overhead was small, if once a jiffy we had a much
> larger time spent in the interrupt handler, that's something that
> would a big concern for someone who was worried about big storage
> array performance.
>
> I'd be happier if we used fast_mix() and mixed the registers into the
> fast pool. That's much lighter weight than using mix_pool_bytes(),
> which involves many more memory accesses, MUCH higher probably of
> cache line bouncing between CPU's, etc.
>
> And there has been some proposals that I've been looking at to make
> fast_mix to be use even less overhead, so if we use fast_mix, any
> changes we make to improve that will help here too.

That makes sense to me. It would require replacing current fast_mix()
with somethat doesn't doesn't assume __u32 input[4] as the only
possible parameter. In a previous incarnation of my patch I was using
a single round of siphash to condense the registers and added that to
our fast_pool.

While siphash is fairly fast, doing that for every interrupt seemed
too much for me, hence the current ratelimit approach. But if people
care more about maximum latency than amortized cost, we can combine
siphash (or something similar) with the ratelimit.

At any rate, here is a slightly updated patch that is independent of
CONFIG_HZ.

Jörn

--
"Error protection by error detection and correction."
-- from a university class

>From 867d62e0165723f9d8dc568d0cb9d8898948ddaa Mon Sep 17 00:00:00 2001
From: Joern Engel <[email protected]>
Date: Sat, 15 Feb 2014 16:29:28 -0800
Subject: [PATCH] random: mix all saved registers into entropy pool

The single biggest entropy source is a high-resolution timer running
asynchronous to the triggering event. That leaves systems without a
useful get_cycles() implementation with significantly less entropy.
Significant means orders of magnitude, not a few percent.

An alternate high-resolution timer is the register content at the time
of an interrupt. While not monotonic, it is guaranteed to change every
few clock cycles and very unlikely to repeat the same pattern. Not
useful for interpretation as time, but we only care about some bits
of the "clock" flipping in an unpredictable fashion.

Experimentation show this to be an excellent entropy source. Doing 1000
boottests with kvm and dumping a hash of the registers for the first
1024 interrupts each, >40% of all hashes were unique and >80% of all
hashes occurred less than once 1% of the time.

Even assuming that only unique register hashes yield any entropy at all
and only one bit each, we would gather >400 bits of entropy early in
boot and another 40 bits/s from the timer interrupt during runtime. I
would feel confident with this amount of entropy in the absence of any
other source.

Repeating the test on real hardware, albeit with fewer repetitions
yields roughly the same results, slightly above 40% unique hashes.

Signed-off-by: Joern Engel <[email protected]>
---
drivers/char/random.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 429b75bb60e8..8e9f6274cf76 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -553,6 +553,8 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in,

struct fast_pool {
__u32 pool[4];
+ u32 last_shift;
+ u32 regs_count;
unsigned long last;
unsigned short count;
unsigned char rotate;
@@ -835,6 +837,64 @@ EXPORT_SYMBOL_GPL(add_input_randomness);

static DEFINE_PER_CPU(struct fast_pool, irq_randomness);

+/*
+ * Ratelimit to a steady state of about 50Hz. A naïve approach would be
+ * to return 1 every time jiffies changes. But we want to avoid being
+ * closely coupled to the timer interrupt. So instead we increment a
+ * counter on every call and shift it right every time jiffies increments
+ * by 20ms.
+ * 20ms or 50Hz is chosen to work well for all options of CONFIG_HZ.
+ * If the counter is a power of two, return false.
+ *
+ * Effect is that some time after a jiffies change and cutting the counter
+ * in half we reach another power of two and return false. But the
+ * likelihood of this happening is about the same at any time within a
+ * jiffies interval.
+ */
+static inline int ratelimited(struct fast_pool *p)
+{
+ int ret = !(p->regs_count == 0 || is_power_of_2(p->regs_count));
+ u32 jiffies_low = (u32)jiffies;
+
+ p->regs_count++;
+ if (p->last_shift > jiffies_low || p->last_shift + HZ/50 <= jiffies_low) {
+ p->regs_count >>= 1;
+ p->last_shift = jiffies_low;
+ }
+ return ret;
+}
+
+#define BOOT_IRQS 1024
+
+/*
+ * The single biggest entropy source is a high-resolution timer running
+ * asynchronous to the triggering event. That leaves systems without a
+ * useful get_cycles() implementation with significantly less entropy.
+ * Significant means orders of magnitude, not a few percent.
+ *
+ * An alternate high-resolution timer is the register content at the time
+ * of an interrupt. While not monotonic, it is guaranteed to change every
+ * few clock cycles and very unlikely to repeat the same pattern. Not
+ * useful for interpretation as time, but we only care about some bits
+ * of the "clock" flipping in an unpredictable fashion.
+ */
+static void mix_regs(struct pt_regs *regs, struct fast_pool *fast_pool)
+{
+ struct entropy_store *r;
+ /* Two variables avoid decrementing by two without using atomics */
+ static int boot_count = BOOT_IRQS;
+ int in_boot = boot_count;
+
+ if (in_boot) {
+ boot_count = in_boot - 1;
+ } else if (ratelimited(fast_pool))
+ return;
+
+ /* During bootup we alternately feed both pools */
+ r = (in_boot & 1) ? &nonblocking_pool : &input_pool;
+ __mix_pool_bytes(r, regs, sizeof(*regs), NULL);
+}
+
void add_interrupt_randomness(int irq, int irq_flags)
{
struct entropy_store *r;
@@ -845,13 +905,14 @@ void add_interrupt_randomness(int irq, int irq_flags)
__u32 input[4], c_high, j_high;
__u64 ip;

+ mix_regs(regs, fast_pool);
c_high = (sizeof(cycles) > 4) ? cycles >> 32 : 0;
j_high = (sizeof(now) > 4) ? now >> 32 : 0;
- input[0] = cycles ^ j_high ^ irq;
- input[1] = now ^ c_high;
+ input[0] ^= cycles ^ j_high ^ irq;
+ input[1] ^= now ^ c_high;
ip = regs ? instruction_pointer(regs) : _RET_IP_;
- input[2] = ip;
- input[3] = ip >> 32;
+ input[2] ^= ip;
+ input[3] ^= ip >> 32;

fast_mix(fast_pool, input);

--
2.0.0.rc0.1.g7b2ba98