LinuxLists.cc - [PATCH] /dev/random: Insufficient of entropy on many architectures

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 1:31 PM, Stephan Mueller <[email protected]> wrote:
> /dev/random uses the get_cycles() function to obtain entropy in addition to jiffies and the event value of hardware events.
>
> Typically the high-resolution timer of get_cycles delivers the majority of entropy, because the event value is quite deterministic and jiffies are very coarse.
>
> However, on the following architectures, get_cycles will return 0:

> - M68K

Thanks, m68k is being worked on, cfr. https://lkml.org/lkml/2013/9/9/441

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-10 15:04:26

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 01:31:41PM +0200, Stephan Mueller wrote:
> Hi,
>
> /dev/random uses the get_cycles() function to obtain entropy in addition to jiffies and the event value of hardware events.
>
> Typically the high-resolution timer of get_cycles delivers the majority of entropy, because the event value is quite deterministic and jiffies are very coarse.
>
> However, on the following architectures, get_cycles will return 0....

I am working on this issue with the MIPS maintainers, and on all of
the platforms where we have some kind of counter which is derived from
the CPU cycle clock, we should use it. So for example there is a
register on MIPS which is incremented on every single clock cycle mod
the number of entries in the TLB. This isn't sufficient for
get_cycles() in general, but what I am thinking about doing is
defining interface random_get_fast_cycles() which can be get_cycles()
on those platforms that have such an interface, but on platforms that
don't we can try to do something else.

> The following patch uses the clocksource clock for a time value in
> case get_cycles returns 0. As clocksource may not be available
> during boot time, a flag is introduced which allows random.c to
> check the availability of clocksource.

I'm a bit concerned about doing things this way because reading the
clocksource clock might be quite heavyweight, and we need something
which is very low overhead, since we call get_cycles() on every single
interrupt. If reading fom the clocksource clock is the equivalent of
a L3 cache miss (or worse) doing this on every single interrupt could
be highly problematic. So I think we will need to implement a
random_get_fast_cycles() for each platform for which get_cycles() is
not available. In some cases we may be able to use the local clock
source (if that's the best we can do), but in others, that may not be
appropriate at all.

Cheers,

- Ted

2013-09-10 16:54:45

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Dienstag, 10. September 2013, 11:04:19 schrieb Theodore Ts'o:

Hi Theodore,

>On Tue, Sep 10, 2013 at 01:31:41PM +0200, Stephan Mueller wrote:
>> Hi,
>>
>> /dev/random uses the get_cycles() function to obtain entropy in
>> addition to jiffies and the event value of hardware events.
>>
>> Typically the high-resolution timer of get_cycles delivers the
>> majority of entropy, because the event value is quite deterministic
>> and jiffies are very coarse.
>>
>> However, on the following architectures, get_cycles will return 0....
>
>I am working on this issue with the MIPS maintainers, and on all of
>the platforms where we have some kind of counter which is derived from
>the CPU cycle clock, we should use it. So for example there is a
>register on MIPS which is incremented on every single clock cycle mod
>the number of entries in the TLB. This isn't sufficient for
>get_cycles() in general, but what I am thinking about doing is
>defining interface random_get_fast_cycles() which can be get_cycles()
>on those platforms that have such an interface, but on platforms that
>don't we can try to do something else.

So, MIPS seem to be covered, and m68k too. But what about the number of
other platforms?
>
>> The following patch uses the clocksource clock for a time value in
>> case get_cycles returns 0. As clocksource may not be available
>> during boot time, a flag is introduced which allows random.c to
>> check the availability of clocksource.
>
>I'm a bit concerned about doing things this way because reading the
>clocksource clock might be quite heavyweight, and we need something
>which is very low overhead, since we call get_cycles() on every single
>interrupt. If reading fom the clocksource clock is the equivalent of
>a L3 cache miss (or worse) doing this on every single interrupt could
>be highly problematic. So I think we will need to implement a
>random_get_fast_cycles() for each platform for which get_cycles() is
>not available. In some cases we may be able to use the local clock
>source (if that's the best we can do), but in others, that may not be
>appropriate at all.

In any case, we need to make sure that get_cycles (or its planned
replacement random_get_fast_cycles) must deliver a high-resolution
timer.

However, as the RNG is critical to crypto, we should make sure that we
have a fix rather sooner than later.

Why do you say that clocksource is heavyweight? Yes, there is a bit more
code than for get_cycles, but that is all just leading to usually an
equally small clock read code as get_cycles.

Moreover, until having your proposed real fix, wouldn't it make sense to
have an interim patch to ensure we have entropy on the mentioned
platforms? I think /dev/random is critical enough to warrant some cache
miss even per interrupt?
>
>Cheers,
>
> - Ted

Ciao
Stephan
--
| Nimm das Recht weg - |
| was ist dann der Staat noch anderes als eine gro?e R?uberbande? |

2013-09-10 18:25:31

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 06:54:38PM +0200, Stephan Mueller wrote:
>
> Why do you say that clocksource is heavyweight? Yes, there is a bit more
> code than for get_cycles, but that is all just leading to usually an
> equally small clock read code as get_cycles.

I've had past experiences where a clock source can be *very*
expensive. One such example was the IBM x440 server, where
gettimeofday() ended up requiring a fetching the clock across the
"scalability cable" (which connected up to 3 other boxes, each box
containing 4 CPU sockets --- this was a big, nasty NUMA system), and
it was expensive enough that a financial firm that was trying to get a
hard timestamp for every single stock transaction had its performance
hit massively. And this was something running in userspace. Remember
what I said, this is being done on *every* *single* *interrupt*.

The bigger problem is that people will scream about the performance
overhead, and then patches to remove the /dev/random hooks from the
core irq code will start showing up on LKML. This is originally how
the SA_SAMPLE_RANDOM sampling got removed from device drivers, and I'm
sensitive to the fact that not everyone on LKML cares as much about
security, and many people are near fanatical about performance. So I
don't want this to be a CONFIG option or a run-time option (since
distributions will be tempted to turn it off to win the performance
benchmarking war). Maybe not right now, when everyone is all worked
up about the NSA. But in the long term, I don't want anyone to have
an excuse to have the entropy sampling removed due to performance
reasons.

> Moreover, until having your proposed real fix, wouldn't it make sense to
> have an interim patch to ensure we have entropy on the mentioned
> platforms? I think /dev/random is critical enough to warrant some cache
> miss even per interrupt?

We are already mixing in the IP from the saved irq registers, on every
single interrupt, so we are mixing in some entropy. We would get more
entropy if we had a good cycle counter to mix in, but it's not the
case that we're completely exposed right now on those platforms which
don't have get_cycles() implemented. If the system running so lock
step that the location of the interrupts is utterly predictable, then
it's not clear that using a clock source is going to help you....

Also note that the clocksource is not necessarily be the best choice;
it may not be the most fine grained sampling that we have available
(that is certainly be true for MIPS). So doing something hacky
doesn't absolve us from the need to example every single platform that
as a no-op get_cycles() function. (Well, at least those platforms
that we think really are going to be running security sensitive
systems ---- does anyone think we really have production systems
running m68k? Maybe, but....)

If we believed that /dev/random was actually returning numbers which
are exploitable, because of this, I might agree with the "we must do
SOMETHING" attitude. But I don't believe this to be the case. Also
note that we're talking about embedded platforms, where upgrade cycles
are measured in years --- if you're lucky. There are probably home
routers still stuck on 2.6, at which point they will be far more
succeptible to the problems described at http://factorable.net.

So I don't think we need to rush. I'd much rather make sure we get
this fixed right.

- Ted

2013-09-10 19:15:41

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Dienstag, 10. September 2013, 14:25:24 schrieb Theodore Ts'o:

Hi Theodore,

>
>Also note that the clocksource is not necessarily be the best choice;
>it may not be the most fine grained sampling that we have available
>(that is certainly be true for MIPS). So doing something hacky
>doesn't absolve us from the need to example every single platform that
>as a no-op get_cycles() function. (Well, at least those platforms
>that we think really are going to be running security sensitive
>systems ---- does anyone think we really have production systems
>running m68k? Maybe, but....)

That is why I arranged the clocksource call after get_cycles does not
return anything, so it would only be used if get_cycles does not have
anything.

But then, your experience with clocksource really has a point.
>
>If we believed that /dev/random was actually returning numbers which
>are exploitable, because of this, I might agree with the "we must do
>SOMETHING" attitude. But I don't believe this to be the case. Also
>note that we're talking about embedded platforms, where upgrade cycles
>are measured in years --- if you're lucky. There are probably home
>routers still stuck on 2.6, at which point they will be far more
>succeptible to the problems described at http://factorable.net.

The core problem I guess on this matter is again the strange use of
/dev/random in embedded devices -- no seeding, no spinning disks, no
heads, no nothing.

Here my CPU jitter RNG would help, but there seems to be no interest in
even discussing it... PS: I have my CPU jitter RNG running running as an
entropy feeder daemon nicely on my router box, though, which happens to
be a MIPS box...
>
>So I don't think we need to rush. I'd much rather make sure we get
>this fixed right.
>
> - Ted

Ciao
Stephan

2013-09-10 19:38:59

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 4:31 AM, Stephan Mueller <[email protected]> wrote:
> Hi,
>
> /dev/random uses the get_cycles() function to obtain entropy in addition to jiffies and the event value of hardware events.
>
> Typically the high-resolution timer of get_cycles delivers the majority of entropy, because the event value is quite deterministic and jiffies are very coarse.
>
[snip]
> The following patch uses the clocksource clock for a time value in case get_cycles returns 0. As clocksource may not be available during boot time, a flag is introduced which allows random.c to check the availability of clocksource.
>
[snip]
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 48b9fff..75b1613 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -35,6 +35,7 @@ static struct timekeeper timekeeper;
> static DEFINE_RAW_SPINLOCK(timekeeper_lock);
> static seqcount_t timekeeper_seq;
> static struct timekeeper shadow_timekeeper;
> +static bool timekeeper_enabled = 0;
>
> /* flag for if timekeeping is suspended */
> int __read_mostly timekeeping_suspended;
> @@ -833,8 +834,15 @@ void __init timekeeping_init(void)
>
> write_seqcount_end(&timekeeper_seq);
> raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
> + timekeeper_enabled = 1;
> }
>

So the end of timekeeping_init() may not be what you want here. This
only means we've started up the timekeping core with only the default
clocksource (with only few exceptions, this is almost always jiffies).
Then as clocksource drivers are initialized, they are registered and
the timekeeping core will switch over to the best available
clocksource. Also, to avoid the churn at boot of switching to every
clocksource registered, we queue them up and wait until fs_init time
to switch to whatever is the best available then.

So its likely with this patch that the systems all still end up using
jiffies for their clocksource at least until fs_init time.

thanks
-john

2013-09-10 19:45:00

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 12:38 PM, John Stultz <[email protected]> wrote:
> On Tue, Sep 10, 2013 at 4:31 AM, Stephan Mueller <[email protected]> wrote:
>> Hi,
>>
>> /dev/random uses the get_cycles() function to obtain entropy in addition to jiffies and the event value of hardware events.
>>
>> Typically the high-resolution timer of get_cycles delivers the majority of entropy, because the event value is quite deterministic and jiffies are very coarse.
>>
> [snip]
>> The following patch uses the clocksource clock for a time value in case get_cycles returns 0. As clocksource may not be available during boot time, a flag is introduced which allows random.c to check the availability of clocksource.

Also, note that at the end of timekeeping_init(), we may not have been
able to access the RTC (as some RTCs require interrupts to access), so
on those systems getnstimeofday() may return something close to 0
until the RTC subsystem initializes and sets the current wall time.

thanks
-john

2013-09-10 19:47:12

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Dienstag, 10. September 2013, 12:38:56 schrieb John Stultz:

Hi John,

>On Tue, Sep 10, 2013 at 4:31 AM, Stephan Mueller <[email protected]>
wrote:
>> Hi,
>>
>> /dev/random uses the get_cycles() function to obtain entropy in
>> addition to jiffies and the event value of hardware events.
>>
>> Typically the high-resolution timer of get_cycles delivers the
>> majority of entropy, because the event value is quite deterministic
>> and jiffies are very coarse.
>[snip]
>
>> The following patch uses the clocksource clock for a time value in
>> case get_cycles returns 0. As clocksource may not be available
>> during boot time, a flag is introduced which allows random.c to
>> check the availability of clocksource.
>[snip]
>
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 48b9fff..75b1613 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -35,6 +35,7 @@ static struct timekeeper timekeeper;
>>
>> static DEFINE_RAW_SPINLOCK(timekeeper_lock);
>> static seqcount_t timekeeper_seq;
>> static struct timekeeper shadow_timekeeper;
>>
>> +static bool timekeeper_enabled = 0;
>>
>> /* flag for if timekeeping is suspended */
>> int __read_mostly timekeeping_suspended;
>>
>> @@ -833,8 +834,15 @@ void __init timekeeping_init(void)
>>
>> write_seqcount_end(&timekeeper_seq);
>> raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
>>
>> + timekeeper_enabled = 1;
>>
>> }
>
>So the end of timekeeping_init() may not be what you want here. This
>only means we've started up the timekeping core with only the default
>clocksource (with only few exceptions, this is almost always jiffies).
>Then as clocksource drivers are initialized, they are registered and
>the timekeeping core will switch over to the best available
>clocksource. Also, to avoid the churn at boot of switching to every
>clocksource registered, we queue them up and wait until fs_init time
>to switch to whatever is the best available then.
>
>So its likely with this patch that the systems all still end up using
>jiffies for their clocksource at least until fs_init time.

Thank you for the explanation. Is there any trigger that is fired at
fs_init time that one can read?

Thanks
>
>thanks
>-john

Ciao
Stephan

2013-09-10 20:35:05

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On 09/10/2013 12:47 PM, Stephan Mueller wrote:
> Am Dienstag, 10. September 2013, 12:38:56 schrieb John Stultz:
>
> Hi John,
>
>> On Tue, Sep 10, 2013 at 4:31 AM, Stephan Mueller <[email protected]>
> wrote:
>>> Hi,
>>>
>>> /dev/random uses the get_cycles() function to obtain entropy in
>>> addition to jiffies and the event value of hardware events.
>>>
>>> Typically the high-resolution timer of get_cycles delivers the
>>> majority of entropy, because the event value is quite deterministic
>>> and jiffies are very coarse.
>> [snip]
>>
>>> The following patch uses the clocksource clock for a time value in
>>> case get_cycles returns 0. As clocksource may not be available
>>> during boot time, a flag is introduced which allows random.c to
>>> check the availability of clocksource.
>> [snip]
>>
>>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>>> index 48b9fff..75b1613 100644
>>> --- a/kernel/time/timekeeping.c
>>> +++ b/kernel/time/timekeeping.c
>>> @@ -35,6 +35,7 @@ static struct timekeeper timekeeper;
>>>
>>> static DEFINE_RAW_SPINLOCK(timekeeper_lock);
>>> static seqcount_t timekeeper_seq;
>>> static struct timekeeper shadow_timekeeper;
>>>
>>> +static bool timekeeper_enabled = 0;
>>>
>>> /* flag for if timekeeping is suspended */
>>> int __read_mostly timekeeping_suspended;
>>>
>>> @@ -833,8 +834,15 @@ void __init timekeeping_init(void)
>>>
>>> write_seqcount_end(&timekeeper_seq);
>>> raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
>>>
>>> + timekeeper_enabled = 1;
>>>
>>> }
>> So the end of timekeeping_init() may not be what you want here. This
>> only means we've started up the timekeping core with only the default
>> clocksource (with only few exceptions, this is almost always jiffies).
>> Then as clocksource drivers are initialized, they are registered and
>> the timekeeping core will switch over to the best available
>> clocksource. Also, to avoid the churn at boot of switching to every
>> clocksource registered, we queue them up and wait until fs_init time
>> to switch to whatever is the best available then.
>>
>> So its likely with this patch that the systems all still end up using
>> jiffies for their clocksource at least until fs_init time.
> Thank you for the explanation. Is there any trigger that is fired at
> fs_init time that one can read?
Check out clocksource_done_booting(), as that's probably where you'd
want to add something similar. Now this doesn't ensure non-jiffies
clocksources are installed, just that we're far enough into boot that we
assume most of the system's clocksources have been registered.

timekeeping_valid_for_hres() might be a better check for what you're
looking for, since that means some free-running clocksource has been
installed.

thanks
-john

2013-09-10 20:38:59

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 12:38:56PM -0700, John Stultz wrote:
> So the end of timekeeping_init() may not be what you want here. This
> only means we've started up the timekeping core with only the default
> clocksource (with only few exceptions, this is almost always jiffies).
> Then as clocksource drivers are initialized, they are registered and
> the timekeeping core will switch over to the best available
> clocksource. Also, to avoid the churn at boot of switching to every
> clocksource registered, we queue them up and wait until fs_init time
> to switch to whatever is the best available then.

Is there any indication in the clocksource structures where we can
determine what the cost (in CPU, time, bus overhead, etc.) for a
particular clock source, verus the granularity of the clock source?

Also, is it always safe to read from a clock source from an interrupt
handler?

Thanks,

- Ted

2013-09-10 20:46:25

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On 09/10/2013 01:38 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 12:38:56PM -0700, John Stultz wrote:
>> So the end of timekeeping_init() may not be what you want here. This
>> only means we've started up the timekeping core with only the default
>> clocksource (with only few exceptions, this is almost always jiffies).
>> Then as clocksource drivers are initialized, they are registered and
>> the timekeeping core will switch over to the best available
>> clocksource. Also, to avoid the churn at boot of switching to every
>> clocksource registered, we queue them up and wait until fs_init time
>> to switch to whatever is the best available then.
> Is there any indication in the clocksource structures where we can
> determine what the cost (in CPU, time, bus overhead, etc.) for a
> particular clock source, verus the granularity of the clock source?

Not really. We only have the rating which is developer assigned to try
to order the clocks in some linear fashion so we can select the best one.

We do read the clocksource fairly frequently though, so rather then
re-reading on each interrupt, could you instead re-use the points at
interrupt time where we already read the clocksource, like in
hrtimer_interrupt()?

Otherwise, it might be good to stick to sched_clock, which is usually
more constrained to higher performance counters, where correctness is
less critical.

> Also, is it always safe to read from a clock source from an interrupt
> handler?
Yes.

thanks
-john

2013-09-10 20:48:03

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 5:04 PM, Theodore Ts'o <[email protected]> wrote:
> On Tue, Sep 10, 2013 at 01:31:41PM +0200, Stephan Mueller wrote:
>> /dev/random uses the get_cycles() function to obtain entropy in addition to jiffies and the event value of hardware events.
>>
>> Typically the high-resolution timer of get_cycles delivers the majority of entropy, because the event value is quite deterministic and jiffies are very coarse.
>>
>> However, on the following architectures, get_cycles will return 0....
>
> I am working on this issue with the MIPS maintainers, and on all of
> the platforms where we have some kind of counter which is derived from
> the CPU cycle clock, we should use it. So for example there is a
> register on MIPS which is incremented on every single clock cycle mod
> the number of entries in the TLB. This isn't sufficient for
> get_cycles() in general, but what I am thinking about doing is
> defining interface random_get_fast_cycles() which can be get_cycles()
> on those platforms that have such an interface, but on platforms that
> don't we can try to do something else.
>
>> The following patch uses the clocksource clock for a time value in
>> case get_cycles returns 0. As clocksource may not be available
>> during boot time, a flag is introduced which allows random.c to
>> check the availability of clocksource.
>
> I'm a bit concerned about doing things this way because reading the
> clocksource clock might be quite heavyweight, and we need something
> which is very low overhead, since we call get_cycles() on every single
> interrupt. If reading fom the clocksource clock is the equivalent of
> a L3 cache miss (or worse) doing this on every single interrupt could
> be highly problematic. So I think we will need to implement a
> random_get_fast_cycles() for each platform for which get_cycles() is
> not available. In some cases we may be able to use the local clock
> source (if that's the best we can do), but in others, that may not be
> appropriate at all.

Good to know it's called from every interrupt.

So the first importance for random_get_fast_cycles() is that it needs to
be fast. What's most important next: number of bits or high-frequency?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-10 21:10:15

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 01:46:21PM -0700, John Stultz wrote:
>
> We do read the clocksource fairly frequently though, so rather then
> re-reading on each interrupt, could you instead re-use the points at
> interrupt time where we already read the clocksource, like in
> hrtimer_interrupt()?

How often is that time updated? Is there some documentation whree I
can understand how this works? I confess to be pretty ignorant about
the details of how our time keeping systems work inside Linux.

Same question for sched_clock(); what are its guarantees, both in
terms of granularity and cost of overhead. Is there any comprehensive
documentation that I should be reading?

Thanks,

- Ted

2013-09-10 21:15:01

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 10:48:00PM +0200, Geert Uytterhoeven wrote:
>
> So the first importance for random_get_fast_cycles() is that it needs to
> be fast. What's most important next: number of bits or high-frequency?

High-frequency. For example MIPS has a register which is bumped at
every clock tick, modulo the number of lines in the TLB. That's what
we're probably going to end up using for MIPS, on the assumption that
the time between interrupts is not likely going to be related to the
number of lines in the TLB. :-)

Something like jiffies has lots of bits, but since it's updated at a
much slower rate, it's not as useful if we are trying to measure
uncertainity based on the interrupt time. (Worse yet, depending on
how the architecture handles the clock, there mgiht be a very high
correlation between when the jiffies counter gets incremented and the
timer interrupt....)

And yes, we will need to make sure this gets well documented in the
sources when we introduce random_get_fast_cycles()....

Cheers,

- Ted

2013-09-10 22:08:16

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On 09/10/2013 02:10 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 01:46:21PM -0700, John Stultz wrote:
>> We do read the clocksource fairly frequently though, so rather then
>> re-reading on each interrupt, could you instead re-use the points at
>> interrupt time where we already read the clocksource, like in
>> hrtimer_interrupt()?
> How often is that time updated? Is there some documentation whree I
> can understand how this works? I confess to be pretty ignorant about
> the details of how our time keeping systems work inside Linux.

So its not how often a time is updated, but re-using where we already
read the time. And yea, there's no real documentation as things are
always in transition: ie For hrtimer_interrupt, it depends on config and
hardware, but its every HZ on every cpu, except except on HRT systems,
where it can be more frequent depending on how far away the next hrtimer
is set to expire, and with the idle NOHZ where it may be longer the HZ
on idle cpus. NOHZ_FULL further complicates this to be even more rarely,
but I've not followed that work as closely, so Fredric could probably be
of more help here.

One other area you might look at is using the delta between when the
next hrtimer was scheduled for and when we actually expired it? That's
something we could cheaply calculate on every hrtimer expiration. Though
I probably should be hesitant with my suggestions, as I'm not well
versed in RNG theory.

> Same question for sched_clock(); what are its guarantees, both in
> terms of granularity and cost of overhead. Is there any comprehensive
> documentation that I should be reading?

No. sched_clock has changed as well over the last few years.
sched_clock should be more performant then any of the timekeeping calls,
since it has lower correctness requirements, but it may just be backed
by jiffies in some cases.

thanks
-john

2013-09-10 22:33:32

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 03:08:12PM -0700, John Stultz wrote:
> One other area you might look at is using the delta between when the
> next hrtimer was scheduled for and when we actually expired it? That's
> something we could cheaply calculate on every hrtimer expiration. Though
> I probably should be hesitant with my suggestions, as I'm not well
> versed in RNG theory.

What we need is a time source which whose granularity is substantially
finer-grained that the rate at which interrupts are delievered to the
system (preferably by a factor at least 8 or 16). It also needs to be
fast enough (since we will be calling it on every single interrupt)
that the overhead doesn't cause architecture maintainers to break out
their torches and pitchforks and come looking for me. :-)

Reading from a cycle counter is therefore ideal; it doesn't need to be
synchronized across CPU's, and I don't care if it gets stops ticking
when the system is suspended, and I don't care if the rate at which it
increment is dependent on CPU clock speed getting jacked up and down
for by power management systems. (And in fact, if it's going to cost
extra overhead to correct for the CPU being suspended or running at a
1.6 MHz instead of 2.8 MHz, that's a bug, not a feature.) We just
need something fine-grained.

The problem is what do we do on platforms that don't have a cycle
counter. Stephan Mueller has proposed using the "best" clocksource as
a default fallback. Which might be OK, but I still remember that
really, REALLY angry customer who discovered that gettimeofday() was
breathtakingly slow on an IBM X440 (at least, when they were calling
it at super high rates).

I just have no idea what various clock sources might do on different
architectures, and if many of the more older ones (i.e., sparc 32,
m32, h8, etc.) are just going to fall back to jiffies, I'm not sure
it's worth it.

- Ted

2013-09-11 00:31:49

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On 09/10/2013 03:33 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 03:08:12PM -0700, John Stultz wrote:
>> One other area you might look at is using the delta between when the
>> next hrtimer was scheduled for and when we actually expired it? That's
>> something we could cheaply calculate on every hrtimer expiration. Though
>> I probably should be hesitant with my suggestions, as I'm not well
>> versed in RNG theory.
> What we need is a time source which whose granularity is substantially
> finer-grained that the rate at which interrupts are delievered to the
> system (preferably by a factor at least 8 or 16). It also needs to be
> fast enough (since we will be calling it on every single interrupt)
> that the overhead doesn't cause architecture maintainers to break out
> their torches and pitchforks and come looking for me. :-)

Yea, that point about "every single interrupt" vs "every timer
interrupt". I suspect that if its every timer interrupt, this can be
done easily w/ clocksources as we already do that read, but every single
interrupt would have potential problems with various other devices with
high irq frequency.

> Reading from a cycle counter is therefore ideal; it doesn't need to be
> synchronized across CPU's, and I don't care if it gets stops ticking
> when the system is suspended, and I don't care if the rate at which it
> increment is dependent on CPU clock speed getting jacked up and down
> for by power management systems. (And in fact, if it's going to cost
> extra overhead to correct for the CPU being suspended or running at a
> 1.6 MHz instead of 2.8 MHz, that's a bug, not a feature.) We just
> need something fine-grained.
Right so get_cycles() really sounds like the right thing here.

Although I wonder if you run into issues with counters that wrap
quickly? Or does that not matter?

> The problem is what do we do on platforms that don't have a cycle
> counter. Stephan Mueller has proposed using the "best" clocksource as
> a default fallback. Which might be OK, but I still remember that
> really, REALLY angry customer who discovered that gettimeofday() was
> breathtakingly slow on an IBM X440 (at least, when they were calling
> it at super high rates).
Yea, I'm familiar with that box. :)

> I just have no idea what various clock sources might do on different
> architectures, and if many of the more older ones (i.e., sparc 32,
> m32, h8, etc.) are just going to fall back to jiffies, I'm not sure
> it's worth it.

Right. In many cases jiffies is all that we have. Thus the check for
timekeeping_valid_for_hres() can be used to at least flag that case. Not
that I know what you want to do if it comes up false.

thanks
-john

2013-09-11 00:50:56

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, Sep 10, 2013 at 05:31:45PM -0700, John Stultz wrote:
> Yea, that point about "every single interrupt" vs "every timer
> interrupt". I suspect that if its every timer interrupt, this can be
> done easily w/ clocksources as we already do that read, but every single
> interrupt would have potential problems with various other devices with
> high irq frequency.

Yep. The worse case would probably a serial port running at 230400
bps. Even with a 16550A UART with the 16-byte FIFO, we're talking
about close to 2,000 interrupts per second. A 16450 running at 115200
bps would give us 11,520 interrupts per second....

> Right so get_cycles() really sounds like the right thing here.

For those platforms which have them, yes.

> Although I wonder if you run into issues with counters that wrap
> quickly? Or does that not matter?

No, it doesn't matter. It's really the low bits of the counter which
are really important to us anyway. For example on MIPS there is a
register which is bumped on every CPU cycle, but its kept mod the
number of entries in the TLB (it's used for random TLB eviction).
That's what we'll probably ending using for MIPS, since its cycle
counter is not guaranteed to be there, and had a bug which could cause
timer interrupts to get lost if you tried reading from it at the wrong
time.

> Right. In many cases jiffies is all that we have. Thus the check for
> timekeeping_valid_for_hres() can be used to at least flag that case. Not
> that I know what you want to do if it comes up false.

Right now, if get_cycles() returns 0, we fall back to only mixing in
the IP register from the interrupt plus the jiffies counter. The
platforms where get_cycles returns 0 is:

- MIPS (solution pending)
- User mode Linux
- Sparc 32 bit
- M68K
- M32R
- Hexagon
- H8/300
- FR-V
- CRIS
- AVR32
- ARC
- METAG
- Microblaze
- SCORE
- SH
- Unicore32

Would I be correct in assuming that most of these probably don't have a
high-res clocksource, and so would fall back to jiffies anyway?

Thanks,

- Ted

2013-09-11 01:15:03

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On 09/10/2013 05:50 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 05:31:45PM -0700, John Stultz wrote:
>> Yea, that point about "every single interrupt" vs "every timer
>> interrupt". I suspect that if its every timer interrupt, this can be
>> done easily w/ clocksources as we already do that read, but every single
>> interrupt would have potential problems with various other devices with
>> high irq frequency.
> Yep. The worse case would probably a serial port running at 230400
> bps. Even with a 16550A UART with the 16-byte FIFO, we're talking
> about close to 2,000 interrupts per second. A 16450 running at 115200
> bps would give us 11,520 interrupts per second....
>
>> Right so get_cycles() really sounds like the right thing here.
> For those platforms which have them, yes.
>
>> Although I wonder if you run into issues with counters that wrap
>> quickly? Or does that not matter?
> No, it doesn't matter. It's really the low bits of the counter which
> are really important to us anyway. For example on MIPS there is a
> register which is bumped on every CPU cycle, but its kept mod the
> number of entries in the TLB (it's used for random TLB eviction).
> That's what we'll probably ending using for MIPS, since its cycle
> counter is not guaranteed to be there, and had a bug which could cause
> timer interrupts to get lost if you tried reading from it at the wrong
> time.
>
>> Right. In many cases jiffies is all that we have. Thus the check for
>> timekeeping_valid_for_hres() can be used to at least flag that case. Not
>> that I know what you want to do if it comes up false.
> Right now, if get_cycles() returns 0, we fall back to only mixing in
> the IP register from the interrupt plus the jiffies counter. The
> platforms where get_cycles returns 0 is:
>
> - MIPS (solution pending)
> - User mode Linux
> - Sparc 32 bit
> - M68K
> - M32R
> - Hexagon
> - H8/300
> - FR-V
> - CRIS
> - AVR32
> - ARC
> - METAG
> - Microblaze
> - SCORE
> - SH
> - Unicore32
>
> Would I be correct in assuming that most of these probably don't have a
> high-res clocksource, and so would fall back to jiffies anyway?

So some of them (on specific hardware, likely not as an architecture
feature) do have clocksources that are reasonably highres (m68k, sh,
hexagon). I have no sense of the cost to read them, however. I suspect
syncing with the maintainers and seeing if get_cycles can be enabled on
any of these arches, or if we should add an extra flag to the
clocksource drivers to determine if the clocksource is fast enough for
get_cycles/sched_clock usage.

Otherwise yea, I'd avoid trying to read clocksources on every irq.

thanks
-john

2013-09-11 06:50:00

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Dienstag, 10. September 2013, 17:14:54 schrieb Theodore Ts'o:

Hi Theodore,

>On Tue, Sep 10, 2013 at 10:48:00PM +0200, Geert Uytterhoeven wrote:
>> So the first importance for random_get_fast_cycles() is that it needs
>> to be fast. What's most important next: number of bits or
>> high-frequency?
>High-frequency. For example MIPS has a register which is bumped at
>every clock tick, modulo the number of lines in the TLB. That's what
>we're probably going to end up using for MIPS, on the assumption that
>the time between interrupts is not likely going to be related to the
>number of lines in the TLB. :-)
>
>Something like jiffies has lots of bits, but since it's updated at a
>much slower rate, it's not as useful if we are trying to measure
>uncertainity based on the interrupt time. (Worse yet, depending on
>how the architecture handles the clock, there mgiht be a very high
>correlation between when the jiffies counter gets incremented and the
>timer interrupt....)
>
>And yes, we will need to make sure this gets well documented in the
>sources when we introduce random_get_fast_cycles()....

As general hunch on the speed, I would say is a simple test that two
adjacent calls to obtain a time stamp show a delta. Eg.

__u64 tmp = random_get_fast_cycles() - random_get_fast_cycles();
if(0 == tmp)
return fail;
return pass;

>
>Cheers,
>
> - Ted

Ciao
Stephan

2013-09-12 11:59:08

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Wed, Sep 11, 2013 at 8:49 AM, Stephan Mueller <[email protected]> wrote:
> __u64 tmp = random_get_fast_cycles() - random_get_fast_cycles();
> if(0 == tmp)
> return fail;
> return pass;

That will fail if the cycle counter runs at less than say 4 times the
CPU clock...

BTW, I prefer a different name than "random_get_fast_cycles()", as it's better
to have something that returns different and unpredictable numbers than an
actual monotonic cycle counter.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-12 12:08:40

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Donnerstag, 12. September 2013, 13:59:04 schrieb Geert Uytterhoeven:

Hi Geert,

>On Wed, Sep 11, 2013 at 8:49 AM, Stephan Mueller <[email protected]>
wrote:
>> __u64 tmp = random_get_fast_cycles() - random_get_fast_cycles();
>> if(0 == tmp)
>>
>> return fail;
>>
>> return pass;
>
>That will fail if the cycle counter runs at less than say 4 times the
>CPU clock...

Well...
>
>BTW, I prefer a different name than "random_get_fast_cycles()", as it's
>better to have something that returns different and unpredictable
>numbers than an actual monotonic cycle counter.

A monotonic counter is fully ok. Note, for /dev/random, the occurrence
of events delivers entropy. Thus, we have to be able to precisely
measure that occurrence. The timer itself does not need to deliver any
entropy as long as it is fast.
>
>Gr{oetje,eeting}s,
>
> Geert
>
>--
>Geert Uytterhoeven -- There's lots of Linux beyond ia32 --
>[email protected]
>
>In personal conversations with technical people, I call myself a
>hacker. But when I'm talking to journalists I just say "programmer" or
>something like that. -- Linus Torvalds

Ciao
Stephan

2013-09-12 12:15:37

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, Sep 12, 2013 at 2:08 PM, Stephan Mueller <[email protected]> wrote:
>>BTW, I prefer a different name than "random_get_fast_cycles()", as it's
>>better to have something that returns different and unpredictable
>>numbers than an actual monotonic cycle counter.
>
> A monotonic counter is fully ok. Note, for /dev/random, the occurrence
> of events delivers entropy. Thus, we have to be able to precisely
> measure that occurrence. The timer itself does not need to deliver any
> entropy as long as it is fast.

Well, in my specific case (m68k/Amiga) I can use:
- a 24-bit counter running at only ca. 15 or 31 kHz (actual
frequency may vary),
- a 16-bit counter running at ca. 700 kHz.

That is, if they have to be monotonic cycle counters.

If not, I can mix the two (e.g. "a << 8 | (b & 0xff)") to get a 32-bit value.
That result would be fine for /dev/random, I guess, but it's not
really "get_cycles()".

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-12 12:36:03

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Donnerstag, 12. September 2013, 14:15:35 schrieb Geert Uytterhoeven:

Hi Geert,

>On Thu, Sep 12, 2013 at 2:08 PM, Stephan Mueller <[email protected]>
wrote:
>>>BTW, I prefer a different name than "random_get_fast_cycles()", as
>>>it's better to have something that returns different and
>>>unpredictable numbers than an actual monotonic cycle counter.
>>>
>> A monotonic counter is fully ok. Note, for /dev/random, the
>> occurrence
>> of events delivers entropy. Thus, we have to be able to precisely
>> measure that occurrence. The timer itself does not need to deliver
>> any
>> entropy as long as it is fast.
>
>Well, in my specific case (m68k/Amiga) I can use:
> - a 24-bit counter running at only ca. 15 or 31 kHz (actual
>frequency may vary),
> - a 16-bit counter running at ca. 700 kHz.
>
>That is, if they have to be monotonic cycle counters.
>
>If not, I can mix the two (e.g. "a << 8 | (b & 0xff)") to get a 32-bit
>value. That result would be fine for /dev/random, I guess, but it's
>not really "get_cycles()".

Note, get_cycles should return an u64.

Not sure what a and b here is, but if a is the 24 bit value and b the
faster 16 bit value, wouldn't there be a gap?

I.e. wouldn't it be better to use the full 16 bit counter as low value
and OR the 24 bit on bits 48 to 17?

Yet, there is a break in that counter: the 16 low bits rotate several
times (around 10 times) before bit 17 is changed once.

>
>Gr{oetje,eeting}s,
>
> Geert
>
>--
>Geert Uytterhoeven -- There's lots of Linux beyond ia32 --
>[email protected]
>
>In personal conversations with technical people, I call myself a
>hacker. But when I'm talking to journalists I just say "programmer" or
>something like that. -- Linus Torvalds

Ciao
Stephan

2013-09-12 12:47:25

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, Sep 12, 2013 at 2:35 PM, Stephan Mueller <[email protected]> wrote:
>>On Thu, Sep 12, 2013 at 2:08 PM, Stephan Mueller <[email protected]>
> wrote:
>>>>BTW, I prefer a different name than "random_get_fast_cycles()", as
>>>>it's better to have something that returns different and
>>>>unpredictable numbers than an actual monotonic cycle counter.
>>>>
>>> A monotonic counter is fully ok. Note, for /dev/random, the
>>> occurrence
>>> of events delivers entropy. Thus, we have to be able to precisely
>>> measure that occurrence. The timer itself does not need to deliver
>>> any
>>> entropy as long as it is fast.
>>
>>Well, in my specific case (m68k/Amiga) I can use:
>> - a 24-bit counter running at only ca. 15 or 31 kHz (actual
>>frequency may vary),
>> - a 16-bit counter running at ca. 700 kHz.
>>
>>That is, if they have to be monotonic cycle counters.
>>
>>If not, I can mix the two (e.g. "a << 8 | (b & 0xff)") to get a 32-bit
>>value. That result would be fine for /dev/random, I guess, but it's
>>not really "get_cycles()".
>
> Note, get_cycles should return an u64.

Currently cycles_t is 64-bit (defined as a 64-bit type, that still doesn't say
anything about the actual values) on blackfin, c6x, cris, ia64, m32r, parisc64,
ppc64, s390x, tile, x86, and xtensa.
On all other architectures cycles_t is 32 bit.

> Not sure what a and b here is, but if a is the 24 bit value and b the
> faster 16 bit value, wouldn't there be a gap?

No, 24 + 8 = 32 (cycles_t is 32-bit on m68k).

> I.e. wouldn't it be better to use the full 16 bit counter as low value
> and OR the 24 bit on bits 48 to 17?
>
> Yet, there is a break in that counter: the 16 low bits rotate several
> times (around 10 times) before bit 17 is changed once.

Sure, but it's also slower, as this will be called for every interrupt, and the
counters (accessed by byte!) are on a slow bus. Think of Ted's L3-cache-miss
story.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-12 12:57:13

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Donnerstag, 12. September 2013, 14:47:23 schrieb Geert Uytterhoeven:

Hi Geert,

>On Thu, Sep 12, 2013 at 2:35 PM, Stephan Mueller <[email protected]>
wrote:
>>>On Thu, Sep 12, 2013 at 2:08 PM, Stephan Mueller
>>><[email protected]>
>>>
>> wrote:
>>>>>BTW, I prefer a different name than "random_get_fast_cycles()", as
>>>>>it's better to have something that returns different and
>>>>>unpredictable numbers than an actual monotonic cycle counter.
>>>>>
>>>> A monotonic counter is fully ok. Note, for /dev/random, the
>>>> occurrence
>>>> of events delivers entropy. Thus, we have to be able to precisely
>>>> measure that occurrence. The timer itself does not need to deliver
>>>> any
>>>> entropy as long as it is fast.
>>>
>>>Well, in my specific case (m68k/Amiga) I can use:
>>> - a 24-bit counter running at only ca. 15 or 31 kHz (actual
>>>
>>>frequency may vary),
>>>
>>> - a 16-bit counter running at ca. 700 kHz.
>>>
>>>That is, if they have to be monotonic cycle counters.
>>>
>>>If not, I can mix the two (e.g. "a << 8 | (b & 0xff)") to get a
>>>32-bit
>>>value. That result would be fine for /dev/random, I guess, but it's
>>>not really "get_cycles()".
>>>
>> Note, get_cycles should return an u64.
>
>Currently cycles_t is 64-bit (defined as a 64-bit type, that still
>doesn't say anything about the actual values) on blackfin, c6x, cris,
>ia64, m32r, parisc64, ppc64, s390x, tile, x86, and xtensa.
>On all other architectures cycles_t is 32 bit.
>
>> Not sure what a and b here is, but if a is the 24 bit value and b the
>> faster 16 bit value, wouldn't there be a gap?
>
>No, 24 + 8 = 32 (cycles_t is 32-bit on m68k).

With gap I meant a logical break as discussed below.
>
>> I.e. wouldn't it be better to use the full 16 bit counter as low
>> value
>> and OR the 24 bit on bits 48 to 17?
>>
>> Yet, there is a break in that counter: the 16 low bits rotate several
>> times (around 10 times) before bit 17 is changed once.
>
>Sure, but it's also slower, as this will be called for every interrupt,
>and the counters (accessed by byte!) are on a slow bus. Think of Ted's
>L3-cache-miss story.

I see. Well, I think a fast running 8 bit counter is to little
considering that the maximum entropy an event may have is assumed to be
/dev/random is 11 bits.

Thus, I would not be comfortable with the proposed 8 bit fast counter.
Of course one may say, the 31 or 15 kHz counter is still fast, but is it
sufficiently fast so that an observer (ie. somebody outside the kernel)
cannot measure that event with (11 - 8) = 3 low bits precision of that
slow counter in the worst case?

>
>Gr{oetje,eeting}s,
>
> Geert
>
>--
>Geert Uytterhoeven -- There's lots of Linux beyond ia32 --
>[email protected]
>
>In personal conversations with technical people, I call myself a
>hacker. But when I'm talking to journalists I just say "programmer" or
>something like that. -- Linus Torvalds

Ciao
Stephan

2013-09-12 14:25:53

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, Sep 12, 2013 at 01:59:04PM +0200, Geert Uytterhoeven wrote:
>
> BTW, I prefer a different name than "random_get_fast_cycles()", as it's better
> to have something that returns different and unpredictable numbers than an
> actual monotonic cycle counter.

I'm open to a different name; the point I was trying to make with that
name proposal was to be clear that it had to be very low overhead, and
that the intent of this interface was that it be time dependent in
some fashion (since that is how we are trying to generate the entropy
source). The problem with "time" is that it implies that it is in
units of seconds. I was trying to come up with something that was
more generic, which is why I suggested "randome_get_fast_cycles()".

If the CPU has a source which is specified to be complately
unpredictable, we have arch_get_random_long(), which the random driver
uses as well.

I get that there could be other things, such as some kind of dram
refresh counter, or the raster scan line if you had an embedded video
chip on something like an Apple II :-) and thse things would work too.

But if someone can suggest a different name, sure, let's talk about
it. It may be, though, that the best we can do is to pick a name
which is not too misleading, even if it is not 100% precise. That can
happen sometimes, alas.

Regards,

- Ted

2013-09-12 20:47:04

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On 09/10/2013 05:50 PM, Theodore Ts'o wrote:
>
> Right now, if get_cycles() returns 0, we fall back to only mixing in
> the IP register from the interrupt plus the jiffies counter. The
> platforms where get_cycles returns 0 is:
>
> - User mode Linux

UML (on x86) should be able to use RDTSC just fine; it might be
meaningless for anything *other* than random though...

-hpa

2013-09-12 22:42:01

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, 10 September 2013 20:50:47 -0400, Theodore Ts'o wrote:
>
> Yep. The worse case would probably a serial port running at 230400
> bps. Even with a 16550A UART with the 16-byte FIFO, we're talking
> about close to 2,000 interrupts per second. A 16450 running at 115200
> bps would give us 11,520 interrupts per second....

I happen to have a real-world system with >100k interrupts per second
and - surprise - add_interrupt_randomness() showed up prominently in
the profiles. I was also told twice to just remove that call. I
resisted both times and have done far more work to reduce overhead
while still collecting entropy. Some others would have caved in.

My bottom line: we have to make add_interrupt_randomness() cheaper or
else it will get removed in custom kernels. Quite likely it already
has been removed in a number of places.

One option is to add the "input_pool.entropy_count > trickle_thresh"
condition that all other entropy sources currently have. But instead
I would rather rename fast_mix() to not_too_fast_mix() and implement a
real fast_mix(). Essentially just xor the collected numbers into a
pool and schedule something to shuffle the bits at a later point.

The point is to make collection of randomness as cheap as possible.
The cheaper it is to collect, the more of it we will collect. And
collecting lots of bad randomness really cheaply may be a better
tradeoff than collecting a bit of good randomness with medium cost.
Hashing will ensure that any real randomness will overcome all
predictable data we may have collected alongside.

Jörn

--
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall

2013-09-12 22:53:39

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, 12 September 2013 14:15:35 +0200, Geert Uytterhoeven wrote:
> On Thu, Sep 12, 2013 at 2:08 PM, Stephan Mueller <[email protected]> wrote:
> >>BTW, I prefer a different name than "random_get_fast_cycles()", as it's
> >>better to have something that returns different and unpredictable
> >>numbers than an actual monotonic cycle counter.
> >
> > A monotonic counter is fully ok. Note, for /dev/random, the occurrence
> > of events delivers entropy. Thus, we have to be able to precisely
> > measure that occurrence. The timer itself does not need to deliver any
> > entropy as long as it is fast.
>
> Well, in my specific case (m68k/Amiga) I can use:
> - a 24-bit counter running at only ca. 15 or 31 kHz (actual
> frequency may vary),
> - a 16-bit counter running at ca. 700 kHz.

Assuming the same cost, please use the 700kHz counter. Or both.

Jiffies is a relatively poor choice, as it can be predicted with high
certainty. Most of the time it will be identical to the last value
for jiffies, most of the remaining time it will be off by exactly one.
So on average you don't even get a single unpredictable bit from
jiffies.

A counter that is fast when compared to interrupt rate will give you
relatively many useful bits. A slow counter - no matter how wide -
will have little useful randomness. Ideal is a counter that cannot be
externally derived even with the most expensive measurement kit. So
an unstable clock is actually a bonus. Think high precision and low
realiability.

And if you have to drop bits from the counter, please drop the high
bits, as they are the easily predictable ones.

Jörn

--
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000

2013-09-12 23:06:31

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Tue, 10 September 2013 15:08:12 -0700, John Stultz wrote:
> Though
> I probably should be hesitant with my suggestions, as I'm not well
> versed in RNG theory.

The basic principle of Ted's RNG is very simple and quite sane:
- You collect as much data as possible, some of which is (hopefully)
unpredictable.
- All the data gets dumped into a small buffer.
- When reading from the buffer, you create a crypto-hash of the entire
buffer. Even if most of the buffer is predictable, the few
unpredictable bits will randomly flip every output bit.
- Half of the hash gets returned to the reader, the other half gets
added back into the pool.

It doesn't matter if you collect predictable data - it neither helps
nor hurts. But you should collect as much unpredictable data as
possible and do it as cheaply as possible. If you want to improve the
RNG, you either collect more data, collect better (less predictable)
data or make the collection cheaper.

Jörn

--
People really ought to be forced to read their code aloud over the phone.
That would rapidly improve the choice of identifiers.
-- Al Viro

2013-09-12 23:32:12

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, Sep 12, 2013 at 05:07:17PM -0400, J?rn Engel wrote:
>
> I happen to have a real-world system with >100k interrupts per second
> and - surprise - add_interrupt_randomness() showed up prominently in
> the profiles. I was also told twice to just remove that call. I
> resisted both times and have done far more work to reduce overhead
> while still collecting entropy. Some others would have caved in.

Would it be possible for you to send me the perf numbers that you saw?

What platform is this? x86? Some embedded processor?

> One option is to add the "input_pool.entropy_count > trickle_thresh"
> condition that all other entropy sources currently have. But instead
> I would rather rename fast_mix() to not_too_fast_mix() and implement a
> real fast_mix(). Essentially just xor the collected numbers into a
> pool and schedule something to shuffle the bits at a later point.

We can try some different things to make fast_mix() faster, but it
would be good to get some hard numbers before we start deciding we
need to do something more complicated.

One thing that comes to mind is that fast_mix() is only called in
exactly one place, and we always pass in a long. So there are
certainly ways that we could optimize fast_mix even keeping the
current mixing algorithm.

Cheers,

- Ted

2013-09-13 01:10:21

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, 12 September 2013 19:31:55 -0400, Theodore Ts'o wrote:
> On Thu, Sep 12, 2013 at 05:07:17PM -0400, Jörn Engel wrote:
> >
> > I happen to have a real-world system with >100k interrupts per second
> > and - surprise - add_interrupt_randomness() showed up prominently in
> > the profiles. I was also told twice to just remove that call. I
> > resisted both times and have done far more work to reduce overhead
> > while still collecting entropy. Some others would have caved in.
>
> Would it be possible for you to send me the perf numbers that you saw?

Eventually. The idiot that was me half a year ago failed to attach
perf numbers to the patch description.

> What platform is this? x86? Some embedded processor?

x86. I suspect NAPI significantly cuts down the number of interrupts
for ethernet cards. My case is with FC interrupts. Quite likely
doing something like NAPI would help far more for performance than
disabling add_interrupt_randomness().

> > One option is to add the "input_pool.entropy_count > trickle_thresh"
> > condition that all other entropy sources currently have. But instead
> > I would rather rename fast_mix() to not_too_fast_mix() and implement a
> > real fast_mix(). Essentially just xor the collected numbers into a
> > pool and schedule something to shuffle the bits at a later point.
>
> We can try some different things to make fast_mix() faster, but it
> would be good to get some hard numbers before we start deciding we
> need to do something more complicated.
>
> One thing that comes to mind is that fast_mix() is only called in
> exactly one place, and we always pass in a long. So there are
> certainly ways that we could optimize fast_mix even keeping the
> current mixing algorithm.

I think the existing code is doing just fine for low interrupt loads.
It makes sense to spend a bit more work to squeeze the last bit of
randomness out. But when you get lots of interrupts, you can be
sloppy and just xor things into the pool.

My patch below is going too far by not even doing the xor. I was
stupid and under time pressure. But to my defence,
add_timer_randomness() makes the same mistake.

Jörn

--
Eighty percent of success is showing up.
-- Woody Allen

>From ee197e39b9a6c905db870606f5bacab2a52a8da2 Mon Sep 17 00:00:00 2001
From: Joern Engel <[email protected]>
Date: Wed, 13 Feb 2013 10:34:26 -0800
Subject: [PATCH] random: limit overhead of add_interrupt_randomness

fast_mix is noticeably less fast than the name might imply. Add
rate-limiting to it, so we only run it once per jiffie and cpu for the
painful case of a single interrupt hammering a cpu. Instead we do the
dumbest possible mixing - we xor the input with the pool without any
shifting whatsoever. Gathers some randomness at near-zero cost.

Signed-off-by: Joern Engel <[email protected]>
---
drivers/char/random.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b86eae9..7b7f64e 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -557,6 +557,7 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in,
struct fast_pool {
__u32 pool[4];
unsigned long last;
+ unsigned long last_jiffies;
unsigned short count;
unsigned char rotate;
unsigned char last_timer_intr;
@@ -760,6 +761,17 @@ void add_interrupt_randomness(int irq, int irq_flags)
input[3] = ip >> 32;
}

+ /*
+ * Even fast_mix is slow when dealing with 6-digit interrupt
+ * rates. Rate-limit this to once per jiffie. If we get lots
+ * of interrupts, this still generates 1.6 bits of entropy per
+ * second and cpu. If we get few interrupts, it shouldn't
+ * substantially change the entropy collection.
+ */
+ if (fast_pool->last_jiffies == jiffies)
+ return;
+ fast_pool->last_jiffies = jiffies;
+
fast_mix(fast_pool, input, sizeof(input));

if ((fast_pool->count & 1023) &&
--
1.7.10.4

2013-09-13 01:34:53

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Thu, 12 September 2013 19:35:36 -0400, Jörn Engel wrote:
>
> I think the existing code is doing just fine for low interrupt loads.
> It makes sense to spend a bit more work to squeeze the last bit of
> randomness out. But when you get lots of interrupts, you can be
> sloppy and just xor things into the pool.

Btw, if we make collection cheap enough, we can start collecting
entropy from the scheduler. Computers are fairly deterministic, but
not that much. The exact time when calling schedule(), the kernel
stack pointer and the userspace stack pointer all contain a bit of
entropy. Particularly on machines that lack input and disk
randomness I would expect some benefits from this.

Jörn

--
Audacity augments courage; hesitation, fear.
-- Publilius Syrus

2013-09-13 05:36:34

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Donnerstag, 12. September 2013, 17:31:48 schrieb J?rn Engel:

Hi J?rn,

>On Tue, 10 September 2013 15:08:12 -0700, John Stultz wrote:
>> Though
>> I probably should be hesitant with my suggestions, as I'm not well
>> versed in RNG theory.
>
>The basic principle of Ted's RNG is very simple and quite sane:
>- You collect as much data as possible, some of which is (hopefully)
> unpredictable.
>- All the data gets dumped into a small buffer.
>- When reading from the buffer, you create a crypto-hash of the entire
> buffer. Even if most of the buffer is predictable, the few
> unpredictable bits will randomly flip every output bit.

And here the RNG theory breaks: a whitening function (crypto function)
like the used SHA1 does not add entropy. Thus, the SHA1 just spreads out
the entropy evenly over the output buffer. As entropy can be considered
as a kind of percentage value, if you have, say, 10% of your input
buffer holding entropy, applying a whitening function, you output buffer
still holds 10% of entropy only.

That said, simply using a whitening function on a buffer with poor
entropy is NOT going to cut it.

>- Half of the hash gets returned to the reader, the other half gets
> added back into the pool.
>
>It doesn't matter if you collect predictable data - it neither helps

Oh yes, it hurts, if you update the entropy estimator on those
predictable bits. Because then you get a deterministic RNG like
/dev/urandom in the worst case. Thus you degrade the quality of
/dev/random which relies on the blocking nature.

>nor hurts. But you should collect as much unpredictable data as
>possible and do it as cheaply as possible. If you want to improve the
>RNG, you either collect more data, collect better (less predictable)
>data or make the collection cheaper.
>
>J?rn
>
>--
>People really ought to be forced to read their code aloud over the
>phone. That would rapidly improve the choice of identifiers.
>-- Al Viro

Ciao
Stephan

2013-09-13 11:34:11

by Thorsten Glaser

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Stephan Mueller <smueller <at> chronox.de> writes:

> A monotonic counter is fully ok. Note, for /dev/random, the occurrence
> of events delivers entropy. Thus, we have to be able to precisely
> measure that occurrence. The timer itself does not need to deliver any
> entropy as long as it is fast.

Rather the other way round… the get_cycles() value is XORd with jiffies
in drivers/char/random.c in one instance, and used alongside it in the
other, so it should NOT be derived from jiffies or the clocksource.

I think the focus on it being high-frequence enough to return a different
result every call is also too strict: better have a 16-bit 700 kHz counter
than none at all (plus, chances are very good it’s increased between calls
for drivers/char/random.c at least).

Geert’s question was probably about the requirement to be monotonic…
other callers do do things like & 0xFF but the random code doesn’t,
so something like use the 16 bit of the fast counter and fill the
slower counter into the upper bits would work… but this is a trade-off
against interrupt speed. I can’t judge whether it’s better to only
use the fast 16-bit counter, considering bus speeds and interrupt counts.
We do have jiffies too and use get_cycles() only to introduce more
uncertainty, so it may very well be enough…

[OT] Oh and for all who have not yet read e.g. Fefe today:
http://people.umass.edu/gbecker/BeckerChes13.pdf
Basically, RDRAND must not be used except to mix it into the pool,
now confirmed. (Kudos to Theodore and Matt for always insisting on this.)

bye,
//mirabilos

2013-09-13 11:55:10

by Thorsten Glaser

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Stephan Mueller <smueller <at> chronox.de> writes:

> And here the RNG theory breaks: a whitening function (crypto function)
> like the used SHA1 does not add entropy. Thus, the SHA1 just spreads out
> the entropy evenly over the output buffer. As entropy can be considered

That’s why you also use a (faster, less powerful) mixing function
on input.

I was wondering: why not use Jenkins’ one-at-a-time hash there?
I’ve worked with it a bit, and it has good avalanche behaviour;
I didn’t measure any cycles though.

Basically, h be an uint32_t register (usually loaded from a
memory location and stored to one afterwards), then you do:

(h) += (uint8_t)(b);
#ifdef with_mirabilos_changes
++(h);
#endif
(h) += (h) << 10;
(h) ^= (h) >> 6;

I’m adding 1 after the first step (adding the byte) in order
to make even NUL bytes make up a difference.

I’m toying with code that has 32 such uint32_t values, making
for a total of 128 Bytes of state, and when asked to add some
memory content into that pool, just distribute the bytes over
the values array, incrementing a counter.

(One could probably do something with cmpxchg to make that
lockless, but I’m not doing any parallel programming myself.)

Then, every once in a while, you’d run the finish function
on every value, which is:

#ifdef with_mirabilos_changes
(h) += (h) << 10;
(h) ^= (h) >> 6;
#endif
(h) += (h) << 3;
(h) ^= (h) >> 11;
(h) += (h) << 15;

My change here is because Jenkins’ OAAT has good avalanche
except for the last input byte, so I just fake adding NUL
(which doesn’t get avalanched so well, but doesn’t matter)
to get the last actual data byte avalanched.

Then I write those finialised hash values back to the
128-Byte-buffer then I add that into the main pool using
a cryptographic (hash, or stream cipher rekey) function.
(In my case, arc4random, if someone wonders.)

> >It doesn't matter if you collect predictable data - it neither helps
>
> Oh yes, it hurts, if you update the entropy estimator on those
> predictable bits. Because then you get a deterministic RNG like

I’ve seen Theodore apply exponential backoff to any estimation,
which is probably a good thing. But yes, you probably will want
to guess the entropy of the bytes added and not count things
where you’re not too sure of. (In the interrupt case, we have
jiffies, cycles and num, so we’d probably best estimate how
often interrupts are called, base a number of “timing” bits
on that, and account for num; this may very well be less than
8 bit “credited” for a long and two u_int added to the pool,
but as long as you get that right and don’t blindly credit
a byte as 8 bit, you’re safe. To quote the author of RANDOM.SYS
for DOS: “Every bit counts.” It adds uncertainty to the pool,
at least by stirring around the already-existing entropy without
adding any new (creditable) entropy.)

bye,
//mirabilos

2013-09-13 17:01:18

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Fri, 13 September 2013 07:36:20 +0200, Stephan Mueller wrote:
> Am Donnerstag, 12. September 2013, 17:31:48 schrieb Jörn Engel:
> >
> >The basic principle of Ted's RNG is very simple and quite sane:
> >- You collect as much data as possible, some of which is (hopefully)
> > unpredictable.
> >- All the data gets dumped into a small buffer.
> >- When reading from the buffer, you create a crypto-hash of the entire
> > buffer. Even if most of the buffer is predictable, the few
> > unpredictable bits will randomly flip every output bit.
>
> And here the RNG theory breaks: a whitening function (crypto function)
> like the used SHA1 does not add entropy. Thus, the SHA1 just spreads out
> the entropy evenly over the output buffer. As entropy can be considered
> as a kind of percentage value, if you have, say, 10% of your input
> buffer holding entropy, applying a whitening function, you output buffer
> still holds 10% of entropy only.

Agreed. An extreme example were the debian ssh keys. You couldn't
predict any particular bit of an ssh key, but the entire keyspace was
just 15bit and easily enumerated.

The crucial bit of the RNG is that noone knows the entire pool state
at any time. If you have a rootkit that controls the kernel, you have
lost. If root is not trustworthy, you have lost. If you are running
on a VM and your hypervisor is not trustworthy, you have lost. If you
don't control physical access to the machine, you have lost - at least
in the most paranoid scenario.

Assuming an unknown state of the pool and a strong crypto-hash it is
hard to argue how entropy is ever removed from the pool, no matter how
often you read from it. A strong crypto-hash, by definition, makes it
impossible to reconstruct the input bits from the output by any means
easier than brute force. Brute-forcing a 160bit hash with 80 unknown
bits is unfeasable, so the input buts are completely unguessable by a
consumer of /dev/[u]random. So if there ever were, say, 80 bits of
entropy in the pool, the random data will remain good forever.

Realistic attacks are the two assumptions. Sha1 is showing the first
signs of defeat. We likely have a lot of margin left in it and Ted's
decision to only reveal half the bits while folding the other half
back into the pool means the hash remains usable for /dev/[u]random
even after it has been broken for other purposes. But it is
conceivable that we will have to change the hash again.

The most realistic attack by far is to simply read out the state of
the pool. So you have to secure your machine in the usual ways and
not trust random generators on machines you don't trust.

> >- Half of the hash gets returned to the reader, the other half gets
> > added back into the pool.
> >
> >It doesn't matter if you collect predictable data - it neither helps
>
> Oh yes, it hurts, if you update the entropy estimator on those
> predictable bits. Because then you get a deterministic RNG like
> /dev/urandom in the worst case. Thus you degrade the quality of
> /dev/random which relies on the blocking nature.

You have to have a conservative estimate of entropy, agreed. And in
many cases the only sane estimate is zero.

Jörn

--
Fancy algorithms are buggier than simple ones, and they're much harder
to implement. Use simple algorithms as well as simple data structures.
-- Rob Pike

2013-09-13 18:59:45

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Fri, Sep 13, 2013 at 07:36:20AM +0200, Stephan Mueller wrote:
>
> And here the RNG theory breaks: a whitening function (crypto function)
> like the used SHA1 does not add entropy. Thus, the SHA1 just spreads out
> the entropy evenly over the output buffer. As entropy can be considered
> as a kind of percentage value, if you have, say, 10% of your input
> buffer holding entropy, applying a whitening function, you output buffer
> still holds 10% of entropy only.

Well..... it's a little bit more complicated than that. What you're
saying is aboslutely true for normal entropy that we might gather.
For example, if a disk read might take 11, 12, or 13 milliseconds
depending on the chaotic air currents affecting how much times it
takes to do a read, you have at best 1.5 bits of entropy. So if the
NSA knows that during a standard boot sequences, you will be doing a
certain precise set of reads, it can try all of the different
possibilities of 11/12/13 ms and try to brute force what might be
generated by the RNG.

This is the source of my recommendation that embedded devices should
wait as long as possible before generating things like ssh host keys.
The longer you wait, more uncertainty will be added to the entropy
pool, and the harder it will be for an attacker to try to brute-force
attack the /dev/random output.

However, if you are worried about a malicious entropy source, things
are a little bit different. Suppose RDRAND == AES(i++, NSA_KEY),
where the NSA doesn't know the starting value of i. But if it get can
get a raw RDRAND value (say, someone uses it without doing any
whitening as a session key or as a D-H parameter), it can decrypt the
output using the NSA_KEY, and then now that it knows i, it can brute
force break the RDRAND output, even if it's not entirely sure how many
times RDRAND has been called between that cleanb RDRAND value and the
RDRAND output it is trying to break.

In *this* case, smearing out the value of RDRAND across the entropy
pool does help, becuase it makes it significantly harder to get a
clean RDRAND value to decrypt.

That being said, the much bigger problem that I'm worried about is not
necessarily a trojan'ed RDRAND, but rather on embedded ARM and MIPS
devices where we have unsufficient entropy, and on the first boot out
of the box, there is no random seed file that can be fixed in at boot
time. Mixing in personalization information (serial numbers, MAC
addresses) which *hopefully* the NSA wouldn't know in the case of
pervasive, bulk surveillance, is a bit of a help. But it's certainly
no help against a direct, targetted attack.

Regards,

- Ted

2013-09-13 19:29:22

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

On Fri, Sep 13, 2013 at 11:54:40AM +0000, Thorsten Glaser wrote:
>
> That’s why you also use a (faster, less powerful) mixing function
> on input.
>
> I was wondering: why not use Jenkins’ one-at-a-time hash there?
> I’ve worked with it a bit, and it has good avalanche behaviour;
> I didn’t measure any cycles though.

One of the primary requirements for the mixing function is that it
should not lose entropy in the course of doing the mixing. For
example, if someone does the equivalent of "dd if=/dev/zero
of=/dev/random", and mixes in a whole series of zero values, we should
not lose any information. That's another way of saying that the
mixing function:

POOL' = MIX(POOL, entropy_input)

must be reversible. This guarantees that:

POOL' = MIX(POOL, 0)

will not lose any information.

The second requirement of the mixing function is that we want to make
sure any entropy we do have is distributed evenly across the pool.
This must be true even if the input is highly structured (i.e., if we
are feeding timestamps into the entropy pool, where the high bits are
always the same, and the entropy is in the low bits), successive
timestamps should not just affect certain bits in the pool.

This is why using a simple XOR into the pool is not a good idea, and
why we rotate the input byte by different amounts during each mixing
operation.

However, when we extract information from the pool, this is where we
want to make sure the pool gets peturbed in such a way that if the
attacker has partial knowledge of the pool, that knowledge gets harder
to use very quickly. And here the paper "The Linux Pseudorandom
Number Generator Revisited", by Lacharme, Rock, Strubel, and Videau
has some good points that I am seriously considering. I am still
thinking about it, but what I am thinking about doing is doing more
mixing at each extract operation to so that there is more
avalanche-style stiring happening each time we extract information
from the pool. Right now we mix in a SHA hash at each extract
operation; but if we call mix_pool_bytes() with a zero buffer, then we
will smear that SHA hash across more of the pool, which makes it
harder for an attacker to try to reverse engineer the pool while
having partial knowledge of an earlier state of the pool.

> > Oh yes, it hurts, if you update the entropy estimator on those
> > predictable bits. Because then you get a deterministic RNG like
>
> I’ve seen Theodore apply exponential backoff to any estimation,
> which is probably a good thing. But yes, you probably will want
> to guess the entropy of the bytes added and not count things
> where you’re not too sure of. (In the interrupt case, we have
> jiffies, cycles and num, so we’d probably best estimate how
> often interrupts are called, base a number of “timing” bits
> on that, and account for num; this may very well be less than
> 8 bit “credited” for a long and two u_int added to the pool,
> but as long as you get that right and don’t blindly credit
> a byte as 8 bit, you’re safe. To quote the author of RANDOM.SYS
> for DOS: “Every bit counts.” It adds uncertainty to the pool,
> at least by stirring around the already-existing entropy without
> adding any new (creditable) entropy.)

So let me be blunt. The entropy estimator we have sucks. It's pretty
much impossible to automatically estimate given an arbitrary input
stream how much entropy is present. The best example I can give is
the aforementioned:

AES_ENCRYPT(i++, NSA_KEY)

Let's assume the starting vaue of i is an unknown 32-bit number. The
resulting stream will pass all statistical randomness tests, but no
matter how may times you turn the crank and generate more "random"
numbers, the amount of entropy in that stream is only 32 bits --- at
least to someone who knows the NSA_KEY. For someone who knows that
this is how the input stream was generated, but not the 256-bit NSA
KEY, then the amount of entropy in the stream would be 32 + 256 = 288
bit. And this is true even if you generate several gigabytes worth
numbers using the above CRNG.

Now, if you know that the input stream is a series of timestamps, as
we do in add_timer_randomness() we can do something slightly better
by looking at the deltas between successive timestamps, but if there
is a strong 50 or 60 Hz component to the timestamp values, perhaps due
to how the disk drive motor is powered from the AC mains, a simple
delta estimator is not going to catch this.

So the only really solid way we can get a strong entropy estimation is
by knowing the details of the entropy source, and how it might fail to
be random (i.e., if it is microphone hooked up to gather room noise,
there might be some regular frequency patterns generated by the HVAC
equipment; so examining the input in the frequency domain and looking
for any spikes might be a good idea).

The one saving grace in all of this is there is plenty of
unpredictable information which we mix in for which we give no entropy
credit for whatsoever. But please don't assume that just because you
can read 8192 bits out of /dev/random, that you are guaranteed to get
8192 bits of pure "true random bits". For one thing, the input pool
is only 4096 bits, plus a 1024 bit output pool. Even if both of the
pools are filled with pure unpredictable bits, that's only 5120 bits.
Sure, as you extract those bits, some more entropy will trickle in,
but it won't be a full 8192 bits in all likelihood.

At the end of the day, there is no real replacement for a real HWRNG.
And I've never had any illusions that the random driver could be a
replacement for a real HWRNG. The problem is though is that most
HWRNG can't be audited, because they are not open, and most users
aren't going to be able to grab a wirewrap gun and make their own ---
and even if they did, it's likely they will screw up in some
embarassing way. Really, the best you can do is hopefull have
multiple sources of entropy. RDRAND, plus the random number generator
in the TPM, etc. and hope that mixing all of this plus some OS-level
entropy, that this is enough to frustrate the attacker enough that
it's no longer the easist way to comrpomise your security.

Regards,

- Ted

2013-09-15 11:12:26

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Am Freitag, 13. September 2013, 14:59:31 schrieb Theodore Ts'o:

Hi Theodore,

>On Fri, Sep 13, 2013 at 07:36:20AM +0200, Stephan Mueller wrote:
>
>However, if you are worried about a malicious entropy source, things
>are a little bit different. Suppose RDRAND == AES(i++, NSA_KEY),
>where the NSA doesn't know the starting value of i. But if it get can
>get a raw RDRAND value (say, someone uses it without doing any
>whitening as a session key or as a D-H parameter), it can decrypt the
>output using the NSA_KEY, and then now that it knows i, it can brute
>force break the RDRAND output, even if it's not entirely sure how many
>times RDRAND has been called between that cleanb RDRAND value and the
>RDRAND output it is trying to break.
>
>In *this* case, smearing out the value of RDRAND across the entropy
>pool does help, becuase it makes it significantly harder to get a
>clean RDRAND value to decrypt.

Agreed. I was only talking about "well-behaved" entropy sources.
>
>
>That being said, the much bigger problem that I'm worried about is not
>necessarily a trojan'ed RDRAND, but rather on embedded ARM and MIPS
>devices where we have unsufficient entropy, and on the first boot out
>of the box, there is no random seed file that can be fixed in at boot

Yes, my local MIPS-based router which is a very ubiquitous one in
Germany (Fritz Box) does not seed /dev/random but yet starts using
/dev/urandom during boot cycle.

>time. Mixing in personalization information (serial numbers, MAC
>addresses) which *hopefully* the NSA wouldn't know in the case of
>pervasive, bulk surveillance, is a bit of a help. But it's certainly
>no help against a direct, targetted attack.

That is why I am hoping that the CPU jitter as harvested by my RNG will
help as it delivers entropy on demand in a bandwidth where the initial
seeding may not needed any more and we have sufficient entropy during
boot sequence.
>
>Regards,
>
> - Ted

Ciao
Stephan

2013-09-16 17:15:24

[permalink] [raw]

Subject: [PATCH,RFC] random: make fast_mix() honor its name

On Thu, 12 September 2013 19:31:55 -0400, Theodore Ts'o wrote:
> On Thu, Sep 12, 2013 at 05:07:17PM -0400, Jörn Engel wrote:
> >
> > I happen to have a real-world system with >100k interrupts per second
> > and - surprise - add_interrupt_randomness() showed up prominently in
> > the profiles. I was also told twice to just remove that call. I
> > resisted both times and have done far more work to reduce overhead
> > while still collecting entropy. Some others would have caved in.
>
> Would it be possible for you to send me the perf numbers that you saw?

Here is a patch to make add_interrupt_randomness() significantly
cheaper without significantly impacting the quality. The second part
is my personal opinion and others might disagree.

So far this has only seen userspace performance testing, so don't
merge it in a hurry.

Jörn

--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike

Doing a quick benchmark on my notebook, fast_mix() turned out to be
barely faster than the full mixing operation. Doing 10M in a loop took
2.1s for the old fast_mix(), 2.5s for _mix_pool_bytes() and .12s for
my new fast_mix(). Making fast_mix() 16x less expensive gives people
less incentive to disable entropy collection.

The new fast_mix has a lower quality. It simply rotates the entire pool
left by 39 bits and xor's the new entropy into the pool. I think this
is good enough - if every interrupt collects just one bit of truly
unpredictable data and that bit is always in the same position, we will
accumulate 64 bits of entropy over the 64 rounds. Given two bits in the
likely position - the lowest two bits of any one word - all 128 bits of
the pool are unpredictable. If we wanted to collect more randomness, we
would have to either enlarge the pool or empty it more than once every
64 interrupts.

So I think the mixing quality is good enough. If someone is concerned,
we can add a multiplication, which will fairly quickly mix every input
bit into every pool bit, at a 20% performance cost. But really I would
rather collect from more entropy sources than improve mixing quality.

Signed-off-by: Joern Engel <[email protected]>
---
drivers/char/random.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 32a6c57..36ef6e1 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -555,7 +555,6 @@ struct fast_pool {
__u32 pool[4];
unsigned long last;
unsigned short count;
- unsigned char rotate;
unsigned char last_timer_intr;
};

@@ -564,21 +563,17 @@ struct fast_pool {
* collector. It's hardcoded for an 128 bit pool and assumes that any
* locks that might be needed are taken by the caller.
*/
-static void fast_mix(struct fast_pool *f, const void *in, int nbytes)
+static void fast_mix(struct fast_pool *f, __u32 input[4])
{
- const char *bytes = in;
- __u32 w;
- unsigned i = f->count;
- unsigned input_rotate = f->rotate;
+ int i;
+ __u32 acc, carry = f->pool[3] >> 25;

- while (nbytes--) {
- w = rol32(*bytes++, input_rotate & 31) ^ f->pool[i & 3] ^
- f->pool[(i + 1) & 3];
- f->pool[i & 3] = (w >> 3) ^ twist_table[w & 7];
- input_rotate += (i++ & 3) ? 7 : 14;
+ for (i = 0; i < 3; i++) {
+ acc = (f->pool[i] << 7) ^ input[i] ^ carry;
+ carry = f->pool[i] >> 25;
+ f->pool[i] = acc;
}
- f->count = i;
- f->rotate = input_rotate;
+ f->count++;
}

/*
@@ -757,9 +752,9 @@ void add_interrupt_randomness(int irq, int irq_flags)
input[3] = ip >> 32;
}

- fast_mix(fast_pool, input, sizeof(input));
+ fast_mix(fast_pool, input);

- if ((fast_pool->count & 1023) &&
+ if ((fast_pool->count & 63) &&
!time_after(now, fast_pool->last + HZ))
return;

--
1.7.10.4

2013-09-21 21:32:30

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Mon, Sep 16, 2013 at 11:40:27AM -0400, J?rn Engel wrote:
>
> Here is a patch to make add_interrupt_randomness() significantly
> cheaper without significantly impacting the quality. The second part
> is my personal opinion and others might disagree.
>
> So far this has only seen userspace performance testing, so don't
> merge it in a hurry.

Performance testing, but your new fast_mix pool has some serious
problems in terms of how well it works.

First of all, I think this is a typo:

> + for (i = 0; i < 3; i++) {

This means that pool[3] is always 0, and the input[3] is never mixed
in. Hence, the pool is now only 96 bits instead of 128 bits, and the
last 4 bytes of the input pool are not getting mixed in.

Secondly, if you change this so that we actually use the whole pool,
and you mix in a constant set of inputs like this:

for (i = 0; i < 100; i++) {
input[0] = i;
input[1] = i;
input[2] = i;
input[3] = i;
fast_mix(&pool, input);
print_pool(&pool);
}

you see something like this:

pool:
0x00000000
0x00000000
0x00000000
0x00000000
pool:
0x00000001
0x00000001
0x00000001
0x00000001
pool:
0x00000082
0x00000082
0x00000082
0x00000082
pool:
0x00004103
0x00004103
0x00004103
0x00004103
pool:
0x00208184
0x00208184
0x00208184
0x00208184
pool:
0x1040c205
0x1040c205
0x1040c205
0x1040c205

Granted, it's unlikely that we will be mixing numbers like this, but
it's a great demonstration of why I added the input_rotate aspect to
my mixing functions.

See my testing program below. I need to think a bit more about
whether I'm comfortable with the new fast_mix that you've proposed
with the input_rotate restored, but at the minimum I think the rotate
is needed.

Regards,

- Ted

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

/* #define ORIG_MIX */
#define ADD_ROTATE
/* #define DEBUG_POOL */
#ifdef DEBUG_POOL
#define NUM_LOOPS 32
#else
#define NUM_LOOPS 1000000
#endif

typedef unsigned int __u32;

struct fast_pool {
__u32 pool[4];
unsigned long last;
unsigned short count;
unsigned char rotate;
unsigned char last_timer_intr;
};

static __u32 const twist_table[8] = {
0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

/**
* rol32 - rotate a 32-bit value left
* @word: value to rotate
* @shift: bits to roll
*/
static inline __u32 rol32(__u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}

#ifdef ORIG_MIX
/*
* This is a fast mixing routine used by the interrupt randomness
* collector. It's hardcoded for an 128 bit pool and assumes that any
* locks that might be needed are taken by the caller.
*/
static void fast_mix(struct fast_pool *f, const void *in, int nbytes)
{
const char *bytes = in;
__u32 w;
unsigned i = f->count;
unsigned input_rotate = f->rotate;

while (nbytes--) {
w = rol32(*bytes++, input_rotate & 31) ^ f->pool[i & 3] ^
f->pool[(i + 1) & 3];
f->pool[i & 3] = (w >> 3) ^ twist_table[w & 7];
input_rotate += (i++ & 3) ? 7 : 14;
}
f->count = i;
f->rotate = input_rotate;
}
#else
static void fast_mix(struct fast_pool *f, __u32 input[4])
{
int i = f->count;
__u32 acc, carry = f->pool[3] >> 25;
unsigned input_rotate = f->rotate;

for (i = 0; i < 4; i++) {
#ifdef ADD_ROTATE
acc = (f->pool[i] << 7) ^ rol32(input[i], input_rotate & 31) ^
carry;
#else
acc = (f->pool[i] << 7) ^ input[i] ^ carry;
#endif

carry = f->pool[i] >> 25;
f->pool[i] = acc;
input_rotate += 7;
}
f->count++;
f->rotate = input_rotate;
}
#endif

void print_pool(struct fast_pool *pool)
{
int i;

printf("pool:\n");
for (i = 0; i < 4; i++) {
printf("\t0x%08x\n", pool->pool[i]);
}
}

int main(const char *argv, int argc)
{
struct fast_pool pool;
int i;
unsigned int input[4];

memset(&pool, 0, sizeof(struct fast_pool));
srandom(42);

for (i = 0; i < NUM_LOOPS; i++) {
input[0] = i;
input[1] = i;
input[2] = i;
input[3] = i;
#ifdef ORIG_MIX
fast_mix(&pool, &input, sizeof(input));
#else
fast_mix(&pool, input);
#endif
#ifdef DEBUG_POOL
print_pool(&pool);
#endif
}
#ifndef DEBUG_POOL
print_pool(&pool);
#endif
return 0;
}

2013-09-21 21:41:33

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

BTW, just to give another example of the difference between the mixing
funtions, try compiling the following with and without ORIG_MIX defined...

- Ted

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

/* #define ORIG_MIX */
#define ADD_ROTATE
#define DEBUG_POOL
#ifdef DEBUG_POOL
#define NUM_LOOPS 100
#else
#define NUM_LOOPS 1000000
#endif

typedef unsigned int __u32;

struct fast_pool {
__u32 pool[4];
unsigned long last;
unsigned short count;
unsigned char rotate;
unsigned char last_timer_intr;
};

static __u32 const twist_table[8] = {
0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };

/**
* rol32 - rotate a 32-bit value left
* @word: value to rotate
* @shift: bits to roll
*/
static inline __u32 rol32(__u32 word, unsigned int shift)
{
return (word << shift) | (word >> (32 - shift));
}

#ifdef ORIG_MIX
/*
* This is a fast mixing routine used by the interrupt randomness
* collector. It's hardcoded for an 128 bit pool and assumes that any
* locks that might be needed are taken by the caller.
*/
static void fast_mix(struct fast_pool *f, const void *in, int nbytes)
{
const char *bytes = in;
__u32 w;
unsigned i = f->count;
unsigned input_rotate = f->rotate;

while (nbytes--) {
w = rol32(*bytes++, input_rotate & 31) ^ f->pool[i & 3] ^
f->pool[(i + 1) & 3];
f->pool[i & 3] = (w >> 3) ^ twist_table[w & 7];
input_rotate += (i++ & 3) ? 7 : 14;
}
f->count = i;
f->rotate = input_rotate;
}
#else
static void fast_mix(struct fast_pool *f, __u32 input[4])
{
int i = f->count;
__u32 acc, carry = f->pool[3] >> 25;
unsigned input_rotate = f->rotate;

for (i = 0; i < 4; i++) {
#ifdef ADD_ROTATE
acc = (f->pool[i] << 7) ^ rol32(input[i], input_rotate & 31) ^
carry;
#else
acc = (f->pool[i] << 7) ^ input[i] ^ carry;
#endif

carry = f->pool[i] >> 25;
f->pool[i] = acc;
input_rotate += 7;
}
f->count++;
f->rotate = input_rotate;
}
#endif

void print_pool(struct fast_pool *pool)
{
int i;

printf("pool:\n");
for (i = 0; i < 4; i++) {
printf("\t0x%08x\n", pool->pool[i]);
}
}

int main(const char *argv, int argc)
{
struct fast_pool pool;
int i;
unsigned int input[4];

memset(&pool, 0, sizeof(struct fast_pool));
memset(&input, 0, sizeof(input));
pool.pool[0] = 1;

for (i = 0; i < NUM_LOOPS; i++) {
#ifdef ORIG_MIX
fast_mix(&pool, &input, sizeof(input));
#else
fast_mix(&pool, input);
#endif
#ifdef DEBUG_POOL
print_pool(&pool);
#endif
}
#ifndef DEBUG_POOL
print_pool(&pool);
#endif
return 0;
}

2013-09-22 03:06:09

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

The following fast_mix function, with the loop unrolling, is about 70%
slower than your proposed version, but it's still four times faster
than the original byte-based fast_mix function. This is what I'm
considering using as a compromise.

Any comments or objections?

- Ted

static void fast_mix(struct fast_pool *f, __u32 input[4])
{
__u32 w;
int i;
unsigned input_rotate = f->rotate;

#if 0
for (i = 0; i < 4; i++) {
w = rol32(input[i], input_rotate) ^ f->pool[i] ^
f->pool[(i + 3) & 3];
f->pool[i] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + (i ? 7 : 14)) & 31;
}
#else /* loop unrolled for speed */
w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
f->pool[0] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 14) & 31;
w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
f->pool[1] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
f->pool[2] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
f->pool[3] = (w >> 3) ^ twist_table[w & 7];
input_rotate = (input_rotate + 7) & 31;
#endif
f->count += 16;
f->rotate = input_rotate;
}

2013-09-22 21:01:54

by Jörg-Volker Peetz

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

Hi Theodore,

Theodore Ts'o wrote, on 09/22/2013 05:05:
> The following fast_mix function, with the loop unrolling, is about 70%
> slower than your proposed version, but it's still four times faster
> than the original byte-based fast_mix function. This is what I'm
> considering using as a compromise.
>
> Any comments or objections?
>
> - Ted
>
> static void fast_mix(struct fast_pool *f, __u32 input[4])
> {
> __u32 w;
> int i;
> unsigned input_rotate = f->rotate;
>
> #if 0
> for (i = 0; i < 4; i++) {
> w = rol32(input[i], input_rotate) ^ f->pool[i] ^
> f->pool[(i + 3) & 3];
> f->pool[i] = (w >> 3) ^ twist_table[w & 7];
> input_rotate = (input_rotate + (i ? 7 : 14)) & 31;
> }
> #else /* loop unrolled for speed */
> w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3];
> f->pool[0] = (w >> 3) ^ twist_table[w & 7];
> input_rotate = (input_rotate + 14) & 31;
> w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0];
> f->pool[1] = (w >> 3) ^ twist_table[w & 7];
> input_rotate = (input_rotate + 7) & 31;
> w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1];
> f->pool[2] = (w >> 3) ^ twist_table[w & 7];
> input_rotate = (input_rotate + 7) & 31;
> w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2];
> f->pool[3] = (w >> 3) ^ twist_table[w & 7];
> input_rotate = (input_rotate + 7) & 31;
> #endif
> f->count += 16;
> f->rotate = input_rotate;
> }
>

just out of interest I would like to ask why this mixing function has to be that
complicated. For example, even if the input is always 0 and the pool is seeded
with pool[0] = 1 (as in your test program) this algorithm generates some
(predictable) pseudo-random numbers in the pool. Is this necessary?

To just mix in some random input filling the whole pool (seeded again with
pool[0] = 1) something as "simple" as

f->pool[0] = rol32(input[0], f->pool[2] & 31) ^ f->pool[1];
f->pool[1] = rol32(input[1], f->pool[3] & 31) ^ f->pool[2];
f->pool[2] = rol32(input[2], f->pool[0] & 31) ^ f->pool[3];
f->pool[3] = rol32(input[3], f->pool[1] & 31) ^ f->pool[0];

would suffice, although I didn't do any statistical tests.

Best regards,
J?rg-Volker.
--

2013-09-22 21:28:08

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sun, Sep 22, 2013 at 11:01:42PM +0200, J?rg-Volker Peetz wrote:
> just out of interest I would like to ask why this mixing function has to be that
> complicated. For example, even if the input is always 0 and the pool is seeded
> with pool[0] = 1 (as in your test program) this algorithm generates some
> (predictable) pseudo-random numbers in the pool. Is this necessary?
>
> To just mix in some random input filling the whole pool (seeded again with
> pool[0] = 1) something as "simple" as
>
> f->pool[0] = rol32(input[0], f->pool[2] & 31) ^ f->pool[1];
> f->pool[1] = rol32(input[1], f->pool[3] & 31) ^ f->pool[2];
> f->pool[2] = rol32(input[2], f->pool[0] & 31) ^ f->pool[3];
> f->pool[3] = rol32(input[3], f->pool[1] & 31) ^ f->pool[0];
>
> would suffice, although I didn't do any statistical tests.

The structure of the mixing functions in /dev/random has been well
studied, and validatetd in a number of different academic papers. So
I prefer to stick with the basic architecture, even as it is scaled
down for speed reasons and beause the pool is smaller.

One of the important things about the mixing function is that if the
attacker knows the input values (of which the simplest example for
analytic purposes is if the input values are all zero), we want there
to be ample mixing across the pool.

If you look at your proposed mixing function, in the case where the
input values are all zero, it devolves into:

f->pool[0] = f->pool[1];
f->pool[1] = f->pool[2];
f->pool[2] = f->pool[3];
f->pool[3] = f->pool[0];

Yes, I know the input values will never be all zero, but in the case
where the attacker knows what the input values are[1], but does not know
the contents of the entropy pool, a very simplistic mixing function
becomes relatively easy to analyze in the case where attacker knows
the inputs and then outputs, and is trying to determine the internal
state of the random driver.

Measuring the speed of the fast_mix function which I put together,
it's already been speeded up compard to the original fast_mix function
by a factor of six. On my x86 laptop, I can run 10 million
iterations in 0.14 seconds, so that translates to 14ns per fast_mix
call. (The original fast_mix function runs in 84 nanoseconds.)

So there is a cost-benefit tradeoff that we need to balance here. If
you have a system with 100k interrupts per second, performance is
going to be poor no matter what, and it's not clear how common such
systems really are. Most modern hardware do have some kind of NAPI or
other interrupt mitigation in place --- heck, even back in 1980's
people had figured out how to improve the 8250 UART with the 16550A
UART, which introdued a FIFO to decrease the interrupt load caused by
serial ports, and things have only gotten better since then.

Out of curiosity, on your profiles, how many nanonseconds total is the
total interrupt code path chewing up per interrupt?

Regards,

- Ted

[1] And on systems where we don't have get_cycles() or
random_get_entropy(), it becomes much easier for the attacker to guess
what at least half of the input values will be!

2013-09-22 21:33:20

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sat, 21 September 2013 17:25:10 -0400, Theodore Ts'o wrote:
> On Mon, Sep 16, 2013 at 11:40:27AM -0400, Jörn Engel wrote:
> >
> > Here is a patch to make add_interrupt_randomness() significantly
> > cheaper without significantly impacting the quality. The second part
> > is my personal opinion and others might disagree.
> >
> > So far this has only seen userspace performance testing, so don't
> > merge it in a hurry.
>
> Performance testing, but your new fast_mix pool has some serious
> problems in terms of how well it works.
>
> First of all, I think this is a typo:
>
> > + for (i = 0; i < 3; i++) {
>
> This means that pool[3] is always 0, and the input[3] is never mixed
> in. Hence, the pool is now only 96 bits instead of 128 bits, and the
> last 4 bytes of the input pool are not getting mixed in.

Yes, indeed.

> Secondly, if you change this so that we actually use the whole pool,
> and you mix in a constant set of inputs like this:
>
> for (i = 0; i < 100; i++) {
> input[0] = i;
> input[1] = i;
> input[2] = i;
> input[3] = i;
> fast_mix(&pool, input);
> print_pool(&pool);
> }
>
> you see something like this:
>
> pool:
> 0x00000000
> 0x00000000
> 0x00000000
> 0x00000000
> pool:
> 0x00000001
> 0x00000001
> 0x00000001
> 0x00000001
> pool:
> 0x00000082
> 0x00000082
> 0x00000082
> 0x00000082
> pool:
> 0x00004103
> 0x00004103
> 0x00004103
> 0x00004103
> pool:
> 0x00208184
> 0x00208184
> 0x00208184
> 0x00208184
> pool:
> 0x1040c205
> 0x1040c205
> 0x1040c205
> 0x1040c205
>
> Granted, it's unlikely that we will be mixing numbers like this, but
> it's a great demonstration of why I added the input_rotate aspect to
> my mixing functions.

Honestly, this is a case of garbage in, garbage out. If your
"randomness" is the same number repeated four times, the repetitions
don't add anything. A more complicated mixing function doesn't buy
you much. In the extreme case, you have the AES-encrypted counter
example. No entropy on the input side, something that looks random on
the output side, and - because there is no secret key in the
complicated mixing function - anyone willing to sit down and do the
math can predict the output.

The failure case I was personally concerned about was a commutative
mixing function. Interrupt numbers, for examples, are perfectly
predictable. The randomness comes from the order or interrupts. So
mix(a, b) != mix(b, a) is a hard requirement, if you excuse my silly
notation.

One variant that still fails your test above, but ensures every input
bit will eventually influence every output bit (certainly after 64
rounds) would be this:

static void fast_mix(struct fast_pool *f, __u32 input[4])
{
int i;
__u32 acc, p, carry = f->pool[3] >> 25;

for (i = 0; i < 4; i++) {
p = carry * input[i];
acc = (f->pool[i] << 7) ^ input[i] ^ carry ^ p;
carry = f->pool[i] >> 25;
f->pool[i] = acc;
}
}

Clearly this is a better mixing function, but I still think we don't
need that much quality. If there is any randomness at all in the
interrupts, we will quickly make every bit in the pool unpredictable.
And if the interrupts are completely predictable in every bit and in
their order, well, we have lost either way.

Jörn

--
I say what I believe in. And give reasons.
-- Pervez Hoodbhoy

2013-09-22 21:49:54

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sat, 21 September 2013 17:41:18 -0400, Theodore Ts'o wrote:
>
> BTW, just to give another example of the difference between the mixing
> funtions, try compiling the following with and without ORIG_MIX defined...

Garbage in, garbage out again. If there is absolutely no randomness
in the input (all bits zero), my mixing function will simply shift the
pool. And because the shifting has a period of 128, there are only
128 possible pool states. Your mixing function is doing slightly
better, it effectively becomes an interrupt counter with a silly
output format.

But who cares? If there is absolutely no randomness in the input, you
have lost. That case isn't worth contemplating. The question is
whether any randomness present in the input will get accumulated.

Without the shifting, a single unpredictable bit on, say, the lowest
position of the timestamp will be xor'ed into the same pool bit every
time. The rest of the pool would always be predictable and the
resulting mixing function would clearly be bad.

With the shifting and using the same example, after 128 rounds every
bit of the pool is unpredictable. Job done, we can go home now. You
cannot achieve anything better than 128 unpredictable bits, no matter
how much you try.

Jörn

--
Those who come seeking peace without a treaty are plotting.
-- Sun Tzu

2013-09-22 22:12:05

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sun, 22 September 2013 17:27:52 -0400, Theodore Ts'o wrote:
>
> The structure of the mixing functions in /dev/random has been well
> studied, and validatetd in a number of different academic papers. So
> I prefer to stick with the basic architecture, even as it is scaled
> down for speed reasons and beause the pool is smaller.

And I want to keep that function. Essentially the point of fast_mix()
is to ratelimit _mix_pool_bytes(). Naïve ratelimiting would simply
discard the input once the ratelimit has been reached. My proposal is
to still use the input bits, but use a really cheap mixing function.

Your version of fast_mix() failed in the "really cheap" department.
As a result, it showed up in profiles and at least one idiot (me)
reverted to naïve ratelimiting. It could have been worse, I was
explicitly asked twice to just remove the call to
add_interrupt_randomness().

So don't think of my patch as weakening the mixing, but as
strengthening the ratelimited mixing. If we have few interrupts,
_mix_pool_bytes() will be run once a second, if we have many it will
be run once every 64 interrupts. And in the latter case, the input
for _mix_pool_bytes() is much better than with naïve ratelimiting.

And you should do the same for add_timer_randomness(), where again you
have ratelimiting. Once trickle_thresh is reached your code simply
discards most randomness. Only once in 4096 call do you use all the
bits you get - most of which will be predictable. Why not use a cheap
mixing function for the other 4095 calls and ensure we have many good
bits on call 4096?

Jörn

--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike

2013-09-22 23:36:54

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sun, Sep 22, 2013 at 04:53:34PM -0400, J?rn Engel wrote:
>
> And I want to keep that function. Essentially the point of fast_mix()
> is to ratelimit _mix_pool_bytes(). Na?ve ratelimiting would simply
> discard the input once the ratelimit has been reached. My proposal is
> to still use the input bits, but use a really cheap mixing function.

Sure, but how cheap is "cheap"? There's a balance here. I don't buy
the argument that we must weaken the fast_mix() at all costs because
we have to drive the cost of fast_mix() to zero. If we're going to do
that, to the limit fast_mix() should be a no-op, which is ridiculous.

So what I've suggested already makes fast_mix() much faster. It's not
fast as what you've proposed, but it's pretty clear that my fast_mix()
is better at mixing the fast_mix pool.

> Your version of fast_mix() failed in the "really cheap" department.
> As a result, it showed up in profiles and at least one idiot (me)
> reverted to na?ve ratelimiting. It could have been worse, I was
> explicitly asked twice to just remove the call to
> add_interrupt_randomness().

Sure, but that's not _my_ problem. If someone stupid adulterates
/dev/random, that's not my responsibility. Most people aren't going
to be dealing with systems with truly stupid interrupt rates, and most
people aren't going to be tampering with the random driver.

I'm certainly willing to make fast_mix() faster, to reduce the
temptation for idiots to tamper with /dev/random, but we need to
balance the time of fast_mix() with the quality of its mixing, and to
remember what the common case is (and it's not 100k interrupts per
second!)

> And you should do the same for add_timer_randomness(), where again you
> have ratelimiting. Once trickle_thresh is reached your code simply
> discards most randomness.

In practice, we are using randomness in so many places (every single
time we start a process for ASLR, in lots of security-related programs
that use SSL libraries, etc.), that we are actually practically
*never* hitting trickle_thresh.

The trickle_thresh was added originally when add_timer_randomness()
was used for interrupts that used SA_SAMPLE_RANDOM. Given that we
don't use add_timer_randomness() for that, but for things that happen
much more rarely (i.e., such as keyboard/mouse input), I'm probably
going to end up removing the trickle thresh_check altogether.

Regards,

- Ted

2013-09-23 01:34:55

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sun, 22 September 2013 19:36:41 -0400, Theodore Ts'o wrote:
> On Sun, Sep 22, 2013 at 04:53:34PM -0400, Jörn Engel wrote:
> >
> > And I want to keep that function. Essentially the point of fast_mix()
> > is to ratelimit _mix_pool_bytes(). Naïve ratelimiting would simply
> > discard the input once the ratelimit has been reached. My proposal is
> > to still use the input bits, but use a really cheap mixing function.
>
> Sure, but how cheap is "cheap"? There's a balance here. I don't buy
> the argument that we must weaken the fast_mix() at all costs because
> we have to drive the cost of fast_mix() to zero. If we're going to do
> that, to the limit fast_mix() should be a no-op, which is ridiculous.

Agreed. We always have a tradeoff between quality and cost. And I
repeat yet again that driving the cost down is important, because it
allows us to collect entropy more often. The schedule is an entropy
source I would like to tap.

> So what I've suggested already makes fast_mix() much faster. It's not
> fast as what you've proposed, but it's pretty clear that my fast_mix()
> is better at mixing the fast_mix pool.

Agreed.

> > Your version of fast_mix() failed in the "really cheap" department.
> > As a result, it showed up in profiles and at least one idiot (me)
> > reverted to naïve ratelimiting. It could have been worse, I was
> > explicitly asked twice to just remove the call to
> > add_interrupt_randomness().
>
> Sure, but that's not _my_ problem. If someone stupid adulterates
> /dev/random, that's not my responsibility. Most people aren't going
> to be dealing with systems with truly stupid interrupt rates, and most
> people aren't going to be tampering with the random driver.

This is where I don't agree. I very much care even about the plastic
routers running some variant of Linux modified by some embedded
engineers under insane time pressure. If you leave them an incentive
to cripple the random generator, some of them will. If you find
source code with a maliciously crippled random generator, the author
has plausible deniability.

So this should be _our_ problem. Maybe not yours specifically, but
certainly that of kernel developers in general.

> I'm certainly willing to make fast_mix() faster, to reduce the
> temptation for idiots to tamper with /dev/random, but we need to
> balance the time of fast_mix() with the quality of its mixing, and to
> remember what the common case is (and it's not 100k interrupts per
> second!)

How about we switch between the two mixing functions depending on the
interrupt load? If this CPU has seen fewer than 1000 interrupts in
the last second, use the better one, otherwise us the cheaper one?

I don't really like the idea too much. But it would cover both the
common case and my particular degenerated one.

> In practice, we are using randomness in so many places (every single
> time we start a process for ASLR, in lots of security-related programs
> that use SSL libraries, etc.), that we are actually practically
> *never* hitting trickle_thresh.
>
> The trickle_thresh was added originally when add_timer_randomness()
> was used for interrupts that used SA_SAMPLE_RANDOM. Given that we
> don't use add_timer_randomness() for that, but for things that happen
> much more rarely (i.e., such as keyboard/mouse input), I'm probably
> going to end up removing the trickle thresh_check altogether.

Makes sense to me.

Jörn

--
Doubt is not a pleasant condition, but certainty is an absurd one.
-- Voltaire

2013-09-23 02:43:53

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sun, Sep 22, 2013 at 08:16:23PM -0400, J?rn Engel wrote:
> How about we switch between the two mixing functions depending on the
> interrupt load? If this CPU has seen fewer than 1000 interrupts in
> the last second, use the better one, otherwise us the cheaper one?

I guess the question here is whether it's worth it. On a 2.8 GHz
laptop Ivy Bridge chip the numbers are:

Original fast_mix: 84 ns
tytso's fast_mix: 14 ns
joern's fast_mix: 8 ns

In terms of absolute overhead if you are running at an insane 100k
interrupts per second, it's still only 0.84%, 0.14%, and 0.08%,
respectively. Granted, an embedded CPU will be (much) slower, but so
will the overall overhead of the rest of the interrupt handling code
path plus whatever the overhead of the device driver will be. The
real bug is the 100k interrupts per second workload.

How about this as a compromise? We can add an #ifdef in the random.c
code which has the alternate fast_mix algorithm in the very rare case
that some embedded software engineer under time-pressure and suffering
under the need to use a pathetically broken hardware design, and who
starts looking in the random.c code, will find the an alternate
version. That way, we avoid the complexity of an dynamic switching
system, plus the overhead of measuring the number of interrupts per
second.

I am very strongly of the opinion that the number of systems where you
have an embedded system with that kind of inane interrupt rate is the
0.00000000001% case. So IMHO it's not even worth having a dynamic
switching system, especially when it's only going to improve things
slightly.

- Ted

P.S. The real reason for the original fast_mix() function is because
it has a separate pool for each CPU, so there's no spinlock contention
and no cache line bouncing. And that's why the fast_mix pool is so
small --- so that the entire struct fast_pool can fit in a single CPU
cache line. So on normal, common systems --- even mobile handsets
have multiple cores these days --- fast_mix() *is* actually much
faster than the standard entropy pool, even before any optimizations.

2013-09-23 07:39:56

by Jörg-Volker Peetz

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

Thanks for your patience and elaborated answer.

Theodore Ts'o wrote, on 09/22/2013 23:27:
> On Sun, Sep 22, 2013 at 11:01:42PM +0200, J?rg-Volker Peetz wrote:
>> just out of interest I would like to ask why this mixing function has to be that
>> complicated. For example, even if the input is always 0 and the pool is seeded
>> with pool[0] = 1 (as in your test program) this algorithm generates some
>> (predictable) pseudo-random numbers in the pool. Is this necessary?
>>
>> To just mix in some random input filling the whole pool (seeded again with
>> pool[0] = 1) something as "simple" as
>>
>> f->pool[0] = rol32(input[0], f->pool[2] & 31) ^ f->pool[1];
>> f->pool[1] = rol32(input[1], f->pool[3] & 31) ^ f->pool[2];
>> f->pool[2] = rol32(input[2], f->pool[0] & 31) ^ f->pool[3];
>> f->pool[3] = rol32(input[3], f->pool[1] & 31) ^ f->pool[0];
>>
>> would suffice, although I didn't do any statistical tests.
>
> The structure of the mixing functions in /dev/random has been well
> studied, and validatetd in a number of different academic papers. So
> I prefer to stick with the basic architecture, even as it is scaled
> down for speed reasons and beause the pool is smaller.

I'm not arguing so much for speed but for simplicity and comprehensibility of
code. As far as I understand the task of fast_mix() is just to collect random
bits in a small buffer separated from the random pool?
Once in a while these collected bits are then mixed into the main random pool.
For that purpose, justifiably, a strong mixing function is used.

> One of the important things about the mixing function is that if the
> attacker knows the input values (of which the simplest example for
> analytic purposes is if the input values are all zero), we want there
> to be ample mixing across the pool.
>
> If you look at your proposed mixing function, in the case where the
> input values are all zero, it devolves into:
>
> f->pool[0] = f->pool[1];
> f->pool[1] = f->pool[2];
> f->pool[2] = f->pool[3];
> f->pool[3] = f->pool[0];
>
> Yes, I know the input values will never be all zero, but in the case
> where the attacker knows what the input values are[1], but does not know
> the contents of the entropy pool, a very simplistic mixing function
> becomes relatively easy to analyze in the case where attacker knows
> the inputs and then outputs, and is trying to determine the internal
> state of the random driver.
>
> Measuring the speed of the fast_mix function which I put together,
> it's already been speeded up compard to the original fast_mix function
> by a factor of six. On my x86 laptop, I can run 10 million
> iterations in 0.14 seconds, so that translates to 14ns per fast_mix
> call. (The original fast_mix function runs in 84 nanoseconds.)
>
> So there is a cost-benefit tradeoff that we need to balance here. If
> you have a system with 100k interrupts per second, performance is
> going to be poor no matter what, and it's not clear how common such
> systems really are. Most modern hardware do have some kind of NAPI or
> other interrupt mitigation in place --- heck, even back in 1980's
> people had figured out how to improve the 8250 UART with the 16550A
> UART, which introdued a FIFO to decrease the interrupt load caused by
> serial ports, and things have only gotten better since then.
>
> Out of curiosity, on your profiles, how many nanonseconds total is the
> total interrupt code path chewing up per interrupt?
>
> Regards,
>
> - Ted
>
> [1] And on systems where we don't have get_cycles() or
> random_get_entropy(), it becomes much easier for the attacker to guess
> what at least half of the input values will be!
>

2013-09-23 16:20:55

[permalink] [raw]

Subject: Re: [PATCH,RFC] random: make fast_mix() honor its name

On Sun, 22 September 2013 22:43:38 -0400, Theodore Ts'o wrote:
> On Sun, Sep 22, 2013 at 08:16:23PM -0400, Jörn Engel wrote:
> > How about we switch between the two mixing functions depending on the
> > interrupt load? If this CPU has seen fewer than 1000 interrupts in
> > the last second, use the better one, otherwise us the cheaper one?
>
> I guess the question here is whether it's worth it. On a 2.8 GHz
> laptop Ivy Bridge chip the numbers are:

Then let us assume for now it is not worth it. When I finally get
around to generating profiles for my insane system we can revisit the
issue.

> I am very strongly of the opinion that the number of systems where you
> have an embedded system with that kind of inane interrupt rate is the
> 0.00000000001% case.

That would be one machine in 10^13? I doubt we have reached 10^13
machines running Linux in all of history, so a single example would
defeat your very strong opinion. ;)

Anyway, let me collect some real numbers before we argue any further.
And thank you for your maintainership. It may not appear that way,
but I have _very_ little to complain about in drivers/char/random.c.

Jörn

--
One of the painful things about our time is that those who feel certainty
are stupid, and those with any imagination and understanding are filled
with doubt and indecision.
-- Bertrand Russell

2013-10-10 06:50:25

by Pavel Machek

[permalink] [raw]

Subject: Re: [PATCH] /dev/random: Insufficient of entropy on many architectures

Hi!

(Sorry to be late and maybe with very early date).

On Tue 2013-09-10 14:25:24, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 06:54:38PM +0200, Stephan Mueller wrote:
>
> > Moreover, until having your proposed real fix, wouldn't it make sense to
> > have an interim patch to ensure we have entropy on the mentioned
> > platforms? I think /dev/random is critical enough to warrant some cache
> > miss even per interrupt?
>
> We are already mixing in the IP from the saved irq registers, on every
> single interrupt, so we are mixing in some entropy. We would get more
> entropy if we had a good cycle counter to mix in, but it's not the
> case that we're completely exposed right now on those platforms which
> don't have get_cycles() implemented. If the system running so lock
> step that the location of the interrupts is utterly predictable, then
> it's not clear that using a clock source is going to help you....

On the system with good power management and idle cpus (many of them), I'd
expect most of the IPs to be pretty much equal -- arch specific version of
"enter low power state" (mwait on recent intel machines, for example).

Would it make sense to mix the other registers, too?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2013-10-14 21:13:46