Ted,
last week you proposed an rfc patch to gather entropy from the CPU's
hwrng, and I was pleased - until I discovered one of my stalling
desktop machines does not have a hwrng. At that point I thought that
the problem was only from reading /dev/random, so I went away to look
at persuading the immediate consumer (unbound) to use /dev/urandom.
Did that, no change. Ran strace from the bootscript, confirmed that
only /dev/urandom was being used, and that it seemed to be blocking.
Thought maybe this was the olnl problematic bootscript, tried moving
it to later, but hit the same problem on chronyd (again, seems to use
urandom). And yes, I probably should have started chronyd first
anyway, but that's irrelevant to this problem.
BUT: I'm not sure if I've correctly understood what is happening.
It seems to me that the fix for CVE-2018-1108 (4.17-rc1, 4.16.4)
means /dev/urandom will now block until fully initialised.
Is that correct and intentional ?
If so, to get the affected desktop machines to boot I seem to have
some choices:
1. Wait for two and a half minutes (timed on the kaveri, the haswell
seemed to take a similar time).
2. Sit at the keyboard and start thumping it once userspace has
started.
3. For the haswell, apply your patch and trust that the CPU has not
been backdored
4. Run haveged.
The latter certainly lets it boot in a reasonable time, but people
who understand this seem to regard it as untrustworthy. For users
of /dev/urandom that is no big deal, but does it not mean that the
values from /dev/random will be similarly untrustworthy and
therefore I should not use this machine for generating long-lived
secure keys ?
TIA.
ĸen
On Mon, Jul 23, 2018 at 04:43:01AM +0100, Ken Moffat wrote:
> Ted,
>
> last week you proposed an rfc patch to gather entropy from the CPU's
> hwrng, and I was pleased - until I discovered one of my stalling
> desktop machines does not have a hwrng. At that point I thought that
> the problem was only from reading /dev/random, so I went away to look
> at persuading the immediate consumer (unbound) to use /dev/urandom.
>
> Did that, no change. Ran strace from the bootscript, confirmed that
> only /dev/urandom was being used, and that it seemed to be blocking.
> Thought maybe this was the olnl problematic bootscript, tried moving
> it to later, but hit the same problem on chronyd (again, seems to use
> urandom). And yes, I probably should have started chronyd first
> anyway, but that's irrelevant to this problem.
Nope, /dev/urandom still doesn't block. Are you sure it isn't caused
by something calling getrandom(2) --- which *will* block?
We intentionally left /dev/urandom non-blocking, because of backwards
compatibility.
> BUT: I'm not sure if I've correctly understood what is happening.
> It seems to me that the fix for CVE-2018-1108 (4.17-rc1, 4.16.4)
> means /dev/urandom will now block until fully initialised.
>
> Is that correct and intentional ?
No, that's not right. What the fix does is more accurately account
for the entropy accounting before getrandom(2) would become
non-blocking. There were a bunch of things we were doing wrong,
including assuming that 100% of the bytes being sent via
add_device_entropy() were random --- when some of the things that were
feeding into it was the (fixed) information you would get from running
dmidecode (e.g., the fixed results from the BIOS configuration data).
Some of those bytes might not be known to an external adversary (such
as your CPU mainboard's serial number), but it's not exactly *Secret*.
> If so, to get the affected desktop machines to boot I seem to have
> some choices...
Well, this probably isn't going to be popular, but the other thing
that might help is you could switch distro's. I'm guessing you run a
Red Hat distro, probably Fedora, right?
The problem which most people are seeing turns out to be a terrible
interaction between dracut-fips, systemd and a Red Hat specific patch
to libgcrypt for FIPS/FEDRAMP compliance:
https://src.fedoraproject.org/rpms/libgcrypt/blob/master/f/libgcrypt-1.6.2-fips-ctor.patch#_23
Uninstalling dracut-fips and recreating the initramfs might also help.
One of the reasons why I didn't see the problem when I was developing
the remediation patch for CVE-2018-1108 is because I run Debian
testing, which doesn't have this particular Red Hat patch.
> The latter certainly lets it boot in a reasonable time, but people
> who understand this seem to regard it as untrustworthy. For users
> of /dev/urandom that is no big deal, but does it not mean that the
> values from /dev/random will be similarly untrustworthy and
> therefore I should not use this machine for generating long-lived
> secure keys ?
This really depends on how paranoid / careful you are. Remember, your
keyboard controller was almost certainly built in Shenzhen, China, and
Matt Blaze published a paper on the Jitterbug in 2006:
http://www.crypto.com/papers/jbug-Usenix06-final.pdf
In practice, after 30 minutes of operation, especially if you are
using the keyboard, the entropy pool *will* be sufficiently
randomized, whether or not it was sufficientl randomized at boot. The
real danger of CVE-2018-1108 was always long-term keys generated at
first boot. That was the problem that was discussed in the "Mining
your p's and q's: Detection of Widespread Weak Keys in Network
Devices" (see https://factorable.net).
So generating long-lived keys means (a) you need to be sure you trust
all of the software on the system --- some very paranoid people such
as Bruce Schneier used a freshly installed machine from CD-ROM that
was never attached to the network before examining materials from
Edward Snowden, and (b) making sure the entropy pool is initialized.
Remember we are constantly feeding input from the hardware sources
into the entropy pool; it doesn't stop the moment we think the entropy
pool is initialized. And you can always mix extra "stuff" into the
entropy pool by echoing the results of say, taking series of dice
rolls, aond sending it via the "cat" or "echo" command into
/dev/urhandom.
So it should be possible to use the machine for generated long lived
keys; you might just need to be a bit more careful before you do it.
It's really keys generated automatically at boot that are most at risk
--- and you can always regenerate the host SSH keys after a fresh
install. In fact, what I have done in the past when I first login to
a freshly created Cloud VM system is to run command like "dd
if=/dev/urandom count=1 bs=256 | od -x", then login to VM, and then
run "cat > /dev/urandom", and cut and paste the results of the od -x
output into the guest VM, to better initialize the entropy pool on the
VM before regenerating the host SSH keys.
Cheers,
- Ted
On Mon, Jul 23, 2018 at 11:16 AM, Theodore Y. Ts'o <[email protected]> wrote:
> On Mon, Jul 23, 2018 at 04:43:01AM +0100, Ken Moffat wrote:
>> ...
> One of the reasons why I didn't see the problem when I was developing
> the remediation patch for CVE-2018-1108 is because I run Debian
> testing, which doesn't have this particular Red Hat patch.
Off-topic, I'm kind of surprised it took that long to fix it (if I am
parsing things correctly).
I believe Stephan Mueller wrote up the weakness a couple of years ago.
He's the one who explained the interactions to me. Mueller was even
cited at https://github.com/systemd/systemd/issues/4167.
It is too bad he Mueller not receive credit for it in the CVE database.
Jeff
On 23 July 2018 at 16:16, Theodore Y. Ts'o <[email protected]> wrote:
> On Mon, Jul 23, 2018 at 04:43:01AM +0100, Ken Moffat wrote:
>>
>> Did that, no change. Ran strace from the bootscript, confirmed that
>> only /dev/urandom was being used, and that it seemed to be blocking.
>
> Nope, /dev/urandom still doesn't block. Are you sure it isn't caused
> by something calling getrandom(2) --- which *will* block?
I'm not at all sure, which was why I asked.
>
> We intentionally left /dev/urandom non-blocking, because of backwards
> compatibility.
>
>> BUT: I'm not sure if I've correctly understood what is happening.
>> It seems to me that the fix for CVE-2018-1108 (4.17-rc1, 4.16.4)
>> means /dev/urandom will now block until fully initialised.
>>
>> Is that correct and intentional ?
>
> No, that's not right. What the fix does is more accurately account
> for the entropy accounting before getrandom(2) would become
> non-blocking. There were a bunch of things we were doing wrong,
> including assuming that 100% of the bytes being sent via
> add_device_entropy() were random --- when some of the things that were
> feeding into it was the (fixed) information you would get from running
> dmidecode (e.g., the fixed results from the BIOS configuration data).
>
> Some of those bytes might not be known to an external adversary (such
> as your CPU mainboard's serial number), but it's not exactly *Secret*.
>
>> If so, to get the affected desktop machines to boot I seem to have
>> some choices...
>
> Well, this probably isn't going to be popular, but the other thing
> that might help is you could switch distro's. I'm guessing you run a
> Red Hat distro, probably Fedora, right?
>
Wrong, linuxfromscratch (sysv version) and beyond linuxfromscratch
plus extras such as chronyd. The only initrd is on the haswell, and just
for intel microcode.
> The problem which most people are seeing turns out to be a terrible
> interaction between dracut-fips, systemd and a Red Hat specific patch
> to libgcrypt for FIPS/FEDRAMP compliance:
>
> https://src.fedoraproject.org/rpms/libgcrypt/blob/master/f/libgcrypt-1.6.2-fips-ctor.patch#_23
>
> Uninstalling dracut-fips and recreating the initramfs might also help.
>
> One of the reasons why I didn't see the problem when I was developing
> the remediation patch for CVE-2018-1108 is because I run Debian
> testing, which doesn't have this particular Red Hat patch.
>
>> The latter certainly lets it boot in a reasonable time, but people
>> who understand this seem to regard it as untrustworthy. For users
>> of /dev/urandom that is no big deal, but does it not mean that the
>> values from /dev/random will be similarly untrustworthy and
>> therefore I should not use this machine for generating long-lived
>> secure keys ?
>
> This really depends on how paranoid / careful you are. Remember, your
> keyboard controller was almost certainly built in Shenzhen, China, and
> Matt Blaze published a paper on the Jitterbug in 2006:
>
> http://www.crypto.com/papers/jbug-Usenix06-final.pdf
>
> In practice, after 30 minutes of operation, especially if you are
> using the keyboard, the entropy pool *will* be sufficiently
> randomized, whether or not it was sufficientl randomized at boot. The
> real danger of CVE-2018-1108 was always long-term keys generated at
> first boot. That was the problem that was discussed in the "Mining
> your p's and q's: Detection of Widespread Weak Keys in Network
> Devices" (see https://factorable.net).
>
> So generating long-lived keys means (a) you need to be sure you trust
> all of the software on the system --- some very paranoid people such
> as Bruce Schneier used a freshly installed machine from CD-ROM that
> was never attached to the network before examining materials from
> Edward Snowden, and (b) making sure the entropy pool is initialized.
>
> Remember we are constantly feeding input from the hardware sources
> into the entropy pool; it doesn't stop the moment we think the entropy
> pool is initialized. And you can always mix extra "stuff" into the
> entropy pool by echoing the results of say, taking series of dice
> rolls, aond sending it via the "cat" or "echo" command into
> /dev/urhandom.
>
> So it should be possible to use the machine for generated long lived
> keys; you might just need to be a bit more careful before you do it.
> It's really keys generated automatically at boot that are most at risk
> --- and you can always regenerate the host SSH keys after a fresh
> install. In fact, what I have done in the past when I first login to
> a freshly created Cloud VM system is to run command like "dd
> if=/dev/urandom count=1 bs=256 | od -x", then login to VM, and then
> run "cat > /dev/urandom", and cut and paste the results of the od -x
> output into the guest VM, to better initialize the entropy pool on the
> VM before regenerating the host SSH keys.
>
> Cheers,
>
> - Ted
Thanks. In that case I'll go with the simple fix (haveged).
ĸen
On Mon, Jul 23, 2018 at 12:11:12PM -0400, Jeffrey Walton wrote:
>
> I believe Stephan Mueller wrote up the weakness a couple of years ago.
> He's the one who explained the interactions to me. Mueller was even
> cited at https://github.com/systemd/systemd/issues/4167.
Stephan had a lot of complaints about the existing random driver.
That's because he has a replacement driver that he has been pushing,
and instead of giving explicit complaints with specific patches to fix
those specific issues, he have a generalized blast of complaints, plus
a "big bang rewrite".
I've reviewed his lrng doc, and this specific issue was not among his
complaints. Quite a while ago, I had gone through his document, and
had specifically addressed each of his complaints. As far as I have
been able determine, all of the specific technical complaints (as
opposed to personal preference issues) have been addressed.
His complaint is a text book complaint about how *not* to file a bug
report. That being said, we try to take bug reports from as many
sources as possible even if they aren't well formed or submitted in
the ideal place.
(I'm reminded of Linux's networking scalability limitations which
Microsoft filed in the Wall Street Journal 15+ years ago --- and which
only applied if you had 4 CPU's and four 10 megabit networking cards;
if you had four CPU's and a 100 megabit networking card, Linux would
grind Microsoft into the dust; still it was a bug, and we appreciated
the report and we fixed it, even if it wasn't filed in the ideal
forum. :-)
> It is too bad he Mueller not receive credit for it in the CVE database.
As near as I can tell, he doesn't deserve it for this particular
issue. It's all Jann Horn and Google's Project Zero. (And his
writeup is a textbook example of how to report this sort of issue with
great specifity and analysis.)
- Ted