2019-09-17 08:27:12

by Matthew Garrett

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On 16 September 2019 18:05:57 GMT-07:00, Linus Torvalds <[email protected]> wrote:
>On Mon, Sep 16, 2019 at 4:29 PM Ahmed S. Darwish <[email protected]>
>wrote:
>>
>> Linus, in all honesty, the other case is _not_ a hypothetical .
>
>Oh yes it is.
>
>You're confusing "use" with "breakage".
>
>The _use_ of getrandom(0) for key generation isn't hypothetical.
>
>But the _breakage_ from the suggested patch that makes it time out is.
>
>See the difference?
>
>The thing is, to break, you have to
>
> (a) do that key generation at boot time
>
> (b) do it on an idle machine that doesn't have entropy

Exactly the scenario where you want getrandom() to block, yes.

>in order to basically reproduce the current boot-time hang situation
>with the broken gdm, except with an actual "generate key".
>
>Then you have to ignore the big warning too.

The big warning that's only printed in dmesg?


--
Matthew Garrett | [email protected]


2019-09-17 08:27:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Mon, Sep 16, 2019 at 6:24 PM Matthew Garrett <[email protected]> wrote:
>
> Exactly the scenario where you want getrandom() to block, yes.

It *would* block. Just not forever.

And btw, the whole "generate key at boot when nothing else is going
on" is already broken, so presumably nobody actually does it.

See why I'm saying "hypothetical"? You're doing it again.

> >Then you have to ignore the big warning too.
>
> The big warning that's only printed in dmesg?

Well, the patch actually made getrandom() return en error too, but you
seem more interested in the hypotheticals than in arguing actualities.

Linus

2019-09-17 08:30:56

by Matthew Garrett

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On 16 September 2019 18:41:36 GMT-07:00, Linus Torvalds <[email protected]> wrote:
>On Mon, Sep 16, 2019 at 6:24 PM Matthew Garrett <[email protected]>
>wrote:
>>
>> Exactly the scenario where you want getrandom() to block, yes.
>
>It *would* block. Just not forever.

It's already not forever - there's enough running in the background of that system that it'll unblock eventually.

>And btw, the whole "generate key at boot when nothing else is going
>on" is already broken, so presumably nobody actually does it.

If nothing ever did this, why was getrandom() designed in a way to protect against this situation?

>See why I'm saying "hypothetical"? You're doing it again.
>
>> >Then you have to ignore the big warning too.
>>
>> The big warning that's only printed in dmesg?
>
>Well, the patch actually made getrandom() return en error too, but you
>seem more interested in the hypotheticals than in arguing actualities.

If you want to be safe, terminate the process.


--
Matthew Garrett | [email protected]

2019-09-17 09:37:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> >Well, the patch actually made getrandom() return en error too, but you
> >seem more interested in the hypotheticals than in arguing actualities.
>
> If you want to be safe, terminate the process.

This is an interesting approach. At least it will cause bug reports in
application using getrandom() in an unreliable way and they will check
for other options. Because one of the issues with systems that do not
finish to boot is that usually the user doesn't know what process is
hanging.

Anyway regarding the impact on applications relying on getrandom() for
security, I'm in favor of not *silently* changing their behavior and
provide a new flag to help others get insecure randoms without waiting.

With your option above we could then have this way to go:

- GRND_SECURE: the application wants secure randoms, i.e. like
the current getrandom(0), waiting for entropy.

- GRND_INSECURE: the application never wants to wait, it just
wants a replacement for /dev/urandom.

- GRND_RANDOM: unchanged, or subject to CAP_xxx, or maybe just emit
a "deprecated" warning if called without a certain capability, to
spot potentially harmful applications.

- by default (0), the application continues to wait but when the
timeout strikes (30 seconds ?), it gets terminated with a
message in the logs for users to report the issue.

After some time all relevant applications which accidently misuse
getrandom() will be fixed to either use GRND_INSECURE or GRND_SECURE
and be able to wait longer if they want (likely SECURE|NONBLOCK).

Willy

2019-09-17 09:43:07

by Martin Steigerwald

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

Willy Tarreau - 17.09.19, 07:24:38 CEST:
> On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > >Well, the patch actually made getrandom() return en error too, but
> > >you seem more interested in the hypotheticals than in arguing
> > >actualities.>
> > If you want to be safe, terminate the process.
>
> This is an interesting approach. At least it will cause bug reports in
> application using getrandom() in an unreliable way and they will
> check for other options. Because one of the issues with systems that
> do not finish to boot is that usually the user doesn't know what
> process is hanging.

A userspace process could just poll on the kernel by forking a process
to use getrandom() and waiting until it does not get terminated anymore.
And then it would still hang.

So yes, that would it make it harder to abuse the API, but not
impossible. Which may still be good, I don't know.

Either the kernel does not reveal at all whether it has seeded CRNG and
leaves GnuPG, OpenSSH and others in the dark, or it does and risk that
userspace does stupid things whether it prints a big fat warning or not.

Of course the warning could be worded like:

process blocking on entropy too early on boot without giving the kernel
much chance to gather entropy. this is not a kernel issue, report to
userspace developers

And probably then kill the process, so at least users will know.

However this again would be burdening users with an issue they should
not have to care about. Unless userspace developers care enough and
manage to take time to fix the issue before updated kernels come to their
systems. Cause again it would be users systems that would not be
working. Just cause kernel and userspace developers did not agree and
chose to fight with each other instead of talking *with* each other.

At least with killing gdm Systemd may restart it if configured to do so.
But if it doesn't, the user is again stuck with a non working system
until restarting gdm themselves.

It may still make sense to make the API harder to use, but it does not
replace talking with userspace developers and it would need some time to
allow for adapting userspace applications and services.

--
Martin


2019-09-17 09:53:10

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> However this again would be burdening users with an issue they should
> not have to care about. Unless userspace developers care enough and
> manage to take time to fix the issue before updated kernels come to their
> systems. Cause again it would be users systems that would not be
> working. Just cause kernel and userspace developers did not agree and
> chose to fight with each other instead of talking *with* each other.

It has nothing to do with fighting at all, it has to do with offering
what applications *need* without breaking existing assumptions that
make most applications work. And more importantly it involves not
silently breaking applications which need good randomness for long
lived keys because the breakage will not be visible initially and can
hit them hard later. Right now most applications which block in the
early stages are only victim of the current situation and their
developers possibly didn't understand the possible impacts of lack
of entropy (or how real an issue it was). These applications do need
to be able to get low-quality random without blocking forever,
provided these are not accidently used by those who need security. At
some point, just like for any syscall, the doc makes the difference.

> At least with killing gdm Systemd may restart it if configured to do so.
> But if it doesn't, the user is again stuck with a non working system
> until restarting gdm themselves.
>
> It may still make sense to make the API harder to use,

No. What is hard to use is often misused. It must be harder to misuse
it, which means it should be easier to correctly use it. The choice of
flag names and the emission of warnings definitely helps during the
development stage.

> but it does not
> replace talking with userspace developers and it would need some time to
> allow for adapting userspace applications and services.

Which is how adding new flags can definitely help even if adoption takes
time. By the way in this discussion I am a userspace developer and have
been hit several times by libraries switching to getrandom() that silently
failed to respond in field. As a userspace developer, I really want to see
a solution to this problem. And I'm fine if the kernel decides to kill
haproxy for using getrandom() with the old settings, at least users will
notice, will complain to me and will update.

Willy

2019-09-17 12:14:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> Willy Tarreau - 17.09.19, 07:24:38 CEST:
> > On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > > >Well, the patch actually made getrandom() return en error too, but
> > > >you seem more interested in the hypotheticals than in arguing
> > > >actualities.>
> > > If you want to be safe, terminate the process.
> >
> > This is an interesting approach. At least it will cause bug reports in
> > application using getrandom() in an unreliable way and they will
> > check for other options. Because one of the issues with systems that
> > do not finish to boot is that usually the user doesn't know what
> > process is hanging.
>

I would be happy with a change which changes getrandom(0) to send a
kill -9 to the process if it is called too early, with a new flag,
getrandom(GRND_BLOCK) which blocks until entropy is available. That
leaves it up to the application developer to decide what behavior they
want.

Userspace applications which want to do something more sophisticated
could set a timer which will cause getrandom(GRND_BLOCK) to return
with EINTR (or the signal handler could use longjmp; whatever) to
abort and do something else, like calling random_r if it's for some
pathetic use of random numbers like MIT-MAGIC-COOKIE.

> A userspace process could just poll on the kernel by forking a process
> to use getrandom() and waiting until it does not get terminated anymore.
> And then it would still hang.

So.... I'm not too worried about that, because if a process is
determined to do something stupid, they can always do something
stupid.

This could potentially be a problem, as would GRND_BLOCK, in that if
an application author decides to use to do something to wait for real
randomness, because in the good judgement of the application author,
it d*mned needs real security because otherwise an attacker could,
say, force a launch of nuclear weapons and cause world war III, and
then some small 3rd-tier distro decides to repurpose that application
for some other use, and puts it in early boot, it's possible that a
user will report it as a "regression", and we'll be back to the
question of whether we revert a performance optimization patch.

There are only two ways out of this mess. The first option is we take
functionality away from a userspace author who Really Wants A Secure
Random Number Generator. And there are an awful lot of programs who
really want secure crypto, becuase this is not a hypothetical. The
result in "Mining your P's and Q's" did happen before. If we forget
the history, we are doomed to repeat it.

The only other way is that we need to try to get the CRNG initialized
securely in early boot, before we let userspace start. If we do it
early enough, we can also make the kernel facilities like KASLR and
Stack Canaries more secure. And this is *doable*, at least for most
common platforms. We can leverage UEFI; we cn try to use the TPM's
random number generator, etc. It won't help so much for certain
brain-dead architectures, like MIPS and ARM, but if they are used for
embedded use cases, it will be caught before the product is released
for consumer use. And this is where blocking is *way* better than a
big fat warning, or sleeping for 15 seconds, both of which can easily
get missed in the embedded case. If we can fix this for traditional
servers/desktops/laptops, then users won't be complaining to Linus,
and I think we can all be happy.

Regards,

- Ted

2019-09-17 12:30:59

by Ahmed S. Darwish

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 08:11:56AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> > Willy Tarreau - 17.09.19, 07:24:38 CEST:
> > > On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > > > >Well, the patch actually made getrandom() return en error too, but
> > > > >you seem more interested in the hypotheticals than in arguing
> > > > >actualities.>
> > > > If you want to be safe, terminate the process.
> > >
> > > This is an interesting approach. At least it will cause bug reports in
> > > application using getrandom() in an unreliable way and they will
> > > check for other options. Because one of the issues with systems that
> > > do not finish to boot is that usually the user doesn't know what
> > > process is hanging.
> >
>
> I would be happy with a change which changes getrandom(0) to send a
> kill -9 to the process if it is called too early, with a new flag,
> getrandom(GRND_BLOCK) which blocks until entropy is available. That
> leaves it up to the application developer to decide what behavior they
> want.
>

Yup, I'm convinced that's the sanest option too. I'll send a final RFC
patch tonight implementing the following:

config GETRANDOM_CRNG_ENTROPY_MAX_WAIT_MS
int
default 3000
help
Default max wait in milliseconds, for the getrandom(2) system
call when asking for entropy from the urandom source, until
the Cryptographic Random Number Generator (CRNG) gets
initialized. Any process exceeding this duration for entropy
wait will get killed by kernel. The maximum wait can be
overriden through the "random.getrandom_max_wait_ms" kernel
boot parameter. Rationale follows.

When the getrandom(2) system call was created, it came with
the clear warning: "Any userspace program which uses this new
functionality must take care to assure that if it is used
during the boot process, that it will not cause the init
scripts or other portions of the system startup to hang
indefinitely.

Unfortunately, due to multiple factors, including not having
this warning written in a scary enough language in the
manpages, and due to glibc since v2.25 implementing a BSD-like
getentropy(3) in terms of getrandom(2), modern user-space is
calling getrandom(2) in the boot path everywhere.

Embedded Linux systems were first hit by this, and reports of
embedded system "getting stuck at boot" began to be
common. Over time, the issue began to even creep into consumer
level x86 laptops: mainstream distributions, like Debian
Buster, began to recommend installing haveged as a workaround,
just to let the system boot.

Filesystem optimizations in EXT4 and XFS exagerated the
problem, due to aggressive batching of IO requests, and thus
minimizing sources of entropy at boot. This led to large
delays until the kernel's Cryptographic Random Number
Generator (CRNG) got initialized, and thus having reports of
getrandom(2) inidifinitely stuck at boot.

Solve this problem by setting a conservative upper bound for
getrandom(2) wait. Kill the process, instead of returning an
error code, because otherwise crypto-sensitive applications
may revert to less secure mechanisms (e.g. /dev/urandom). We
__deeply encourage__ system integrators and distribution
builders not to considerably increase this value: during
system boot, you either have entropy, or you don't. And if you
didn't have entropy, it will stay like this forever, because
if you had, you wouldn't have blocked in the first place. It's
an atomic "either/or" situation, with no middle ground. Please
think twice.

Ideally, systems would be configured with hardware random
number generators, and/or configured to trust the CPU-provided
RNG's (CONFIG_RANDOM_TRUST_CPU) or boot-loader provided ones
(CONFIG_RANDOM_TRUST_BOOTLOADER). In addition, userspace
should generate cryptographic keys only as late as possible,
when they are needed, instead of during early boot. (For
non-cryptographic use cases, such as dictionary seeds or MIT
Magic Cookies, other mechanisms such as /dev/urandom or
random(3) may be more appropropriate.)

Sounds good?

thanks,

--
Ahmed Darwish
http://darwish.chasingpointers.com

2019-09-17 12:47:47

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

17.09.2019 17:30, Ahmed S. Darwish пишет:
> On Tue, Sep 17, 2019 at 08:11:56AM -0400, Theodore Y. Ts'o wrote:
>> On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
>>> Willy Tarreau - 17.09.19, 07:24:38 CEST:
>>>> On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
>>>>>> Well, the patch actually made getrandom() return en error too, but
>>>>>> you seem more interested in the hypotheticals than in arguing
>>>>>> actualities.>
>>>>> If you want to be safe, terminate the process.
>>>>
>>>> This is an interesting approach. At least it will cause bug reports in
>>>> application using getrandom() in an unreliable way and they will
>>>> check for other options. Because one of the issues with systems that
>>>> do not finish to boot is that usually the user doesn't know what
>>>> process is hanging.
>>>
>>
>> I would be happy with a change which changes getrandom(0) to send a
>> kill -9 to the process if it is called too early, with a new flag,
>> getrandom(GRND_BLOCK) which blocks until entropy is available. That
>> leaves it up to the application developer to decide what behavior they
>> want.
>>
>
> Yup, I'm convinced that's the sanest option too. I'll send a final RFC
> patch tonight implementing the following:
>
> config GETRANDOM_CRNG_ENTROPY_MAX_WAIT_MS
> int
> default 3000
> help
> Default max wait in milliseconds, for the getrandom(2) system
> call when asking for entropy from the urandom source, until
> the Cryptographic Random Number Generator (CRNG) gets
> initialized. Any process exceeding this duration for entropy
> wait will get killed by kernel. The maximum wait can be
> overriden through the "random.getrandom_max_wait_ms" kernel
> boot parameter. Rationale follows.
>
> When the getrandom(2) system call was created, it came with
> the clear warning: "Any userspace program which uses this new
> functionality must take care to assure that if it is used
> during the boot process, that it will not cause the init
> scripts or other portions of the system startup to hang
> indefinitely.
>
> Unfortunately, due to multiple factors, including not having
> this warning written in a scary enough language in the
> manpages, and due to glibc since v2.25 implementing a BSD-like
> getentropy(3) in terms of getrandom(2), modern user-space is
> calling getrandom(2) in the boot path everywhere.
>
> Embedded Linux systems were first hit by this, and reports of
> embedded system "getting stuck at boot" began to be
> common. Over time, the issue began to even creep into consumer
> level x86 laptops: mainstream distributions, like Debian
> Buster, began to recommend installing haveged as a workaround,
> just to let the system boot.
>
> Filesystem optimizations in EXT4 and XFS exagerated the
> problem, due to aggressive batching of IO requests, and thus
> minimizing sources of entropy at boot. This led to large
> delays until the kernel's Cryptographic Random Number
> Generator (CRNG) got initialized, and thus having reports of
> getrandom(2) inidifinitely stuck at boot.
>
> Solve this problem by setting a conservative upper bound for
> getrandom(2) wait. Kill the process, instead of returning an
> error code, because otherwise crypto-sensitive applications
> may revert to less secure mechanisms (e.g. /dev/urandom). We
> __deeply encourage__ system integrators and distribution
> builders not to considerably increase this value: during
> system boot, you either have entropy, or you don't. And if you
> didn't have entropy, it will stay like this forever, because
> if you had, you wouldn't have blocked in the first place. It's
> an atomic "either/or" situation, with no middle ground. Please
> think twice.
>
> Ideally, systems would be configured with hardware random
> number generators, and/or configured to trust the CPU-provided
> RNG's (CONFIG_RANDOM_TRUST_CPU) or boot-loader provided ones
> (CONFIG_RANDOM_TRUST_BOOTLOADER). In addition, userspace
> should generate cryptographic keys only as late as possible,
> when they are needed, instead of during early boot. (For
> non-cryptographic use cases, such as dictionary seeds or MIT
> Magic Cookies, other mechanisms such as /dev/urandom or
> random(3) may be more appropropriate.)
>
> Sounds good?
>
> thanks,
>
> --
> Ahmed Darwish
> http://darwish.chasingpointers.com
>

This would fail the litmus test that started this thread, re-explained
below.

0. Linus applies your patch.
1. A kernel release happens, and it boots fine.
2. Ted Ts'o invents yet another brilliant ext4 optimization, and it gets
merged.
3. Somebody discovers that the new kernel kills all his processes, up to
and including gnome-session, and that's obviously a regression.
4. Linus is forced to revert (2), nobody wins.

--
Alexander E. Patrakov


Attachments:
smime.p7s (3.96 kB)
Криптографическая подпись S/MIME

2019-09-17 12:48:32

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 12:30:15PM +0000, Ahmed S. Darwish wrote:
> Sounds good?

Sounds good to me except that I'd like to have the option to get
poor randoms. getrandom() is used when /dev/urandom is not accessible
or painful to use. Until we provide applications with a solution to
this fairly common need, the problem will continue to regularly pop
up, in a different way ("my application randomly crashes at boot").
Let's get GRND_INSECURE in addition to your change and I think all
needs will be properly covered.

Thanks,
Willy

2019-09-17 13:11:55

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

17.09.2019 17:11, Theodore Y. Ts'o пишет:
> There are only two ways out of this mess. The first option is we take
> functionality away from a userspace author who Really Wants A Secure
> Random Number Generator. And there are an awful lot of programs who
> really want secure crypto, becuase this is not a hypothetical. The
> result in "Mining your P's and Q's" did happen before. If we forget
> the history, we are doomed to repeat it.

You cannot take away functionality that does not really exist. Every
time somebody tries to use it, there is a huge news, "the boot process
is blocked on application FOO", followed by an insecure fallback to
/dev/urandom in the said application or library.

Regarding the "Mining your P's and Q's" paper: I would say it is a
combination of TWO faults, only one of which (poor, or, as explained
below, "marginally poor" entropy) is discussed and the other one (not
really sound crypto when deriving the RSA key from the
presumedly-available entropy) is ignored.

The authors of the paper factored the weak keys by applying the
generalized GCD algorithm, thus looking for common factors in the RSA
public keys. For two RSA public keys to be detected as faulty, they must
share exactly one of their prime factors. In other words: repeated keys
were specifically excluded from the study by the paper authors.

Sharing only one of the two primes means that that the systems in
question behaved identically when they generated the first prime, but
diverged (possibly due to the extra entropy becoming available) when
they generated the second one. And asking the randomness for p and for q
separately is what I would call the application bug here that nobody
wants to talk about: both p and q should have been derived from a CSPRNG
seeded by a single read from a random source. If that practice were
followed, then it would either result in a duplicate key (which is not
as bad as a factorable one), or in completely unrelated keys.

--
Alexander E. Patrakov


Attachments:
smime.p7s (3.96 kB)
Криптографическая подпись S/MIME

2019-09-17 13:39:03

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

17.09.2019 18:11, Alexander E. Patrakov пишет:
> 17.09.2019 17:11, Theodore Y. Ts'o пишет:
>> There are only two ways out of this mess.  The first option is we take
>> functionality away from a userspace author who Really Wants A Secure
>> Random Number Generator.  And there are an awful lot of programs who
>> really want secure crypto, becuase this is not a hypothetical.  The
>> result in "Mining your P's and Q's" did happen before.  If we forget
>> the history, we are doomed to repeat it.
>
> You cannot take away functionality that does not really exist. Every
> time somebody tries to use it, there is a huge news, "the boot process
> is blocked on application FOO", followed by an insecure fallback to
> /dev/urandom in the said application or library.
>
> Regarding the "Mining your P's and Q's" paper: I would say it is a
> combination of TWO faults, only one of which (poor, or, as explained
> below, "marginally poor" entropy) is discussed and the other one (not
> really sound crypto when deriving the RSA key from the
> presumedly-available entropy) is ignored.
>
> The authors of the paper factored the weak keys by applying the
> generalized GCD algorithm, thus looking for common factors in the RSA
> public keys. For two RSA public keys to be detected as faulty, they must
> share exactly one of their prime factors. In other words: repeated keys
> were specifically excluded from the study by the paper authors.
>
> Sharing only one of the two primes means that that the systems in
> question behaved identically when they generated the first prime, but
> diverged (possibly due to the extra entropy becoming available) when
> they generated the second one. And asking the randomness for p and for q
> separately is what I would call the application bug here that nobody
> wants to talk about: both p and q should have been derived from a CSPRNG
> seeded by a single read from a random source. If that practice were
> followed, then it would either result in a duplicate key (which is not
> as bad as a factorable one), or in completely unrelated keys.

I take this back. Of course, completely duplicate keys are weak keys,
and they are even more dangerous because they are not distinguishable
from intentionally copied good keys by the method in the paper.

--
Alexander E. Patrakov


Attachments:
smime.p7s (3.96 kB)
Криптографическая подпись S/MIME

2019-09-17 18:30:51

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 08:11, Theodore Y. Ts'o ([email protected]) wrote:

> On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> > Willy Tarreau - 17.09.19, 07:24:38 CEST:
> > > On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > > > >Well, the patch actually made getrandom() return en error too, but
> > > > >you seem more interested in the hypotheticals than in arguing
> > > > >actualities.>
> > > > If you want to be safe, terminate the process.
> > >
> > > This is an interesting approach. At least it will cause bug reports in
> > > application using getrandom() in an unreliable way and they will
> > > check for other options. Because one of the issues with systems that
> > > do not finish to boot is that usually the user doesn't know what
> > > process is hanging.
> >
>
> I would be happy with a change which changes getrandom(0) to send a
> kill -9 to the process if it is called too early, with a new flag,
> getrandom(GRND_BLOCK) which blocks until entropy is available. That
> leaves it up to the application developer to decide what behavior they
> want.

Note that calling getrandom(0) "too early" is not something people do
on purpose. It happens by accident, i.e. because we live in a world
where SSH or HTTPS or so is run in the initrd already, and in a world
where booting sometimes can be very very fast. So even if you write a
program and you think "this stuff should run late I'll just
getrandom(0)" it might not actually be that case IRL because people
deploy it a slightly bit differently than you initially thought in a
slightly differently equipped system with other runtime behaviour...

Lennart

2019-09-17 18:32:31

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 12:30, Ahmed S. Darwish ([email protected]) wrote:

> Ideally, systems would be configured with hardware random
> number generators, and/or configured to trust the CPU-provided
> RNG's (CONFIG_RANDOM_TRUST_CPU) or boot-loader provided ones
> (CONFIG_RANDOM_TRUST_BOOTLOADER). In addition, userspace
> should generate cryptographic keys only as late as possible,
> when they are needed, instead of during early boot. (For
> non-cryptographic use cases, such as dictionary seeds or MIT
> Magic Cookies, other mechanisms such as /dev/urandom or
> random(3) may be more appropropriate.)
>
> Sounds good?

This sounds mean. You make apps pay for something they aren't really
at fault for.

I mean, in the cloud people typically put together images that are
replicated to many systems, and as first thing generate an SSH key, on
the individual system. In fact, most big distros tend to ship SSH that
is precisely set up this way: on first boot the SSH key is
generated. They tend to call getrandom(0) for this right now, and
rightfully so. Now suddenly you kill them because they are doing
everything correctly? Those systems aren't going to be more useful if
they have no SSH key at all than they would be if they would hang at
boot: either way you can't log in.

Here's what I'd propose:

1) Add GRND_INSECURE to get those users of getrandom() who do not need
high quality entropy off its use (systemd has uses for this, for
seeding hash tables for example), thus reducing the places where
things might block.

2) Add a kernel log message if a getrandom(0) client hung for 15s or
more, explaining the situation briefly, but not otherwise changing
behaviour.

3) Change systemd-random-seed.service to log to console in the same
case, blocking boot cleanly and discoverably.

I am not a fan of randomly killing userspace processes that just
happened to be the unlucky ones, to call this first... I see no
benefit in killing stuff over letting boot hang in a discoverable way.

Lennart

2019-09-17 18:36:26

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 05:57:43PM +0200, Lennart Poettering wrote:
> Note that calling getrandom(0) "too early" is not something people do
> on purpose. It happens by accident, i.e. because we live in a world
> where SSH or HTTPS or so is run in the initrd already, and in a world
> where booting sometimes can be very very fast.

It's not an accident, it's a lack of understanding of the impacts
from the people who package the systems. Generating an SSH key from
an initramfs without thinking where the randomness used for this could
come from is not accidental, it's a lack of experience that will be
fixed once they start to collect such reports. And those who absolutely
need their SSH daemon or HTTPS server for a recovery image in initramfs
can very well feed fake entropy by dumping whatever they want into
/dev/random to make it possible to build temporary keys for use within
this single session. At least all supposedly incorrect use will be made
*on purpose* and will still be possible to match what users need.

> So even if you write a
> program and you think "this stuff should run late I'll just
> getrandom(0)" it might not actually be that case IRL because people
> deploy it a slightly bit differently than you initially thought in a
> slightly differently equipped system with other runtime behaviour...

I agree with this, it's precisely because I think we should not restrict
userspace capabilities that I want the issue addressed in a way that lets
users do what they need instead of relying on dangerous workarounds. Just
googling for "mknod /dev/random c 1 9" returns tens, maybe hundreds of
pages all explaining how to fix the problem of non-booting systems. It
simply proves that the kernel is not the place to decide what users are
allowed to do. Let's give them the tools to work correctly and be
responsible for their choices. They just need to be hit by bad choices
to get some feedback from the field other than a new list of well-known
SSH keys.

Willy

2019-09-17 18:36:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 9:08 AM Lennart Poettering <[email protected]> wrote:
>
> Here's what I'd propose:

So I think this is ok, but I have another proposal. Before I post that
one, though, I just wanted to point out:

> 1) Add GRND_INSECURE to get those users of getrandom() who do not need
> high quality entropy off its use (systemd has uses for this, for
> seeding hash tables for example), thus reducing the places where
> things might block.

I really think that trhe logic should be the other way around.

The getrandom() users that don't need high quality entropy are the
ones that don't really think about this, and so _they_ shouldn't be
the ones that have to explicitly state anything. To those users,
"random is random". By definition they don't much care, and quite
possibly they don't even know what "entropy" really means in that
context.

The ones that *do* want high security randomness should be the ones
that know that "random" means different things to different people,
and that randomness is hard.

So the onus should be on them to say that "yes, I'm one of those
people willing to wait".

That's why I'd like to see GRND_SECURE instead. That's kind of what
GRND_RANDOM is right now, but it went overboard and it's not useful
even to the people who do want secure random numners.

Besides, the GRND_RANDOM naming doesn't really help the people who
don't know anyway, so it's just bad in so many ways. We should
probably just get rid of that flag entirely and make it imply
GRND_SECURE without the overdone entropy accounting, but that's a
separate issue.

When we do add GRND_SECURE, we should also add the GRND_INSECURE just
to allow people to mark their use, and to avoid the whole existing
confusion about "0".

> 2) Add a kernel log message if a getrandom(0) client hung for 15s or
> more, explaining the situation briefly, but not otherwise changing
> behaviour.

The problem is that when you have some graphical boot, you'll not even
see the kernel messages ;(

I do agree that a message is a good idea regardless, but I don't think
it necessarily solves the problems except for developers.

> 3) Change systemd-random-seed.service to log to console in the same
> case, blocking boot cleanly and discoverably.

So I think systemd-random-seed might as well just use a new
GRND_SECURE, and then not even have to worry about it.

That said, I think I have a suggestion that everybody can live with -
even if they might not be _happy_ about it. See next email.

> I am not a fan of randomly killing userspace processes that just
> happened to be the unlucky ones, to call this first... I see no
> benefit in killing stuff over letting boot hang in a discoverable way.

Absolutely agreed. The point was to not break things.

Linus

2019-09-17 18:37:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 12:33 AM Martin Steigerwald <[email protected]> wrote:
>
> So yes, that would it make it harder to abuse the API, but not
> impossible. Which may still be good, I don't know.

So the real problem is not people abusing the ABI per se. Yes, I was a
bit worried about that too, but it's not the cause of the immediate
issue.

The real problem is that "getrandom(0)" is really _convenient_ for
people who just want random numbers - and not at all the "secure"
kind.

And it's convenient, and during development and testing, it always
"just works", because it doesn't ever block in any normal situation.

And then you deploy it, and on some poor users machine it *does*
block, because the program now encounters the "oops, no entropy"
situation that it never ever encountered on the development machine,
because the testing there was mainly done not during booting, but the
developer also probably had a much more modern machine that had
rdrand, and that quite possibly also had more services enabled at
bootup etc so even without rdrand it got tons of entropy.

That's why

(a) killing the process is _completely_ silly. It misses the whole
point of the problem in the first place and only makes things much
worse.

(b) we should just change getrandom() and add that GRND_SECURE flag
instead. Because the current API is fundamentally confusing. If you
want secure random numbers, you should really deeply _know_ about it,
and think about it, rather than have it be the "oh, don't even bother
passing any flags, it's secure by default".

(c) the timeout approach isn't wonderful, but it at least helps with
the "this was never tested under those circumstances" kind of problem.

Note that the people who actually *thought* about getrandom() and use
it correctly should already handle error returns (even for the
blocking version), because getrandom() can already return EINTR. So
the argument that we should cater primarily to the secure key people
is not all that strong. We should be able to return EINTR, and the
people who *thought* about blocking and about entropy should be fine.

And gdm and other silly random users that never wanted entropy in the
first place, just "random" random numbers, wouldn't be in the
situation they are now.

That said - looking at some of the problematic traces that Ahmed
posted for his bootup problem, I actually think we can use *another*
heuristic to solve the problem. Namely just looking at how much
randomness the caller wants.

The processes that ask for randomness for an actual secure key have a
very fundamental constraint: they need enough randomness for the key
to be secure in the first place.

But look at what gnome-shell and gnome-session-b does:

https://lore.kernel.org/linux-ext4/20190912034421.GA2085@darwi-home-pc/

and most of them already set GRND_NONBLOCK, but look at the
problematic one that actually causes the boot problem:

gnome-session-b-327 4.400620: getrandom(16 bytes, flags = 0)

and here the big clue is: "Hey, it only asks for 128 bits of randomness".

Does anybody believe that 128 bits of randomness is a good basis for a
long-term secure key? Even if the key itself contains than that, if
you are generating a long-term secure key in this day and age, you had
better be asking for more than 128 bits of actual unpredictable base
data. So just based on the size of the request we can determine that
this is not hugely important.

Compare that to the case later on for something that seems to ask for
actual interesting randomness. and - just judging by the name -
probably even has a reason for it:

gsd-smartcard-388 51.433924: getrandom(110 bytes, flags = 0)
gsd-smartcard-388 51.433936: getrandom(256 bytes, flags = 0)

big difference.

End result: I would propose the attached patch.

Ahmed, can you just verify that it works for you (obviously with the
ext4 plugging reinstated)? It looks like it should "obviously" fix
things, but still...

Linus


Attachments:
patch.diff (1.70 kB)

2019-09-17 18:40:03

by Reindl Harald

[permalink] [raw]
Subject: Re: Linux 5.3-rc8



Am 17.09.19 um 18:23 schrieb Linus Torvalds:
> I do agree that a message is a good idea regardless, but I don't think
> it necessarily solves the problems except for developers

sadly in our current world dvelopers and maintainers don't read any logs
and as long it compiles and boots it works and can be pushed :-(

they even argue instead fix a dmaned line in a textfile which could have
been fixed 8 years in advance and i have written a ton of such reports
for F30 not talking about 15 others where software spits warnings with
the source file and line into the syslog and nobody out there gives a
damn about it

one example of many
https://bugzilla.redhat.com/show_bug.cgi?id=1748322

the only way you can get developers to clean up their mess these days is
to spit it straight into their face in modal window everytime they login
but how to exclude innocent endusers.....

half of my "rsyslog.conf" is to filter out stuff i can't fix anyways to
have my peace when call the script below every time i reboot whatever
linux machine

the 'usb_serial_init - returning with error' is BTW Linux when you boot
with 'nousb usbcore.nousb'

------------------

[root@srv-rhsoft:~]$ cat /scripts/system-errors.sh
#!/usr/bin/dash
dmesg -T | grep --color -i warn | grep -v 'Perf event create on CPU' |
grep -v 'Hardware RNG Device' | grep -v 'TPM RNG Device' | grep -v
'Correctable Errors collector initialized' | grep -v
'error=format-security' | grep -v 'MHD_USE_THREAD_PER_CONNECTION' | grep
-v 'usb_serial_init - returning with error' | grep -v
'systemd-journald.service' | grep -v 'usb_serial_init - registering
generic driver failed'
grep --color -i warn /var/log/messages | grep -v 'Perf event create on
CPU' | grep -v 'Hardware RNG Device' | grep -v 'TPM RNG Device' | grep
-v 'Correctable Errors collector initialized' | grep -v
'error=format-security' | grep -v 'MHD_USE_THREAD_PER_CONNECTION' | grep
-v 'usb_serial_init - returning with error' | grep -v
'systemd-journald.service' | grep -v 'usb_serial_init - registering
generic driver failed'
dmesg -T | grep --color -i fail | grep -v 'BAR 13' | grep -v 'Perf event
create on CPU' | grep -v 'Hardware RNG Device' | grep -v 'TPM RNG
Device' | grep -v 'Correctable Errors collector initialized' | grep -v
'error=format-security' | grep -v 'MHD_USE_THREAD_PER_CONNECTION' | grep
-v 'usb_serial_init - returning with error' | grep -v
'systemd-journald.service' | grep -v 'usb_serial_init - registering
generic driver failed'
grep --color -i fail /var/log/messages | grep -v 'BAR 13' | grep -v
'Perf event create on CPU' | grep -v 'Hardware RNG Device' | grep -v
'TPM RNG Device' | grep -v 'Correctable Errors collector initialized' |
grep -v 'error=format-security' | grep -v
'MHD_USE_THREAD_PER_CONNECTION' | grep -v 'usb_serial_init - returning
with error' | grep -v 'systemd-journald.service' | grep -v
'usb_serial_init - registering generic driver failed'
dmesg -T | grep --color -i error | grep -v 'Perf event create on CPU' |
grep -v 'Hardware RNG Device' | grep -v 'TPM RNG Device' | grep -v
'Correctable Errors collector initialized' | grep -v
'error=format-security' | grep -v 'MHD_USE_THREAD_PER_CONNECTION' | grep
-v 'usb_serial_init - returning with error' | grep -v
'systemd-journald.service' | grep -v 'usb_serial_init - registering
generic driver failed'
grep --color -i error /var/log/messages | grep -v 'Perf event create on
CPU' | grep -v 'Hardware RNG Device' | grep -v 'TPM RNG Device' | grep
-v 'Correctable Errors collector initialized' | grep -v
'error=format-security' | grep -v 'MHD_USE_THREAD_PER_CONNECTION' | grep
-v 'usb_serial_init - returning with error' | grep -v
'systemd-journald.service' | grep -v 'usb_serial_init - registering
generic driver failed'
grep --color -i "scheduling restart" /var/log/messages | grep -v
'systemd-journald.service'
[root@srv-rhsoft:~]$

2019-09-17 18:49:41

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

17.09.2019 21:27, Linus Torvalds пишет:
> On Tue, Sep 17, 2019 at 12:33 AM Martin Steigerwald <[email protected]> wrote:
>>
>> So yes, that would it make it harder to abuse the API, but not
>> impossible. Which may still be good, I don't know.
>
> So the real problem is not people abusing the ABI per se. Yes, I was a
> bit worried about that too, but it's not the cause of the immediate
> issue.
>
> The real problem is that "getrandom(0)" is really _convenient_ for
> people who just want random numbers - and not at all the "secure"
> kind.
>
> And it's convenient, and during development and testing, it always
> "just works", because it doesn't ever block in any normal situation.
>
> And then you deploy it, and on some poor users machine it *does*
> block, because the program now encounters the "oops, no entropy"
> situation that it never ever encountered on the development machine,
> because the testing there was mainly done not during booting, but the
> developer also probably had a much more modern machine that had
> rdrand, and that quite possibly also had more services enabled at
> bootup etc so even without rdrand it got tons of entropy.
>
> That's why
>
> (a) killing the process is _completely_ silly. It misses the whole
> point of the problem in the first place and only makes things much
> worse.
>
> (b) we should just change getrandom() and add that GRND_SECURE flag
> instead. Because the current API is fundamentally confusing. If you
> want secure random numbers, you should really deeply _know_ about it,
> and think about it, rather than have it be the "oh, don't even bother
> passing any flags, it's secure by default".
>
> (c) the timeout approach isn't wonderful, but it at least helps with
> the "this was never tested under those circumstances" kind of problem.
>
> Note that the people who actually *thought* about getrandom() and use
> it correctly should already handle error returns (even for the
> blocking version), because getrandom() can already return EINTR. So
> the argument that we should cater primarily to the secure key people
> is not all that strong. We should be able to return EINTR, and the
> people who *thought* about blocking and about entropy should be fine.
>
> And gdm and other silly random users that never wanted entropy in the
> first place, just "random" random numbers, wouldn't be in the
> situation they are now.
>
> That said - looking at some of the problematic traces that Ahmed
> posted for his bootup problem, I actually think we can use *another*
> heuristic to solve the problem. Namely just looking at how much
> randomness the caller wants.
>
> The processes that ask for randomness for an actual secure key have a
> very fundamental constraint: they need enough randomness for the key
> to be secure in the first place.
>
> But look at what gnome-shell and gnome-session-b does:
>
> https://lore.kernel.org/linux-ext4/20190912034421.GA2085@darwi-home-pc/
>
> and most of them already set GRND_NONBLOCK, but look at the
> problematic one that actually causes the boot problem:
>
> gnome-session-b-327 4.400620: getrandom(16 bytes, flags = 0)
>
> and here the big clue is: "Hey, it only asks for 128 bits of randomness".
>
> Does anybody believe that 128 bits of randomness is a good basis for a
> long-term secure key? Even if the key itself contains than that, if
> you are generating a long-term secure key in this day and age, you had
> better be asking for more than 128 bits of actual unpredictable base
> data. So just based on the size of the request we can determine that
> this is not hugely important.
>
> Compare that to the case later on for something that seems to ask for
> actual interesting randomness. and - just judging by the name -
> probably even has a reason for it:
>
> gsd-smartcard-388 51.433924: getrandom(110 bytes, flags = 0)
> gsd-smartcard-388 51.433936: getrandom(256 bytes, flags = 0)
>
> big difference.
>
> End result: I would propose the attached patch.
>
> Ahmed, can you just verify that it works for you (obviously with the
> ext4 plugging reinstated)? It looks like it should "obviously" fix
> things, but still...

I have looked at the patch, but have not tested it.

I am worried that the getrandom delays will be serialized, because
processes sometimes run one after another. If there are enough
chained/dependent processes that ask for randomness before it is ready,
the end result is still a too-big delay, essentially a failed boot.

In other words: your approach of adding delays only makes sense for
heavily parallelized boot, which may not be the case, especially for
embedded systems that don't like systemd.

--
Alexander E. Patrakov

2019-09-17 18:50:16

by Matthew Garrett

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 09:27:44AM -0700, Linus Torvalds wrote:

> Does anybody believe that 128 bits of randomness is a good basis for a
> long-term secure key?

Yes, it's exactly what you'd expect for an AES 128 key, which is still
considered to be secure.

--
Matthew Garrett | [email protected]

2019-09-17 18:54:26

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 18:21, Willy Tarreau ([email protected]) wrote:

> On Tue, Sep 17, 2019 at 05:57:43PM +0200, Lennart Poettering wrote:
> > Note that calling getrandom(0) "too early" is not something people do
> > on purpose. It happens by accident, i.e. because we live in a world
> > where SSH or HTTPS or so is run in the initrd already, and in a world
> > where booting sometimes can be very very fast.
>
> It's not an accident, it's a lack of understanding of the impacts
> from the people who package the systems. Generating an SSH key from
> an initramfs without thinking where the randomness used for this could
> come from is not accidental, it's a lack of experience that will be
> fixed once they start to collect such reports. And those who absolutely
> need their SSH daemon or HTTPS server for a recovery image in initramfs
> can very well feed fake entropy by dumping whatever they want into
> /dev/random to make it possible to build temporary keys for use within
> this single session. At least all supposedly incorrect use will be made
> *on purpose* and will still be possible to match what users need.

What do you expect these systems to do though?

I mean, think about general purpose distros: they put together live
images that are supposed to work on a myriad of similar (as in: same
arch) but otherwise very different systems (i.e. VMs that might lack
any form of RNG source the same as beefy servers with muliple sources
the same as older netbooks with few and crappy sources, …). They can't
know what the specific hw will provide or won't. It's not their
incompetence that they build the image like that. It's a common, very
common usecase to install a system via SSH, and it's also very common
to have very generic images for a large number varied systems to run
on.

Lennart

--
Lennart Poettering, Berlin

2019-09-17 18:57:00

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 05:34:56PM +0100, Matthew Garrett wrote:
> On Tue, Sep 17, 2019 at 09:27:44AM -0700, Linus Torvalds wrote:
>
> > Does anybody believe that 128 bits of randomness is a good basis for a
> > long-term secure key?
>
> Yes, it's exactly what you'd expect for an AES 128 key, which is still
> considered to be secure.

AES keys are for symmetrical encryption and thus as such are short-lived.
We're back to what Linus was saying about the fact that our urandom is
already very good for such use cases, it should just not be used to
produce long-lived keys (i.e. asymmetrical).

However I'm worried regarding this precise patch about the fact that
delays will add up. I think that once we've failed to wait for a first
process, we've broken any hypothetical trust in terms of random quality
so there's no point continuing to wait for future requests.

Willy

2019-09-17 18:58:31

by Matthew Garrett

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 07:16:41PM +0200, Willy Tarreau wrote:
> On Tue, Sep 17, 2019 at 05:34:56PM +0100, Matthew Garrett wrote:
> > On Tue, Sep 17, 2019 at 09:27:44AM -0700, Linus Torvalds wrote:
> >
> > > Does anybody believe that 128 bits of randomness is a good basis for a
> > > long-term secure key?
> >
> > Yes, it's exactly what you'd expect for an AES 128 key, which is still
> > considered to be secure.
>
> AES keys are for symmetrical encryption and thus as such are short-lived.
> We're back to what Linus was saying about the fact that our urandom is
> already very good for such use cases, it should just not be used to
> produce long-lived keys (i.e. asymmetrical).

AES keys are used for a variety of long-lived purposes (eg, disk
encryption).

--
Matthew Garrett | [email protected]

2019-09-17 19:04:21

by Matthew Garrett

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 06:20:02PM +0100, Matthew Garrett wrote:

> AES keys are used for a variety of long-lived purposes (eg, disk
> encryption).

And as an example of when we'd want to do that during early boot - swap
is frequently encrypted with a random key generated on each boot, but
it's still important for that key to be strong in order to avoid someone
being able to recover the contents of swap.

--
Matthew Garrett | [email protected]

2019-09-17 19:12:39

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 09:27, Linus Torvalds ([email protected]) wrote:

> But look at what gnome-shell and gnome-session-b does:
>
> https://lore.kernel.org/linux-ext4/20190912034421.GA2085@darwi-home-pc/
>
> and most of them already set GRND_NONBLOCK, but look at the
> problematic one that actually causes the boot problem:
>
> gnome-session-b-327 4.400620: getrandom(16 bytes, flags = 0)
>
> and here the big clue is: "Hey, it only asks for 128 bits of
> randomness".

I don't think this is a good check to make.

In fact most cryptography folks say taking out more than 256bit is
never going to make sense, that's why BSD getentropy() even returns an
error if you ask for more than 256bit. (and glibc's getentropy()
wrapper around getrandom() enforces the same size limit btw)

On the BSDs the kernel's getentropy() call is primarily used to seed
their libc's arc4random() every now and then, and userspace is
supposed to use only arc4random(). I am pretty sure we should do the
same on Linux in the long run. i.e. the idea that everyone uses the
kernel syscall directly sounds wrong to me, and designing the syscall
so that everyone calls it is hence wrong too.

On the BSDs getentropy() is hence unconditionally blocking, without
any flags or so, which makes sense since it's not supposed to be
user-facing really so much, but more a basic primitive for low-level
userspace infrastructure only, that is supposed to be wrapped
non-trivially to be useful. (that's at least how I understood their
APIs)

> Does anybody believe that 128 bits of randomness is a good basis for a
> long-term secure key? Even if the key itself contains than that, if
> you are generating a long-term secure key in this day and age, you had
> better be asking for more than 128 bits of actual unpredictable base
> data. So just based on the size of the request we can determine that
> this is not hugely important.

aes128 is very common today. It's what baseline security is.

I have the suspicion crypto folks would argue that 128…256 is the only
sane range for cryptographic keys...

Lennart

--
Lennart Poettering, Berlin

2019-09-17 19:13:42

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 07:13:28PM +0200, Lennart Poettering wrote:
> On Di, 17.09.19 18:21, Willy Tarreau ([email protected]) wrote:
>
> > On Tue, Sep 17, 2019 at 05:57:43PM +0200, Lennart Poettering wrote:
> > > Note that calling getrandom(0) "too early" is not something people do
> > > on purpose. It happens by accident, i.e. because we live in a world
> > > where SSH or HTTPS or so is run in the initrd already, and in a world
> > > where booting sometimes can be very very fast.
> >
> > It's not an accident, it's a lack of understanding of the impacts
> > from the people who package the systems. Generating an SSH key from
> > an initramfs without thinking where the randomness used for this could
> > come from is not accidental, it's a lack of experience that will be
> > fixed once they start to collect such reports. And those who absolutely
> > need their SSH daemon or HTTPS server for a recovery image in initramfs
> > can very well feed fake entropy by dumping whatever they want into
> > /dev/random to make it possible to build temporary keys for use within
> > this single session. At least all supposedly incorrect use will be made
> > *on purpose* and will still be possible to match what users need.
>
> What do you expect these systems to do though?
>
> I mean, think about general purpose distros: they put together live
> images that are supposed to work on a myriad of similar (as in: same
> arch) but otherwise very different systems (i.e. VMs that might lack
> any form of RNG source the same as beefy servers with muliple sources
> the same as older netbooks with few and crappy sources, ...). They can't
> know what the specific hw will provide or won't. It's not their
> incompetence that they build the image like that. It's a common, very
> common usecase to install a system via SSH, and it's also very common
> to have very generic images for a large number varied systems to run
> on.

I'm totally file with installing the system via SSH, using a temporary
SSH key. I do make a strong distinction between the installation phase
and the final deployment. The SSH key used *for installation* doesn't
need to the be same as the final one. And very often at the end of the
installation we'll have produced enough entropy to produce a correct
key.

It's not because people got used to doing things the wrong way by
ignorance of how randomness works and raised this to an industrial
level that they must not adapt a little bit. If they insist on producing
an SSH key immediately at boot, you can be sure that many of those that
never fail are probably bad because they probably used some of the
tricks mentioned in this thread (like the fairly common mknod trick
that can make sense in a temporary system installation image) :-/

I maintain that we don't need the same amount of entropy to run a
regular system and to create a new key, and that as such it is not
a reasonable thing to do to create such a key as the first action.
I'm not saying that doing things correctly is as easy, but it's not
impossible at all: many of us have already used systems which use
something like dropbear with temporary key on the install image but
run off openssh in the final system image.

And even when booting off a pre-configured final image we could
easily imagine that the ssh service detects lack of entropy and
runs with a temporary key that is not saved, and in the background
starts a process trying to produce a final key for later use.

Willy

2019-09-17 19:14:32

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 21:58, Alexander E. Patrakov ([email protected]) wrote:

> I am worried that the getrandom delays will be serialized, because processes
> sometimes run one after another. If there are enough chained/dependent
> processes that ask for randomness before it is ready, the end result is
> still a too-big delay, essentially a failed boot.
>
> In other words: your approach of adding delays only makes sense for heavily
> parallelized boot, which may not be the case, especially for embedded
> systems that don't like systemd.

As mentioned elsewhere: once the pool is initialized it's
initialized. This means any pending getrandom() on the whole system
will unblock at the same time, and from the on all getrandom()s will
be non-blocking.

systemd-random-seed.service is nowadays a synchronization point for
exactly the moment where the pool is considered full.

Lennart

--
Lennart Poettering, Berlin

2019-09-17 19:20:47

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 07:30:36PM +0200, Lennart Poettering wrote:
> On Di, 17.09.19 21:58, Alexander E. Patrakov ([email protected]) wrote:
>
> > I am worried that the getrandom delays will be serialized, because processes
> > sometimes run one after another. If there are enough chained/dependent
> > processes that ask for randomness before it is ready, the end result is
> > still a too-big delay, essentially a failed boot.
> >
> > In other words: your approach of adding delays only makes sense for heavily
> > parallelized boot, which may not be the case, especially for embedded
> > systems that don't like systemd.
>
> As mentioned elsewhere: once the pool is initialized it's
> initialized. This means any pending getrandom() on the whole system
> will unblock at the same time, and from the on all getrandom()s will
> be non-blocking.

He means that all process will experience this delay until there's enough
entropy.

Willy

2019-09-17 19:26:49

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

17.09.2019 22:32, Willy Tarreau пишет:
> On Tue, Sep 17, 2019 at 07:30:36PM +0200, Lennart Poettering wrote:
>> On Di, 17.09.19 21:58, Alexander E. Patrakov ([email protected]) wrote:
>>
>>> I am worried that the getrandom delays will be serialized, because processes
>>> sometimes run one after another. If there are enough chained/dependent
>>> processes that ask for randomness before it is ready, the end result is
>>> still a too-big delay, essentially a failed boot.
>>>
>>> In other words: your approach of adding delays only makes sense for heavily
>>> parallelized boot, which may not be the case, especially for embedded
>>> systems that don't like systemd.
>>
>> As mentioned elsewhere: once the pool is initialized it's
>> initialized. This means any pending getrandom() on the whole system
>> will unblock at the same time, and from the on all getrandom()s will
>> be non-blocking.
>
> He means that all process will experience this delay until there's enough
> entropy.
>
> Willy

Indeed, my wording was not clear enough. Linus' patch has a 5-second
timeout for small entropy requests, after which they get converted to
the equivalent of urandom. However, in the following shell script:

#!/bin/sh
p1
p2

if both p1 and p2 ask for a small amount of entropy before crng is fully
initialized, and do nothing that produces more entropy, the total delay
will be 10 seconds.

--
Alexander E. Patrakov

2019-09-17 19:26:52

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 09:23, Linus Torvalds ([email protected]) wrote:

> On Tue, Sep 17, 2019 at 9:08 AM Lennart Poettering <[email protected]> wrote:
> >
> > Here's what I'd propose:
>
> So I think this is ok, but I have another proposal. Before I post that
> one, though, I just wanted to point out:
>
> > 1) Add GRND_INSECURE to get those users of getrandom() who do not need
> > high quality entropy off its use (systemd has uses for this, for
> > seeding hash tables for example), thus reducing the places where
> > things might block.
>
> I really think that trhe logic should be the other way around.
>
> The getrandom() users that don't need high quality entropy are the
> ones that don't really think about this, and so _they_ shouldn't be
> the ones that have to explicitly state anything. To those users,
> "random is random". By definition they don't much care, and quite
> possibly they don't even know what "entropy" really means in that
> context.

So I think people nowadays prefer getrandom() over /dev/urandom
primarily because of the noisy logging the kernel does when you use
the latter on a non-initialized pool. If that'd be dropped then I am
pretty sure that the porting from /dev/urandom to getrandom() you see
in various projects (such as gdm/x11) would probably not take place.

In fact, speaking for systemd: the noisy logging in the kernel is the
primary (actually: only) reason that we prefer using RDRAND (if
available) over /dev/urandom if we need "medium quality" random
numbers, for example to seed hash tables and such. If the log message
wasn't there we wouldn't be tempted to bother with RDRAND and would
just use /dev/urandom like we used to for that.

> > 2) Add a kernel log message if a getrandom(0) client hung for 15s or
> > more, explaining the situation briefly, but not otherwise changing
> > behaviour.
>
> The problem is that when you have some graphical boot, you'll not even
> see the kernel messages ;(

Well, but as mentioned, there's infrastructure for this, that's why I
suggested changing systemd-random-seed.service.

We can make boot hang in "sane", discoverable way.

The reason why I think this should also be logged by the kernel since
people use netconsole and pstore and whatnot and they should see this
there. If systemd with its infrastructure brings this to screen via
plymouth then this wouldn't help people who debug much more low-level.

(I mean, there have been requests to add a logic to systemd that
refuses booting — or delays it — if the system has a battery and it is
nearly empty. I am pretty sure adding a cleanm discoverable concept of
"uh, i can't boot for a good reason which is this" wouldn't be the
worst of ideas)

Lennart

--
Lennart Poettering, Berlin

2019-09-17 19:32:54

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 06:20:02PM +0100, Matthew Garrett wrote:
> AES keys are used for a variety of long-lived purposes (eg, disk
> encryption).

True, good point.

Willy

2019-09-17 20:37:33

by Martin Steigerwald

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

Willy Tarreau - 17.09.19, 18:21:37 CEST:
> On Tue, Sep 17, 2019 at 05:57:43PM +0200, Lennart Poettering wrote:
> > Note that calling getrandom(0) "too early" is not something people
> > do
> > on purpose. It happens by accident, i.e. because we live in a world
> > where SSH or HTTPS or so is run in the initrd already, and in a
> > world
> > where booting sometimes can be very very fast.
>
> It's not an accident, it's a lack of understanding of the impacts
> from the people who package the systems. Generating an SSH key from
> an initramfs without thinking where the randomness used for this could
> come from is not accidental, it's a lack of experience that will be
> fixed once they start to collect such reports. And those who
> absolutely need their SSH daemon or HTTPS server for a recovery image
> in initramfs can very well feed fake entropy by dumping whatever they
> want into /dev/random to make it possible to build temporary keys for
> use within this single session. At least all supposedly incorrect use
> will be made *on purpose* and will still be possible to match what
> users need.

Well I wondered before whether SSH key generation for cloud init or
other automatically individualized systems could happen in the
background. Replacing a key that would be there before it would be
replaced. So SSH would be available *before* the key is regenerated. But
then there are those big fast man in the middle warnings… and I have no
clear idea to handle this in a way that would both be secure and not
scare users off too much.

Well probably systems at some point better have good entropy very
quickly… and that is it. (And then quantum computers may crack those
good keys anyway in the future.)

--
Martin


2019-09-17 20:44:59

by Martin Steigerwald

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

Willy Tarreau - 17.09.19, 19:29:29 CEST:
> On Tue, Sep 17, 2019 at 07:13:28PM +0200, Lennart Poettering wrote:
> > On Di, 17.09.19 18:21, Willy Tarreau ([email protected]) wrote:
> > > On Tue, Sep 17, 2019 at 05:57:43PM +0200, Lennart Poettering
> > > wrote:
> > > > Note that calling getrandom(0) "too early" is not something
> > > > people do
> > > > on purpose. It happens by accident, i.e. because we live in a
> > > > world
> > > > where SSH or HTTPS or so is run in the initrd already, and in a
> > > > world
> > > > where booting sometimes can be very very fast.
> > >
> > > It's not an accident, it's a lack of understanding of the impacts
> > > from the people who package the systems. Generating an SSH key
> > > from
> > > an initramfs without thinking where the randomness used for this
> > > could come from is not accidental, it's a lack of experience that
> > > will be fixed once they start to collect such reports. And those
> > > who absolutely need their SSH daemon or HTTPS server for a
> > > recovery image in initramfs can very well feed fake entropy by
> > > dumping whatever they want into /dev/random to make it possible
> > > to build temporary keys for use within this single session. At
> > > least all supposedly incorrect use will be made *on purpose* and
> > > will still be possible to match what users need.>
> > What do you expect these systems to do though?
> >
> > I mean, think about general purpose distros: they put together live
> > images that are supposed to work on a myriad of similar (as in: same
> > arch) but otherwise very different systems (i.e. VMs that might lack
> > any form of RNG source the same as beefy servers with muliple
> > sources
> > the same as older netbooks with few and crappy sources, ...). They
> > can't know what the specific hw will provide or won't. It's not
> > their incompetence that they build the image like that. It's a
> > common, very common usecase to install a system via SSH, and it's
> > also very common to have very generic images for a large number
> > varied systems to run on.
>
> I'm totally file with installing the system via SSH, using a temporary
> SSH key. I do make a strong distinction between the installation
> phase and the final deployment. The SSH key used *for installation*
> doesn't need to the be same as the final one. And very often at the
> end of the installation we'll have produced enough entropy to produce
> a correct key.

Well… systems cloud-init adapts may come from the same template. Cloud
Init thus replaces the key that has been there before on their first
boot. There is no "installation".

Cloud Init could replace the key in the background… and restart SSH
then… but that will give those big fat man in the middle warnings and
all systems would use the same SSH host key initially. I just don't see
a good way at the moment how to handle this. Introducing an SSH mode for
this is still a temporary not so random key with proper warnings might
be challenging to get right from both a security and usability point of
view. And it would add complexity.

That said with Proxmox VE on Fujitsu S8 or Intel NUCs I have never seen
this issue even when starting 50 VMs in a row, however, with large cloud
providers starting 50 VMs in a row does not sound like all that much.
And I bet with Proxmox VE virtio rng is easily available cause it uses
KVM.

--
Martin


2019-09-18 13:42:44

by Lennart Poettering

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Di, 17.09.19 19:29, Willy Tarreau ([email protected]) wrote:

> > What do you expect these systems to do though?
> >
> > I mean, think about general purpose distros: they put together live
> > images that are supposed to work on a myriad of similar (as in: same
> > arch) but otherwise very different systems (i.e. VMs that might lack
> > any form of RNG source the same as beefy servers with muliple sources
> > the same as older netbooks with few and crappy sources, ...). They can't
> > know what the specific hw will provide or won't. It's not their
> > incompetence that they build the image like that. It's a common, very
> > common usecase to install a system via SSH, and it's also very common
> > to have very generic images for a large number varied systems to run
> > on.
>
> I'm totally file with installing the system via SSH, using a temporary
> SSH key. I do make a strong distinction between the installation phase
> and the final deployment. The SSH key used *for installation* doesn't
> need to the be same as the final one. And very often at the end of the
> installation we'll have produced enough entropy to produce a correct
> key.

That's not how systems are built today though. And I am not sure they
should be. I mean, the majority of systems at this point probably have
some form of hardware (or virtualized) RNG available (even raspi has
one these days!), so generating these keys once at boot is totally
OK. Probably a number of others need just a few seconds to get the
entropy needed, where things are totally OK too. The only problem is
systems that lack any reasonable source of entropy and where
initialization of the pool will take overly long.

I figure we can reduce the number of systems where entropy is scarce
quite a bit if we'd start crediting entropy by default from various hw
rngs we currently don't credit entropy for. For example, the TPM and
older intel/amd chipsets. You currently have to specify
rng_core.default_quality=1000 on the kernel cmdline to make them
credit entropy. I am pretty sure this should be the default now, in a
world where CONFIG_RANDOM_TRUST_CPU=y is set anyway. i.e. why say
RDRAND is fine but those chipsets are not? That makes no sense to me.

I am very sure that crediting entropy to chipset hwrngs is a much
better way to solve the issue on those systems than to just hand out
rubbish randomness.

Lennart

--
Lennart Poettering, Berlin

2019-09-18 14:00:59

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

18.09.2019 18:38, Lennart Poettering пишет:
> On Di, 17.09.19 19:29, Willy Tarreau ([email protected]) wrote:
>
>>> What do you expect these systems to do though?
>>>
>>> I mean, think about general purpose distros: they put together live
>>> images that are supposed to work on a myriad of similar (as in: same
>>> arch) but otherwise very different systems (i.e. VMs that might lack
>>> any form of RNG source the same as beefy servers with muliple sources
>>> the same as older netbooks with few and crappy sources, ...). They can't
>>> know what the specific hw will provide or won't. It's not their
>>> incompetence that they build the image like that. It's a common, very
>>> common usecase to install a system via SSH, and it's also very common
>>> to have very generic images for a large number varied systems to run
>>> on.
>>
>> I'm totally file with installing the system via SSH, using a temporary
>> SSH key. I do make a strong distinction between the installation phase
>> and the final deployment. The SSH key used *for installation* doesn't
>> need to the be same as the final one. And very often at the end of the
>> installation we'll have produced enough entropy to produce a correct
>> key.
>
> That's not how systems are built today though. And I am not sure they
> should be. I mean, the majority of systems at this point probably have
> some form of hardware (or virtualized) RNG available (even raspi has
> one these days!), so generating these keys once at boot is totally
> OK. Probably a number of others need just a few seconds to get the
> entropy needed, where things are totally OK too. The only problem is
> systems that lack any reasonable source of entropy and where
> initialization of the pool will take overly long.
>
> I figure we can reduce the number of systems where entropy is scarce
> quite a bit if we'd start crediting entropy by default from various hw
> rngs we currently don't credit entropy for. For example, the TPM and
> older intel/amd chipsets. You currently have to specify
> rng_core.default_quality=1000 on the kernel cmdline to make them
> credit entropy. I am pretty sure this should be the default now, in a
> world where CONFIG_RANDOM_TRUST_CPU=y is set anyway. i.e. why say
> RDRAND is fine but those chipsets are not? That makes no sense to me.
>
> I am very sure that crediting entropy to chipset hwrngs is a much
> better way to solve the issue on those systems than to just hand out
> rubbish randomness.

Very well said. However, 1000 is more than the hard-coded quality of
some existing rngs, and so would send a misleading message that they are
somehow worse. I would suggest case-by-case reevaluation of all existing
hwrng drivers by their maintainers, and then setting the default to
something like 899, so that evaluated drivers have priority.

--
Alexander E. Patrakov


Attachments:
smime.p7s (3.96 kB)
Криптографическая подпись S/MIME

2019-09-18 14:52:11

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

18.09.2019 18:59, Alexander E. Patrakov пишет:
> 18.09.2019 18:38, Lennart Poettering пишет:
>> On Di, 17.09.19 19:29, Willy Tarreau ([email protected]) wrote:
>>
>>>> What do you expect these systems to do though?
>>>>
>>>> I mean, think about general purpose distros: they put together live
>>>> images that are supposed to work on a myriad of similar (as in: same
>>>> arch) but otherwise very different systems (i.e. VMs that might lack
>>>> any form of RNG source the same as beefy servers with muliple sources
>>>> the same as older netbooks with few and crappy sources, ...). They
>>>> can't
>>>> know what the specific hw will provide or won't. It's not their
>>>> incompetence that they build the image like that. It's a common, very
>>>> common usecase to install a system via SSH, and it's also very common
>>>> to have very generic images for a large number varied systems to run
>>>> on.
>>>
>>> I'm totally file with installing the system via SSH, using a temporary
>>> SSH key. I do make a strong distinction between the installation phase
>>> and the final deployment. The SSH key used *for installation* doesn't
>>> need to the be same as the final one. And very often at the end of the
>>> installation we'll have produced enough entropy to produce a correct
>>> key.
>>
>> That's not how systems are built today though. And I am not sure they
>> should be. I mean, the majority of systems at this point probably have
>> some form of hardware (or virtualized) RNG available (even raspi has
>> one these days!), so generating these keys once at boot is totally
>> OK. Probably a number of others need just a few seconds to get the
>> entropy needed, where things are totally OK too. The only problem is
>> systems that lack any reasonable source of entropy and where
>> initialization of the pool will take overly long.
>>
>> I figure we can reduce the number of systems where entropy is scarce
>> quite a bit if we'd start crediting entropy by default from various hw
>> rngs we currently don't credit entropy for. For example, the TPM and
>> older intel/amd chipsets. You currently have to specify
>> rng_core.default_quality=1000 on the kernel cmdline to make them
>> credit entropy. I am pretty sure this should be the default now, in a
>> world where CONFIG_RANDOM_TRUST_CPU=y is set anyway. i.e. why say
>> RDRAND is fine but those chipsets are not? That makes no sense to me.
>>
>> I am very sure that crediting entropy to chipset hwrngs is a much
>> better way to solve the issue on those systems than to just hand out
>> rubbish randomness.
>
> Very well said. However, 1000 is more than the hard-coded quality of
> some existing rngs, and so would send a misleading message that they are
> somehow worse. I would suggest case-by-case reevaluation of all existing
> hwrng drivers by their maintainers, and then setting the default to
> something like 899, so that evaluated drivers have priority.
>

Well, I have to provide another data point. On Arch Linux and MSI Z87I
desktop board:

$ lsmod | grep rng
<nothing>
$ modinfo rng_core
<yes, the module does exist>

So this particular board has no sources of randomness except interrupts
(which are scarce), RDRAND (which is not trusted in Arch Linux by
default) and jitter entropy (which is not collected by the kernel and
needs haveged or equivalent).

--
Alexander E. Patrakov


Attachments:
smime.p7s (3.96 kB)
Криптографическая подпись S/MIME

2019-09-18 19:59:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

Lennart Poettering <[email protected]> writes:

> On Di, 17.09.19 09:23, Linus Torvalds ([email protected]) wrote:
>
>> On Tue, Sep 17, 2019 at 9:08 AM Lennart Poettering <[email protected]> wrote:
>> >
>> > Here's what I'd propose:
>>
>> So I think this is ok, but I have another proposal. Before I post that
>> one, though, I just wanted to point out:
>>
>> > 1) Add GRND_INSECURE to get those users of getrandom() who do not need
>> > high quality entropy off its use (systemd has uses for this, for
>> > seeding hash tables for example), thus reducing the places where
>> > things might block.
>>
>> I really think that trhe logic should be the other way around.
>>
>> The getrandom() users that don't need high quality entropy are the
>> ones that don't really think about this, and so _they_ shouldn't be
>> the ones that have to explicitly state anything. To those users,
>> "random is random". By definition they don't much care, and quite
>> possibly they don't even know what "entropy" really means in that
>> context.
>
> So I think people nowadays prefer getrandom() over /dev/urandom
> primarily because of the noisy logging the kernel does when you use
> the latter on a non-initialized pool. If that'd be dropped then I am
> pretty sure that the porting from /dev/urandom to getrandom() you see
> in various projects (such as gdm/x11) would probably not take place.
>
> In fact, speaking for systemd: the noisy logging in the kernel is the
> primary (actually: only) reason that we prefer using RDRAND (if
> available) over /dev/urandom if we need "medium quality" random
> numbers, for example to seed hash tables and such. If the log message
> wasn't there we wouldn't be tempted to bother with RDRAND and would
> just use /dev/urandom like we used to for that.
>
>> > 2) Add a kernel log message if a getrandom(0) client hung for 15s or
>> > more, explaining the situation briefly, but not otherwise changing
>> > behaviour.
>>
>> The problem is that when you have some graphical boot, you'll not even
>> see the kernel messages ;(
>
> Well, but as mentioned, there's infrastructure for this, that's why I
> suggested changing systemd-random-seed.service.
>
> We can make boot hang in "sane", discoverable way.
>
> The reason why I think this should also be logged by the kernel since
> people use netconsole and pstore and whatnot and they should see this
> there. If systemd with its infrastructure brings this to screen via
> plymouth then this wouldn't help people who debug much more low-level.
>
> (I mean, there have been requests to add a logic to systemd that
> refuses booting — or delays it — if the system has a battery and it is
> nearly empty. I am pretty sure adding a cleanm discoverable concept of
> "uh, i can't boot for a good reason which is this" wouldn't be the
> worst of ideas)

As I understand it the deep problem is that sometimes we have not
observed enough random activity early in boot.

The cheap solution appears to be copying a random seed from a previous
boot, and I think that will take care of many many cases, and has
already been implemented. Which reduces this to a system first
boot issue.

So for first system boot can we take some special actions to make
it possible to see randomness sooner. An unconditional filesystem check
of the filesystem perhaps. Something that will initiate disk activity
or other hardware activity that will generate interrupts and allow
us to capture randomness.

For many systems we could even have the installer capture some random
data as a final stage of the installation, and use that to seed
randomness on the first boot.

Somewhere in installing the random seed we need to be careful about
people just copying disk images from one system to another, and a
replicated seed probably can not be considered very random.

My sense is that by copying a random seed from one boot to the next
and by initiating system activity to hurry along the process of
having enough randomness we can have systems where we can almost
always have good random numbers available.

And if we almost always have good random numbers available we won't
have to worry about people getting this wrong.

Am I wrong or can we just solve random number availablity is practically
all cases?

Eric

2019-09-18 22:47:35

by Alexander E. Patrakov

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

19.09.2019 00:56, Eric W. Biederman пишет:

> The cheap solution appears to be copying a random seed from a previous
> boot, and I think that will take care of many many cases, and has
> already been implemented. Which reduces this to a system first
> boot issue.

No, this is not the solution, if we take seriously not only getrandom
hangs, but also urandom warnings. In some setups (root on LUKS is one of
them) they happen early in the initramfs. Therefore "restoring" entropy
from the previous boot by a script that runs from the main system is too
late. That's why it is suggested to load at least a part of the random
seed in the boot loader, and that has not been commonly implemented.

--
Alexander E. Patrakov


Attachments:
smime.p7s (3.96 kB)
Криптографическая подпись S/MIME

2019-09-18 22:48:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Wed, Sep 18, 2019 at 12:56 PM Eric W. Biederman
<[email protected]> wrote:
>
> The cheap solution appears to be copying a random seed from a previous
> boot, and I think that will take care of many many cases, and has
> already been implemented. Which reduces this to a system first
> boot issue.

Not really.

Part of the problem is that many people don't _trust_ that "previous
boot entropy".

The lack of trust is sometimes fundamental mistrust ("Who knows where
it came from"), which also tends to cover things like not trusting
rdrand or not trusting the boot loader claimed randomness data.

But the lack of trust has been realistic - if you generated your disk
image by cloning a pre-existing one, you may well have two (or more -
up to any infinite number) of subsequent boots that use the same
"random" data for initialization.

And doing that "boot a pre-existing image" is not as unusual as you'd
think. Some people do it to make bootup faster - there have been
people who work on pre-populating bootup all the way to user mode by
basically making boot be a "resume from disk" kind of event.

So a large part of the problem is that we don't actually trust things
that _should_ be trust-worthy, because we've seen (over and over
again) people mis-using it. So then we do mix in the data into the
randomness pool (because there's no downside to _that_), but we don't
treat it as entropy (because while it _probably_ is, we don't actually
trust it sufficiently).

A _lot_ of the problems with randomness come from these trust issues.
Our entropy counting is very very conservative indeed.

Linus

2019-09-18 22:52:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Wed, Sep 18, 2019 at 1:15 PM Alexander E. Patrakov
<[email protected]> wrote:
>
> No, this is not the solution, if we take seriously not only getrandom
> hangs, but also urandom warnings. In some setups (root on LUKS is one of
> them) they happen early in the initramfs. Therefore "restoring" entropy
> from the previous boot by a script that runs from the main system is too
> late. That's why it is suggested to load at least a part of the random
> seed in the boot loader, and that has not been commonly implemented.

Honestly, I think the bootloader suggestion is naive and silly too.

Yes, we now support it. And no, I don't think people will trust that
either. And I suspect for good reason: there's really very little
reason to believe that bootloaders would be any better than any other
part of the system.

So right now some people trust bootloaders exactly _because_ there
basically is just one or two that do this, and the people who use them
are usually the people who wrote them or are at least closely
associated with them. That will change, and then people will say "why
would I trust that, when we know of bug Xyz".

And I guarantee that those bugs _will_ happen, and people will quite
reasonably then say "yeah, I don't trust the bootloader". Bootloaders
do some questionable things.

The most likely thing to actually be somewhat useful is I feel things
like the kernel just saving the seed by itself in nvram. There's
already an example of this for the EFI random seed thing, but that's
used purely for kexec, I think.

Adding an EFI variable (or other platform nonvolatile thing), and
reading (and writing to it) purely from the kernel ends up being one
of those things where you can then say "ok, if we trust the platform
AT ALL, we can trust that". Since you can't reasonably do things like
add EFI variables to your distro image by mistake.

Of course, even then people will say "I don't trust the platform". But
at some point you just say "you have trust issues" and move on.

Linus

2019-09-18 23:38:07

by Willy Tarreau

[permalink] [raw]
Subject: Re: Linux 5.3-rc8

On Wed, Sep 18, 2019 at 01:26:39PM -0700, Linus Torvalds wrote:
> Of course, even then people will say "I don't trust the platform". But
> at some point you just say "you have trust issues" and move on.

It's where our extreme configurability can hurt. Sometimes we'd rather
avoid providing some of these "I don't trust this or that" options and
impose some choices to users: "you need entropy to boot, stop being
childish and collect the small entropy where it is, period". I'm not
certain the other operating systems not experiencing entropy issues
leave as many choices as we do. I can understand how some choices may
be problematic in virtual environments but there are so many other
attack vectors there that randomness is probably a detail.

Willy