2022-05-02 23:09:19

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

Hi Alex,

On Mon, May 02, 2022 at 08:57:01PM +0200, Alexander Graf wrote:
> So far I see little that would block your patch? It seems to go into a
> good direction from all I can tell.

With kernel code for the kernel, if it's good enough, it's good enough,
and can be made better later. With kernel code for userspace, it might
not be possible to make it better later, so if we've got a plethora of
options, it makes the incrementalism I was enthusiastic about a little
bit more risky.

Here's where we're at:

- /proc/sys has this easy to use poll() interface [which currently breaks
normal poll() semantics; see patch 1/2]. It'd be nice to use this as
an async notifier interface.

- A race-free mmap()d mechanism needs virtual hardware support to be
useful, so I don't want to add it now. But when we do have it, where
should it be added?

* Atop the same /proc/sys file that this patch has returning -ENODATA
at the moment? /proc/sys doesn't do mmap now so that'd be some
plumbing work, which would be pretty odd for the sysctl abstraction.

* Atop some new file elsewhere? Not hard to do, but also begs the
question of which driver should do it, where the file should be, and
a whole host of bikesheddy questions.

* In the VDSO? This would make the most sense to me if we're ever
going to do a VDSO RNG, and to design it along with the userspace
RNG stuff people think they want, even if it's maybe not the best
idea in the end. This is a huge can of worms. Termites, even, and
they'll eat through the beams of your house.

* Nowhere, because even though this race-free mechanism is attractive,
it's way too hard to program for, and nobody is going to use it, and
not many people care /that/ much about having this mechanism being
100% race-free. Colm said Amazon doesn't, for example, since it uses
quiescent states. And apparently Microsoft doesn't care much either,
and they don't even use quiescent states. Maybe I'm the only one
raising a fuss about the race window (while hypocritically also not
being very eager to write code around an interface that has to check
a variable all the time).

- A file in /proc/sys can return -ENODATA, as this patch has it doing,
or it can return something else:

* A counter. Flawed for reasons discussed elsewhere.

* A UUID, just like /proc/sys/kernel/random/boot_id, except we'd name
it generation_id or something and it'd change on VM fork. An earlier
version of this patch did that.

* -ENODATA, as this patch does, because each caller can get a new unique
value with getrandom(), as there's no point in having some global
identifier like with boot_id. This is what I'm leaning toward. It
seems like getrandom() would handle Lennart's case, Go's case, s2n's
case, and we don't need a particular ID.

* The direct value of the ACPI vmgenid value. Discussed in earlier
emails; I don't want to do this.

- You don't care about race-free, but you do care about mmap(), for
"programming reasons" in s2n. I'm not yet sympathetic to this case
:-P. The point of mmap() should be for the race-free stuff. Otherwise
poll() should do the job, and if you can't poll, you can hack around
it with another process or a thread or whatever other trick. Your
systemd proposal with a file was just punting the idea elsewhere in
userspace (which I think is a bit ugly too...).

This state of affairs inclines me to:

- Have a /proc/sys/kernel/random/fork_event that has a poll() interface
with reads returning -ENODATA (this patch).

- Try to convince whomever that poll()ing on fork_event and then calling
getrandom() for a process-local snapshot ID is better than poll()ing
on and subsequently read()ing a /proc/sys/kernel/random/generation_id
that returns a global UUID like boot_id. [Lennart - I'm specifically
interested in trying to convince you of this. Thoughts?]

- Given that you are okay with an async mechanism, figure out how poll()
can work for s2n rather than mmap().

- Not really pursue the race-free mmap() thing unless Microsoft goes
full steam ahead with it because it's complicated to program for and
maybe (or maybe not? I don't know? data please?) a theoretical
concern.

> You block the thread on poll, so the only option is a thread. I was
> until now always under the working assumption that we can't do this in a
> thread because you don't want your single threaded application to turn
> into a pthreaded one, but you make me wonder. Let me check with Torben
> tomorrow.

Can't you use raw clone() and just have a super dirty thread that shares
a single page mapping and nothing else? Poll on the pidfd of the parent
and the fork_event fd, quit on the former, and ++ on the latter.

> >> 2) A way to notify larger applications (think Java here) that a system
> >> is going to be suspended soon so it can wipe PII before it gets cloned
> >> for example.
> > Suspension, like S3 power notification stuff? Talk to Rafael about that;
>
>
> Whether you go running -> S3 -> clone or you go running -> paused ->
> clone is an implementation detail I'm not terribly worried about. Users
> can do either, because on both cases the VM is in paused state.

Ahh, I see.
https://lore.kernel.org/lkml/[email protected]/
might interest you.

Jason