2022-05-02 22:19:50

by Lennart Poettering

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

On Mo, 02.05.22 18:12, Jason A. Donenfeld ([email protected]) wrote:

> > > In order to inform userspace of virtual machine forks, this commit adds
> > > a "fork_event" sysctl, which does not return any data, but allows
> > > userspace processes to poll() on it for notification of VM forks.
> > >
> > > It avoids exposing the actual vmgenid from the hypervisor to userspace,
> > > in case there is any randomness value in keeping it secret. Rather,
> > > userspace is expected to simply use getrandom() if it wants a fresh
> > > value.
> >
> > Wouldn't it make sense to expose a monotonic 64bit counter of detected
> > VM forks since boot through read()? It might be interesting to know
> > for userspace how many forks it missed the fork events for. Moreover it
> > might be interesting to userspace to know if any fork happened so far
> > *at* *all*, by checking if the counter is non-zero.
>
> "Might be interesting" is different from "definitely useful". I'm not
> going to add this without a clear use case. This feature is pretty
> narrowly scoped in its objectives right now, and I intend to keep it
> that way if possible.

Sure, whatever. I mean, if you think it's preferable to have 3 API
abstractions for the same concept each for it's special usecase, then
that's certainly one way to do things. I personally would try to
figure out a modicum of generalization for things like this. But maybe
that' just me…

I can just tell you, that in systemd we'd have a usecase for consuming
such a generation counter: we try to provide stable MAC addresses for
synthetic network interfaces managed by networkd, so we hash them from
/etc/machine-id, but otoh people also want them to change when they
clone their VMs. We could very nicely solve this if we had a
generation counter easily accessible from userspace, that starts at 0
initially. Because then we can hash as we always did when the counter
is zero, but otherwise use something else, possibly hashed from the
generation counter.

But anyway, I understand you are not interested in
generalization/other usecases, so I'll shut up.

Lennart

--
Lennart Poettering, Berlin


2022-05-02 23:22:30

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

Hey Lennart,

On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote:
> On Mo, 02.05.22 18:12, Jason A. Donenfeld ([email protected]) wrote:
>
> > > > In order to inform userspace of virtual machine forks, this commit adds
> > > > a "fork_event" sysctl, which does not return any data, but allows
> > > > userspace processes to poll() on it for notification of VM forks.
> > > >
> > > > It avoids exposing the actual vmgenid from the hypervisor to userspace,
> > > > in case there is any randomness value in keeping it secret. Rather,
> > > > userspace is expected to simply use getrandom() if it wants a fresh
> > > > value.
> > >
> > > Wouldn't it make sense to expose a monotonic 64bit counter of detected
> > > VM forks since boot through read()? It might be interesting to know
> > > for userspace how many forks it missed the fork events for. Moreover it
> > > might be interesting to userspace to know if any fork happened so far
> > > *at* *all*, by checking if the counter is non-zero.
> >
> > "Might be interesting" is different from "definitely useful". I'm not
> > going to add this without a clear use case. This feature is pretty
> > narrowly scoped in its objectives right now, and I intend to keep it
> > that way if possible.
>
> Sure, whatever. I mean, if you think it's preferable to have 3 API
> abstractions for the same concept each for it's special usecase, then
> that's certainly one way to do things. I personally would try to
> figure out a modicum of generalization for things like this. But maybe
> that' just me…
>
> I can just tell you, that in systemd we'd have a usecase for consuming
> such a generation counter: we try to provide stable MAC addresses for
> synthetic network interfaces managed by networkd, so we hash them from
> /etc/machine-id, but otoh people also want them to change when they
> clone their VMs. We could very nicely solve this if we had a
> generation counter easily accessible from userspace, that starts at 0
> initially. Because then we can hash as we always did when the counter
> is zero, but otherwise use something else, possibly hashed from the
> generation counter.

This doesn't work, because you could have memory-A split into memory-A.1
and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
the same new value "2". The solution is to instead have the hypervisor
pass a unique value and a counter. We currently have a 16 byte unique
value from the hypervisor, which I'm keeping as a kernel space secret
for the RNG; we're waiting on a word-sized monotonic counter interface
from hypervisors in the future. When we have the latter, then we can
start talking about mmapable things. Your use case would probably be
served by exposing that 16-byte unique value (hashed with some constant
for safety I suppose), but I'm hesitant to start going down that route
all at once, especially if we're to have a more useful counter in the
future.

Jason

2022-05-02 23:29:02

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

Err...

On Mon, May 02, 2022 at 08:04:21PM +0200, Jason A. Donenfeld wrote:
> This doesn't work, because you could have memory-A split into memory-A.1
> and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> the same new value "2". The solution is to instead have the hypervisor
> pass a unique value and a counter. We currently have a 16 byte unique
> value from the hypervisor, which I'm keeping as a kernel space secret
> for the RNG; we're waiting on a word-sized monotonic counter interface
> from hypervisors in the future. When we have the latter, then we can
> start talking about mmapable things. Your use case would probably be
> served by exposing that 16-byte unique value (hashed with some constant
> for safety I suppose), but I'm hesitant to start going down that route
> all at once, especially if we're to have a more useful counter in the
> future.

I kind of muddled things a bit by conflating two issues.

I'd like the hypervisor to provide a counter so that we can mmap it to
userspace so that userspace programs can do word-sized comparisons on
mmap'd counters, avoiding the race that currently exists from relying on
the async ACPI notification, which arrives after the system is already
up and running. That's one thing, but not what we're talking about here
with the MAC addresses.

The point over here is that neither the guest *nor* the hypervisor can
maintain a counter that actually represents something unique. A.1 and
A.2 will both ++counter to the same value in the example above. The
guest can't do it (neither in systemd nor in the kernel), because it
will always start with the same counter value of A and ++ it to the same
next value. The hypervisor can't do it either, because snapshots can be
shipped around to different computers that aren't coordinated.

So, put that way, the counter thing that I'd like wouldn't be for having
a unique snapshot ID, but just as a mmap-able way of learning when a
snapshot forks. It wouldn't be more useful than that.

If you want a unique ID, we have two options for that: the first is
exposing the vmgenid 16 byte value to userspace (which I don't want to
do). The second is just calling getrandom() after you get a poll()
notification, and that'll be guaranteed to be unique to that VM because
of the vmgenid driver in 5.18.

This last suggestion is thus what you should do for your MAC addresses.

Jason

2022-05-03 08:06:17

by Lennart Poettering

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

On Mo, 02.05.22 20:04, Jason A. Donenfeld ([email protected]) wrote:

> > I can just tell you, that in systemd we'd have a usecase for consuming
> > such a generation counter: we try to provide stable MAC addresses for
> > synthetic network interfaces managed by networkd, so we hash them from
> > /etc/machine-id, but otoh people also want them to change when they
> > clone their VMs. We could very nicely solve this if we had a
> > generation counter easily accessible from userspace, that starts at 0
> > initially. Because then we can hash as we always did when the counter
> > is zero, but otherwise use something else, possibly hashed from the
> > generation counter.
>
> This doesn't work, because you could have memory-A split into memory-A.1
> and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> the same new value "2".

Yes, that's why I as vague about what to switch to if the counter is
non-zero, i.e. "something else, *possibly* hashed…".

For this MAC address usecase it's entirely sufficient to be able to
distinguish if the system was closed at all, i.e. if the counter is
zero or is non-zero. Because that would already be great for a policy
of "hash it in a stable way from /etc/machine-id, if counter == 0" +
"use random MAC once counter > 0".

Such a MAC address policy I think should probably even be the new
default in networkd, if we could implement it. For that we'd need a
single bit of info from the kernel, indicating whether the sysem was
cloned at all. i.e. if the vmgenid uuid is different from the one the
system booted up first.

Lennart

--
Lennart Poettering, Berlin