2022-05-02 21:09:52

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks


On 02.05.22 18:51, Lennart Poettering wrote:
> On Mo, 02.05.22 18:12, Jason A. Donenfeld ([email protected]) wrote:
>
>>>> In order to inform userspace of virtual machine forks, this commit adds
>>>> a "fork_event" sysctl, which does not return any data, but allows
>>>> userspace processes to poll() on it for notification of VM forks.
>>>>
>>>> It avoids exposing the actual vmgenid from the hypervisor to userspace,
>>>> in case there is any randomness value in keeping it secret. Rather,
>>>> userspace is expected to simply use getrandom() if it wants a fresh
>>>> value.
>>> Wouldn't it make sense to expose a monotonic 64bit counter of detected
>>> VM forks since boot through read()? It might be interesting to know
>>> for userspace how many forks it missed the fork events for. Moreover it
>>> might be interesting to userspace to know if any fork happened so far
>>> *at* *all*, by checking if the counter is non-zero.
>> "Might be interesting" is different from "definitely useful". I'm not
>> going to add this without a clear use case. This feature is pretty
>> narrowly scoped in its objectives right now, and I intend to keep it
>> that way if possible.
> Sure, whatever. I mean, if you think it's preferable to have 3 API
> abstractions for the same concept each for it's special usecase, then
> that's certainly one way to do things. I personally would try to
> figure out a modicum of generalization for things like this. But maybe
> that' just me…
>
> I can just tell you, that in systemd we'd have a usecase for consuming
> such a generation counter: we try to provide stable MAC addresses for
> synthetic network interfaces managed by networkd, so we hash them from
> /etc/machine-id, but otoh people also want them to change when they
> clone their VMs. We could very nicely solve this if we had a
> generation counter easily accessible from userspace, that starts at 0
> initially. Because then we can hash as we always did when the counter
> is zero, but otherwise use something else, possibly hashed from the
> generation counter.
>
> But anyway, I understand you are not interested in
> generalization/other usecases, so I'll shut up.


Let's not turn this into a pit fight please :). I think it's a good idea
to collect the use cases we all have and evaluate whether this patch is
a good stepping stone towards the final solution.

At the end of the road, what I would like to see is

1) A way for libraries such as s2n to identify that a clone occurred.
Because it's a deep-down library with no access to its own thread or the
main loop, it can not rely on poll/select. Mmap of a file however would
work great, as you can create transactions on top of a 64bit mmap'ed
value for example.

We can have systemd generate and then provide such a file as long as we
have an event based notification mechanism.

2) A way to notify larger applications (think Java here) that a system
is going to be suspended soon so it can wipe PII before it gets cloned
for example.

3) Notifications after clone so applications know they can regenerate VM
unique data based on randomness.

For 2 and 3, applications should have native support for these events
(think of poll() on the fork file) as well as external script based
support (think "invoke systemctl restart smbd.service on clone").



Lennart, looking at the current sysctl proposal, systemd could poll() on
the fork file. It would then be able to generate a /run/fork-id file
which it can use for the flow above, right?

The sysctl proposal also gives us 3, if we implement the inhibitor
proposal [1] in systemd.

That leaves 2, which we don't have a hardware interface for yet. We can
still get to at least script level automation with inhibitors and
manually triggering a "hey, you'll be suspended soon" event via systemd.

Overall, it sounds to me like the sysctl poll based kernel interface in
this patch in combination with systemd inhibitors gives us an answer to
most of the flows above.

I can see attractiveness in providing the /run/fork-id directly from the
kernel though, to remove the dependency on systemd for poll-less
notification of libraries.

Jason, how much complexity would it add to provide an mmap() and read()
interface to a fork counter value to the sysctl? Read sounds like a
trivial change on top of what you have already, mmap a bit more heavy
lift. If we had both, it would allow us to implement a Linux standard
fork detect path in libraries that does not rely on systemd.



Alex


[1] https://github.com/systemd/systemd/issues/20222




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



2022-05-03 00:32:54

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

Hi Alex,

On Mon, May 02, 2022 at 07:59:08PM +0200, Alexander Graf wrote:
> to collect the use cases we all have and evaluate whether this patch is
> a good stepping stone towards the final solution.

Indeed, I'm all for collecting use cases. What I meant to say is that
we're not going to add something "just 'cuz"; I'd like to have some
concrete things in mind.

To date, I've basically had your s2n case in mind, but as you haven't
responded to this in the last month, I started looking to see if this
was useful elsewhere or if I should abandon it, so I filed this issue
with the Go project: <https://github.com/golang/go/issues/52544>. We're
over halfway through 5.18 now, and only at this point have you arrived
to discuss and finalize things. So in all likelihood we'll wind up
tabling this until 5.20 or never, since what I thought was an easy
consensus before now apparently is not.

> 1) A way for libraries such as s2n to identify that a clone occurred.
> Because it's a deep-down library with no access to its own thread or the
> main loop, it can not rely on poll/select. Mmap of a file however would
> work great, as you can create transactions on top of a 64bit mmap'ed
> value for example.

I didn't realize that s2n can't poll. That's surprising. In the worst
case, can't you just spawn a thread?

> 2) A way to notify larger applications (think Java here) that a system
> is going to be suspended soon so it can wipe PII before it gets cloned
> for example.

Suspension, like S3 power notification stuff? Talk to Rafael about that;
this isn't related to the VM fork issue. I use those PM notifiers
happily in kernel space but AFAICT, there's still no userspace thing for
it. This seems orthogonal to this conversation though, so let's not veer
off into that topic.

If you didn't mean S3 but actually meant notification prior to snapshot
taking, we don't have any virtual hardware for that, so it's a moot
point.

> 3) Notifications after clone so applications know they can regenerate VM
> unique data based on randomness.

You mean this as "the same as (1) but with poll() instead of mmap()",
right?

> Lennart, looking at the current sysctl proposal, systemd could poll() on
> the fork file. It would then be able to generate a /run/fork-id file
> which it can use for the flow above, right?

For the reasons I gave in my last email to Lennart, I don't think
there's a good way for systemd to generate a fork-id file on its own
either. I don't think systemd should really be involved here as a
provider of values, just as a potential consumer of what the kernel
provides.

> The sysctl proposal also gives us 3, if we implement the inhibitor
> proposal [1] in systemd.

These userspace components you're proposing seem like a lot of
overengineering for little gain, which is why this conversation went
nowhere when Amazon attempted all this year. But it sounds like you
agree with me based on your remark below about systemd-less interfaces
provided by the kernel.

> Overall, it sounds to me like the sysctl poll based kernel interface in
> this patch in combination with systemd inhibitors gives us an answer to
> most of the flows above.
>
> I can see attractiveness in providing the /run/fork-id directly from the
> kernel though, to remove the dependency on systemd for poll-less
> notification of libraries.
>
> Jason, how much complexity would it add to provide an mmap() and read()
> interface to a fork counter value to the sysctl? Read sounds like a
> trivial change on top of what you have already, mmap a bit more heavy
> lift. If we had both, it would allow us to implement a Linux standard
> fork detect path in libraries that does not rely on systemd.

mmap() does not give us anything if we're not going to expose the raw
ACPI-mapped ID directly. It will still be a racy mechanism until we do
that. So I think we should wait until there's a proper vmgenid
word-sized counter to expose something mmap()able. If you have the
energy to talk to Microsoft about this and make it happen, please be my
guest. As I wrote at the beginning of this email. I haven't gotten a
response from you at all about this stuff in quite some time, so I'm not
really itching take that on alone now.

Jason

2022-05-03 09:40:03

by Lennart Poettering

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

On Mo, 02.05.22 19:59, Alexander Graf ([email protected]) wrote:

> Lennart, looking at the current sysctl proposal, systemd could poll() on the
> fork file. It would then be able to generate a /run/fork-id file which it
> can use for the flow above, right?

I am not to keen on making sytemd such a proxy. Sounds like something
the kernel could do on its own, better and resulting in an ultimately
simpler system...

If systemd its the proxy this adds in extra raciness. i.e. in a ideal
world, if we have some form of notification fd, then it would be great
if that fd is guaranteed to have POLLIN set and its contents updated
the instant the clone happened. But if we proxy this through
userspace, there's necessarily a latency involved that it it takes
userspace to catch up and effect the POLLIN and updated contents
towards its client apps.

I understand the underlying VM hypervisor APIs currently are designed
to always imply some notification latency. Which sucks, but I think we
should be very careful with replicating this design mistake with any
userspace APIs we add.

i.e. I am pretty sure that even if the underlying VM hypervisor
primitive isn't as good as we wanted, the Linux kernel→userspace API
should be built so that if one day a better VM hypervisor interface
exists it can be plugged behind it without such limitations.

> Overall, it sounds to me like the sysctl poll based kernel interface in this
> patch in combination with systemd inhibitors gives us an answer to most of
> the flows above.

As mentioned earlier, I am not convinced sysctl is the right place for
this. sysctls are understood by most people as being the place for
tweaking kernel settings. This is not a kernel setting, but a
notification concept, and the way Jason defined it there's nothing to
read nor write, which strongly suggests to move it elsewhere, but not
/proc/sys/.

Use /sys/kernel/ or so. Or maybe O_NOTIFICATION_PIPE or whatever, but
/proc/sys/ looks really wrong.

> I can see attractiveness in providing the /run/fork-id directly from the
> kernel though, to remove the dependency on systemd for poll-less
> notification of libraries.

I agree.

Lennart

--
Lennart Poettering, Berlin

2022-05-03 12:45:32

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

On Tue, May 03, 2022 at 10:29:49AM +0200, Lennart Poettering wrote:
> As mentioned earlier, I am not convinced sysctl is the right place for
> this. sysctls are understood by most people as being the place for
> tweaking kernel settings. This is not a kernel setting, but a
> notification concept, and the way Jason defined it there's nothing to
> read nor write, which strongly suggests to move it elsewhere, but not
> /proc/sys/.

I think I'm coming around to this view that having a sysctl return
-ENODATA is weird. It makes `sysctl -a` always complain to stderr, for
example, which seems bad.

> > I can see attractiveness in providing the /run/fork-id directly from the
> > kernel though, to remove the dependency on systemd for poll-less
> > notification of libraries.
>
> I agree.

I'm still not convinced there's value in having a counter or a UUID, but
if you had to choose, would you prefer a counter or a UUID? It sounds
like the former, because you see a use for distinguishing between zero
and non-zero? Or did you finally agree with me that vmgenid isn't
granular enough for that?

Jason