- Background
The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.
The feature is required in virtualized environments by apps that work
with local copies/caches of world-unique data such as random values,
uuids, monotonically increasing counters, etc.
Such apps can be negatively affected by VM snapshotting when the VM
is either cloned or returned to an earlier point in time.
The VM Generation ID is a simple concept meant to alleviate the issue
by providing a unique ID that changes each time the VM is restored
from a snapshot. The hw provided UUID value can be used to
differentiate between VMs or different generations of the same VM.
- Problem
The VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors but neither the vendors or upstream Linux have no
default driver for it leaving users to fend for themselves.
Furthermore, simply finding out about a VM generation change is only
the starting point of a process to renew internal states of possibly
multiple applications across the system. This process could benefit
from a driver that provides an interface through which orchestration
can be easily done.
- Solution
This patch is a driver that exposes a monotonic incremental Virtual
Machine Generation u32 counter via a char-dev FS interface that
provides sync and async VmGen counter updates notifications. It also
provides VmGen counter retrieval and confirmation mechanisms.
The hw provided UUID is not exposed to userspace, it is internally
used by the driver to keep accounting for the exposed VmGen counter.
The counter starts from zero when the driver is initialized and
monotonically increments every time the hw UUID changes (the VM
generation changes).
On each hw UUID change, the new hypervisor-provided UUID is also fed
to the kernel RNG.
This patch builds on top of Or Idgar <[email protected]>'s proposal
https://lkml.org/lkml/2018/3/1/498
- Future improvements
Ideally we would want the driver to register itself based on devices'
_CID and not _HID, but unfortunately I couldn't find a way to do that.
The problem is that ACPI device matching is done by
'__acpi_match_device()' which exclusively looks at
'acpi_hardware_id *hwid'.
There is a path for platform devices to match on _CID when _HID is
'PRP0001' - but this is not the case for the Qemu vmgenid device.
Guidance and help here would be greatly appreciated.
- v1 -> v2:
- expose to userspace a monotonically increasing u32 Vm Gen Counter
instead of the hw VmGen UUID
- since the hw/hypervisor-provided 128-bit UUID is not public
anymore, add it to the kernel RNG as device randomness
- insert driver page containing Vm Gen Counter in the user vma in
the driver's mmap handler instead of using a fault handler
- turn driver into a misc device driver to auto-create /dev/vmgenid
- change ioctl arg to avoid leaking kernel structs to userspace
- update documentation
- various nits (license, unnecessary casting, Kconfig, others)
- rebase on top of linus latest
Signed-off-by: Adrian Catangiu <[email protected]>
---
Documentation/virt/vmgenid.rst | 228 ++++++++++++++++++++++++
drivers/virt/Kconfig | 17 ++
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 390 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vmgenid.h | 13 ++
5 files changed, 649 insertions(+)
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/virt/vmgenid.c
create mode 100644 include/uapi/linux/vmgenid.h
diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..603e8a5
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,228 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+VMGENID
+============
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by apps that work
+with local copies/caches of world-unique data such as random values,
+uuids, monotonically increasing counters, etc.
+Such apps can be negatively affected by VM snapshotting when the VM
+is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept meant to alleviate the issue
+by providing a unique ID that changes each time the VM is restored
+from a snapshot. The hw provided UUID value can be used to
+differentiate between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The driver exposes a monotonic incremental Virtual Machine Generation
+u32 counter via a char-dev FS interface that provides sync and async
+VmGen counter updates notifications. It also provides VmGen counter
+retrieval and confirmation mechanisms.
+
+The hw provided UUID is not exposed to userspace, it is internally
+used by the driver to keep accounting for the exposed VmGen counter.
+The counter starts from zero when the driver is initialized and
+monotonically increments every time the hw UUID changes (the VM
+generation changes).
+
+On each hw UUID change, the new UUID is also fed to the kernel RNG.
+
+Driver interface:
+
+``open()``:
+ When the device is opened, a copy of the current Vm-Gen-Id (counter)
+ is associated with the open file descriptor. The driver now tracks
+ this file as an independent *watcher*. The driver tracks how many
+ watchers are aware of the latest Vm-Gen-Id counter and how many of
+ them are *outdated*; outdated being those that have lived through
+ a Vm-Gen-Id change but not yet confirmed the new generation counter.
+
+``read()``:
+ Read is meant to provide the *new* VM generation counter when a
+ generation change takes place. The read operation blocks until the
+ associated counter is no longer up to date - until HW vm gen id
+ changes - at which point the new counter is provided/returned.
+ Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
+ *new* counter value available. The generation counter is considered
+ *new* for each open file descriptor that hasn't confirmed the new
+ value, following a generation change. Therefore, once a generation
+ change takes place, all ``read()`` calls will immediately return the
+ new generation counter and will continue to do so until the
+ new value is confirmed back to the driver through ``write()``.
+ Partial reads are not allowed - read buffer needs to be at least
+ ``sizeof(unsigned)`` in size.
+
+``write()``:
+ Write is used to confirm the up-to-date Vm Gen counter back to the
+ driver.
+ Following a VM generation change, all existing watchers are marked
+ as *outdated*. Each file descriptor will maintain the *outdated*
+ status until a ``write()`` confirms the up-to-date counter back to
+ the driver.
+ Partial writes are not allowed - write buffer should be exactly
+ ``sizeof(unsigned)`` in size.
+
+``poll()``:
+ Poll is implemented to allow polling for generation counter updates.
+ Such updates result in ``EPOLLIN`` polling status until the new
+ up-to-date counter is confirmed back to the driver through a
+ ``write()``.
+
+``ioctl()``:
+ The driver also adds support for tracking count of open file
+ descriptors that haven't acknowledged a generation counter update.
+ This is exposed through two IOCTLs:
+
+ - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
+ *outdated* watchers - number of file descriptors that were open
+ during a VM generation change, and which have not yet confirmed the
+ new generation counter.
+ - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+ watchers, or if a ``timeout`` argument is provided, until the
+ timeout expires.
+
+``mmap()``:
+ The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+ in size. The first 4 bytes of the mapped page will contain an
+ up-to-date copy of the VM generation counter.
+ The mapped memory can be used as a low-latency generation counter
+ probe mechanism in critical sections - see examples.
+
+``close()``:
+ Removes the file descriptor as a Vm generation counter watcher.
+
+Example application workflows
+-----------------------------
+
+1) Watchdog thread simplified example::
+
+ void watchdog_thread_handler(int *thread_active)
+ {
+ unsigned genid;
+ int fd = open("/dev/vmgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
+
+ do {
+ // read new gen ID - blocks until VM generation changes
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling gen ID update
+ write(fd, &genid, sizeof(genid));
+ } while (atomic_read(thread_active));
+
+ close(fd);
+ }
+
+2) ASYNC simplified example::
+
+ void handle_io_on_vmgenfd(int vmgenfd)
+ {
+ unsigned genid;
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // read new gen ID - we need it to confirm we've handled update
+ read(fd, &genid, sizeof(genid));
+
+ // confirm we're done handling the gen ID update
+ write(fd, &genid, sizeof(genid));
+ }
+
+ int main() {
+ int epfd, vmgenfd;
+ struct epoll_event ev;
+
+ epfd = epoll_create(EPOLL_QUEUE_LEN);
+
+ vmgenfd = open("/dev/vmgenid",
+ O_RDWR | O_CLOEXEC | O_NONBLOCK,
+ S_IRUSR | S_IWUSR);
+
+ // register vmgenid for polling
+ ev.events = EPOLLIN;
+ ev.data.fd = vmgenfd;
+ epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
+
+ // register other parts of your app for polling
+ // ...
+
+ while (1) {
+ // wait for something to do...
+ int nfds = epoll_wait(epfd, events,
+ MAX_EPOLL_EVENTS_PER_RUN,
+ EPOLL_RUN_TIMEOUT);
+ if (nfds < 0) die("Error in epoll_wait!");
+
+ // for each ready fd
+ for(int i = 0; i < nfds; i++) {
+ int fd = events[i].data.fd;
+
+ if (fd == vmgenfd)
+ handle_io_on_vmgenfd(vmgenfd);
+ else
+ handle_some_other_part_of_the_app(fd);
+ }
+ }
+
+ return 0;
+ }
+
+3) Mapped memory polling simplified example::
+
+ /*
+ * app/library function that provides cached secrets
+ */
+ char * safe_cached_secret(app_data_t *app)
+ {
+ char *secret;
+ volatile unsigned *const genid_ptr = get_vmgenid_mapping(app);
+ again:
+ secret = __cached_secret(app);
+
+ if (unlikely(*genid_ptr != app->cached_genid)) {
+ // rebuild world then confirm the genid update (thru write)
+ rebuild_caches(app);
+
+ app->cached_genid = *genid_ptr;
+ ack_vmgenid_update(app);
+
+ goto again;
+ }
+
+ return secret;
+ }
+
+4) Orchestrator simplified example::
+
+ /*
+ * orchestrator - manages multiple apps and libraries used by a service
+ * and tries to make sure all sensitive components gracefully handle
+ * VM generation changes.
+ * Following function is called on detection of a VM generation change.
+ */
+ int handle_vmgen_update(int vmgen_fd, unsigned new_gen_id)
+ {
+ // pause until all components have handled event
+ pause_service();
+
+ // confirm *this* watcher as up-to-date
+ write(vmgen_fd, &new_gen_id, sizeof(unsigned));
+
+ // wait for all *others* for at most 5 seconds.
+ ioctl(vmgen_fd, VMGENID_WAIT_WATCHERS, 5000);
+
+ // all apps on the system have rebuilt worlds
+ resume_service();
+ }
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..5d5f37b 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,23 @@ menuconfig VIRT_DRIVERS
if VIRT_DRIVERS
+config VMGENID
+ tristate "Virtual Machine Generation ID driver"
+ depends on ACPI
+ default N
+ help
+ This is a Virtual Machine Generation ID driver which provides
+ a virtual machine generation counter. The driver exposes FS ops
+ on /dev/vmgenid through which it can provide information and
+ notifications on VM generation changes that happen on snapshots
+ or cloning.
+ This enables applications and libraries that store or cache
+ sensitive information, to know that they need to regenerate it
+ after process memory has been exposed to potential copying.
+
+ To compile this driver as a module, choose M here: the
+ module will be called vmgenid.
+
config FSL_HV_MANAGER
tristate "Freescale hypervisor management driver"
depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f28425c..889be01 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,6 +4,7 @@
#
obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
+obj-$(CONFIG_VMGENID) += vmgenid.o
obj-y += vboxguest/
obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..75a787d
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,390 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc. All rights reserved.
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ * Authors:
+ * Adrian Catangiu <[email protected]>
+ * Or Idgar <[email protected]>
+ * Gal Hammer <[email protected]>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/vmgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct dev_data {
+ struct miscdevice misc_dev;
+ unsigned long map_buf;
+ wait_queue_head_t read_wait;
+ unsigned int generation_counter;
+
+ void *uuid_iomap;
+ uuid_t uuid;
+
+ atomic_t watchers;
+ atomic_t outdated_watchers;
+ wait_queue_head_t outdated_wait;
+};
+
+struct file_data {
+ struct dev_data *dev_data;
+ unsigned int acked_gen_counter;
+};
+
+static void vmgenid_put_outdated_watchers(struct dev_data *priv)
+{
+ if (atomic_dec_and_test(&priv->outdated_watchers))
+ wake_up_interruptible(&priv->outdated_wait);
+}
+
+static int vmgenid_open(struct inode *inode, struct file *file)
+{
+ struct dev_data *priv =
+ container_of(file->private_data, struct dev_data, misc_dev);
+ struct file_data *file_data =
+ kzalloc(sizeof(struct file_data), GFP_KERNEL);
+
+ if (!file_data)
+ return -ENOMEM;
+
+ file_data->acked_gen_counter = priv->generation_counter;
+ file_data->dev_data = priv;
+
+ file->private_data = file_data;
+ atomic_inc(&priv->watchers);
+
+ return 0;
+}
+
+static int vmgenid_close(struct inode *inode, struct file *file)
+{
+ struct file_data *file_data = file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+
+ if (file_data->acked_gen_counter != priv->generation_counter)
+ vmgenid_put_outdated_watchers(priv);
+ atomic_dec(&priv->watchers);
+ kfree(file_data);
+
+ return 0;
+}
+
+static ssize_t
+vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
+{
+ struct file_data *file_data = file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+ ssize_t ret;
+
+ if (nbytes == 0)
+ return 0;
+ /* disallow partial reads */
+ if (nbytes < sizeof(priv->generation_counter))
+ return -EINVAL;
+
+ if (file_data->acked_gen_counter == priv->generation_counter) {
+ if (file->f_flags & O_NONBLOCK)
+ return -EAGAIN;
+ ret = wait_event_interruptible(
+ priv->read_wait,
+ file_data->acked_gen_counter != priv->generation_counter
+ );
+ if (ret)
+ return ret;
+ }
+
+ nbytes = sizeof(priv->generation_counter);
+ ret = copy_to_user(ubuf, &priv->generation_counter, nbytes);
+ if (ret)
+ return -EFAULT;
+
+ return nbytes;
+}
+
+static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct file_data *file_data = file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+ unsigned int acked_gen_count;
+
+ /* disallow partial writes */
+ if (count != sizeof(acked_gen_count))
+ return -EINVAL;
+ if (copy_from_user(&acked_gen_count, ubuf, count))
+ return -EFAULT;
+ /* wrong gen-counter acknowledged */
+ if (acked_gen_count != priv->generation_counter)
+ return -EINVAL;
+
+ if (file_data->acked_gen_counter != priv->generation_counter) {
+ /* update local view of UUID */
+ file_data->acked_gen_counter = acked_gen_count;
+ vmgenid_put_outdated_watchers(priv);
+ }
+
+ return (ssize_t)count;
+}
+
+static __poll_t
+vmgenid_poll(struct file *file, poll_table *wait)
+{
+ __poll_t mask = 0;
+ struct file_data *file_data = file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+
+ if (file_data->acked_gen_counter != priv->generation_counter)
+ return EPOLLIN | EPOLLRDNORM;
+
+ poll_wait(file, &priv->read_wait, wait);
+
+ if (file_data->acked_gen_counter != priv->generation_counter)
+ mask = EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static long vmgenid_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ struct file_data *file_data = file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+ unsigned long timeout_ns = arg * NSEC_PER_MSEC;
+ ktime_t until = ktime_set(0, timeout_ns);
+ int ret;
+
+ switch (cmd) {
+ case VMGENID_GET_OUTDATED_WATCHERS:
+ ret = atomic_read(&priv->outdated_watchers);
+ break;
+ case VMGENID_WAIT_WATCHERS:
+ ret = wait_event_interruptible_hrtimeout(
+ priv->outdated_wait,
+ !atomic_read(&priv->outdated_watchers),
+ until
+ );
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file_data *file_data = file->private_data;
+ struct dev_data *priv = file_data->dev_data;
+
+ if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) != 0)
+ return -EPERM;
+
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags &= ~VM_MAYWRITE;
+ vma->vm_private_data = file_data;
+
+ return vm_insert_page(vma, vma->vm_start,
+ virt_to_page(priv->map_buf));
+}
+
+static const struct file_operations fops = {
+ .owner = THIS_MODULE,
+ .mmap = vmgenid_mmap,
+ .open = vmgenid_open,
+ .release = vmgenid_close,
+ .read = vmgenid_read,
+ .write = vmgenid_write,
+ .poll = vmgenid_poll,
+ .unlocked_ioctl = vmgenid_ioctl,
+};
+
+struct miscdevice vmgenid_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "vmgenid",
+ .fops = &fops,
+};
+
+static int vmgenid_acpi_map(struct dev_data *priv, acpi_handle handle)
+{
+ int i;
+ phys_addr_t phys_addr;
+ struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+ acpi_status status;
+ union acpi_object *pss;
+ union acpi_object *element;
+
+ status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ pss = buffer.pointer;
+ if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+ return -EINVAL;
+
+ phys_addr = 0;
+ for (i = 0; i < pss->package.count; i++) {
+ element = &(pss->package.elements[i]);
+ if (element->type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+ phys_addr |= element->integer.value << i * 32;
+ }
+
+ priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+ if (!priv->uuid_iomap) {
+ pr_err("Could not map memory at 0x%llx, size %u\n",
+ phys_addr,
+ (u32)sizeof(uuid_t));
+ return -ENOMEM;
+ }
+
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+ int ret;
+ struct dev_data *priv;
+
+ if (!device)
+ return -EINVAL;
+
+ priv = kzalloc(sizeof(struct dev_data), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->map_buf = get_zeroed_page(GFP_KERNEL);
+ if (!priv->map_buf) {
+ ret = -ENOMEM;
+ goto free;
+ }
+
+ memcpy(&priv->misc_dev, &vmgenid_misc, sizeof(vmgenid_misc));
+
+ init_waitqueue_head(&priv->read_wait);
+ atomic_set(&priv->watchers, 0);
+ atomic_set(&priv->outdated_watchers, 0);
+ init_waitqueue_head(&priv->outdated_wait);
+
+ device->driver_data = priv;
+
+ ret = vmgenid_acpi_map(priv, device->handle);
+ if (ret < 0)
+ goto err;
+
+ ret = misc_register(&priv->misc_dev);
+ if (ret < 0) {
+ pr_err("misc_register() failed for vmgenid\n");
+ goto err;
+ }
+
+ return 0;
+
+err:
+ if (priv->uuid_iomap)
+ acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+
+ free_pages(priv->map_buf, 0);
+ priv->map_buf = 0;
+
+free:
+ kfree(priv);
+
+ return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+ struct dev_data *priv;
+
+ if (!device || !acpi_driver_data(device))
+ return -EINVAL;
+ priv = acpi_driver_data(device);
+
+ misc_deregister(&priv->misc_dev);
+ device->driver_data = NULL;
+
+ if (priv->uuid_iomap)
+ acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+ free_pages(priv->map_buf, 0);
+ kfree(priv);
+
+ return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+ uuid_t old_uuid;
+ struct dev_data *priv;
+
+ if (!device || !acpi_driver_data(device)) {
+ pr_err("VMGENID notify with NULL private data\n");
+ return;
+ }
+ priv = acpi_driver_data(device);
+
+ /* update VM Generation UUID */
+ old_uuid = priv->uuid;
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ if (memcmp(&old_uuid, &priv->uuid, sizeof(uuid_t))) {
+ /* HW uuid updated */
+ priv->generation_counter++;
+ *((unsigned int *) priv->map_buf) = priv->generation_counter;
+ atomic_set(&priv->outdated_watchers, atomic_read(&priv->watchers));
+ wake_up_interruptible(&priv->read_wait);
+ add_device_randomness(&priv->uuid, sizeof(uuid_t));
+ }
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+ {"QEMUVGID", 0},
+ {"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+ .name = "vm_generation_id",
+ .ids = vmgenid_ids,
+ .owner = THIS_MODULE,
+ .ops = {
+ .add = vmgenid_acpi_add,
+ .remove = vmgenid_acpi_remove,
+ .notify = vmgenid_acpi_notify,
+ }
+};
+
+static int __init vmgenid_init(void)
+{
+ return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+ acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h
new file mode 100644
index 0000000..6256a55
--- /dev/null
+++ b/include/uapi/linux/vmgenid.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_VMGENID_H
+#define _UAPI_LINUX_VMGENID_H
+
+#include <linux/ioctl.h>
+
+#define VMGENID_IOCTL 0x2d
+#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
+#define VMGENID_WAIT_WATCHERS _IO(VMGENID_IOCTL, 2)
+
+#endif /* _UAPI_LINUX_VMGENID_H */
+
--
2.7.4
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
>
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
>
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
>
> Furthermore, simply finding out about a VM generation change is only
> the starting point of a process to renew internal states of possibly
> multiple applications across the system. This process could benefit
> from a driver that provides an interface through which orchestration
> can be easily done.
>
> - Solution
>
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface that
> provides sync and async VmGen counter updates notifications. It also
> provides VmGen counter retrieval and confirmation mechanisms.
>
> The hw provided UUID is not exposed to userspace, it is internally
> used by the driver to keep accounting for the exposed VmGen counter.
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the hw UUID changes (the VM
> generation changes).
>
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.
>
> This patch builds on top of Or Idgar <[email protected]>'s proposal
> https://lkml.org/lkml/2018/3/1/498
>
> - Future improvements
>
> Ideally we would want the driver to register itself based on devices'
> _CID and not _HID, but unfortunately I couldn't find a way to do that.
> The problem is that ACPI device matching is done by
> '__acpi_match_device()' which exclusively looks at
> 'acpi_hardware_id *hwid'.
>
> There is a path for platform devices to match on _CID when _HID is
> 'PRP0001' - but this is not the case for the Qemu vmgenid device.
>
> Guidance and help here would be greatly appreciated.
That one is pretty important IMHO. How about the following (probably
pretty mangled) patch? That seems to work for me. The ACPI change would
obviously need to be its own stand alone change and needs proper
assessment whether it could possibly break any existing systems.
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 1682f8b454a2..452443d79d87 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device
*device,
/* First, check the ACPI/PNP IDs provided by the caller. */
if (acpi_ids) {
for (id = acpi_ids; id->id[0] || id->cls; id++) {
- if (id->id[0] && !strcmp((char *)id->id, hwid->id))
+ if (id->id[0] && !strncmp((char *)id->id, hwid->id, ACPI_ID_LEN - 1))
goto out_acpi_match;
if (id->cls && __acpi_match_device_cls(id, hwid))
goto out_acpi_match;
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
index 75a787da8aad..0bfa422cf094 100644
--- a/drivers/virt/vmgenid.c
+++ b/drivers/virt/vmgenid.c
@@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device
*device, u32 event)
}
static const struct acpi_device_id vmgenid_ids[] = {
- {"QEMUVGID", 0},
+ /* This really is VM_Gen_Counter, but we can only match 8 characters */
+ {"VM_GEN_C", 0},
{"", 0},
};
>
> - v1 -> v2:
>
Please put the change log below your Signed-off-by line and separate it
with a "---" line. That way, git am will ignore the change log on apply.
> - expose to userspace a monotonically increasing u32 Vm Gen Counter
> instead of the hw VmGen UUID
> - since the hw/hypervisor-provided 128-bit UUID is not public
> anymore, add it to the kernel RNG as device randomness
> - insert driver page containing Vm Gen Counter in the user vma in
> the driver's mmap handler instead of using a fault handler
> - turn driver into a misc device driver to auto-create /dev/vmgenid
> - change ioctl arg to avoid leaking kernel structs to userspace
> - update documentation
> - various nits (license, unnecessary casting, Kconfig, others)
> - rebase on top of linus latest
>
> Signed-off-by: Adrian Catangiu <[email protected]>
> ---
> Documentation/virt/vmgenid.rst | 228 ++++++++++++++++++++++++
> drivers/virt/Kconfig | 17 ++
> drivers/virt/Makefile | 1 +
> drivers/virt/vmgenid.c | 390 +++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/vmgenid.h | 13 ++
> 5 files changed, 649 insertions(+)
> create mode 100644 Documentation/virt/vmgenid.rst
> create mode 100644 drivers/virt/vmgenid.c
> create mode 100644 include/uapi/linux/vmgenid.h
>
> diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
> new file mode 100644
> index 0000000..603e8a5
> --- /dev/null
> +++ b/Documentation/virt/vmgenid.rst
> @@ -0,0 +1,228 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +============
> +VMGENID
> +============
> +
> +The VM Generation ID is a feature defined by Microsoft (paper:
> +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> +multiple hypervisor vendors.
> +
> +The feature is required in virtualized environments by apps that work
> +with local copies/caches of world-unique data such as random values,
> +uuids, monotonically increasing counters, etc.
> +Such apps can be negatively affected by VM snapshotting when the VM
> +is either cloned or returned to an earlier point in time.
> +
> +The VM Generation ID is a simple concept meant to alleviate the issue
> +by providing a unique ID that changes each time the VM is restored
> +from a snapshot. The hw provided UUID value can be used to
> +differentiate between VMs or different generations of the same VM.
> +
> +The VM Generation ID is exposed through an ACPI device by multiple
> +hypervisor vendors. The driver for it lives at
> +``drivers/virt/vmgenid.c``
> +
> +The driver exposes a monotonic incremental Virtual Machine Generation
> +u32 counter via a char-dev FS interface that provides sync and async
> +VmGen counter updates notifications. It also provides VmGen counter
> +retrieval and confirmation mechanisms.
> +
> +The hw provided UUID is not exposed to userspace, it is internally
> +used by the driver to keep accounting for the exposed VmGen counter.
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the hw UUID changes (the VM
> +generation changes).
> +
> +On each hw UUID change, the new UUID is also fed to the kernel RNG.
> +
> +Driver interface:
> +
> +``open()``:
> + When the device is opened, a copy of the current Vm-Gen-Id (counter)
> + is associated with the open file descriptor. The driver now tracks
> + this file as an independent *watcher*. The driver tracks how many
> + watchers are aware of the latest Vm-Gen-Id counter and how many of
> + them are *outdated*; outdated being those that have lived through
> + a Vm-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> + Read is meant to provide the *new* VM generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date - until HW vm gen id
> + changes - at which point the new counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value, following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + ``sizeof(unsigned)`` in size.
> +
> +``write()``:
> + Write is used to confirm the up-to-date Vm Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + ``sizeof(unsigned)`` in size.
> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for tracking count of open file
> + descriptors that haven't acknowledged a generation counter update.
> + This is exposed through two IOCTLs:
> +
> + - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
> + *outdated* watchers - number of file descriptors that were open
> + during a VM generation change, and which have not yet confirmed the
> + new generation counter.
> + - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> + watchers, or if a ``timeout`` argument is provided, until the
> + timeout expires.
> +
> +``mmap()``:
> + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> + in size. The first 4 bytes of the mapped page will contain an
> + up-to-date copy of the VM generation counter.
> + The mapped memory can be used as a low-latency generation counter
> + probe mechanism in critical sections - see examples.
> +
> +``close()``:
> + Removes the file descriptor as a Vm generation counter watcher.
> +
> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> + void watchdog_thread_handler(int *thread_active)
> + {
> + unsigned genid;
> + int fd = open("/dev/vmgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
> +
> + do {
> + // read new gen ID - blocks until VM generation changes
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling gen ID update
> + write(fd, &genid, sizeof(genid));
> + } while (atomic_read(thread_active));
> +
> + close(fd);
> + }
> +
> +2) ASYNC simplified example::
> +
> + void handle_io_on_vmgenfd(int vmgenfd)
> + {
> + unsigned genid;
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // read new gen ID - we need it to confirm we've handled update
> + read(fd, &genid, sizeof(genid));
This is racy in case two consecutive snapshots happen. The read needs to
go before the reseed.
> +
> + // confirm we're done handling the gen ID update
> + write(fd, &genid, sizeof(genid));
> + }
> +
> + int main() {
> + int epfd, vmgenfd;
> + struct epoll_event ev;
> +
> + epfd = epoll_create(EPOLL_QUEUE_LEN);
> +
> + vmgenfd = open("/dev/vmgenid",
> + O_RDWR | O_CLOEXEC | O_NONBLOCK,
> + S_IRUSR | S_IWUSR);
> +
> + // register vmgenid for polling
> + ev.events = EPOLLIN;
> + ev.data.fd = vmgenfd;
> + epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
> +
> + // register other parts of your app for polling
> + // ...
> +
> + while (1) {
> + // wait for something to do...
> + int nfds = epoll_wait(epfd, events,
> + MAX_EPOLL_EVENTS_PER_RUN,
> + EPOLL_RUN_TIMEOUT);
> + if (nfds < 0) die("Error in epoll_wait!");
> +
> + // for each ready fd
> + for(int i = 0; i < nfds; i++) {
> + int fd = events[i].data.fd;
> +
> + if (fd == vmgenfd)
> + handle_io_on_vmgenfd(vmgenfd);
> + else
> + handle_some_other_part_of_the_app(fd);
> + }
> + }
> +
> + return 0;
> + }
> +
> +3) Mapped memory polling simplified example::
> +
> + /*
> + * app/library function that provides cached secrets
> + */
> + char * safe_cached_secret(app_data_t *app)
> + {
> + char *secret;
> + volatile unsigned *const genid_ptr = get_vmgenid_mapping(app);
> + again:
> + secret = __cached_secret(app);
> +
> + if (unlikely(*genid_ptr != app->cached_genid)) {
> + // rebuild world then confirm the genid update (thru write)
> + rebuild_caches(app);
> +
> + app->cached_genid = *genid_ptr;
This is racy again. You need to read the genid before rebuild and set it
here.
> + ack_vmgenid_update(app);
> +
> + goto again;
> + }
> +
> + return secret;
> + }
> +
> +4) Orchestrator simplified example::
> +
> + /*
> + * orchestrator - manages multiple apps and libraries used by a service
> + * and tries to make sure all sensitive components gracefully handle
> + * VM generation changes.
> + * Following function is called on detection of a VM generation change.
> + */
> + int handle_vmgen_update(int vmgen_fd, unsigned new_gen_id)
> + {
> + // pause until all components have handled event
> + pause_service();
> +
> + // confirm *this* watcher as up-to-date
> + write(vmgen_fd, &new_gen_id, sizeof(unsigned));
> +
> + // wait for all *others* for at most 5 seconds.
> + ioctl(vmgen_fd, VMGENID_WAIT_WATCHERS, 5000);
> +
> + // all apps on the system have rebuilt worlds
> + resume_service();
> + }
> diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
> index 80c5f9c1..5d5f37b 100644
> --- a/drivers/virt/Kconfig
> +++ b/drivers/virt/Kconfig
> @@ -13,6 +13,23 @@ menuconfig VIRT_DRIVERS
>
> if VIRT_DRIVERS
>
> +config VMGENID
> + tristate "Virtual Machine Generation ID driver"
> + depends on ACPI
> + default N
> + help
> + This is a Virtual Machine Generation ID driver which provides
> + a virtual machine generation counter. The driver exposes FS ops
> + on /dev/vmgenid through which it can provide information and
> + notifications on VM generation changes that happen on snapshots
> + or cloning.
> + This enables applications and libraries that store or cache
> + sensitive information, to know that they need to regenerate it
> + after process memory has been exposed to potential copying.
> +
> + To compile this driver as a module, choose M here: the
> + module will be called vmgenid.
> +
> config FSL_HV_MANAGER
> tristate "Freescale hypervisor management driver"
> depends on FSL_SOC
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index f28425c..889be01 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -4,6 +4,7 @@
> #
>
> obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
> +obj-$(CONFIG_VMGENID) += vmgenid.o
> obj-y += vboxguest/
>
> obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> new file mode 100644
> index 0000000..75a787d
> --- /dev/null
> +++ b/drivers/virt/vmgenid.c
> @@ -0,0 +1,390 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Virtual Machine Generation ID driver
> + *
> + * Copyright (C) 2018 Red Hat Inc. All rights reserved.
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + * Authors:
> + * Adrian Catangiu <[email protected]>
> + * Or Idgar <[email protected]>
> + * Gal Hammer <[email protected]>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/vmgenid.h>
> +
> +#define DEV_NAME "vmgenid"
> +ACPI_MODULE_NAME(DEV_NAME);
> +
> +struct dev_data {
> + struct miscdevice misc_dev;
> + unsigned long map_buf;
> + wait_queue_head_t read_wait;
> + unsigned int generation_counter;
> +
> + void *uuid_iomap;
> + uuid_t uuid;
> +
> + atomic_t watchers;
> + atomic_t outdated_watchers;
> + wait_queue_head_t outdated_wait;
> +};
> +
> +struct file_data {
> + struct dev_data *dev_data;
> + unsigned int acked_gen_counter;
> +};
> +
> +static void vmgenid_put_outdated_watchers(struct dev_data *priv)
> +{
> + if (atomic_dec_and_test(&priv->outdated_watchers))
> + wake_up_interruptible(&priv->outdated_wait);
> +}
> +
> +static int vmgenid_open(struct inode *inode, struct file *file)
> +{
> + struct dev_data *priv =
> + container_of(file->private_data, struct dev_data, misc_dev);
> + struct file_data *file_data =
> + kzalloc(sizeof(struct file_data), GFP_KERNEL);
> +
> + if (!file_data)
> + return -ENOMEM;
> +
> + file_data->acked_gen_counter = priv->generation_counter;
> + file_data->dev_data = priv;
> +
> + file->private_data = file_data;
> + atomic_inc(&priv->watchers);
> +
> + return 0;
> +}
> +
> +static int vmgenid_close(struct inode *inode, struct file *file)
> +{
> + struct file_data *file_data = file->private_data;
> + struct dev_data *priv = file_data->dev_data;
> +
> + if (file_data->acked_gen_counter != priv->generation_counter)
> + vmgenid_put_outdated_watchers(priv);
Is this racy? Could there be a snapshot notification coming between the
branch and the put?
> + atomic_dec(&priv->watchers);
> + kfree(file_data);
> +
> + return 0;
> +}
> +
> +static ssize_t
> +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes, loff_t *ppos)
> +{
> + struct file_data *file_data = file->private_data;
> + struct dev_data *priv = file_data->dev_data;
> + ssize_t ret;
> +
> + if (nbytes == 0)
> + return 0;
> + /* disallow partial reads */
> + if (nbytes < sizeof(priv->generation_counter))
> + return -EINVAL;
> +
> + if (file_data->acked_gen_counter == priv->generation_counter) {
> + if (file->f_flags & O_NONBLOCK)
> + return -EAGAIN;
> + ret = wait_event_interruptible(
> + priv->read_wait,
> + file_data->acked_gen_counter != priv->generation_counter
> + );
> + if (ret)
> + return ret;
> + }
> +
> + nbytes = sizeof(priv->generation_counter);
> + ret = copy_to_user(ubuf, &priv->generation_counter, nbytes);
> + if (ret)
> + return -EFAULT;
> +
> + return nbytes;
> +}
> +
> +static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + struct file_data *file_data = file->private_data;
> + struct dev_data *priv = file_data->dev_data;
> + unsigned int acked_gen_count;
> +
> + /* disallow partial writes */
> + if (count != sizeof(acked_gen_count))
> + return -EINVAL;
> + if (copy_from_user(&acked_gen_count, ubuf, count))
> + return -EFAULT;
> + /* wrong gen-counter acknowledged */
> + if (acked_gen_count != priv->generation_counter)
> + return -EINVAL;
> +
> + if (file_data->acked_gen_counter != priv->generation_counter) {
> + /* update local view of UUID */
> + file_data->acked_gen_counter = acked_gen_count;
> + vmgenid_put_outdated_watchers(priv);
Same question here: What if there is a notification between the branch
and the put?
> + }
> +
> + return (ssize_t)count;
> +}
> +
> +static __poll_t
> +vmgenid_poll(struct file *file, poll_table *wait)
> +{
> + __poll_t mask = 0;
> + struct file_data *file_data = file->private_data;
> + struct dev_data *priv = file_data->dev_data;
> +
> + if (file_data->acked_gen_counter != priv->generation_counter)
> + return EPOLLIN | EPOLLRDNORM;
> +
> + poll_wait(file, &priv->read_wait, wait);
> +
> + if (file_data->acked_gen_counter != priv->generation_counter)
> + mask = EPOLLIN | EPOLLRDNORM;
> +
> + return mask;
> +}
> +
> +static long vmgenid_ioctl(struct file *file,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct file_data *file_data = file->private_data;
> + struct dev_data *priv = file_data->dev_data;
> + unsigned long timeout_ns = arg * NSEC_PER_MSEC;
> + ktime_t until = ktime_set(0, timeout_ns);
> + int ret;
> +
> + switch (cmd) {
> + case VMGENID_GET_OUTDATED_WATCHERS:
> + ret = atomic_read(&priv->outdated_watchers);
> + break;
> + case VMGENID_WAIT_WATCHERS:
> + ret = wait_event_interruptible_hrtimeout(
> + priv->outdated_wait,
> + !atomic_read(&priv->outdated_watchers),
> + until
> + );
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct file_data *file_data = file->private_data;
> + struct dev_data *priv = file_data->dev_data;
> +
> + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> + return -EINVAL;
> +
> + if ((vma->vm_flags & VM_WRITE) != 0)
> + return -EPERM;
> +
> + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> + vma->vm_flags &= ~VM_MAYWRITE;
> + vma->vm_private_data = file_data;
> +
> + return vm_insert_page(vma, vma->vm_start,
> + virt_to_page(priv->map_buf));
Is this weird white space introduced by my mail client or your patch? :)
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
>
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
>
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
I see that the qemu implementation is still under discussion. What is
the status of the other existing implementations. Do they already exist?
In other words is ACPI a given?
I think the majority of this driver could be used with just a different
backend for platforms without ACPI so in any case we could factor out
the backend (acpi, virtio, whatever) but if we are open we could maybe
start with something else.
On 19.11.20 13:02, Christian Borntraeger wrote:
>
> On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
>> - Background
>>
>> The VM Generation ID is a feature defined by Microsoft (paper:
>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
>> multiple hypervisor vendors.
>>
>> The feature is required in virtualized environments by apps that work
>> with local copies/caches of world-unique data such as random values,
>> uuids, monotonically increasing counters, etc.
>> Such apps can be negatively affected by VM snapshotting when the VM
>> is either cloned or returned to an earlier point in time.
>>
>> The VM Generation ID is a simple concept meant to alleviate the issue
>> by providing a unique ID that changes each time the VM is restored
>> from a snapshot. The hw provided UUID value can be used to
>> differentiate between VMs or different generations of the same VM.
>>
>> - Problem
>>
>> The VM Generation ID is exposed through an ACPI device by multiple
>> hypervisor vendors but neither the vendors or upstream Linux have no
>> default driver for it leaving users to fend for themselves.
>
> I see that the qemu implementation is still under discussion. What is
Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :).
> the status of the other existing implementations. Do they already exist?
> In other words is ACPI a given?
> I think the majority of this driver could be used with just a different
> backend for platforms without ACPI so in any case we could factor out
> the backend (acpi, virtio, whatever) but if we are open we could maybe
> start with something else.
I agree 100%. I don't think we really need a new framework in the kernel
for that. We can just have for example an s390x specific driver that
also provides the same notification mechanism through a device node that
is also named "/dev/vmgenid", no?
Or alternatively we can split the generic part of this driver as soon as
a second one comes along and then have both driver include that generic
logic.
The only piece where I'm unsure is how this will interact with CRIU. Can
containers emulate ioctls and device nodes?
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On 19.11.20 13:51, Alexander Graf wrote:
>
>
> On 19.11.20 13:02, Christian Borntraeger wrote:
>>
>> On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
>>> - Background
>>>
>>> The VM Generation ID is a feature defined by Microsoft (paper:
>>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
>>> multiple hypervisor vendors.
>>>
>>> The feature is required in virtualized environments by apps that work
>>> with local copies/caches of world-unique data such as random values,
>>> uuids, monotonically increasing counters, etc.
>>> Such apps can be negatively affected by VM snapshotting when the VM
>>> is either cloned or returned to an earlier point in time.
>>>
>>> The VM Generation ID is a simple concept meant to alleviate the issue
>>> by providing a unique ID that changes each time the VM is restored
>>> from a snapshot. The hw provided UUID value can be used to
>>> differentiate between VMs or different generations of the same VM.
>>>
>>> - Problem
>>>
>>> The VM Generation ID is exposed through an ACPI device by multiple
>>> hypervisor vendors but neither the vendors or upstream Linux have no
>>> default driver for it leaving users to fend for themselves.
>>
>> I see that the qemu implementation is still under discussion. What is
>
> Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :).
Ah right. Found it.
>
>> the status of the other existing implementations. Do they already exist?
>> In other words is ACPI a given?
>> I think the majority of this driver could be used with just a different
>> backend for platforms without ACPI so in any case we could factor out
>> the backend (acpi, virtio, whatever) but if we are open we could maybe
>> start with something else.
>
> I agree 100%. I don't think we really need a new framework in the kernel for that. We can just have for example an s390x specific driver that also provides the same notification mechanism through a device node that is also named "/dev/vmgenid", no?
>
> Or alternatively we can split the generic part of this driver as soon as a second one comes along and then have both driver include that generic logic.
Yes. I think it is probably the best variant to check if we split this into a front end /back end or provide a new driver when we have something.
>
> The only piece where I'm unsure is how this will interact with CRIU. Can containers emulate ioctls and device nodes?
>
>
> Alex
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
On Thu, Nov 19, 2020 at 01:51:18PM +0100, Alexander Graf wrote:
>
>
> On 19.11.20 13:02, Christian Borntraeger wrote:
> >
> > On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
> > > - Background
> > >
> > > The VM Generation ID is a feature defined by Microsoft (paper:
> > > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> > > multiple hypervisor vendors.
> > >
> > > The feature is required in virtualized environments by apps that work
> > > with local copies/caches of world-unique data such as random values,
> > > uuids, monotonically increasing counters, etc.
> > > Such apps can be negatively affected by VM snapshotting when the VM
> > > is either cloned or returned to an earlier point in time.
> > >
> > > The VM Generation ID is a simple concept meant to alleviate the issue
> > > by providing a unique ID that changes each time the VM is restored
> > > from a snapshot. The hw provided UUID value can be used to
> > > differentiate between VMs or different generations of the same VM.
> > >
> > > - Problem
> > >
> > > The VM Generation ID is exposed through an ACPI device by multiple
> > > hypervisor vendors but neither the vendors or upstream Linux have no
> > > default driver for it leaving users to fend for themselves.
> >
> > I see that the qemu implementation is still under discussion. What is
>
> Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :).
>
> > the status of the other existing implementations. Do they already exist?
> > In other words is ACPI a given?
> > I think the majority of this driver could be used with just a different
> > backend for platforms without ACPI so in any case we could factor out
> > the backend (acpi, virtio, whatever) but if we are open we could maybe
> > start with something else.
>
> I agree 100%. I don't think we really need a new framework in the kernel for
> that. We can just have for example an s390x specific driver that also
> provides the same notification mechanism through a device node that is also
> named "/dev/vmgenid", no?
>
> Or alternatively we can split the generic part of this driver as soon as a
> second one comes along and then have both driver include that generic logic.
>
> The only piece where I'm unsure is how this will interact with CRIU.
To C/R applications that use /dev/vmgenid CRIU need to be aware of it.
Checkpointing and restoring withing the same "VM generation" shouldn't be
a problem, but IMHO, making restore work after genid bump could be
challenging.
Alex, what scenario involving CRIU did you have in mind?
> Can containers emulate ioctls and device nodes?
Containers do not emulate ioctls but they can have /dev/vmgenid inside
the container, so applications can use it the same way as outside the
container.
> Alex
--
Sincerely yours,
Mike.
On 19.11.20 18:38, Mike Rapoport wrote:
>
> On Thu, Nov 19, 2020 at 01:51:18PM +0100, Alexander Graf wrote:
>>
>>
>> On 19.11.20 13:02, Christian Borntraeger wrote:
>>>
>>> On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
>>>> - Background
>>>>
>>>> The VM Generation ID is a feature defined by Microsoft (paper:
>>>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
>>>> multiple hypervisor vendors.
>>>>
>>>> The feature is required in virtualized environments by apps that work
>>>> with local copies/caches of world-unique data such as random values,
>>>> uuids, monotonically increasing counters, etc.
>>>> Such apps can be negatively affected by VM snapshotting when the VM
>>>> is either cloned or returned to an earlier point in time.
>>>>
>>>> The VM Generation ID is a simple concept meant to alleviate the issue
>>>> by providing a unique ID that changes each time the VM is restored
>>>> from a snapshot. The hw provided UUID value can be used to
>>>> differentiate between VMs or different generations of the same VM.
>>>>
>>>> - Problem
>>>>
>>>> The VM Generation ID is exposed through an ACPI device by multiple
>>>> hypervisor vendors but neither the vendors or upstream Linux have no
>>>> default driver for it leaving users to fend for themselves.
>>>
>>> I see that the qemu implementation is still under discussion. What is
>>
>> Uh, the ACPI Vmgenid device emulation is in QEMU since 2.9.0 :).
>>
>>> the status of the other existing implementations. Do they already exist?
>>> In other words is ACPI a given?
>>> I think the majority of this driver could be used with just a different
>>> backend for platforms without ACPI so in any case we could factor out
>>> the backend (acpi, virtio, whatever) but if we are open we could maybe
>>> start with something else.
>>
>> I agree 100%. I don't think we really need a new framework in the kernel for
>> that. We can just have for example an s390x specific driver that also
>> provides the same notification mechanism through a device node that is also
>> named "/dev/vmgenid", no?
>>
>> Or alternatively we can split the generic part of this driver as soon as a
>> second one comes along and then have both driver include that generic logic.
>>
>> The only piece where I'm unsure is how this will interact with CRIU.
>
> To C/R applications that use /dev/vmgenid CRIU need to be aware of it.
> Checkpointing and restoring withing the same "VM generation" shouldn't be
> a problem, but IMHO, making restore work after genid bump could be
> challenging.
>
> Alex, what scenario involving CRIU did you have in mind?
You can in theory run into the same situation with containers that this
patch is solving for virtual machines. You could for example do a
snapshot of a prewarmed Java runtime with CRIU to get full JIT speeds
starting from the first request.
That however means you run into the problem of predictable randomness again.
>
>> Can containers emulate ioctls and device nodes?
>
> Containers do not emulate ioctls but they can have /dev/vmgenid inside
> the container, so applications can use it the same way as outside the
> container.
Hm. I suppose we could add a CAP_ADMIN ioctl interface to /dev/vmgenid
(when container people get to the point of needing it) that sets the
generation to "at least X". That way on restore, you could just call
that with "generation at snapshot"+1.
That also means we need to have this interface available without virtual
machines then though, right?
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Hello,
+Cc Eric, Adrian
On 11/19/20 6:36 PM, Alexander Graf wrote:
> On 19.11.20 18:38, Mike Rapoport wrote:
>> On Thu, Nov 19, 2020 at 01:51:18PM +0100, Alexander Graf wrote:
>>> On 19.11.20 13:02, Christian Borntraeger wrote:
>>>> On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
>>>>> - Background
>>>>>
>>>>> The VM Generation ID is a feature defined by Microsoft (paper:
>>>>> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
>>>>> multiple hypervisor vendors.
>>>>>
>>>>> The feature is required in virtualized environments by apps that work
>>>>> with local copies/caches of world-unique data such as random values,
>>>>> uuids, monotonically increasing counters, etc.
>>>>> Such apps can be negatively affected by VM snapshotting when the VM
>>>>> is either cloned or returned to an earlier point in time.
>>>>>
>>>>> The VM Generation ID is a simple concept meant to alleviate the issue
>>>>> by providing a unique ID that changes each time the VM is restored
>>>>> from a snapshot. The hw provided UUID value can be used to
>>>>> differentiate between VMs or different generations of the same VM.
>>>>>
>>>>> - Problem
>>>>>
>>>>> The VM Generation ID is exposed through an ACPI device by multiple
>>>>> hypervisor vendors but neither the vendors or upstream Linux have no
>>>>> default driver for it leaving users to fend for themselves.
[..]
>>> The only piece where I'm unsure is how this will interact with CRIU.
>>
>> To C/R applications that use /dev/vmgenid CRIU need to be aware of it.
>> Checkpointing and restoring withing the same "VM generation" shouldn't be
>> a problem, but IMHO, making restore work after genid bump could be
>> challenging.
>>
>> Alex, what scenario involving CRIU did you have in mind?
>
> You can in theory run into the same situation with containers that this
> patch is solving for virtual machines. You could for example do a
> snapshot of a prewarmed Java runtime with CRIU to get full JIT speeds
> starting from the first request.
>
> That however means you run into the problem of predictable randomness
> again.
>
>>
>>> Can containers emulate ioctls and device nodes?
>>
>> Containers do not emulate ioctls but they can have /dev/vmgenid inside
>> the container, so applications can use it the same way as outside the
>> container.
>
> Hm. I suppose we could add a CAP_ADMIN ioctl interface to /dev/vmgenid
> (when container people get to the point of needing it) that sets the
> generation to "at least X". That way on restore, you could just call
> that with "generation at snapshot"+1.
>
> That also means we need to have this interface available without virtual
> machines then though, right?
Sounds like a good idea.
I guess, genvmid can be global on host, rather than per-userns or
per-process for simplicity. Later if somebody will have a bottleneck on
restore when every process on the machine wakes up from read() it could
be virtualized, but doing it now sounds too early.
ioctl() probably should go under
checkpoint_restore_ns_capable(current_user_ns()), rather than
CAP_SYS_ADMIN (I believe it should be safe from DOS as only CRIU should
run with this capability, but worth to document this).
Thanks,
Dmitry
On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
<[email protected]> wrote:
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface that
> provides sync and async VmGen counter updates notifications. It also
> provides VmGen counter retrieval and confirmation mechanisms.
>
> The hw provided UUID is not exposed to userspace, it is internally
> used by the driver to keep accounting for the exposed VmGen counter.
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the hw UUID changes (the VM
> generation changes).
>
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.
As for v1:
Is there a reasonable usecase for the "confirmation" mechanism? It
doesn't seem very useful to me.
How do you envision integrating this with libraries that have to work
in restrictive seccomp sandboxes? If this was in the vDSO, that would
be much easier.
On 18/11/2020 12:30, Alexander Graf wrote:
>
>
> On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
>> - Future improvements
>>
>> Ideally we would want the driver to register itself based on devices'
>> _CID and not _HID, but unfortunately I couldn't find a way to do that.
>> The problem is that ACPI device matching is done by
>> '__acpi_match_device()' which exclusively looks at
>> 'acpi_hardware_id *hwid'.
>>
>> There is a path for platform devices to match on _CID when _HID is
>> 'PRP0001' - but this is not the case for the Qemu vmgenid device.
>>
>> Guidance and help here would be greatly appreciated.
>
> That one is pretty important IMHO. How about the following (probably
> pretty mangled) patch? That seems to work for me. The ACPI change
> would obviously need to be its own stand alone change and needs proper
> assessment whether it could possibly break any existing systems.
>
> diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
> index 1682f8b454a2..452443d79d87 100644
> --- a/drivers/acpi/bus.c
> +++ b/drivers/acpi/bus.c
> @@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device
> *device,
> /* First, check the ACPI/PNP IDs provided by the caller. */
> if (acpi_ids) {
> for (id = acpi_ids; id->id[0] || id->cls; id++) {
> - if (id->id[0] && !strcmp((char *)id->id, hwid->id))
> + if (id->id[0] && !strncmp((char *)id->id, hwid->id,
> ACPI_ID_LEN - 1))
> goto out_acpi_match;
> if (id->cls && __acpi_match_device_cls(id, hwid))
> goto out_acpi_match;
> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> index 75a787da8aad..0bfa422cf094 100644
> --- a/drivers/virt/vmgenid.c
> +++ b/drivers/virt/vmgenid.c
> @@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device
> *device, u32 event)
> }
>
> static const struct acpi_device_id vmgenid_ids[] = {
> - {"QEMUVGID", 0},
> + /* This really is VM_Gen_Counter, but we can only match 8
> characters */
> + {"VM_GEN_C", 0},
> {"", 0},
> };
>
Looks legit. I can propose a patch with it, but how do we validate it
doesn't break any devices?
>> +2) ASYNC simplified example::
>> +
>> + void handle_io_on_vmgenfd(int vmgenfd)
>> + {
>> + unsigned genid;
>> +
>> + // because of VM generation change, we need to rebuild world
>> + reseed_app_env();
>> +
>> + // read new gen ID - we need it to confirm we've handled update
>> + read(fd, &genid, sizeof(genid));
>
> This is racy in case two consecutive snapshots happen. The read needs
> to go before the reseed.
>
Switched them around like you suggest to avoid confusion.
But I don't see a problem with this race. The idea here is to trigger
reseed_app_env() which doesn't depend on the generation counter value.
Whether it gets incremented once or N times is irrelevant, we're just
interested that we pause execution and reseed before resuming (in
between these, whether N or M generation changes is the same thing).
>> +3) Mapped memory polling simplified example::
>> +
>> + /*
>> + * app/library function that provides cached secrets
>> + */
>> + char * safe_cached_secret(app_data_t *app)
>> + {
>> + char *secret;
>> + volatile unsigned *const genid_ptr = get_vmgenid_mapping(app);
>> + again:
>> + secret = __cached_secret(app);
>> +
>> + if (unlikely(*genid_ptr != app->cached_genid)) {
>> + // rebuild world then confirm the genid update (thru write)
>> + rebuild_caches(app);
>> +
>> + app->cached_genid = *genid_ptr;
>
> This is racy again. You need to read the genid before rebuild and set
> it here.
>
I don't see the race. Gen counter is read from volatile mapped mem, on
detected change we rebuild world, confirm the update back to the driver
then restart the loop. Loop will break when no more changes happen.
>> + ack_vmgenid_update(app);
>> +
>> + goto again;
>> + }
>> +
>> + return secret;
>> + }
>> +
>> +
>> +static int vmgenid_close(struct inode *inode, struct file *file)
>> +{
>> + struct file_data *file_data = file->private_data;
>> + struct dev_data *priv = file_data->dev_data;
>> +
>> + if (file_data->acked_gen_counter != priv->generation_counter)
>> + vmgenid_put_outdated_watchers(priv);
>
> Is this racy? Could there be a snapshot notification coming between
> the branch and the put?
>
This is indeed racy, will fix it in patch v3.
>> + atomic_dec(&priv->watchers);
>> + kfree(file_data);
>> +
>> + return 0;
>> +}
>> +static ssize_t vmgenid_write(struct file *file, const char __user
>> *ubuf,
>> + size_t count, loff_t *ppos)
>> +{
>> + struct file_data *file_data = file->private_data;
>> + struct dev_data *priv = file_data->dev_data;
>> + unsigned int acked_gen_count;
>> +
>> + /* disallow partial writes */
>> + if (count != sizeof(acked_gen_count))
>> + return -EINVAL;
>> + if (copy_from_user(&acked_gen_count, ubuf, count))
>> + return -EFAULT;
>> + /* wrong gen-counter acknowledged */
>> + if (acked_gen_count != priv->generation_counter)
>> + return -EINVAL;
>> +
>> + if (file_data->acked_gen_counter != priv->generation_counter) {
>> + /* update local view of UUID */
>> + file_data->acked_gen_counter = acked_gen_count;
>> + vmgenid_put_outdated_watchers(priv);
>
> Same question here: What if there is a notification between the branch
> and the put?
>
Right, racy here as well. Will fix in patch v3.
Thanks,
Adrian.
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
[resend in the hope that amazon will accept my mail this time instead
of replying "550 Too many invalid recipients" again]
On Fri, Nov 20, 2020 at 11:29 PM Jann Horn <[email protected]> wrote:
> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
> <[email protected]> wrote:
> > This patch is a driver that exposes a monotonic incremental Virtual
> > Machine Generation u32 counter via a char-dev FS interface that
> > provides sync and async VmGen counter updates notifications. It also
> > provides VmGen counter retrieval and confirmation mechanisms.
> >
> > The hw provided UUID is not exposed to userspace, it is internally
> > used by the driver to keep accounting for the exposed VmGen counter.
> > The counter starts from zero when the driver is initialized and
> > monotonically increments every time the hw UUID changes (the VM
> > generation changes).
> >
> > On each hw UUID change, the new hypervisor-provided UUID is also fed
> > to the kernel RNG.
>
> As for v1:
>
> Is there a reasonable usecase for the "confirmation" mechanism? It
> doesn't seem very useful to me.
>
> How do you envision integrating this with libraries that have to work
> in restrictive seccomp sandboxes? If this was in the vDSO, that would
> be much easier.
- Background
The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.
The feature is required in virtualized environments by apps that work
with local copies/caches of world-unique data such as random values,
uuids, monotonically increasing counters, etc.
Such apps can be negatively affected by VM snapshotting when the VM
is either cloned or returned to an earlier point in time.
The VM Generation ID is a simple concept meant to alleviate the issue
by providing a unique ID that changes each time the VM is restored
from a snapshot. The hw provided UUID value can be used to
differentiate between VMs or different generations of the same VM.
- Problem
The VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors but neither the vendors or upstream Linux have no
default driver for it leaving users to fend for themselves.
Furthermore, simply finding out about a VM generation change is only
the starting point of a process to renew internal states of possibly
multiple applications across the system. This process could benefit
from a driver that provides an interface through which orchestration
can be easily done.
- Solution
This patch is a driver that exposes a monotonic incremental Virtual
Machine Generation u32 counter via a char-dev FS interface. The FS
interface provides sync and async VmGen counter updates notifications.
It also provides VmGen counter retrieval and confirmation mechanisms.
The generation counter and the interface through which it is exposed
are available even when there is no acpi device present.
When the device is present, the hw provided UUID is not exposed to
userspace, it is internally used by the driver to keep accounting for
the exposed VmGen counter. The counter starts from zero when the
driver is initialized and monotonically increments every time the hw
UUID changes (the VM generation changes).
On each hw UUID change, the new hypervisor-provided UUID is also fed
to the kernel RNG.
If there is no acpi vmgenid device present, the generation changes are
not driven by hw vmgenid events but can be driven by software through
a dedicated driver ioctl.
This patch builds on top of Or Idgar <[email protected]>'s proposal
https://lkml.org/lkml/2018/3/1/498
- Future improvements
Ideally we would want the driver to register itself based on devices'
_CID and not _HID, but unfortunately I couldn't find a way to do that.
The problem is that ACPI device matching is done by
'__acpi_match_device()' which exclusively looks at
'acpi_hardware_id *hwid'.
There is a path for platform devices to match on _CID when _HID is
'PRP0001' - but this is not the case for the Qemu vmgenid device.
Guidance and help here would be greatly appreciated.
Signed-off-by: Adrian Catangiu <[email protected]>
---
v1 -> v2:
- expose to userspace a monotonically increasing u32 Vm Gen Counter
instead of the hw VmGen UUID
- since the hw/hypervisor-provided 128-bit UUID is not public
anymore, add it to the kernel RNG as device randomness
- insert driver page containing Vm Gen Counter in the user vma in
the driver's mmap handler instead of using a fault handler
- turn driver into a misc device driver to auto-create /dev/vmgenid
- change ioctl arg to avoid leaking kernel structs to userspace
- update documentation
- various nits
- rebase on top of linus latest
v2 -> v3:
- separate the core driver logic and interface, from the ACPI device.
The ACPI vmgenid device is now one possible backend.
- fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
- add locking to avoid races between fs ops handlers and hw irq
driven generation updates
- change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
outdated or a generation change happens while waiting (thus making
current caller outdated), the ioctl returns -EINTR to signal the
user to handle event and retry. Fixes blocking on oneself.
- add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
CAP_CHECKPOINT_RESTORE capability, through which software can force
generation bump.
---
Documentation/virt/vmgenid.rst | 240 +++++++++++++++++++++++
drivers/virt/Kconfig | 17 ++
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 435
+++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vmgenid.h | 14 ++
5 files changed, 707 insertions(+)
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/virt/vmgenid.c
create mode 100644 include/uapi/linux/vmgenid.h
diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..b6a9f8d
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,240 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+VMGENID
+============
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by apps that work
+with local copies/caches of world-unique data such as random values,
+uuids, monotonically increasing counters, etc.
+Such apps can be negatively affected by VM snapshotting when the VM
+is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept meant to alleviate the issue
+by providing a unique ID that changes each time the VM is restored
+from a snapshot. The hw provided UUID value can be used to
+differentiate between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The ``vmgenid`` driver exposes a monotonic incremental Virtual
+Machine Generation u32 counter via a char-dev FS interface that
+provides sync and async VmGen counter updates notifications. It also
+provides VmGen counter retrieval and confirmation mechanisms.
+
+This counter and the interface through which it is exposed are
+available even when there is no acpi device present.
+
+When the device is present, the hw provided UUID is not exposed to
+userspace, it is internally used by the driver to keep accounting for
+the exposed VmGen counter. The counter starts from zero when the
+driver is initialized and monotonically increments every time the hw
+UUID changes (the VM generation changes).
+On each hw UUID change, the new UUID is also fed to the kernel RNG.
+
+If there is no acpi vmgenid device present, the generation changes are
+not driven by hw vmgenid events and thus should be driven by software
+through a dedicated driver ioctl.
+
+Driver interface:
+
+``open()``:
+ When the device is opened, a copy of the current Vm-Gen-Id (counter)
+ is associated with the open file descriptor. The driver now tracks
+ this file as an independent *watcher*. The driver tracks how many
+ watchers are aware of the latest Vm-Gen-Id counter and how many of
+ them are *outdated*; outdated being those that have lived through
+ a Vm-Gen-Id change but not yet confirmed the new generation counter.
+
+``read()``:
+ Read is meant to provide the *new* VM generation counter when a
+ generation change takes place. The read operation blocks until the
+ associated counter is no longer up to date - until HW vm gen id
+ changes - at which point the new counter is provided/returned.
+ Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
+ *new* counter value available. The generation counter is considered
+ *new* for each open file descriptor that hasn't confirmed the new
+ value, following a generation change. Therefore, once a generation
+ change takes place, all ``read()`` calls will immediately return the
+ new generation counter and will continue to do so until the
+ new value is confirmed back to the driver through ``write()``.
+ Partial reads are not allowed - read buffer needs to be at least
+ ``sizeof(unsigned)`` in size.
+
+``write()``:
+ Write is used to confirm the up-to-date Vm Gen counter back to the
+ driver.
+ Following a VM generation change, all existing watchers are marked
+ as *outdated*. Each file descriptor will maintain the *outdated*
+ status until a ``write()`` confirms the up-to-date counter back to
+ the driver.
+ Partial writes are not allowed - write buffer should be exactly
+ ``sizeof(unsigned)`` in size.
+
+``poll()``:
+ Poll is implemented to allow polling for generation counter updates.
+ Such updates result in ``EPOLLIN`` polling status until the new
+ up-to-date counter is confirmed back to the driver through a
+ ``write()``.
+
+``ioctl()``:
+ The driver also adds support for tracking count of open file
+ descriptors that haven't acknowledged a generation counter update.
+ This is exposed through two IOCTLs:
+
+ - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
+ *outdated* watchers - number of file descriptors that were open
+ during a VM generation change, and which have not yet confirmed the
+ new generation counter.
+ - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+ watchers, or if a ``timeout`` argument is provided, until the
+ timeout expires.
+ If the current caller is *outdated* or a generation change happens
+ while waiting (thus making current caller *outdated*), the ioctl
+ returns ``-EINTR`` to signal the user to handle event and retry.
+ - VMGENID_FORCE_GEN_UPDATE: forces a generation counter bump. Can only
+ be used by processes with CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN
+ capabilities.
+
+``mmap()``:
+ The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+ in size. The first 4 bytes of the mapped page will contain an
+ up-to-date copy of the VM generation counter.
+ The mapped memory can be used as a low-latency generation counter
+ probe mechanism in critical sections - see examples.
+
+``close()``:
+ Removes the file descriptor as a Vm generation counter watcher.
+
+Example application workflows
+-----------------------------
+
+1) Watchdog thread simplified example::
+
+ void watchdog_thread_handler(int *thread_active)
+ {
+ unsigned genid;
+ int fd = open("/dev/vmgenid", O_RDWR | O_CLOEXEC, S_IRUSR |
S_IWUSR);
+
+ do {
+ // read new gen ID - blocks until VM generation changes
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling gen ID update
+ write(fd, &genid, sizeof(genid));
+ } while (atomic_read(thread_active));
+
+ close(fd);
+ }
+
+2) ASYNC simplified example::
+
+ void handle_io_on_vmgenfd(int vmgenfd)
+ {
+ unsigned genid;
+
+ // read new gen ID - we need it to confirm we've handled update
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling the gen ID update
+ write(fd, &genid, sizeof(genid));
+ }
+
+ int main() {
+ int epfd, vmgenfd;
+ struct epoll_event ev;
+
+ epfd = epoll_create(EPOLL_QUEUE_LEN);
+
+ vmgenfd = open("/dev/vmgenid",
+ O_RDWR | O_CLOEXEC | O_NONBLOCK,
+ S_IRUSR | S_IWUSR);
+
+ // register vmgenid for polling
+ ev.events = EPOLLIN;
+ ev.data.fd = vmgenfd;
+ epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
+
+ // register other parts of your app for polling
+ // ...
+
+ while (1) {
+ // wait for something to do...
+ int nfds = epoll_wait(epfd, events,
+ MAX_EPOLL_EVENTS_PER_RUN,
+ EPOLL_RUN_TIMEOUT);
+ if (nfds < 0) die("Error in epoll_wait!");
+
+ // for each ready fd
+ for(int i = 0; i < nfds; i++) {
+ int fd = events[i].data.fd;
+
+ if (fd == vmgenfd)
+ handle_io_on_vmgenfd(vmgenfd);
+ else
+ handle_some_other_part_of_the_app(fd);
+ }
+ }
+
+ return 0;
+ }
+
+3) Mapped memory polling simplified example::
+
+ /*
+ * app/library function that provides cached secrets
+ */
+ char * safe_cached_secret(app_data_t *app)
+ {
+ char *secret;
+ volatile unsigned *const genid_ptr = get_vmgenid_mapping(app);
+ again:
+ secret = __cached_secret(app);
+
+ if (unlikely(*genid_ptr != app->cached_genid)) {
+ // rebuild world then confirm the genid update (thru write)
+ rebuild_caches(app);
+
+ app->cached_genid = *genid_ptr;
+ ack_vmgenid_update(app);
+
+ goto again;
+ }
+
+ return secret;
+ }
+
+4) Orchestrator simplified example::
+
+ /*
+ * orchestrator - manages multiple apps and libraries used by a service
+ * and tries to make sure all sensitive components gracefully handle
+ * VM generation changes.
+ * Following function is called on detection of a VM generation change.
+ */
+ int handle_vmgen_update(int vmgen_fd, unsigned new_gen_id)
+ {
+ // pause until all components have handled event
+ pause_service();
+
+ // confirm *this* watcher as up-to-date
+ write(vmgen_fd, &new_gen_id, sizeof(unsigned));
+
+ // wait for all *others* for at most 5 seconds.
+ ioctl(vmgen_fd, VMGENID_WAIT_WATCHERS, 5000);
+
+ // all apps on the system have rebuilt worlds
+ resume_service();
+ }
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..5d5f37b 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,23 @@ menuconfig VIRT_DRIVERS
if VIRT_DRIVERS
+config VMGENID
+ tristate "Virtual Machine Generation ID driver"
+ depends on ACPI
+ default N
+ help
+ This is a Virtual Machine Generation ID driver which provides
+ a virtual machine generation counter. The driver exposes FS ops
+ on /dev/vmgenid through which it can provide information and
+ notifications on VM generation changes that happen on snapshots
+ or cloning.
+ This enables applications and libraries that store or cache
+ sensitive information, to know that they need to regenerate it
+ after process memory has been exposed to potential copying.
+
+ To compile this driver as a module, choose M here: the
+ module will be called vmgenid.
+
config FSL_HV_MANAGER
tristate "Freescale hypervisor management driver"
depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f28425c..889be01 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,6 +4,7 @@
#
obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
+obj-$(CONFIG_VMGENID) += vmgenid.o
obj-y += vboxguest/
obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..c4d4683
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,435 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc. All rights reserved.
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ * Authors:
+ * Adrian Catangiu <[email protected]>
+ * Or Idgar <[email protected]>
+ * Gal Hammer <[email protected]>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/vmgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct acpi_data {
+ uuid_t uuid;
+ void *uuid_iomap;
+};
+
+struct driver_data {
+ unsigned long map_buf;
+ wait_queue_head_t read_waitq;
+ atomic_t generation_counter;
+
+ unsigned int watchers;
+ atomic_t outdated_watchers;
+ wait_queue_head_t outdated_waitq;
+ spinlock_t lock;
+
+ struct acpi_data *acpi_data;
+};
+struct driver_data driver_data;
+
+struct file_data {
+ unsigned int acked_gen_counter;
+};
+
+static int equals_gen_counter(unsigned int counter)
+{
+ return counter == atomic_read(&driver_data.generation_counter);
+}
+
+static void vmgenid_bump_generation(void)
+{
+ unsigned long flags;
+ int counter;
+
+ spin_lock_irqsave(&driver_data.lock, flags);
+ counter = atomic_inc_return(&driver_data.generation_counter);
+ *((int *) driver_data.map_buf) = counter;
+ atomic_set(&driver_data.outdated_watchers, driver_data.watchers);
+
+ wake_up_interruptible(&driver_data.read_waitq);
+ wake_up_interruptible(&driver_data.outdated_waitq);
+ spin_unlock_irqrestore(&driver_data.lock, flags);
+}
+
+static void vmgenid_put_outdated_watchers(void)
+{
+ if (atomic_dec_and_test(&driver_data.outdated_watchers))
+ wake_up_interruptible(&driver_data.outdated_waitq);
+}
+
+static int vmgenid_open(struct inode *inode, struct file *file)
+{
+ struct file_data *fdata = kzalloc(sizeof(struct file_data),
GFP_KERNEL);
+ unsigned long flags;
+
+ if (!fdata)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&driver_data.lock, flags);
+ fdata->acked_gen_counter =
atomic_read(&driver_data.generation_counter);
+ ++driver_data.watchers;
+ spin_unlock_irqrestore(&driver_data.lock, flags);
+
+ file->private_data = fdata;
+
+ return 0;
+}
+
+static int vmgenid_close(struct inode *inode, struct file *file)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned long flags;
+
+ spin_lock_irqsave(&driver_data.lock, flags);
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ vmgenid_put_outdated_watchers();
+ --driver_data.watchers;
+ spin_unlock_irqrestore(&driver_data.lock, flags);
+
+ kfree(fdata);
+
+ return 0;
+}
+
+static ssize_t
+vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes,
loff_t *ppos)
+{
+ struct file_data *fdata = file->private_data;
+ ssize_t ret;
+ int gen_counter;
+
+ if (nbytes == 0)
+ return 0;
+ /* disallow partial reads */
+ if (nbytes < sizeof(gen_counter))
+ return -EINVAL;
+
+ if (equals_gen_counter(fdata->acked_gen_counter)) {
+ if (file->f_flags & O_NONBLOCK)
+ return -EAGAIN;
+ ret = wait_event_interruptible(
+ driver_data.read_waitq,
+ !equals_gen_counter(fdata->acked_gen_counter)
+ );
+ if (ret)
+ return ret;
+ }
+
+ gen_counter = atomic_read(&driver_data.generation_counter);
+ ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
+ if (ret)
+ return -EFAULT;
+
+ return sizeof(gen_counter);
+}
+
+static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned int new_acked_gen;
+ unsigned long flags;
+
+ /* disallow partial writes */
+ if (count != sizeof(new_acked_gen))
+ return -EINVAL;
+ if (copy_from_user(&new_acked_gen, ubuf, count))
+ return -EFAULT;
+
+ spin_lock_irqsave(&driver_data.lock, flags);
+ /* wrong gen-counter acknowledged */
+ if (!equals_gen_counter(new_acked_gen)) {
+ spin_unlock_irqrestore(&driver_data.lock, flags);
+ return -EINVAL;
+ }
+ if (!equals_gen_counter(fdata->acked_gen_counter)) {
+ fdata->acked_gen_counter = new_acked_gen;
+ vmgenid_put_outdated_watchers();
+ }
+ spin_unlock_irqrestore(&driver_data.lock, flags);
+
+ return (ssize_t)count;
+}
+
+static __poll_t
+vmgenid_poll(struct file *file, poll_table *wait)
+{
+ __poll_t mask = 0;
+ struct file_data *fdata = file->private_data;
+
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ return EPOLLIN | EPOLLRDNORM;
+
+ poll_wait(file, &driver_data.read_waitq, wait);
+
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ mask = EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static long vmgenid_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned long timeout_ns;
+ ktime_t until;
+ int ret = 0;
+
+ switch (cmd) {
+ case VMGENID_GET_OUTDATED_WATCHERS:
+ ret = atomic_read(&driver_data.outdated_watchers);
+ break;
+ case VMGENID_WAIT_WATCHERS:
+ timeout_ns = arg * NSEC_PER_MSEC;
+ until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
+
+ ret = wait_event_interruptible_hrtimeout(
+ driver_data.outdated_waitq,
+ (!atomic_read(&driver_data.outdated_watchers) ||
+ !equals_gen_counter(fdata->acked_gen_counter)),
+ until
+ );
+ if (atomic_read(&driver_data.outdated_watchers))
+ ret = -EINTR;
+ else
+ ret = 0;
+ break;
+ case VMGENID_FORCE_GEN_UPDATE:
+ if (!checkpoint_restore_ns_capable(current_user_ns()))
+ return -EACCES;
+ vmgenid_bump_generation();
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file_data *fdata = file->private_data;
+
+ if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) != 0)
+ return -EPERM;
+
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags &= ~VM_MAYWRITE;
+ vma->vm_private_data = fdata;
+
+ return vm_insert_page(vma, vma->vm_start,
+ virt_to_page(driver_data.map_buf));
+}
+
+static const struct file_operations fops = {
+ .owner = THIS_MODULE,
+ .mmap = vmgenid_mmap,
+ .open = vmgenid_open,
+ .release = vmgenid_close,
+ .read = vmgenid_read,
+ .write = vmgenid_write,
+ .poll = vmgenid_poll,
+ .unlocked_ioctl = vmgenid_ioctl,
+};
+
+struct miscdevice vmgenid_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "vmgenid",
+ .fops = &fops,
+};
+
+static int vmgenid_acpi_map(struct acpi_data *priv, acpi_handle handle)
+{
+ int i;
+ phys_addr_t phys_addr;
+ struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+ acpi_status status;
+ union acpi_object *pss;
+ union acpi_object *element;
+
+ status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ pss = buffer.pointer;
+ if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+ return -EINVAL;
+
+ phys_addr = 0;
+ for (i = 0; i < pss->package.count; i++) {
+ element = &(pss->package.elements[i]);
+ if (element->type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+ phys_addr |= element->integer.value << i * 32;
+ }
+
+ priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+ if (!priv->uuid_iomap) {
+ pr_err("Could not map memory at 0x%llx, size %u\n",
+ phys_addr,
+ (u32) sizeof(uuid_t));
+ return -ENOMEM;
+ }
+
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+ int ret;
+
+ if (!device)
+ return -EINVAL;
+
+ driver_data.acpi_data = kzalloc(sizeof(struct acpi_data), GFP_KERNEL);
+ if (!driver_data.acpi_data) {
+ pr_err("vmgenid: failed to allocate acpi_data\n");
+ return -ENOMEM;
+ }
+ device->driver_data = &driver_data;
+
+ ret = vmgenid_acpi_map(driver_data.acpi_data, device->handle);
+ if (ret < 0) {
+ pr_err("vmgenid: failed to map acpi device\n");
+ goto err;
+ }
+
+ return 0;
+
+err:
+ kfree(driver_data.acpi_data);
+ driver_data.acpi_data = NULL;
+
+ return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+ struct acpi_data *priv;
+
+ if (!device || !acpi_driver_data(device))
+ return -EINVAL;
+
+ device->driver_data = NULL;
+ priv = driver_data.acpi_data;
+ driver_data.acpi_data = NULL;
+
+ if (priv && priv->uuid_iomap)
+ acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
+ kfree(priv);
+
+ return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+ struct acpi_data *priv;
+ uuid_t old_uuid;
+
+ if (!device || !acpi_driver_data(device)) {
+ pr_err("VMGENID notify with NULL private data\n");
+ return;
+ }
+ priv = driver_data.acpi_data;
+
+ /* update VM Generation UUID */
+ old_uuid = priv->uuid;
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ if (memcmp(&old_uuid, &priv->uuid, sizeof(uuid_t))) {
+ /* HW uuid updated */
+ vmgenid_bump_generation();
+ add_device_randomness(&priv->uuid, sizeof(uuid_t));
+ }
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+ {"QEMUVGID", 0},
+ {"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+ .name = "vm_generation_id",
+ .ids = vmgenid_ids,
+ .owner = THIS_MODULE,
+ .ops = {
+ .add = vmgenid_acpi_add,
+ .remove = vmgenid_acpi_remove,
+ .notify = vmgenid_acpi_notify,
+ }
+};
+
+static int __init vmgenid_init(void)
+{
+ int ret;
+
+ driver_data.map_buf = get_zeroed_page(GFP_KERNEL);
+ if (!driver_data.map_buf)
+ return -ENOMEM;
+
+ atomic_set(&driver_data.generation_counter, 0);
+ atomic_set(&driver_data.outdated_watchers, 0);
+ init_waitqueue_head(&driver_data.read_waitq);
+ init_waitqueue_head(&driver_data.outdated_waitq);
+ spin_lock_init(&driver_data.lock);
+ driver_data.acpi_data = NULL;
+
+ ret = misc_register(&vmgenid_misc);
+ if (ret < 0) {
+ pr_err("misc_register() failed for vmgenid\n");
+ goto err;
+ }
+
+ ret = acpi_bus_register_driver(&acpi_vmgenid_driver);
+ if (ret < 0)
+ pr_warn("No vmgenid acpi device found\n");
+
+ return 0;
+
+err:
+ free_pages(driver_data.map_buf, 0);
+ driver_data.map_buf = 0;
+
+ return ret;
+}
+
+static void __exit vmgenid_exit(void)
+{
+ acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+
+ misc_deregister(&vmgenid_misc);
+ free_pages(driver_data.map_buf, 0);
+ driver_data.map_buf = 0;
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h
new file mode 100644
index 0000000..9316b00
--- /dev/null
+++ b/include/uapi/linux/vmgenid.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_VMGENID_H
+#define _UAPI_LINUX_VMGENID_H
+
+#include <linux/ioctl.h>
+
+#define VMGENID_IOCTL 0x2d
+#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
+#define VMGENID_WAIT_WATCHERS _IO(VMGENID_IOCTL, 2)
+#define VMGENID_FORCE_GEN_UPDATE _IO(VMGENID_IOCTL, 3)
+
+#endif /* _UAPI_LINUX_VMGENID_H */
+
--
2.7.4
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Sorry Jann for missing your original email.
On 27/11/2020 20:22, Jann Horn wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> [resend in the hope that amazon will accept my mail this time instead
> of replying "550 Too many invalid recipients" again]
>
> On Fri, Nov 20, 2020 at 11:29 PM Jann Horn <[email protected]> wrote:
>> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
>> <[email protected]> wrote:
>>> This patch is a driver that exposes a monotonic incremental Virtual
>>> Machine Generation u32 counter via a char-dev FS interface that
>>> provides sync and async VmGen counter updates notifications. It also
>>> provides VmGen counter retrieval and confirmation mechanisms.
>>>
>>> The hw provided UUID is not exposed to userspace, it is internally
>>> used by the driver to keep accounting for the exposed VmGen counter.
>>> The counter starts from zero when the driver is initialized and
>>> monotonically increments every time the hw UUID changes (the VM
>>> generation changes).
>>>
>>> On each hw UUID change, the new hypervisor-provided UUID is also fed
>>> to the kernel RNG.
>> As for v1:
>>
>> Is there a reasonable usecase for the "confirmation" mechanism? It
>> doesn't seem very useful to me.
I think it adds value in complex scenarios with multiple users of the
mechanism, potentially at varying layers of the stack, different
processes and/or runtime libraries.
The driver offers a natural place to handle minimal orchestration
support and offer visibility in system-wide status.
A high-level service that trusts all system components to properly use
the confirmation mechanism can actually block and wait patiently for the
system to adjust to the new world. Even if it doesn't trust all
components it can still do a best-effort, timeout block.
>>
>> How do you envision integrating this with libraries that have to work
>> in restrictive seccomp sandboxes? If this was in the vDSO, that would
>> be much easier.
Since this mechanism targets all of userspace stack, the usecase greatly
vary. I doubt we can have a single silver bullet interface.
For example, the mmap interface targets user space RNGs, where as fast
and as race free as possible is key. But there also higher level
applications that don't manage their own memory or don't have access to
low-level primitives so they can't use the mmap or even vDSO interfaces.
That's what the rest of the logic is there for, the read+poll interface
and all of the orchestration logic.
Like you correctly point out, there are also scenarios like tight
seccomp jails where even the FS interfaces is inaccessible. For cases
like this and others, I believe we will have to work incrementally to
build up the interface diversity to cater to all the user scenarios
diversity.
Thanks,
Adrian.
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin
<[email protected]> wrote:
> On 27/11/2020 20:22, Jann Horn wrote:
> > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn <[email protected]> wrote:
> >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
> >> <[email protected]> wrote:
> >>> This patch is a driver that exposes a monotonic incremental Virtual
> >>> Machine Generation u32 counter via a char-dev FS interface that
> >>> provides sync and async VmGen counter updates notifications. It also
> >>> provides VmGen counter retrieval and confirmation mechanisms.
> >>>
> >>> The hw provided UUID is not exposed to userspace, it is internally
> >>> used by the driver to keep accounting for the exposed VmGen counter.
> >>> The counter starts from zero when the driver is initialized and
> >>> monotonically increments every time the hw UUID changes (the VM
> >>> generation changes).
> >>>
> >>> On each hw UUID change, the new hypervisor-provided UUID is also fed
> >>> to the kernel RNG.
> >> As for v1:
> >>
> >> Is there a reasonable usecase for the "confirmation" mechanism? It
> >> doesn't seem very useful to me.
>
> I think it adds value in complex scenarios with multiple users of the
> mechanism, potentially at varying layers of the stack, different
> processes and/or runtime libraries.
>
> The driver offers a natural place to handle minimal orchestration
> support and offer visibility in system-wide status.
>
> A high-level service that trusts all system components to properly use
> the confirmation mechanism can actually block and wait patiently for the
> system to adjust to the new world. Even if it doesn't trust all
> components it can still do a best-effort, timeout block.
What concrete action would that high-level service be able to take
after waiting for such an event?
My model of the vmgenid mechanism is that RNGs and cryptographic
libraries that use it need to be fundamentally written such that it is
guaranteed that a VM fork can not cause the same random number /
counter / ... to be reused for two different cryptographic operations
in a way visible to an attacker. This means that e.g. TLS libraries
need to, between accepting unencrypted input and sending out encrypted
data, check whether the vmgenid changed since the connection was set
up, and if a vmgenid change occurred, kill the connection.
Can you give a concrete example of a usecase where the vmgenid
mechanism is used securely and the confirmation mechanism is necessary
as part of that?
> >> How do you envision integrating this with libraries that have to work
> >> in restrictive seccomp sandboxes? If this was in the vDSO, that would
> >> be much easier.
>
> Since this mechanism targets all of userspace stack, the usecase greatly
> vary. I doubt we can have a single silver bullet interface.
>
> For example, the mmap interface targets user space RNGs, where as fast
> and as race free as possible is key. But there also higher level
> applications that don't manage their own memory or don't have access to
> low-level primitives so they can't use the mmap or even vDSO interfaces.
> That's what the rest of the logic is there for, the read+poll interface
> and all of the orchestration logic.
Are you saying that, because people might not want to write proper
bindings for this interface while also being unwilling to take the
performance hit of calling read() in every place where they would have
to do so to be fully correct, you want to build a "best-effort"
mechanism that is deliberately designed to allow some cryptographic
state reuse in a limited time window?
> Like you correctly point out, there are also scenarios like tight
> seccomp jails where even the FS interfaces is inaccessible. For cases
> like this and others, I believe we will have to work incrementally to
> build up the interface diversity to cater to all the user scenarios
> diversity.
It would be much nicer if we could have one simple interface that lets
everyone correctly do what they need to, though...
Hi Adrian,
Usually each version of a patch is a separate e-mail thread
On Fri, Nov 27, 2020 at 08:26:02PM +0200, Catangiu, Adrian Costin wrote:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
>
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
>
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
>
> Furthermore, simply finding out about a VM generation change is only
> the starting point of a process to renew internal states of possibly
> multiple applications across the system. This process could benefit
> from a driver that provides an interface through which orchestration
> can be easily done.
>
> - Solution
>
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface. The FS
> interface provides sync and async VmGen counter updates notifications.
> It also provides VmGen counter retrieval and confirmation mechanisms.
>
> The generation counter and the interface through which it is exposed
> are available even when there is no acpi device present.
>
> When the device is present, the hw provided UUID is not exposed to
> userspace, it is internally used by the driver to keep accounting for
> the exposed VmGen counter. The counter starts from zero when the
> driver is initialized and monotonically increments every time the hw
> UUID changes (the VM generation changes).
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.
>
> If there is no acpi vmgenid device present, the generation changes are
> not driven by hw vmgenid events but can be driven by software through
> a dedicated driver ioctl.
>
> This patch builds on top of Or Idgar <[email protected]>'s proposal
> https://lkml.org/lkml/2018/3/1/498
>
> - Future improvements
>
> Ideally we would want the driver to register itself based on devices'
> _CID and not _HID, but unfortunately I couldn't find a way to do that.
> The problem is that ACPI device matching is done by
> '__acpi_match_device()' which exclusively looks at
> 'acpi_hardware_id *hwid'.
>
> There is a path for platform devices to match on _CID when _HID is
> 'PRP0001' - but this is not the case for the Qemu vmgenid device.
>
> Guidance and help here would be greatly appreciated.
>
> Signed-off-by: Adrian Catangiu <[email protected]>
>
> ---
Please put the history in the descending order next time
v2 -> v3:
...
v1 -> v2:
...
> v1 -> v2:
>
> ? - expose to userspace a monotonically increasing u32 Vm Gen Counter
> ??? instead of the hw VmGen UUID
> ? - since the hw/hypervisor-provided 128-bit UUID is not public
> ??? anymore, add it to the kernel RNG as device randomness
> ? - insert driver page containing Vm Gen Counter in the user vma in
> ??? the driver's mmap handler instead of using a fault handler
> ? - turn driver into a misc device driver to auto-create /dev/vmgenid
> ? - change ioctl arg to avoid leaking kernel structs to userspace
> ? - update documentation
> ? - various nits
> ? - rebase on top of linus latest
>
> v2 -> v3:
>
> ? - separate the core driver logic and interface, from the ACPI device.
> ??? The ACPI vmgenid device is now one possible backend.
> ? - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
> ? - add locking to avoid races between fs ops handlers and hw irq
> ??? driven generation updates
> ? - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
> ??? outdated or a generation change happens while waiting (thus making
> ??? current caller outdated), the ioctl returns -EINTR to signal the
> ??? user to handle event and retry. Fixes blocking on oneself.
> ? - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
> ??? CAP_CHECKPOINT_RESTORE capability, through which software can force
> ??? generation bump.
> ---
> ?Documentation/virt/vmgenid.rst | 240 +++++++++++++++++++++++
> ?drivers/virt/Kconfig?????????? |? 17 ++
> ?drivers/virt/Makefile????????? |?? 1 +
> ?drivers/virt/vmgenid.c???????? | 435 +++++++++++++++++++++++++++++++++++++++++
> ?include/uapi/linux/vmgenid.h?? |? 14 ++
> ?5 files changed, 707 insertions(+)
> ?create mode 100644 Documentation/virt/vmgenid.rst
> ?create mode 100644 drivers/virt/vmgenid.c
> ?create mode 100644 include/uapi/linux/vmgenid.h
>
> diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
> new file mode 100644
> index 0000000..b6a9f8d
> --- /dev/null
> +++ b/Documentation/virt/vmgenid.rst
> @@ -0,0 +1,240 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +============
> +VMGENID
> +============
The "==" line should be the same length as the title, I think.
> +
> +The VM Generation ID is a feature defined by Microsoft (paper:
> +http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> +multiple hypervisor vendors.
> +
> +The feature is required in virtualized environments by apps that work
Please spell 'applications' fully
> +with local copies/caches of world-unique data such as random values,
> +uuids, monotonically increasing counters, etc.
UUIDs
> +Such apps can be negatively affected by VM snapshotting when the VM
^applications
> +is either cloned or returned to an earlier point in time.
> +
> +The VM Generation ID is a simple concept meant to alleviate the issue
> +by providing a unique ID that changes each time the VM is restored
> +from a snapshot. The hw provided UUID value can be used to
^hardware (and below as well)
> +differentiate between VMs or different generations of the same VM.
> +
> +The VM Generation ID is exposed through an ACPI device by multiple
> +hypervisor vendors. The driver for it lives at
> +``drivers/virt/vmgenid.c``
> +
> +The ``vmgenid`` driver exposes a monotonic incremental Virtual
> +Machine Generation u32 counter via a char-dev FS interface that
> +provides sync and async VmGen counter updates notifications. It also
> +provides VmGen counter retrieval and confirmation mechanisms.
It would be nice to memntion here the name of the chardev :)
> +This counter and the interface through which it is exposed are
> +available even when there is no acpi device present.
> +
> +When the device is present, the hw provided UUID is not exposed to
> +userspace, it is internally used by the driver to keep accounting for
> +the exposed VmGen counter. The counter starts from zero when the
> +driver is initialized and monotonically increments every time the hw
> +UUID changes (the VM generation changes).
> +On each hw UUID change, the new UUID is also fed to the kernel RNG.
> +
> +If there is no acpi vmgenid device present, the generation changes are
> +not driven by hw vmgenid events and thus should be driven by software
> +through a dedicated driver ioctl.
> +
> +Driver interface:
> +
> +``open()``:
> +? When the device is opened, a copy of the current Vm-Gen-Id (counter)
> +? is associated with the open file descriptor. The driver now tracks
> +? this file as an independent *watcher*. The driver tracks how many
> +? watchers are aware of the latest Vm-Gen-Id counter and how many of
> +? them are *outdated*; outdated being those that have lived through
> +? a Vm-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> +? Read is meant to provide the *new* VM generation counter when a
> +? generation change takes place. The read operation blocks until the
> +? associated counter is no longer up to date - until HW vm gen id
> +? changes - at which point the new counter is provided/returned.
> +? Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> +? *new* counter value available. The generation counter is considered
> +? *new* for each open file descriptor that hasn't confirmed the new
> +? value, following a generation change. Therefore, once a generation
> +? change takes place, all ``read()`` calls will immediately return the
> +? new generation counter and will continue to do so until the
> +? new value is confirmed back to the driver through ``write()``.
> +? Partial reads are not allowed - read buffer needs to be at least
> +? ``sizeof(unsigned)`` in size.
> +
> +``write()``:
> +? Write is used to confirm the up-to-date Vm Gen counter back to the
> +? driver.
> +? Following a VM generation change, all existing watchers are marked
> +? as *outdated*. Each file descriptor will maintain the *outdated*
> +? status until a ``write()`` confirms the up-to-date counter back to
> +? the driver.
> +? Partial writes are not allowed - write buffer should be exactly
> +? ``sizeof(unsigned)`` in size.
> +
> +``poll()``:
> +? Poll is implemented to allow polling for generation counter updates.
> +? Such updates result in ``EPOLLIN`` polling status until the new
> +? up-to-date counter is confirmed back to the driver through a
> +? ``write()``.
> +
> +``ioctl()``:
> +? The driver also adds support for tracking count of open file
> +? descriptors that haven't acknowledged a generation counter update.
> +? This is exposed through two IOCTLs:
> +
> +? - VMGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
> +??? *outdated* watchers - number of file descriptors that were open
> +??? during a VM generation change, and which have not yet confirmed the
> +??? new generation counter.
> +? - VMGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> +??? watchers, or if a ``timeout`` argument is provided, until the
> +??? timeout expires.
> +??? If the current caller is *outdated* or a generation change happens
> +??? while waiting (thus making current caller *outdated*), the ioctl
> +??? returns ``-EINTR`` to signal the user to handle event and retry.
> +? - VMGENID_FORCE_GEN_UPDATE: forces a generation counter bump. Can only
> +??? be used by processes with CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN
> +??? capabilities.
> +
> +``mmap()``:
> +? The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> +? in size. The first 4 bytes of the mapped page will contain an
> +? up-to-date copy of the VM generation counter.
> +? The mapped memory can be used as a low-latency generation counter
> +? probe mechanism in critical sections - see examples.
> +
> +``close()``:
> +? Removes the file descriptor as a Vm generation counter watcher.
> +
> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> +??? void watchdog_thread_handler(int *thread_active)
> +??? {
> +??? ??? unsigned genid;
> +??? ??? int fd = open("/dev/vmgenid", O_RDWR | O_CLOEXEC, S_IRUSR |
> S_IWUSR);
> +
> +??? ??? do {
> +??? ??? ??? // read new gen ID - blocks until VM generation changes
> +??? ??? ??? read(fd, &genid, sizeof(genid));
> +
> +??? ??? ??? // because of VM generation change, we need to rebuild world
> +??? ??? ??? reseed_app_env();
> +
> +??? ??? ??? // confirm we're done handling gen ID update
> +??? ??? ??? write(fd, &genid, sizeof(genid));
> +??? ??? } while (atomic_read(thread_active));
> +
> +??? ??? close(fd);
> +??? }
> +
> +2) ASYNC simplified example::
> +
> +??? void handle_io_on_vmgenfd(int vmgenfd)
> +??? {
> +??? ??? unsigned genid;
> +
> +??? ??? // read new gen ID - we need it to confirm we've handled update
> +??? ??? read(fd, &genid, sizeof(genid));
> +
> +??? ??? // because of VM generation change, we need to rebuild world
> +??? ??? reseed_app_env();
> +
> +??? ??? // confirm we're done handling the gen ID update
> +??? ??? write(fd, &genid, sizeof(genid));
> +??? }
> +
> +??? int main() {
> +??? ??? int epfd, vmgenfd;
> +??? ??? struct epoll_event ev;
> +
> +??? ??? epfd = epoll_create(EPOLL_QUEUE_LEN);
> +
> +??? ??? vmgenfd = open("/dev/vmgenid",
> +??? ??? ?????????????? O_RDWR | O_CLOEXEC | O_NONBLOCK,
> +??? ??? ?????????????? S_IRUSR | S_IWUSR);
> +
> +??? ??? // register vmgenid for polling
> +??? ??? ev.events = EPOLLIN;
> +??? ??? ev.data.fd = vmgenfd;
> +??? ??? epoll_ctl(epfd, EPOLL_CTL_ADD, vmgenfd, &ev);
> +
> +??? ??? // register other parts of your app for polling
> +??? ??? // ...
> +
> +??? ??? while (1) {
> +??? ??? ??? // wait for something to do...
> +??? ??? ??? int nfds = epoll_wait(epfd, events,
> +??? ??? ??? ??? MAX_EPOLL_EVENTS_PER_RUN,
> +??? ??? ??? ??? EPOLL_RUN_TIMEOUT);
> +??? ??? ??? if (nfds < 0) die("Error in epoll_wait!");
> +
> +??? ??? ??? // for each ready fd
> +??? ??? ??? for(int i = 0; i < nfds; i++) {
> +??? ??? ??? ??? int fd = events[i].data.fd;
> +
> +??? ??? ??? ??? if (fd == vmgenfd)
> +??? ??? ??? ??? ??? handle_io_on_vmgenfd(vmgenfd);
> +??? ??? ??? ??? else
> +??? ??? ??? ??? ??? handle_some_other_part_of_the_app(fd);
> +??? ??? ??? }
> +??? ??? }
> +
> +??? ??? return 0;
> +??? }
> +
> +3) Mapped memory polling simplified example::
> +
> +??? /*
> +??? ?* app/library function that provides cached secrets
> +??? ?*/
> +??? char * safe_cached_secret(app_data_t *app)
> +??? {
> +??? ??? char *secret;
> +??? ??? volatile unsigned *const genid_ptr = get_vmgenid_mapping(app);
> +??? again:
> +??? ??? secret = __cached_secret(app);
> +
> +??? ??? if (unlikely(*genid_ptr != app->cached_genid)) {
> +??? ??? ??? // rebuild world then confirm the genid update (thru write)
> +??? ??? ??? rebuild_caches(app);
> +
> +??? ??? ??? app->cached_genid = *genid_ptr;
> +??? ??? ??? ack_vmgenid_update(app);
> +
> +??? ??? ??? goto again;
> +??? ??? }
> +
> +??? ??? return secret;
> +??? }
> +
> +4) Orchestrator simplified example::
> +
> +??? /*
> +??? ?* orchestrator - manages multiple apps and libraries used by a service
> +??? ?* and tries to make sure all sensitive components gracefully handle
> +??? ?* VM generation changes.
> +??? ?* Following function is called on detection of a VM generation change.
> +??? ?*/
> +??? int handle_vmgen_update(int vmgen_fd, unsigned new_gen_id)
> +??? {
> +??? ??? // pause until all components have handled event
> +??? ??? pause_service();
> +
> +??? ??? // confirm *this* watcher as up-to-date
> +??? ??? write(vmgen_fd, &new_gen_id, sizeof(unsigned));
> +
> +??? ??? // wait for all *others* for at most 5 seconds.
> +??? ??? ioctl(vmgen_fd, VMGENID_WAIT_WATCHERS, 5000);
> +
> +??? ??? // all apps on the system have rebuilt worlds
> +??? ??? resume_service();
> +??? }
> diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
> index 80c5f9c1..5d5f37b 100644
> --- a/drivers/virt/Kconfig
> +++ b/drivers/virt/Kconfig
> @@ -13,6 +13,23 @@ menuconfig VIRT_DRIVERS
> ?
> ?if VIRT_DRIVERS
> ?
> +config VMGENID
> +??? tristate "Virtual Machine Generation ID driver"
> +??? depends on ACPI
I think this is not needed. We have /dev/vmgenid regardless of ACPI
device for container usecase and we may have a different HW emulation
for s390 and PowerPC.
> +??? default N
> +??? help
> +??? ? This is a Virtual Machine Generation ID driver which provides
> +??? ? a virtual machine generation counter. The driver exposes FS ops
> +??? ? on /dev/vmgenid through which it can provide information and
> +??? ? notifications on VM generation changes that happen on snapshots
> +??? ? or cloning.
> +??? ? This enables applications and libraries that store or cache
> +??? ? sensitive information, to know that they need to regenerate it
> +??? ? after process memory has been exposed to potential copying.
> +
> +??? ? To compile this driver as a module, choose M here: the
> +??? ? module will be called vmgenid.
> +
> ?config FSL_HV_MANAGER
> ???? tristate "Freescale hypervisor management driver"
> ???? depends on FSL_SOC
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index f28425c..889be01 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -4,6 +4,7 @@
> ?#
> ?
> ?obj-$(CONFIG_FSL_HV_MANAGER)??? += fsl_hypervisor.o
> +obj-$(CONFIG_VMGENID)??? ??? += vmgenid.o
> ?obj-y??? ??? ??? ??? += vboxguest/
> ?
> ?obj-$(CONFIG_NITRO_ENCLAVES)??? += nitro_enclaves/
> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> new file mode 100644
> index 0000000..c4d4683
> --- /dev/null
> +++ b/drivers/virt/vmgenid.c
> @@ -0,0 +1,435 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Virtual Machine Generation ID driver
> + *
> + * Copyright (C) 2018 Red Hat Inc. All rights reserved.
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + *??? Authors:
> + *??? ? Adrian Catangiu <[email protected]>
> + *??? ? Or Idgar <[email protected]>
> + *??? ? Gal Hammer <[email protected]>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/vmgenid.h>
> +
> +#define DEV_NAME "vmgenid"
> +ACPI_MODULE_NAME(DEV_NAME);
> +
> +struct acpi_data {
> +??? uuid_t uuid;
> +??? void?? *uuid_iomap;
> +};
> +
> +struct driver_data {
I'd suggest vmgenid_data
> +??? unsigned long???? map_buf;
We use tab=8 for indentation. Please run your patch though
scripts/checkpatch.pl to make sure it conforms the coding style.
> +??? wait_queue_head_t read_waitq;
> +??? atomic_t????????? generation_counter;
> +
> +??? unsigned int????? watchers;
> +??? atomic_t????????? outdated_watchers;
> +??? wait_queue_head_t outdated_waitq;
> +??? spinlock_t??????? lock;
> +
> +??? struct acpi_data? *acpi_data;
> +};
> +struct driver_data driver_data;
static
> +
> +struct file_data {
> +??? unsigned int acked_gen_counter;
> +};
> +
> +static int equals_gen_counter(unsigned int counter)
> +{
> +??? return counter == atomic_read(&driver_data.generation_counter);
> +}
> +
> +static void vmgenid_bump_generation(void)
> +{
> +??? unsigned long flags;
> +??? int counter;
> +
> +??? spin_lock_irqsave(&driver_data.lock, flags);
> +??? counter = atomic_inc_return(&driver_data.generation_counter);
> +??? *((int *) driver_data.map_buf) = counter;
> +??? atomic_set(&driver_data.outdated_watchers, driver_data.watchers);
> +
> +??? wake_up_interruptible(&driver_data.read_waitq);
> +??? wake_up_interruptible(&driver_data.outdated_waitq);
> +??? spin_unlock_irqrestore(&driver_data.lock, flags);
> +}
> +
> +static void vmgenid_put_outdated_watchers(void)
> +{
> +??? if (atomic_dec_and_test(&driver_data.outdated_watchers))
> +??? ??? wake_up_interruptible(&driver_data.outdated_waitq);
> +}
> +
> +static int vmgenid_open(struct inode *inode, struct file *file)
> +{
> +??? struct file_data *fdata = kzalloc(sizeof(struct file_data),
> GFP_KERNEL);
> +??? unsigned long flags;
> +
> +??? if (!fdata)
> +??? ??? return -ENOMEM;
> +
> +??? spin_lock_irqsave(&driver_data.lock, flags);
> +??? fdata->acked_gen_counter =
> atomic_read(&driver_data.generation_counter);
> +??? ++driver_data.watchers;
> +??? spin_unlock_irqrestore(&driver_data.lock, flags);
> +
> +??? file->private_data = fdata;
> +
> +??? return 0;
> +}
> +
> +static int vmgenid_close(struct inode *inode, struct file *file)
> +{
> +??? struct file_data *fdata = file->private_data;
> +??? unsigned long flags;
> +
> +??? spin_lock_irqsave(&driver_data.lock, flags);
> +??? if (!equals_gen_counter(fdata->acked_gen_counter))
> +??? ??? vmgenid_put_outdated_watchers();
> +??? --driver_data.watchers;
> +??? spin_unlock_irqrestore(&driver_data.lock, flags);
> +
> +??? kfree(fdata);
> +
> +??? return 0;
> +}
> +
> +static ssize_t
> +vmgenid_read(struct file *file, char __user *ubuf, size_t nbytes,
Please keep the function name at the same line as return type and wrap
parameters to the next line.
> loff_t *ppos)
> +{
> +??? struct file_data *fdata = file->private_data;
> +??? ssize_t ret;
> +??? int gen_counter;
> +
> +??? if (nbytes == 0)
> +??? ??? return 0;
> +??? /* disallow partial reads */
> +??? if (nbytes < sizeof(gen_counter))
> +??? ??? return -EINVAL;
> +
> +??? if (equals_gen_counter(fdata->acked_gen_counter)) {
> +??? ??? if (file->f_flags & O_NONBLOCK)
> +??? ??? ??? return -EAGAIN;
> +??? ??? ret = wait_event_interruptible(
> +??? ??? ??? driver_data.read_waitq,
> +??? ??? ??? !equals_gen_counter(fdata->acked_gen_counter)
> +??? ??? );
> +??? ??? if (ret)
> +??? ??? ??? return ret;
> +??? }
> +
> +??? gen_counter = atomic_read(&driver_data.generation_counter);
> +??? ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
> +??? if (ret)
> +??? ??? return -EFAULT;
> +
> +??? return sizeof(gen_counter);
> +}
> +
> +static ssize_t vmgenid_write(struct file *file, const char __user *ubuf,
> +??? ??? ??? ??? size_t count, loff_t *ppos)
> +{
> +??? struct file_data *fdata = file->private_data;
> +??? unsigned int new_acked_gen;
> +??? unsigned long flags;
> +
> +??? /* disallow partial writes */
> +??? if (count != sizeof(new_acked_gen))
> +??? ??? return -EINVAL;
> +??? if (copy_from_user(&new_acked_gen, ubuf, count))
> +??? ??? return -EFAULT;
> +
> +??? spin_lock_irqsave(&driver_data.lock, flags);
> +??? /* wrong gen-counter acknowledged */
> +??? if (!equals_gen_counter(new_acked_gen)) {
> +??? ??? spin_unlock_irqrestore(&driver_data.lock, flags);
> +??? ??? return -EINVAL;
> +??? }
> +??? if (!equals_gen_counter(fdata->acked_gen_counter)) {
> +??? ??? fdata->acked_gen_counter = new_acked_gen;
> +??? ??? vmgenid_put_outdated_watchers();
> +??? }
> +??? spin_unlock_irqrestore(&driver_data.lock, flags);
> +
> +??? return (ssize_t)count;
> +}
> +
> +static __poll_t
> +vmgenid_poll(struct file *file, poll_table *wait)
> +{
> +??? __poll_t mask = 0;
> +??? struct file_data *fdata = file->private_data;
> +
> +??? if (!equals_gen_counter(fdata->acked_gen_counter))
> +??? ??? return EPOLLIN | EPOLLRDNORM;
> +
> +??? poll_wait(file, &driver_data.read_waitq, wait);
> +
> +??? if (!equals_gen_counter(fdata->acked_gen_counter))
> +??? ??? mask = EPOLLIN | EPOLLRDNORM;
> +
> +??? return mask;
> +}
> +
> +static long vmgenid_ioctl(struct file *file,
> +??? ??? unsigned int cmd, unsigned long arg)
> +{
> +??? struct file_data *fdata = file->private_data;
> +??? unsigned long timeout_ns;
> +??? ktime_t until;
> +??? int ret = 0;
> +
> +??? switch (cmd) {
> +??? case VMGENID_GET_OUTDATED_WATCHERS:
> +??? ??? ret = atomic_read(&driver_data.outdated_watchers);
> +??? ??? break;
> +??? case VMGENID_WAIT_WATCHERS:
> +??? ??? timeout_ns = arg * NSEC_PER_MSEC;
> +??? ??? until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
> +
> +??? ??? ret = wait_event_interruptible_hrtimeout(
> +??? ??? ??? driver_data.outdated_waitq,
> +??? ??? ??? (!atomic_read(&driver_data.outdated_watchers) ||
> +??? ??? ??? ??? ??? !equals_gen_counter(fdata->acked_gen_counter)),
> +??? ??? ??? until
> +??? ??? );
> +??? ??? if (atomic_read(&driver_data.outdated_watchers))
> +??? ??? ??? ret = -EINTR;
> +??? ??? else
> +??? ??? ??? ret = 0;
> +??? ??? break;
> +??? case VMGENID_FORCE_GEN_UPDATE:
> +??? ??? if (!checkpoint_restore_ns_capable(current_user_ns()))
> +??? ??? ??? return -EACCES;
> +??? ??? vmgenid_bump_generation();
> +??? ??? break;
> +??? default:
> +??? ??? ret = -EINVAL;
> +??? ??? break;
> +??? }
> +??? return ret;
> +}
> +
> +static int vmgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +??? struct file_data *fdata = file->private_data;
> +
> +??? if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> +??? ??? return -EINVAL;
> +
> +??? if ((vma->vm_flags & VM_WRITE) != 0)
> +??? ??? return -EPERM;
> +
> +??? vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +??? vma->vm_flags &= ~VM_MAYWRITE;
> +??? vma->vm_private_data = fdata;
> +
> +??? return vm_insert_page(vma, vma->vm_start,
> +??? ??? ??? ??? ??? ??? ? virt_to_page(driver_data.map_buf));
> +}
> +
> +static const struct file_operations fops = {
> +??? .owner????????? = THIS_MODULE,
> +??? .mmap?????????? = vmgenid_mmap,
> +??? .open?????????? = vmgenid_open,
> +??? .release??????? = vmgenid_close,
> +??? .read?????????? = vmgenid_read,
> +??? .write????????? = vmgenid_write,
> +??? .poll?????????? = vmgenid_poll,
> +??? .unlocked_ioctl = vmgenid_ioctl,
> +};
> +
> +struct miscdevice vmgenid_misc = {
static
> +??? .minor = MISC_DYNAMIC_MINOR,
> +??? .name = "vmgenid",
> +??? .fops = &fops,
> +};
> +
> +static int vmgenid_acpi_map(struct acpi_data *priv, acpi_handle handle)
> +{
> +??? int i;
> +??? phys_addr_t phys_addr;
> +??? struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
> +??? acpi_status status;
> +??? union acpi_object *pss;
> +??? union acpi_object *element;
> +
> +??? status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
> +??? if (ACPI_FAILURE(status)) {
> +??? ??? ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
> +??? ??? return -ENODEV;
> +??? }
> +??? pss = buffer.pointer;
> +??? if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
> +??? ??? return -EINVAL;
> +
> +??? phys_addr = 0;
> +??? for (i = 0; i < pss->package.count; i++) {
> +??? ??? element = &(pss->package.elements[i]);
> +??? ??? if (element->type != ACPI_TYPE_INTEGER)
> +??? ??? ??? return -EINVAL;
> +??? ??? phys_addr |= element->integer.value << i * 32;
> +??? }
> +
> +??? priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
> +??? if (!priv->uuid_iomap) {
> +??? ??? pr_err("Could not map memory at 0x%llx, size %u\n",
> +??? ??? ??? ?? phys_addr,
> +??? ??? ??? ?? (u32) sizeof(uuid_t));
> +??? ??? return -ENOMEM;
> +??? }
> +
> +??? memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
> +
> +??? return 0;
> +}
> +
> +static int vmgenid_acpi_add(struct acpi_device *device)
> +{
> +??? int ret;
> +
> +??? if (!device)
> +??? ??? return -EINVAL;
> +
> +??? driver_data.acpi_data = kzalloc(sizeof(struct acpi_data), GFP_KERNEL);
> +??? if (!driver_data.acpi_data) {
> +??? ??? pr_err("vmgenid: failed to allocate acpi_data\n");
> +??? ??? return -ENOMEM;
> +??? }
> +??? device->driver_data = &driver_data;
> +
> +??? ret = vmgenid_acpi_map(driver_data.acpi_data, device->handle);
> +??? if (ret < 0) {
> +??? ??? pr_err("vmgenid: failed to map acpi device\n");
> +??? ??? goto err;
> +??? }
> +
> +??? return 0;
> +
> +err:
> +??? kfree(driver_data.acpi_data);
> +??? driver_data.acpi_data = NULL;
> +
> +??? return ret;
> +}
> +
> +static int vmgenid_acpi_remove(struct acpi_device *device)
> +{
> +??? struct acpi_data *priv;
> +
> +??? if (!device || !acpi_driver_data(device))
> +??? ??? return -EINVAL;
> +
> +??? device->driver_data = NULL;
> +??? priv = driver_data.acpi_data;
> +??? driver_data.acpi_data = NULL;
> +
> +??? if (priv && priv->uuid_iomap)
> +??? ??? acpi_os_unmap_memory(priv->uuid_iomap, sizeof(uuid_t));
> +??? kfree(priv);
> +
> +??? return 0;
> +}
> +
> +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> +{
> +??? struct acpi_data *priv;
> +??? uuid_t old_uuid;
> +
> +??? if (!device || !acpi_driver_data(device)) {
> +??? ??? pr_err("VMGENID notify with NULL private data\n");
> +??? ??? return;
> +??? }
> +??? priv = driver_data.acpi_data;
> +
> +??? /* update VM Generation UUID */
> +??? old_uuid = priv->uuid;
> +??? memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
> +
> +??? if (memcmp(&old_uuid, &priv->uuid, sizeof(uuid_t))) {
> +??? ??? /* HW uuid updated */
> +??? ??? vmgenid_bump_generation();
> +??? ??? add_device_randomness(&priv->uuid, sizeof(uuid_t));
> +??? }
> +}
> +
> +static const struct acpi_device_id vmgenid_ids[] = {
> +??? {"QEMUVGID", 0},
> +??? {"", 0},
> +};
> +
> +static struct acpi_driver acpi_vmgenid_driver = {
> +??? .name = "vm_generation_id",
> +??? .ids = vmgenid_ids,
> +??? .owner = THIS_MODULE,
> +??? .ops = {
> +??? ??? .add = vmgenid_acpi_add,
> +??? ??? .remove = vmgenid_acpi_remove,
> +??? ??? .notify = vmgenid_acpi_notify,
> +??? }
> +};
> +
> +static int __init vmgenid_init(void)
> +{
> +??? int ret;
> +
> +??? driver_data.map_buf = get_zeroed_page(GFP_KERNEL);
> +??? if (!driver_data.map_buf)
> +??? ??? return -ENOMEM;
> +
> +??? atomic_set(&driver_data.generation_counter, 0);
> +??? atomic_set(&driver_data.outdated_watchers, 0);
> +??? init_waitqueue_head(&driver_data.read_waitq);
> +??? init_waitqueue_head(&driver_data.outdated_waitq);
> +??? spin_lock_init(&driver_data.lock);
> +??? driver_data.acpi_data = NULL;
> +
> +??? ret = misc_register(&vmgenid_misc);
> +??? if (ret < 0) {
> +??? ??? pr_err("misc_register() failed for vmgenid\n");
> +??? ??? goto err;
> +??? }
> +
> +??? ret = acpi_bus_register_driver(&acpi_vmgenid_driver);
> +??? if (ret < 0)
> +??? ??? pr_warn("No vmgenid acpi device found\n");
I think this needs to be reworked to support no-ACPI version. For
instance we can call here something like
ret = vmgenid_hw_register();
and have
#ifdef CONFIG_ACPI
static int vmgenid_hw_register(void)
{
return acpi_bus_register_driver(&acpi_vmgenid_driver);
}
#else
static int vmgenid_hw_register(void)
{
return 0;
}
#endif
> +
> +??? return 0;
> +
> +err:
> +??? free_pages(driver_data.map_buf, 0);
> +??? driver_data.map_buf = 0;
> +
> +??? return ret;
> +}
> +
> +static void __exit vmgenid_exit(void)
> +{
> +??? acpi_bus_unregister_driver(&acpi_vmgenid_driver);
> +
> +??? misc_deregister(&vmgenid_misc);
> +??? free_pages(driver_data.map_buf, 0);
> +??? driver_data.map_buf = 0;
> +}
> +
> +module_init(vmgenid_init);
> +module_exit(vmgenid_exit);
> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("Virtual Machine Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");
> diff --git a/include/uapi/linux/vmgenid.h b/include/uapi/linux/vmgenid.h
> new file mode 100644
> index 0000000..9316b00
> --- /dev/null
> +++ b/include/uapi/linux/vmgenid.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_VMGENID_H
> +#define _UAPI_LINUX_VMGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define VMGENID_IOCTL 0x2d
> +#define VMGENID_GET_OUTDATED_WATCHERS _IO(VMGENID_IOCTL, 1)
> +#define VMGENID_WAIT_WATCHERS???????? _IO(VMGENID_IOCTL, 2)
> +#define VMGENID_FORCE_GEN_UPDATE????? _IO(VMGENID_IOCTL, 3)
> +
> +#endif /* _UAPI_LINUX_VMGENID_H */
> +
> --
> 2.7.4
>
--
Sincerely yours,
Mike.
"Catangiu, Adrian Costin" <[email protected]> writes:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
How does this differ from /proc/sys/kernel/random/boot_id?
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
Does the VM generation ID change in a running that effectively things it
is running?
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
>
> Furthermore, simply finding out about a VM generation change is only
> the starting point of a process to renew internal states of possibly
> multiple applications across the system. This process could benefit
> from a driver that provides an interface through which orchestration
> can be easily done.
>
> - Solution
>
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface.
Earlier it was a UUID now it is 32bit number?
> The FS
> interface provides sync and async VmGen counter updates notifications.
> It also provides VmGen counter retrieval and confirmation mechanisms.
>
> The generation counter and the interface through which it is exposed
> are available even when there is no acpi device present.
>
> When the device is present, the hw provided UUID is not exposed to
> userspace, it is internally used by the driver to keep accounting for
> the exposed VmGen counter. The counter starts from zero when the
> driver is initialized and monotonically increments every time the hw
> UUID changes (the VM generation changes).
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.
Should this be a hotplug even rather than a new character device?
Without plugging into udev and the rest of the hotplug infrastructure
I suspect things will be missed.
> If there is no acpi vmgenid device present, the generation changes are
> not driven by hw vmgenid events but can be driven by software through
> a dedicated driver ioctl.
>
> This patch builds on top of Or Idgar <[email protected]>'s proposal
> https://lkml.org/lkml/2018/3/1/498
Eric
On 27.11.20 19:26, Catangiu, Adrian Costin wrote:
> - Background
>
> The VM Generation ID is a feature defined by Microsoft (paper:
> http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> multiple hypervisor vendors.
>
> The feature is required in virtualized environments by apps that work
> with local copies/caches of world-unique data such as random values,
> uuids, monotonically increasing counters, etc.
> Such apps can be negatively affected by VM snapshotting when the VM
> is either cloned or returned to an earlier point in time.
>
> The VM Generation ID is a simple concept meant to alleviate the issue
> by providing a unique ID that changes each time the VM is restored
> from a snapshot. The hw provided UUID value can be used to
> differentiate between VMs or different generations of the same VM.
>
> - Problem
>
> The VM Generation ID is exposed through an ACPI device by multiple
> hypervisor vendors but neither the vendors or upstream Linux have no
> default driver for it leaving users to fend for themselves.
>
> Furthermore, simply finding out about a VM generation change is only
> the starting point of a process to renew internal states of possibly
> multiple applications across the system. This process could benefit
> from a driver that provides an interface through which orchestration
> can be easily done.
>
> - Solution
>
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface. The FS
> interface provides sync and async VmGen counter updates notifications.
> It also provides VmGen counter retrieval and confirmation mechanisms.
>
> The generation counter and the interface through which it is exposed
> are available even when there is no acpi device present.
>
> When the device is present, the hw provided UUID is not exposed to
> userspace, it is internally used by the driver to keep accounting for
> the exposed VmGen counter. The counter starts from zero when the
> driver is initialized and monotonically increments every time the hw
> UUID changes (the VM generation changes).
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.
>
> If there is no acpi vmgenid device present, the generation changes are
> not driven by hw vmgenid events but can be driven by software through
> a dedicated driver ioctl.
>
> This patch builds on top of Or Idgar <[email protected]>'s proposal
> https://lkml.org/lkml/2018/3/1/498
>
> - Future improvements
>
> Ideally we would want the driver to register itself based on devices'
> _CID and not _HID, but unfortunately I couldn't find a way to do that.
> The problem is that ACPI device matching is done by
> '__acpi_match_device()' which exclusively looks at
> 'acpi_hardware_id *hwid'.
>
> There is a path for platform devices to match on _CID when _HID is
> 'PRP0001' - but this is not the case for the Qemu vmgenid device.
>
> Guidance and help here would be greatly appreciated.
>
> Signed-off-by: Adrian Catangiu <[email protected]>
>
> ---
>
> v1 -> v2:
>
> - expose to userspace a monotonically increasing u32 Vm Gen Counter
> instead of the hw VmGen UUID
> - since the hw/hypervisor-provided 128-bit UUID is not public
> anymore, add it to the kernel RNG as device randomness
> - insert driver page containing Vm Gen Counter in the user vma in
> the driver's mmap handler instead of using a fault handler
> - turn driver into a misc device driver to auto-create /dev/vmgenid
> - change ioctl arg to avoid leaking kernel structs to userspace
> - update documentation
> - various nits
> - rebase on top of linus latest
>
> v2 -> v3:
>
> - separate the core driver logic and interface, from the ACPI device.
> The ACPI vmgenid device is now one possible backend.
> - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
> - add locking to avoid races between fs ops handlers and hw irq
> driven generation updates
> - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
> outdated or a generation change happens while waiting (thus making
> current caller outdated), the ioctl returns -EINTR to signal the
> user to handle event and retry. Fixes blocking on oneself.
> - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
> CAP_CHECKPOINT_RESTORE capability, through which software can force
> generation bump.
> ---
> Documentation/virt/vmgenid.rst | 240 +++++++++++++++++++++++
> drivers/virt/Kconfig | 17 ++
> drivers/virt/Makefile | 1 +
> drivers/virt/vmgenid.c | 435
> +++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/vmgenid.h | 14 ++
> 5 files changed, 707 insertions(+)
> create mode 100644 Documentation/virt/vmgenid.rst
> create mode 100644 drivers/virt/vmgenid.c
> create mode 100644 include/uapi/linux/vmgenid.h
>
[...]
> diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
> index 80c5f9c1..5d5f37b 100644
> --- a/drivers/virt/Kconfig
> +++ b/drivers/virt/Kconfig
> @@ -13,6 +13,23 @@ menuconfig VIRT_DRIVERS
>
> if VIRT_DRIVERS
>
> +config VMGENID
> + tristate "Virtual Machine Generation ID driver"
> + depends on ACPI
I think you want to split the KConfig bit into two now. One for generic
/dev/vmgenid support and another one for ACPI_VMGENID to automatically
bump revisions when the hypervisor indicates it.
In fact, you can probably make this two separate patches with two
separate files (read: kernel modules) even. The generic code can just
export symbols to bump the system genid.
I'm also not fully convinced that calling the generic mechanism
"vmgenid" is still accurate at this point. Can you think of a better
name? "System Generation ID", so "sysgenid" maybe?
> + default N
> + help
> + This is a Virtual Machine Generation ID driver which provides
> + a virtual machine generation counter. The driver exposes FS ops
> + on /dev/vmgenid through which it can provide information and
> + notifications on VM generation changes that happen on snapshots
> + or cloning.
> + This enables applications and libraries that store or cache
> + sensitive information, to know that they need to regenerate it
> + after process memory has been exposed to potential copying.
> +
> + To compile this driver as a module, choose M here: the
> + module will be called vmgenid.
> +
> config FSL_HV_MANAGER
> tristate "Freescale hypervisor management driver"
> depends on FSL_SOC
[...]
> + case VMGENID_FORCE_GEN_UPDATE:
> + if (!checkpoint_restore_ns_capable(current_user_ns()))
> + return -EACCES;
> + vmgenid_bump_generation();
I think this is racy and needs to be slightly different. Imagine the
following:
- container is running with genid 5
- I take a snapshot of the container
- Target system has genid 4
- I resume the container
- I call the genid update (genid = 5)
Then the container still sees genid 5, so *maybe* it won't adapt to the
new environment. This will depend on whether the container gets enough
time to adjust to genid=4 before we bump it to 5.
How about we pass a "bump, but not to this value" argument to the ioctl?
Then it would look like this:
- container is running with genid 5
- I take a snapshot of the container and its genid (5)
- Target system has genid 4
- I resume the container
- I call the genid update with avoid=5 (so we bump genid to 6)
Now all processes in the system will adapt to genid=6, including the
resumed container.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On 27.11.20 18:17, Catangiu, Adrian Costin wrote:
>
> On 18/11/2020 12:30, Alexander Graf wrote:
>>
>>
>> On 16.11.20 16:34, Catangiu, Adrian Costin wrote:
>>> - Future improvements
>>>
>>> Ideally we would want the driver to register itself based on devices'
>>> _CID and not _HID, but unfortunately I couldn't find a way to do that.
>>> The problem is that ACPI device matching is done by
>>> '__acpi_match_device()' which exclusively looks at
>>> 'acpi_hardware_id *hwid'.
>>>
>>> There is a path for platform devices to match on _CID when _HID is
>>> 'PRP0001' - but this is not the case for the Qemu vmgenid device.
>>>
>>> Guidance and help here would be greatly appreciated.
>>
>> That one is pretty important IMHO. How about the following (probably
>> pretty mangled) patch? That seems to work for me. The ACPI change
>> would obviously need to be its own stand alone change and needs proper
>> assessment whether it could possibly break any existing systems.
>>
>> diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
>> index 1682f8b454a2..452443d79d87 100644
>> --- a/drivers/acpi/bus.c
>> +++ b/drivers/acpi/bus.c
>> @@ -748,7 +748,7 @@ static bool __acpi_match_device(struct acpi_device
>> *device,
>> /* First, check the ACPI/PNP IDs provided by the caller. */
>> if (acpi_ids) {
>> for (id = acpi_ids; id->id[0] || id->cls; id++) {
>> - if (id->id[0] && !strcmp((char *)id->id, hwid->id))
>> + if (id->id[0] && !strncmp((char *)id->id, hwid->id,
>> ACPI_ID_LEN - 1))
>> goto out_acpi_match;
>> if (id->cls && __acpi_match_device_cls(id, hwid))
>> goto out_acpi_match;
>> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
>> index 75a787da8aad..0bfa422cf094 100644
>> --- a/drivers/virt/vmgenid.c
>> +++ b/drivers/virt/vmgenid.c
>> @@ -356,7 +356,8 @@ static void vmgenid_acpi_notify(struct acpi_device
>> *device, u32 event)
>> }
>>
>> static const struct acpi_device_id vmgenid_ids[] = {
>> - {"QEMUVGID", 0},
>> + /* This really is VM_Gen_Counter, but we can only match 8
>> characters */
>> + {"VM_GEN_C", 0},
>> {"", 0},
>> };
>>
>
> Looks legit. I can propose a patch with it, but how do we validate it
> doesn't break any devices?
Mainly by proposing it and seeing what the ACPI maintainers say. Maybe
they have a better idea even. At least this explictly nudges them.
>
>
>>> +2) ASYNC simplified example::
>>> +
>>> + void handle_io_on_vmgenfd(int vmgenfd)
>>> + {
>>> + unsigned genid;
>>> +
>>> + // because of VM generation change, we need to rebuild world
>>> + reseed_app_env();
>>> +
>>> + // read new gen ID - we need it to confirm we've handled update
>>> + read(fd, &genid, sizeof(genid));
>>
>> This is racy in case two consecutive snapshots happen. The read needs
>> to go before the reseed.
>>
> Switched them around like you suggest to avoid confusion.
>
> But I don't see a problem with this race. The idea here is to trigger
> reseed_app_env() which doesn't depend on the generation counter value.
> Whether it gets incremented once or N times is irrelevant, we're just
> interested that we pause execution and reseed before resuming (in
> between these, whether N or M generation changes is the same thing).
>
>>> +3) Mapped memory polling simplified example::
>>> +
>>> + /*
>>> + * app/library function that provides cached secrets
>>> + */
>>> + char * safe_cached_secret(app_data_t *app)
>>> + {
>>> + char *secret;
>>> + volatile unsigned *const genid_ptr = get_vmgenid_mapping(app);
>>> + again:
>>> + secret = __cached_secret(app);
>>> +
*genid_ptr = 1
cached_genid = 1
>>> + if (unlikely(*genid_ptr != app->cached_genid)) {
*genid_ptr = 2
cached_genid = 1
>>> + // rebuild world then confirm the genid update (thru write)
>>> + rebuild_caches(app);
hypervisor takes another snapshot during rebuild_caches(). Resume path
bumps genid
>>> + app->cached_genid = *genid_ptr;
*genid_ptr = 3
cached_genid = 3
>>
>> This is racy again. You need to read the genid before rebuild and set
>> it here.
>>
> I don't see the race. Gen counter is read from volatile mapped mem, on
> detected change we rebuild world, confirm the update back to the driver
> then restart the loop. Loop will break when no more changes happen.
See above. After the outlined course of things, the snapshot will
contain data that will be identical between 2 snapshots.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On 27.11.20 21:20, Jann Horn wrote:
>
> On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin
> <[email protected]> wrote:
>> On 27/11/2020 20:22, Jann Horn wrote:
>>> On Fri, Nov 20, 2020 at 11:29 PM Jann Horn <[email protected]> wrote:
>>>> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
>>>> <[email protected]> wrote:
>>>>> This patch is a driver that exposes a monotonic incremental Virtual
>>>>> Machine Generation u32 counter via a char-dev FS interface that
>>>>> provides sync and async VmGen counter updates notifications. It also
>>>>> provides VmGen counter retrieval and confirmation mechanisms.
>>>>>
>>>>> The hw provided UUID is not exposed to userspace, it is internally
>>>>> used by the driver to keep accounting for the exposed VmGen counter.
>>>>> The counter starts from zero when the driver is initialized and
>>>>> monotonically increments every time the hw UUID changes (the VM
>>>>> generation changes).
>>>>>
>>>>> On each hw UUID change, the new hypervisor-provided UUID is also fed
>>>>> to the kernel RNG.
>>>> As for v1:
>>>>
>>>> Is there a reasonable usecase for the "confirmation" mechanism? It
>>>> doesn't seem very useful to me.
>>
>> I think it adds value in complex scenarios with multiple users of the
>> mechanism, potentially at varying layers of the stack, different
>> processes and/or runtime libraries.
>>
>> The driver offers a natural place to handle minimal orchestration
>> support and offer visibility in system-wide status.
>>
>> A high-level service that trusts all system components to properly use
>> the confirmation mechanism can actually block and wait patiently for the
>> system to adjust to the new world. Even if it doesn't trust all
>> components it can still do a best-effort, timeout block.
>
> What concrete action would that high-level service be able to take
> after waiting for such an event?
For us, it would only allow incoming requests to the target container
after the container has successfully adjusted.
You can think of other models too. Your container orchestration engine
could prevent network traffic to reach the container applications until
the full container is adjusted for example.
> My model of the vmgenid mechanism is that RNGs and cryptographic
> libraries that use it need to be fundamentally written such that it is
> guaranteed that a VM fork can not cause the same random number /
> counter / ... to be reused for two different cryptographic operations
> in a way visible to an attacker. This means that e.g. TLS libraries
> need to, between accepting unencrypted input and sending out encrypted
> data, check whether the vmgenid changed since the connection was set
> up, and if a vmgenid change occurred, kill the connection.
>
> Can you give a concrete example of a usecase where the vmgenid
> mechanism is used securely and the confirmation mechanism is necessary
> as part of that?
The main crux here is that we have 2 fundamental approaches:
1) Transactional
For an RNG, the natural place to adjust yourself to a resumed snapshot
is in the random number retrieval. You just check if your generation is
still identical when you fetch the next random number.
Ideally, you also do the same for anything consuming such a random
number. So random number retrieval would no longer just return ( number
), but instead ( number, generation ). That way you could check at every
consumer side of the random number that it's actually still random. That
would need to cascade down.
So every key you derive from a random number, every uuid you generate,
they all would need to store the generation as well and compare if the
current generation is still the same as when they were generated. That
means you need to convert every data access method to a function call
that checks if the value is still consumable and if not, able to
regenerate it. The same applies for global values, such as a system
global UUID that is shared among multiple processes.
If you slowly move away from super integrated environments like a TLS
library and start thinking of samba system UUIDs or SSH host keys,
you'll quickly see how that approach reaches its limits.
2) Event based
Let's take a look at the complicated things to implement with the
transactional approach: samba system UUIDs, SSH host keys, global
variables in a random Java application that get initialized on
application start.
All of these are very easy to resolve through an event based mechanism.
Based on the "new generation" event, you can just generate a new UUID.
Or a new host key. All you would need to know for this to be non-racy is
that before you actually use the target services, you know they are
self-adjusted. In most container workloads, that can be achieved by not
letting network traffic go in before the event is fully processed.
What this patch set does is provide both: We allow the transactional
approach through mmap() of a shared page to be implemented for stacks
where that's easiest. You can use that when your logic is realistically
convertable to transactional. We also allow for an asynchronous event,
which can be used in environments where the transactional approach is
hard because of design constraints (language, API, system, etc.).
Combining the two, you get the best of both worlds IMHO.
>
>>>> How do you envision integrating this with libraries that have to work
>>>> in restrictive seccomp sandboxes? If this was in the vDSO, that would
>>>> be much easier.
>>
>> Since this mechanism targets all of userspace stack, the usecase greatly
>> vary. I doubt we can have a single silver bullet interface.
>>
>> For example, the mmap interface targets user space RNGs, where as fast
>> and as race free as possible is key. But there also higher level
>> applications that don't manage their own memory or don't have access to
>> low-level primitives so they can't use the mmap or even vDSO interfaces.
>> That's what the rest of the logic is there for, the read+poll interface
>> and all of the orchestration logic.
>
> Are you saying that, because people might not want to write proper
> bindings for this interface while also being unwilling to take the
> performance hit of calling read() in every place where they would have
> to do so to be fully correct, you want to build a "best-effort"
> mechanism that is deliberately designed to allow some cryptographic
> state reuse in a limited time window?
I seriously hope that for crypto, we will always(?) be able to use the
transactional approach. And there we don't even have to resort to read()
- you can just mmap() the generation ID.
What the event based mechanism is there for are the other cases that are
not easily converted to such an approach. As library owner, you always
have the choice.
That said, I don't think the "best-effort" mechanism is as bad as you
describe it above. If you're thinking of a normal VM image, imagine
systemd would implement vmgenid support. It could install a quick BPF
program that just blocks all network traffic altogether from the VM
until its genid is fully synchronized across all processes. Ideally, it
would even be able to kill uncooperative processes eventually, so that
your resumed VM is reachable after a timeout.
>> Like you correctly point out, there are also scenarios like tight
>> seccomp jails where even the FS interfaces is inaccessible. For cases
>> like this and others, I believe we will have to work incrementally to
>> build up the interface diversity to cater to all the user scenarios
>> diversity.
>
> It would be much nicer if we could have one simple interface that lets
> everyone correctly do what they need to, though...
I want a pony too :). We need to do what's best for our users here. I am
not convinced only offering a transaction based interface is going to
find the adoption we're hoping for. That means, we'll end up less secure
than we want to, not more.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879