2021-01-12 13:06:26

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: [PATCH v4 0/2] System Generation ID driver and VMGENID backend

This feature is aimed at virtualized or containerized environments
where VM or container snapshotting duplicates memory state, which is a
challenge for applications that want to generate unique data such as
request IDs, UUIDs, and cryptographic nonces.

The patch set introduces a mechanism that provides a userspace
interface for applications and libraries to be made aware of uniqueness
breaking events such as VM or container snapshotting, and allow them to
react and adapt to such events.

Solving the uniqueness problem strongly enough for cryptographic
purposes requires a mechanism which can deterministically reseed
userspace PRNGs with new entropy at restore time. This mechanism must
also support the high-throughput and low-latency use-cases that led
programmers to pick a userspace PRNG in the first place; be usable by
both application code and libraries; allow transparent retrofitting
behind existing popular PRNG interfaces without changing application
code; it must be efficient, especially on snapshot restore; and be
simple enough for wide adoption.

The first patch in the set implements a device driver which exposes a
read-only device /dev/sysgenid to userspace, which contains a
monotonically increasing u32 generation counter. Libraries and
applications are expected to open() the device, and then call read()
which blocks until the SysGenId changes. Following an update, read()
calls no longer block until the application acknowledges the new
SysGenId by write()ing it back to the device. Non-blocking read() calls
return EAGAIN when there is no new SysGenId available. Alternatively,
libraries can mmap() the device to get a single shared page which
contains the latest SysGenId at offset 0.

SysGenId also supports a notification mechanism exposed as two IOCTLs
on the device. SYSGENID_GET_OUTDATED_WATCHERS immediately returns the
number of file descriptors to the device that were open during the last
SysGenId change but have not yet acknowledged the new id.
SYSGENID_WAIT_WATCHERS blocks until there are no open file handles on
the device which haven’t acknowledged the new id. These two interfaces
are intended for serverless and container control planes, which want to
confirm that all application code has detected and reacted to the new
SysGenId before sending an invoke to the newly-restored sandbox.

The second patch in the set adds a VmGenId driver which makes use of
the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
on VM snapshots.

---

v3 -> v4:

- split functionality in two separate kernel modules:
1. drivers/misc/sysgenid.c which provides the generic userspace
interface and mechanisms
2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
kernel entropy and acts as a driving backend for the generic
sysgenid
- renamed /dev/vmgenid -> /dev/sysgenid
- renamed uapi header file vmgenid.h -> sysgenid.h
- renamed ioctls VMGENID_* -> SYSGENID_*
- added ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
- fixed races in documentation examples
- various style nits
- rebased on top of linus latest

v2 -> v3:

- separate the core driver logic and interface, from the ACPI device.
The ACPI vmgenid device is now one possible backend.
- fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
- add locking to avoid races between fs ops handlers and hw irq
driven generation updates
- change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
outdated or a generation change happens while waiting (thus making
current caller outdated), the ioctl returns -EINTR to signal the
user to handle event and retry. Fixes blocking on oneself.
- add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
CAP_CHECKPOINT_RESTORE capability, through which software can force
generation bump.

v1 -> v2:

- expose to userspace a monotonically increasing u32 Vm Gen Counter
instead of the hw VmGen UUID
- since the hw/hypervisor-provided 128-bit UUID is not public
anymore, add it to the kernel RNG as device randomness
- insert driver page containing Vm Gen Counter in the user vma in
the driver's mmap handler instead of using a fault handler
- turn driver into a misc device driver to auto-create /dev/vmgenid
- change ioctl arg to avoid leaking kernel structs to userspace
- update documentation
- various nits
- rebase on top of linus latest

Adrian Catangiu (2):
drivers/misc: sysgenid: add system generation id driver
drivers/virt: vmgenid: add vm generation id driver

Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
Documentation/virt/vmgenid.rst | 34 ++++
drivers/misc/Kconfig | 16 ++
drivers/misc/Makefile | 1 +
drivers/misc/sysgenid.c | 298 ++++++++++++++++++++++++++++++++
drivers/virt/Kconfig | 14 ++
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 153 ++++++++++++++++
include/uapi/linux/sysgenid.h | 18 ++
9 files changed, 775 insertions(+)
create mode 100644 Documentation/misc-devices/sysgenid.rst
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/misc/sysgenid.c
create mode 100644 drivers/virt/vmgenid.c
create mode 100644 include/uapi/linux/sysgenid.h

--
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.


2021-01-12 13:07:01

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

- Background and problem

The System Generation ID feature is required in virtualized or
containerized environments by applications that work with local copies
or caches of world-unique data such as random values, uuids,
monotonically increasing counters, etc.
Such applications can be negatively affected by VM or container
snapshotting when the VM or container is either cloned or returned to
an earlier point in time.

Furthermore, simply finding out about a system generation change is
only the starting point of a process to renew internal states of
possibly multiple applications across the system. This process could
benefit from a driver that provides an interface through which
orchestration can be easily done.

- Solution

The System Generation ID is a simple concept meant to alleviate the
issue by providing a monotonically increasing u32 counter that changes
each time the VM or container is restored from a snapshot.

The `sysgenid` driver exposes a monotonic incremental System Generation
u32 counter via a char-dev FS interface that provides sync and async
SysGen counter updates notifications. It also provides SysGen counter
retrieval and confirmation mechanisms.

The counter starts from zero when the driver is initialized and
monotonically increments every time the system generation changes.

The `sysgenid` driver exports the `void sysgenid_bump_generation()`
symbol which can be used by backend drivers to drive system generation
changes based on hardware events.
System generation changes can also be driven by userspace software
through a dedicated driver ioctl.

Userspace applications or libraries can then (a)synchronously consume
the system generation counter through the provided FS interface to make
any necessary internal adjustments following a system generation
update.

Signed-off-by: Adrian Catangiu <[email protected]>
---
Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
drivers/misc/Kconfig | 16 ++
drivers/misc/Makefile | 1 +
drivers/misc/sysgenid.c | 298 ++++++++++++++++++++++++++++++++
include/uapi/linux/sysgenid.h | 18 ++
5 files changed, 573 insertions(+)
create mode 100644 Documentation/misc-devices/sysgenid.rst
create mode 100644 drivers/misc/sysgenid.c
create mode 100644 include/uapi/linux/sysgenid.h

diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
new file mode 100644
index 0000000..0b31ccf
--- /dev/null
+++ b/Documentation/misc-devices/sysgenid.rst
@@ -0,0 +1,240 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+SYSGENID
+========
+
+The System Generation ID feature is required in virtualized or
+containerized environments by applications that work with local copies
+or caches of world-unique data such as random values, UUIDs,
+monotonically increasing counters, etc.
+Such applications can be negatively affected by VM or container
+snapshotting when the VM or container is either cloned or returned to
+an earlier point in time.
+
+The System Generation ID is a simple concept meant to alleviate the
+issue by providing a monotonically increasing counter that changes
+each time the VM or container is restored from a snapshot.
+The driver for it lives at ``drivers/misc/sysgenid.c``.
+
+The ``sysgenid`` driver exposes a monotonic incremental System
+Generation u32 counter via a char-dev FS interface accessible through
+``/dev/sysgenid`` that provides sync and async SysGen counter updates
+notifications. It also provides SysGen counter retrieval and
+confirmation mechanisms.
+
+The counter starts from zero when the driver is initialized and
+monotonically increments every time the system generation changes.
+
+The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
+symbol which can be used by backend drivers to drive system generation
+changes based on hardware events.
+System generation changes can also be driven by userspace software
+through a dedicated driver ioctl.
+
+Userspace applications or libraries can (a)synchronously consume the
+system generation counter through the provided FS interface, to make
+any necessary internal adjustments following a system generation update.
+
+Driver FS interface:
+
+``open()``:
+ When the device is opened, a copy of the current Sys-Gen-Id (counter)
+ is associated with the open file descriptor. The driver now tracks
+ this file as an independent *watcher*. The driver tracks how many
+ watchers are aware of the latest Sys-Gen-Id counter and how many of
+ them are *outdated*; outdated being those that have lived through
+ a Sys-Gen-Id change but not yet confirmed the new generation counter.
+
+``read()``:
+ Read is meant to provide the *new* system generation counter when a
+ generation change takes place. The read operation blocks until the
+ associated counter is no longer up to date, at which point the new
+ counter is provided/returned.
+ Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
+ *new* counter value available. The generation counter is considered
+ *new* for each open file descriptor that hasn't confirmed the new
+ value following a generation change. Therefore, once a generation
+ change takes place, all ``read()`` calls will immediately return the
+ new generation counter and will continue to do so until the
+ new value is confirmed back to the driver through ``write()``.
+ Partial reads are not allowed - read buffer needs to be at least
+ ``sizeof(unsigned)`` in size.
+
+``write()``:
+ Write is used to confirm the up-to-date Sys Gen counter back to the
+ driver.
+ Following a VM generation change, all existing watchers are marked
+ as *outdated*. Each file descriptor will maintain the *outdated*
+ status until a ``write()`` confirms the up-to-date counter back to
+ the driver.
+ Partial writes are not allowed - write buffer should be exactly
+ ``sizeof(unsigned)`` in size.
+
+``poll()``:
+ Poll is implemented to allow polling for generation counter updates.
+ Such updates result in ``EPOLLIN`` polling status until the new
+ up-to-date counter is confirmed back to the driver through a
+ ``write()``.
+
+``ioctl()``:
+ The driver also adds support for tracking count of open file
+ descriptors that haven't acknowledged a generation counter update,
+ as well as a mechanism for userspace to *force* a generation update:
+
+ - SYSGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
+ *outdated* watchers - number of file descriptors that were open
+ during a system generation change, and which have not yet confirmed
+ the new generation counter.
+ - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+ watchers, or if a ``timeout`` argument is provided, until the
+ timeout expires.
+ If the current caller is *outdated* or a generation change happens
+ while waiting (thus making current caller *outdated*), the ioctl
+ returns ``-EINTR`` to signal the user to handle event and retry.
+ - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
+ It takes a ``minimum-generation`` argument which represents the
+ minimum value the generation counter will be incremented to. For
+ example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
+ is called, the generation counter will increment to ``8``.
+ This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
+ or CAP_SYS_ADMIN capabilities.
+
+``mmap()``:
+ The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+ in size. The first 4 bytes of the mapped page will contain an
+ up-to-date u32 copy of the system generation counter.
+ The mapped memory can be used as a low-latency generation counter
+ probe mechanism in critical sections - see examples.
+
+``close()``:
+ Removes the file descriptor as a system generation counter *watcher*.
+
+Example application workflows
+-----------------------------
+
+1) Watchdog thread simplified example::
+
+ void watchdog_thread_handler(int *thread_active)
+ {
+ unsigned genid;
+ int fd = open("/dev/sysgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
+
+ do {
+ // read new gen ID - blocks until VM generation changes
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling gen ID update
+ write(fd, &genid, sizeof(genid));
+ } while (atomic_read(thread_active));
+
+ close(fd);
+ }
+
+2) ASYNC simplified example::
+
+ void handle_io_on_sysgenfd(int sysgenfd)
+ {
+ unsigned genid;
+
+ // read new gen ID - we need it to confirm we've handled update
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling the gen ID update
+ write(fd, &genid, sizeof(genid));
+ }
+
+ int main() {
+ int epfd, sysgenfd;
+ struct epoll_event ev;
+
+ epfd = epoll_create(EPOLL_QUEUE_LEN);
+
+ sysgenfd = open("/dev/sysgenid",
+ O_RDWR | O_CLOEXEC | O_NONBLOCK,
+ S_IRUSR | S_IWUSR);
+
+ // register sysgenid for polling
+ ev.events = EPOLLIN;
+ ev.data.fd = sysgenfd;
+ epoll_ctl(epfd, EPOLL_CTL_ADD, sysgenfd, &ev);
+
+ // register other parts of your app for polling
+ // ...
+
+ while (1) {
+ // wait for something to do...
+ int nfds = epoll_wait(epfd, events,
+ MAX_EPOLL_EVENTS_PER_RUN,
+ EPOLL_RUN_TIMEOUT);
+ if (nfds < 0) die("Error in epoll_wait!");
+
+ // for each ready fd
+ for(int i = 0; i < nfds; i++) {
+ int fd = events[i].data.fd;
+
+ if (fd == sysgenfd)
+ handle_io_on_sysgenfd(sysgenfd);
+ else
+ handle_some_other_part_of_the_app(fd);
+ }
+ }
+
+ return 0;
+ }
+
+3) Mapped memory polling simplified example::
+
+ /*
+ * app/library function that provides cached secrets
+ */
+ char * safe_cached_secret(app_data_t *app)
+ {
+ char *secret;
+ volatile unsigned *const genid_ptr = get_sysgenid_mapping(app);
+ again:
+ secret = __cached_secret(app);
+
+ if (unlikely(*genid_ptr != app->cached_genid)) {
+ app->cached_genid = *genid_ptr;
+ barrier();
+
+ // rebuild world then confirm the genid update (thru write)
+ rebuild_caches(app);
+
+ ack_sysgenid_update(app);
+
+ goto again;
+ }
+
+ return secret;
+ }
+
+4) Orchestrator simplified example::
+
+ /*
+ * orchestrator - manages multiple applications and libraries used by
+ * a service and tries to make sure all sensitive components gracefully
+ * handle VM generation changes.
+ * Following function is called on detection of a VM generation change.
+ */
+ int handle_sysgen_update(int sysgen_fd, unsigned new_gen_id)
+ {
+ // pause until all components have handled event
+ pause_service();
+
+ // confirm *this* watcher as up-to-date
+ write(sysgen_fd, &new_gen_id, sizeof(unsigned));
+
+ // wait for all *others* for at most 5 seconds.
+ ioctl(sysgen_fd, VMGENID_WAIT_WATCHERS, 5000);
+
+ // all applications on the system have rebuilt worlds
+ resume_service();
+ }
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index fafa8b0..931d716 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -456,6 +456,22 @@ config PVPANIC
a paravirtualized device provided by QEMU; it lets a virtual machine
(guest) communicate panic events to the host.

+config SYSGENID
+ tristate "System Generation ID driver"
+ default N
+ help
+ This is a System Generation ID driver which provides a system
+ generation counter. The driver exposes FS ops on /dev/sysgenid
+ through which it can provide information and notifications on system
+ generation changes that happen because of VM or container snapshots
+ or cloning.
+ This enables applications and libraries that store or cache
+ sensitive information, to know that they need to regenerate it
+ after process memory has been exposed to potential copying.
+
+ To compile this driver as a module, choose M here: the
+ module will be called sysgenid.
+
config HISI_HIKEY_USB
tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
depends on (OF && GPIOLIB) || COMPILE_TEST
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index d23231e..4b4933d 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI) += habanalabs/
obj-$(CONFIG_UACCE) += uacce/
obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o
obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o
+obj-$(CONFIG_SYSGENID) += sysgenid.o
diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
new file mode 100644
index 0000000..dd64099
--- /dev/null
+++ b/drivers/misc/sysgenid.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * System Generation ID driver
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ * Authors:
+ * Adrian Catangiu <[email protected]>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/minmax.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+struct sysgenid_data {
+ unsigned long map_buf;
+ wait_queue_head_t read_waitq;
+ atomic_t generation_counter;
+
+ unsigned int watchers;
+ atomic_t outdated_watchers;
+ wait_queue_head_t outdated_waitq;
+ spinlock_t lock;
+};
+static struct sysgenid_data sysgenid_data;
+
+struct file_data {
+ unsigned int acked_gen_counter;
+};
+
+static int equals_gen_counter(unsigned int counter)
+{
+ return counter == atomic_read(&sysgenid_data.generation_counter);
+}
+
+static void _bump_generation(int min_gen)
+{
+ unsigned long flags;
+ int counter;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
+ atomic_set(&sysgenid_data.generation_counter, counter);
+ *((int *) sysgenid_data.map_buf) = counter;
+ atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
+
+ wake_up_interruptible(&sysgenid_data.read_waitq);
+ wake_up_interruptible(&sysgenid_data.outdated_waitq);
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+}
+
+void sysgenid_bump_generation(void)
+{
+ _bump_generation(0);
+}
+EXPORT_SYMBOL(sysgenid_bump_generation);
+
+static void put_outdated_watchers(void)
+{
+ if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
+ wake_up_interruptible(&sysgenid_data.outdated_waitq);
+}
+
+static int sysgenid_open(struct inode *inode, struct file *file)
+{
+ struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
+ unsigned long flags;
+
+ if (!fdata)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
+ ++sysgenid_data.watchers;
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ file->private_data = fdata;
+
+ return 0;
+}
+
+static int sysgenid_close(struct inode *inode, struct file *file)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned long flags;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ put_outdated_watchers();
+ --sysgenid_data.watchers;
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ kfree(fdata);
+
+ return 0;
+}
+
+static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct file_data *fdata = file->private_data;
+ ssize_t ret;
+ int gen_counter;
+
+ if (nbytes == 0)
+ return 0;
+ /* disallow partial reads */
+ if (nbytes < sizeof(gen_counter))
+ return -EINVAL;
+
+ if (equals_gen_counter(fdata->acked_gen_counter)) {
+ if (file->f_flags & O_NONBLOCK)
+ return -EAGAIN;
+ ret = wait_event_interruptible(
+ sysgenid_data.read_waitq,
+ !equals_gen_counter(fdata->acked_gen_counter)
+ );
+ if (ret)
+ return ret;
+ }
+
+ gen_counter = atomic_read(&sysgenid_data.generation_counter);
+ ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
+ if (ret)
+ return -EFAULT;
+
+ return sizeof(gen_counter);
+}
+
+static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned int new_acked_gen;
+ unsigned long flags;
+
+ /* disallow partial writes */
+ if (count != sizeof(new_acked_gen))
+ return -EINVAL;
+ if (copy_from_user(&new_acked_gen, ubuf, count))
+ return -EFAULT;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ /* wrong gen-counter acknowledged */
+ if (!equals_gen_counter(new_acked_gen)) {
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+ return -EINVAL;
+ }
+ if (!equals_gen_counter(fdata->acked_gen_counter)) {
+ fdata->acked_gen_counter = new_acked_gen;
+ put_outdated_watchers();
+ }
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ return (ssize_t)count;
+}
+
+static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
+{
+ __poll_t mask = 0;
+ struct file_data *fdata = file->private_data;
+
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ return EPOLLIN | EPOLLRDNORM;
+
+ poll_wait(file, &sysgenid_data.read_waitq, wait);
+
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ mask = EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static long sysgenid_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned long timeout_ns, min_gen;
+ ktime_t until;
+ int ret = 0;
+
+ switch (cmd) {
+ case SYSGENID_GET_OUTDATED_WATCHERS:
+ ret = atomic_read(&sysgenid_data.outdated_watchers);
+ break;
+ case SYSGENID_WAIT_WATCHERS:
+ timeout_ns = arg * NSEC_PER_MSEC;
+ until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
+
+ ret = wait_event_interruptible_hrtimeout(
+ sysgenid_data.outdated_waitq,
+ (!atomic_read(&sysgenid_data.outdated_watchers) ||
+ !equals_gen_counter(fdata->acked_gen_counter)),
+ until
+ );
+ if (atomic_read(&sysgenid_data.outdated_watchers))
+ ret = -EINTR;
+ else
+ ret = 0;
+ break;
+ case SYSGENID_FORCE_GEN_UPDATE:
+ if (!checkpoint_restore_ns_capable(current_user_ns()))
+ return -EACCES;
+ min_gen = arg;
+ _bump_generation(min_gen);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file_data *fdata = file->private_data;
+
+ if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) != 0)
+ return -EPERM;
+
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags &= ~VM_MAYWRITE;
+ vma->vm_private_data = fdata;
+
+ return vm_insert_page(vma, vma->vm_start,
+ virt_to_page(sysgenid_data.map_buf));
+}
+
+static const struct file_operations fops = {
+ .owner = THIS_MODULE,
+ .mmap = sysgenid_mmap,
+ .open = sysgenid_open,
+ .release = sysgenid_close,
+ .read = sysgenid_read,
+ .write = sysgenid_write,
+ .poll = sysgenid_poll,
+ .unlocked_ioctl = sysgenid_ioctl,
+};
+
+static struct miscdevice sysgenid_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "sysgenid",
+ .fops = &fops,
+};
+
+static int __init sysgenid_init(void)
+{
+ int ret;
+
+ sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
+ if (!sysgenid_data.map_buf)
+ return -ENOMEM;
+
+ atomic_set(&sysgenid_data.generation_counter, 0);
+ atomic_set(&sysgenid_data.outdated_watchers, 0);
+ init_waitqueue_head(&sysgenid_data.read_waitq);
+ init_waitqueue_head(&sysgenid_data.outdated_waitq);
+ spin_lock_init(&sysgenid_data.lock);
+
+ ret = misc_register(&sysgenid_misc);
+ if (ret < 0) {
+ pr_err("misc_register() failed for sysgenid\n");
+ goto err;
+ }
+
+ return 0;
+
+err:
+ free_pages(sysgenid_data.map_buf, 0);
+ sysgenid_data.map_buf = 0;
+
+ return ret;
+}
+
+static void __exit sysgenid_exit(void)
+{
+ misc_deregister(&sysgenid_misc);
+ free_pages(sysgenid_data.map_buf, 0);
+ sysgenid_data.map_buf = 0;
+}
+
+module_init(sysgenid_init);
+module_exit(sysgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("System Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
new file mode 100644
index 0000000..ea38fd3
--- /dev/null
+++ b/include/uapi/linux/sysgenid.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_SYSGENID_H
+#define _UAPI_LINUX_SYSGENID_H
+
+#include <linux/ioctl.h>
+
+#define SYSGENID_IOCTL 0x2d
+#define SYSGENID_GET_OUTDATED_WATCHERS _IO(SYSGENID_IOCTL, 1)
+#define SYSGENID_WAIT_WATCHERS _IO(SYSGENID_IOCTL, 2)
+#define SYSGENID_FORCE_GEN_UPDATE _IO(SYSGENID_IOCTL, 3)
+
+#ifdef __KERNEL__
+void sysgenid_bump_generation(void);
+#endif /* __KERNEL__ */
+
+#endif /* _UAPI_LINUX_SYSGENID_H */
+
--
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-12 13:09:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] System Generation ID driver and VMGENID backend

On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
> This feature is aimed at virtualized or containerized environments
> where VM or container snapshotting duplicates memory state, which is a
> challenge for applications that want to generate unique data such as
> request IDs, UUIDs, and cryptographic nonces.
>
> The patch set introduces a mechanism that provides a userspace
> interface for applications and libraries to be made aware of uniqueness
> breaking events such as VM or container snapshotting, and allow them to
> react and adapt to such events.
>
> Solving the uniqueness problem strongly enough for cryptographic
> purposes requires a mechanism which can deterministically reseed
> userspace PRNGs with new entropy at restore time. This mechanism must
> also support the high-throughput and low-latency use-cases that led
> programmers to pick a userspace PRNG in the first place; be usable by
> both application code and libraries; allow transparent retrofitting
> behind existing popular PRNG interfaces without changing application
> code; it must be efficient, especially on snapshot restore; and be
> simple enough for wide adoption.
>
> The first patch in the set implements a device driver which exposes a
> read-only device /dev/sysgenid to userspace, which contains a
> monotonically increasing u32 generation counter. Libraries and
> applications are expected to open() the device, and then call read()
> which blocks until the SysGenId changes. Following an update, read()
> calls no longer block until the application acknowledges the new
> SysGenId by write()ing it back to the device. Non-blocking read() calls
> return EAGAIN when there is no new SysGenId available. Alternatively,
> libraries can mmap() the device to get a single shared page which
> contains the latest SysGenId at offset 0.

Looking at some specifications, the gen ID might actually be located
at an arbitrary address. How about instead of hard-coding the offset,
we expose it e.g. in sysfs?


> SysGenId also supports a notification mechanism exposed as two IOCTLs
> on the device. SYSGENID_GET_OUTDATED_WATCHERS immediately returns the
> number of file descriptors to the device that were open during the last
> SysGenId change but have not yet acknowledged the new id.
> SYSGENID_WAIT_WATCHERS blocks until there are no open file handles on
> the device which haven’t acknowledged the new id. These two interfaces
> are intended for serverless and container control planes, which want to
> confirm that all application code has detected and reacted to the new
> SysGenId before sending an invoke to the newly-restored sandbox.
>
> The second patch in the set adds a VmGenId driver which makes use of
> the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
> on VM snapshots.
>
> ---
>
> v3 -> v4:
>
> - split functionality in two separate kernel modules:
> 1. drivers/misc/sysgenid.c which provides the generic userspace
> interface and mechanisms
> 2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
> kernel entropy and acts as a driving backend for the generic
> sysgenid
> - renamed /dev/vmgenid -> /dev/sysgenid
> - renamed uapi header file vmgenid.h -> sysgenid.h
> - renamed ioctls VMGENID_* -> SYSGENID_*
> - added ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
> - fixed races in documentation examples
> - various style nits
> - rebased on top of linus latest
>
> v2 -> v3:
>
> - separate the core driver logic and interface, from the ACPI device.
> The ACPI vmgenid device is now one possible backend.
> - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
> - add locking to avoid races between fs ops handlers and hw irq
> driven generation updates
> - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
> outdated or a generation change happens while waiting (thus making
> current caller outdated), the ioctl returns -EINTR to signal the
> user to handle event and retry. Fixes blocking on oneself.
> - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
> CAP_CHECKPOINT_RESTORE capability, through which software can force
> generation bump.
>
> v1 -> v2:
>
> - expose to userspace a monotonically increasing u32 Vm Gen Counter
> instead of the hw VmGen UUID
> - since the hw/hypervisor-provided 128-bit UUID is not public
> anymore, add it to the kernel RNG as device randomness
> - insert driver page containing Vm Gen Counter in the user vma in
> the driver's mmap handler instead of using a fault handler
> - turn driver into a misc device driver to auto-create /dev/vmgenid
> - change ioctl arg to avoid leaking kernel structs to userspace
> - update documentation
> - various nits
> - rebase on top of linus latest
>
> Adrian Catangiu (2):
> drivers/misc: sysgenid: add system generation id driver
> drivers/virt: vmgenid: add vm generation id driver
>
> Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
> Documentation/virt/vmgenid.rst | 34 ++++
> drivers/misc/Kconfig | 16 ++
> drivers/misc/Makefile | 1 +
> drivers/misc/sysgenid.c | 298 ++++++++++++++++++++++++++++++++
> drivers/virt/Kconfig | 14 ++
> drivers/virt/Makefile | 1 +
> drivers/virt/vmgenid.c | 153 ++++++++++++++++
> include/uapi/linux/sysgenid.h | 18 ++
> 9 files changed, 775 insertions(+)
> create mode 100644 Documentation/misc-devices/sysgenid.rst
> create mode 100644 Documentation/virt/vmgenid.rst
> create mode 100644 drivers/misc/sysgenid.c
> create mode 100644 drivers/virt/vmgenid.c
> create mode 100644 include/uapi/linux/sysgenid.h
>
> --
> 2.7.4
>
>
>
>
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-12 13:13:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

On Tue, Jan 12, 2021 at 02:15:59PM +0200, Adrian Catangiu wrote:
> +``read()``:
> + Read is meant to provide the *new* system generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date, at which point the new
> + counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + ``sizeof(unsigned)`` in size.

"sizeof(unsigned)"? How about being specific and making this a real "X
bits big" value please.

"unsigned" does not work well across user/kernel boundries. Ok, that's
on understatement, the correct thing is "does not work at all".

Please be specific in your apis.

This is listed elsewhere also.

> +``write()``:
> + Write is used to confirm the up-to-date Sys Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + ``sizeof(unsigned)`` in size.
> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for tracking count of open file
> + descriptors that haven't acknowledged a generation counter update,
> + as well as a mechanism for userspace to *force* a generation update:
> +
> + - SYSGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
> + *outdated* watchers - number of file descriptors that were open
> + during a system generation change, and which have not yet confirmed
> + the new generation counter.

But this number can instantly change after it is read, what good is it?
It should never be relied on, so why is this needed at all?

What can userspace do with this information?

thanks,

greg k-h

2021-01-13 01:35:49

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: [PATCH v4 2/2] drivers/virt: vmgenid: add vm generation id driver

- Background

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature can be used to drive the `sysgenid` mechanism required in
virtualized environments by software that works with local copies and
caches of world-unique data such as random values, uuids, monotonically
increasing counters, etc.

- Solution

The VM Generation ID is a hypervisor/hardware provided 128-bit unique
ID that changes each time the VM is restored from a snapshot. It can be
used to differentiate between VMs or different generations of the same
VM.
This VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors.

The `vmgenid` driver uses ACPI events to be notified by hardware
changes to the 128-bit Vm Gen Id HW UUID. The UUID is not exposed to
userspace, it is added by the driver as device randomness to improve
kernel entropy following VM snapshot events.

This driver also acts as a backend for the `sysgenid` kernel module
(`drivers/misc/sysgenid.c`, `Documentation/misc-devices/sysgenid.rst`)
to drive changes to the "System Generation Id" which is further exposed
to userspace as a system-wide monotonically increasing counter.

This patch builds on top of Or Idgar <[email protected]>'s proposal
https://lkml.org/lkml/2018/3/1/498

- Future improvements

Ideally we would want the driver to register itself based on devices'
_CID and not _HID, but unfortunately I couldn't find a way to do that.
The problem is that ACPI device matching is done by
'__acpi_match_device()' which exclusively looks at
'acpi_hardware_id *hwid'.

There is a path for platform devices to match on _CID when _HID is
'PRP0001' - but this is not the case for the Qemu vmgenid device.

Guidance and help here would be greatly appreciated.

Signed-off-by: Adrian Catangiu <[email protected]>
---
Documentation/virt/vmgenid.rst | 34 +++++++++
drivers/virt/Kconfig | 14 ++++
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 153 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 202 insertions(+)
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/virt/vmgenid.c

diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..2106354
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,34 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+VMGENID
+=======
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by applications
+that work with local copies/caches of world-unique data such as random
+values, UUIDs, monotonically increasing counters, etc.
+Such applications can be negatively affected by VM snapshotting when
+the VM is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept meant to alleviate the issue
+by providing a unique ID that changes each time the VM is restored
+from a snapshot. The hardware provided UUID value can be used to
+differentiate between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The ``vmgenid`` driver uses ACPI events to be notified by hardware
+changes to the 128-bit Vm Gen Id UUID. This UUID is not exposed to
+userspace, it is added by the driver as device randomness to improve
+kernel entropy following VM snapshot events.
+
+This driver also acts as a backend for the ``sysgenid`` kernel module
+(``drivers/misc/sysgenid.c``, ``Documentation/misc-devices/sysgenid.rst``)
+to drive changes to the "System Generation Id" which is further exposed
+to userspace as a monotonically increasing counter.
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..4771633 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,20 @@ menuconfig VIRT_DRIVERS

if VIRT_DRIVERS

+config VMGENID
+ tristate "Virtual Machine Generation ID driver"
+ depends on ACPI && SYSGENID
+ default N
+ help
+ The driver uses the hypervisor provided Virtual Machine Generation ID
+ to drive the system generation counter mechanism exposed by sysgenid.
+ The vmgenid changes on VM snapshots or VM cloning. The hypervisor
+ provided 128-bit vmgenid is also used as device randomness to improve
+ kernel entropy following VM snapshot events.
+
+ To compile this driver as a module, choose M here: the
+ module will be called vmgenid.
+
config FSL_HV_MANAGER
tristate "Freescale hypervisor management driver"
depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f28425c..889be01 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,6 +4,7 @@
#

obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
+obj-$(CONFIG_VMGENID) += vmgenid.o
obj-y += vboxguest/

obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..d9d089a
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc. All rights reserved.
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ * Authors:
+ * Adrian Catangiu <[email protected]>
+ * Or Idgar <[email protected]>
+ * Gal Hammer <[email protected]>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct vmgenid_data {
+ uuid_t uuid;
+ void *uuid_iomap;
+};
+static struct vmgenid_data vmgenid_data;
+
+static int vmgenid_acpi_map(struct vmgenid_data *priv, acpi_handle handle)
+{
+ int i;
+ phys_addr_t phys_addr;
+ struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+ acpi_status status;
+ union acpi_object *pss;
+ union acpi_object *element;
+
+ status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ pss = buffer.pointer;
+ if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+ return -EINVAL;
+
+ phys_addr = 0;
+ for (i = 0; i < pss->package.count; i++) {
+ element = &(pss->package.elements[i]);
+ if (element->type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+ phys_addr |= element->integer.value << i * 32;
+ }
+
+ priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+ if (!priv->uuid_iomap) {
+ pr_err("Could not map memory at 0x%llx, size %u\n",
+ phys_addr,
+ (u32) sizeof(uuid_t));
+ return -ENOMEM;
+ }
+
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+ int ret;
+
+ if (!device)
+ return -EINVAL;
+ device->driver_data = &vmgenid_data;
+
+ ret = vmgenid_acpi_map(device->driver_data, device->handle);
+ if (ret < 0) {
+ pr_err("vmgenid: failed to map acpi device\n");
+ device->driver_data = NULL;
+ }
+
+ return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+ if (!device || acpi_driver_data(device) != &vmgenid_data)
+ return -EINVAL;
+ device->driver_data = NULL;
+
+ if (vmgenid_data.uuid_iomap)
+ acpi_os_unmap_memory(vmgenid_data.uuid_iomap, sizeof(uuid_t));
+ vmgenid_data.uuid_iomap = NULL;
+
+ return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+ uuid_t old_uuid;
+
+ if (!device || acpi_driver_data(device) != &vmgenid_data) {
+ pr_err("VMGENID notify with unexpected driver private data\n");
+ return;
+ }
+
+ /* update VM Generation UUID */
+ old_uuid = vmgenid_data.uuid;
+ memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
+
+ if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
+ /* HW uuid updated */
+ sysgenid_bump_generation();
+ add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
+ }
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+ {"VMGENID", 0},
+ {"QEMUVGID", 0},
+ {"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+ .name = "vm_generation_id",
+ .ids = vmgenid_ids,
+ .owner = THIS_MODULE,
+ .ops = {
+ .add = vmgenid_acpi_add,
+ .remove = vmgenid_acpi_remove,
+ .notify = vmgenid_acpi_notify,
+ }
+};
+
+static int __init vmgenid_init(void)
+{
+ return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+ acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
--
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-13 01:43:03

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

On Tue, Jan 12, 2021 at 02:15:59PM +0200, Adrian Catangiu wrote:
> - Background and problem
>
> The System Generation ID feature is required in virtualized or
> containerized environments by applications that work with local copies
> or caches of world-unique data such as random values, uuids,
> monotonically increasing counters, etc.
> Such applications can be negatively affected by VM or container
> snapshotting when the VM or container is either cloned or returned to
> an earlier point in time.
>
> Furthermore, simply finding out about a system generation change is
> only the starting point of a process to renew internal states of
> possibly multiple applications across the system. This process could
> benefit from a driver that provides an interface through which
> orchestration can be easily done.
>
> - Solution
>
> The System Generation ID is a simple concept meant to alleviate the
> issue by providing a monotonically increasing u32 counter that changes
> each time the VM or container is restored from a snapshot.
>
> The `sysgenid` driver exposes a monotonic incremental System Generation
> u32 counter via a char-dev FS interface that provides sync and async
> SysGen counter updates notifications. It also provides SysGen counter
> retrieval and confirmation mechanisms.
>
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the system generation changes.
>
> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
> symbol which can be used by backend drivers to drive system generation
> changes based on hardware events.
> System generation changes can also be driven by userspace software
> through a dedicated driver ioctl.
>
> Userspace applications or libraries can then (a)synchronously consume
> the system generation counter through the provided FS interface to make
> any necessary internal adjustments following a system generation
> update.
>
> Signed-off-by: Adrian Catangiu <[email protected]>
> ---
> Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
> drivers/misc/Kconfig | 16 ++
> drivers/misc/Makefile | 1 +
> drivers/misc/sysgenid.c | 298 ++++++++++++++++++++++++++++++++
> include/uapi/linux/sysgenid.h | 18 ++
> 5 files changed, 573 insertions(+)
> create mode 100644 Documentation/misc-devices/sysgenid.rst
> create mode 100644 drivers/misc/sysgenid.c
> create mode 100644 include/uapi/linux/sysgenid.h
>
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..0b31ccf
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,240 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is a simple concept meant to alleviate the
> +issue by providing a monotonically increasing counter that changes
> +each time the VM or container is restored from a snapshot.
> +The driver for it lives at ``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev FS interface accessible through
> +``/dev/sysgenid`` that provides sync and async SysGen counter updates
> +notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided FS interface, to make
> +any necessary internal adjustments following a system generation update.
> +
> +Driver FS interface:
> +
> +``open()``:
> + When the device is opened, a copy of the current Sys-Gen-Id (counter)
> + is associated with the open file descriptor. The driver now tracks
> + this file as an independent *watcher*. The driver tracks how many
> + watchers are aware of the latest Sys-Gen-Id counter and how many of
> + them are *outdated*; outdated being those that have lived through
> + a Sys-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> + Read is meant to provide the *new* system generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date, at which point the new
> + counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + ``sizeof(unsigned)`` in size.
> +
> +``write()``:
> + Write is used to confirm the up-to-date Sys Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + ``sizeof(unsigned)`` in size.
> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for tracking count of open file
> + descriptors that haven't acknowledged a generation counter update,
> + as well as a mechanism for userspace to *force* a generation update:
> +
> + - SYSGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
> + *outdated* watchers - number of file descriptors that were open
> + during a system generation change, and which have not yet confirmed
> + the new generation counter.
> + - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> + watchers, or if a ``timeout`` argument is provided, until the
> + timeout expires.
> + If the current caller is *outdated* or a generation change happens
> + while waiting (thus making current caller *outdated*), the ioctl
> + returns ``-EINTR`` to signal the user to handle event and retry.
> + - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
> + It takes a ``minimum-generation`` argument which represents the
> + minimum value the generation counter will be incremented to. For
> + example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
> + is called, the generation counter will increment to ``8``.
> + This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> + or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> + in size. The first 4 bytes of the mapped page will contain an
> + up-to-date u32 copy of the system generation counter.
> + The mapped memory can be used as a low-latency generation counter
> + probe mechanism in critical sections - see examples.
> +
> +``close()``:
> + Removes the file descriptor as a system generation counter *watcher*.
> +
> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> + void watchdog_thread_handler(int *thread_active)
> + {
> + unsigned genid;
> + int fd = open("/dev/sysgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
> +
> + do {
> + // read new gen ID - blocks until VM generation changes
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling gen ID update
> + write(fd, &genid, sizeof(genid));
> + } while (atomic_read(thread_active));
> +
> + close(fd);
> + }
> +
> +2) ASYNC simplified example::
> +
> + void handle_io_on_sysgenfd(int sysgenfd)
> + {
> + unsigned genid;
> +
> + // read new gen ID - we need it to confirm we've handled update
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling the gen ID update
> + write(fd, &genid, sizeof(genid));
> + }
> +
> + int main() {
> + int epfd, sysgenfd;
> + struct epoll_event ev;
> +
> + epfd = epoll_create(EPOLL_QUEUE_LEN);
> +
> + sysgenfd = open("/dev/sysgenid",
> + O_RDWR | O_CLOEXEC | O_NONBLOCK,
> + S_IRUSR | S_IWUSR);
> +
> + // register sysgenid for polling
> + ev.events = EPOLLIN;
> + ev.data.fd = sysgenfd;
> + epoll_ctl(epfd, EPOLL_CTL_ADD, sysgenfd, &ev);
> +
> + // register other parts of your app for polling
> + // ...
> +
> + while (1) {
> + // wait for something to do...
> + int nfds = epoll_wait(epfd, events,
> + MAX_EPOLL_EVENTS_PER_RUN,
> + EPOLL_RUN_TIMEOUT);
> + if (nfds < 0) die("Error in epoll_wait!");
> +
> + // for each ready fd
> + for(int i = 0; i < nfds; i++) {
> + int fd = events[i].data.fd;
> +
> + if (fd == sysgenfd)
> + handle_io_on_sysgenfd(sysgenfd);
> + else
> + handle_some_other_part_of_the_app(fd);
> + }
> + }
> +
> + return 0;
> + }
> +
> +3) Mapped memory polling simplified example::
> +
> + /*
> + * app/library function that provides cached secrets
> + */
> + char * safe_cached_secret(app_data_t *app)
> + {
> + char *secret;
> + volatile unsigned *const genid_ptr = get_sysgenid_mapping(app);
> + again:
> + secret = __cached_secret(app);
> +
> + if (unlikely(*genid_ptr != app->cached_genid)) {
> + app->cached_genid = *genid_ptr;
> + barrier();
> +
> + // rebuild world then confirm the genid update (thru write)
> + rebuild_caches(app);
> +
> + ack_sysgenid_update(app);
> +
> + goto again;
> + }
> +
> + return secret;



Hmm. This seems to rely on the assumption that if you have
read the ID and it did not change, then all is well.

Problem is, in the interrupt driven implementation
this is not a safe assumption to make: ID
from hypervisor might have changed but interrupt
could be delayed.


If we map the hypervisor ID to userspace then we won't
have this race ... worth worry about? why not?



> + }
> +
> +4) Orchestrator simplified example::
> +
> + /*
> + * orchestrator - manages multiple applications and libraries used by
> + * a service and tries to make sure all sensitive components gracefully
> + * handle VM generation changes.
> + * Following function is called on detection of a VM generation change.
> + */
> + int handle_sysgen_update(int sysgen_fd, unsigned new_gen_id)
> + {
> + // pause until all components have handled event
> + pause_service();
> +
> + // confirm *this* watcher as up-to-date
> + write(sysgen_fd, &new_gen_id, sizeof(unsigned));
> +
> + // wait for all *others* for at most 5 seconds.
> + ioctl(sysgen_fd, VMGENID_WAIT_WATCHERS, 5000);
> +
> + // all applications on the system have rebuilt worlds
> + resume_service();
> + }
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index fafa8b0..931d716 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -456,6 +456,22 @@ config PVPANIC
> a paravirtualized device provided by QEMU; it lets a virtual machine
> (guest) communicate panic events to the host.
>
> +config SYSGENID
> + tristate "System Generation ID driver"
> + default N
> + help
> + This is a System Generation ID driver which provides a system
> + generation counter. The driver exposes FS ops on /dev/sysgenid
> + through which it can provide information and notifications on system
> + generation changes that happen because of VM or container snapshots
> + or cloning.
> + This enables applications and libraries that store or cache
> + sensitive information, to know that they need to regenerate it
> + after process memory has been exposed to potential copying.
> +
> + To compile this driver as a module, choose M here: the
> + module will be called sysgenid.
> +
> config HISI_HIKEY_USB
> tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
> depends on (OF && GPIOLIB) || COMPILE_TEST
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index d23231e..4b4933d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI) += habanalabs/
> obj-$(CONFIG_UACCE) += uacce/
> obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o
> obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o
> +obj-$(CONFIG_SYSGENID) += sysgenid.o
> diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
> new file mode 100644
> index 0000000..dd64099
> --- /dev/null
> +++ b/drivers/misc/sysgenid.c
> @@ -0,0 +1,298 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * System Generation ID driver
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + * Authors:
> + * Adrian Catangiu <[email protected]>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/minmax.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/sysgenid.h>
> +
> +struct sysgenid_data {
> + unsigned long map_buf;
> + wait_queue_head_t read_waitq;
> + atomic_t generation_counter;
> +
> + unsigned int watchers;
> + atomic_t outdated_watchers;
> + wait_queue_head_t outdated_waitq;
> + spinlock_t lock;
> +};
> +static struct sysgenid_data sysgenid_data;
> +
> +struct file_data {
> + unsigned int acked_gen_counter;
> +};
> +
> +static int equals_gen_counter(unsigned int counter)
> +{
> + return counter == atomic_read(&sysgenid_data.generation_counter);
> +}
> +
> +static void _bump_generation(int min_gen)
> +{
> + unsigned long flags;
> + int counter;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
> + atomic_set(&sysgenid_data.generation_counter, counter);
> + *((int *) sysgenid_data.map_buf) = counter;
> + atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
> +
> + wake_up_interruptible(&sysgenid_data.read_waitq);
> + wake_up_interruptible(&sysgenid_data.outdated_waitq);
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +}
> +
> +void sysgenid_bump_generation(void)
> +{
> + _bump_generation(0);
> +}
> +EXPORT_SYMBOL(sysgenid_bump_generation);
> +
> +static void put_outdated_watchers(void)
> +{
> + if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
> + wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +}
> +
> +static int sysgenid_open(struct inode *inode, struct file *file)
> +{
> + struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
> + unsigned long flags;
> +
> + if (!fdata)
> + return -ENOMEM;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
> + ++sysgenid_data.watchers;
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + file->private_data = fdata;
> +
> + return 0;
> +}
> +
> +static int sysgenid_close(struct inode *inode, struct file *file)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + put_outdated_watchers();
> + --sysgenid_data.watchers;
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + kfree(fdata);
> +
> + return 0;
> +}
> +
> +static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
> + size_t nbytes, loff_t *ppos)
> +{
> + struct file_data *fdata = file->private_data;
> + ssize_t ret;
> + int gen_counter;
> +
> + if (nbytes == 0)
> + return 0;
> + /* disallow partial reads */
> + if (nbytes < sizeof(gen_counter))
> + return -EINVAL;
> +
> + if (equals_gen_counter(fdata->acked_gen_counter)) {
> + if (file->f_flags & O_NONBLOCK)
> + return -EAGAIN;
> + ret = wait_event_interruptible(
> + sysgenid_data.read_waitq,
> + !equals_gen_counter(fdata->acked_gen_counter)
> + );
> + if (ret)
> + return ret;
> + }
> +
> + gen_counter = atomic_read(&sysgenid_data.generation_counter);
> + ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
> + if (ret)
> + return -EFAULT;
> +
> + return sizeof(gen_counter);
> +}
> +
> +static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned int new_acked_gen;
> + unsigned long flags;
> +
> + /* disallow partial writes */
> + if (count != sizeof(new_acked_gen))
> + return -EINVAL;
> + if (copy_from_user(&new_acked_gen, ubuf, count))
> + return -EFAULT;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + /* wrong gen-counter acknowledged */
> + if (!equals_gen_counter(new_acked_gen)) {
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> + return -EINVAL;
> + }
> + if (!equals_gen_counter(fdata->acked_gen_counter)) {
> + fdata->acked_gen_counter = new_acked_gen;
> + put_outdated_watchers();
> + }
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + return (ssize_t)count;
> +}
> +
> +static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
> +{
> + __poll_t mask = 0;
> + struct file_data *fdata = file->private_data;
> +
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + return EPOLLIN | EPOLLRDNORM;
> +
> + poll_wait(file, &sysgenid_data.read_waitq, wait);
> +
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + mask = EPOLLIN | EPOLLRDNORM;
> +
> + return mask;
> +}
> +
> +static long sysgenid_ioctl(struct file *file,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned long timeout_ns, min_gen;
> + ktime_t until;
> + int ret = 0;
> +
> + switch (cmd) {
> + case SYSGENID_GET_OUTDATED_WATCHERS:
> + ret = atomic_read(&sysgenid_data.outdated_watchers);
> + break;
> + case SYSGENID_WAIT_WATCHERS:
> + timeout_ns = arg * NSEC_PER_MSEC;
> + until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
> +
> + ret = wait_event_interruptible_hrtimeout(
> + sysgenid_data.outdated_waitq,
> + (!atomic_read(&sysgenid_data.outdated_watchers) ||
> + !equals_gen_counter(fdata->acked_gen_counter)),
> + until
> + );
> + if (atomic_read(&sysgenid_data.outdated_watchers))
> + ret = -EINTR;
> + else
> + ret = 0;
> + break;
> + case SYSGENID_FORCE_GEN_UPDATE:
> + if (!checkpoint_restore_ns_capable(current_user_ns()))
> + return -EACCES;
> + min_gen = arg;
> + _bump_generation(min_gen);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct file_data *fdata = file->private_data;
> +
> + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> + return -EINVAL;
> +
> + if ((vma->vm_flags & VM_WRITE) != 0)
> + return -EPERM;
> +
> + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> + vma->vm_flags &= ~VM_MAYWRITE;
> + vma->vm_private_data = fdata;
> +
> + return vm_insert_page(vma, vma->vm_start,
> + virt_to_page(sysgenid_data.map_buf));
> +}
> +
> +static const struct file_operations fops = {
> + .owner = THIS_MODULE,
> + .mmap = sysgenid_mmap,
> + .open = sysgenid_open,
> + .release = sysgenid_close,
> + .read = sysgenid_read,
> + .write = sysgenid_write,
> + .poll = sysgenid_poll,
> + .unlocked_ioctl = sysgenid_ioctl,
> +};
> +
> +static struct miscdevice sysgenid_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "sysgenid",
> + .fops = &fops,
> +};
> +
> +static int __init sysgenid_init(void)
> +{
> + int ret;
> +
> + sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> + if (!sysgenid_data.map_buf)
> + return -ENOMEM;
> +
> + atomic_set(&sysgenid_data.generation_counter, 0);
> + atomic_set(&sysgenid_data.outdated_watchers, 0);
> + init_waitqueue_head(&sysgenid_data.read_waitq);
> + init_waitqueue_head(&sysgenid_data.outdated_waitq);
> + spin_lock_init(&sysgenid_data.lock);
> +
> + ret = misc_register(&sysgenid_misc);
> + if (ret < 0) {
> + pr_err("misc_register() failed for sysgenid\n");
> + goto err;
> + }
> +
> + return 0;
> +
> +err:
> + free_pages(sysgenid_data.map_buf, 0);
> + sysgenid_data.map_buf = 0;
> +
> + return ret;
> +}
> +
> +static void __exit sysgenid_exit(void)
> +{
> + misc_deregister(&sysgenid_misc);
> + free_pages(sysgenid_data.map_buf, 0);
> + sysgenid_data.map_buf = 0;
> +}
> +
> +module_init(sysgenid_init);
> +module_exit(sysgenid_exit);
> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("System Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");
> diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
> new file mode 100644
> index 0000000..ea38fd3
> --- /dev/null
> +++ b/include/uapi/linux/sysgenid.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_SYSGENID_H
> +#define _UAPI_LINUX_SYSGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define SYSGENID_IOCTL 0x2d
> +#define SYSGENID_GET_OUTDATED_WATCHERS _IO(SYSGENID_IOCTL, 1)
> +#define SYSGENID_WAIT_WATCHERS _IO(SYSGENID_IOCTL, 2)
> +#define SYSGENID_FORCE_GEN_UPDATE _IO(SYSGENID_IOCTL, 3)
> +
> +#ifdef __KERNEL__
> +void sysgenid_bump_generation(void);
> +#endif /* __KERNEL__ */
> +
> +#endif /* _UAPI_LINUX_SYSGENID_H */
> +
> --
> 2.7.4
>
>
>
>
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-19 22:39:34

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

Hi--

On 1/12/21 4:15 AM, Adrian Catangiu wrote:
> - Background and problem
>

> ---
> Documentation/misc-devices/sysgenid.rst | 240 +++++++++++++++++++++++++
> drivers/misc/Kconfig | 16 ++
> drivers/misc/Makefile | 1 +
> drivers/misc/sysgenid.c | 298 ++++++++++++++++++++++++++++++++
> include/uapi/linux/sysgenid.h | 18 ++
> 5 files changed, 573 insertions(+)
> create mode 100644 Documentation/misc-devices/sysgenid.rst
> create mode 100644 drivers/misc/sysgenid.c
> create mode 100644 include/uapi/linux/sysgenid.h
>
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..0b31ccf
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,240 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is a simple concept meant to alleviate the
> +issue by providing a monotonically increasing counter that changes
> +each time the VM or container is restored from a snapshot.
> +The driver for it lives at ``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev FS interface accessible through
> +``/dev/sysgenid`` that provides sync and async SysGen counter updates

update

> +notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided FS interface, to make
> +any necessary internal adjustments following a system generation update.
> +
> +Driver FS interface:
> +
> +``open()``:
> + When the device is opened, a copy of the current Sys-Gen-Id (counter)
> + is associated with the open file descriptor. The driver now tracks
> + this file as an independent *watcher*. The driver tracks how many
> + watchers are aware of the latest Sys-Gen-Id counter and how many of
> + them are *outdated*; outdated being those that have lived through
> + a Sys-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> + Read is meant to provide the *new* system generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date, at which point the new
> + counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + ``sizeof(unsigned)`` in size.

Please use (unsigned int), not just (unsigned).
(Linux style)

> +
> +``write()``:
> + Write is used to confirm the up-to-date Sys Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + ``sizeof(unsigned)`` in size.

ditto.

> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for tracking count of open file
> + descriptors that haven't acknowledged a generation counter update,
> + as well as a mechanism for userspace to *force* a generation update:
> +
> + - SYSGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
> + *outdated* watchers - number of file descriptors that were open
> + during a system generation change, and which have not yet confirmed
> + the new generation counter.
> + - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> + watchers, or if a ``timeout`` argument is provided, until the
> + timeout expires.
> + If the current caller is *outdated* or a generation change happens
> + while waiting (thus making current caller *outdated*), the ioctl
> + returns ``-EINTR`` to signal the user to handle event and retry.
> + - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
> + It takes a ``minimum-generation`` argument which represents the
> + minimum value the generation counter will be incremented to. For
> + example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
> + is called, the generation counter will increment to ``8``.
> + This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> + or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> + in size. The first 4 bytes of the mapped page will contain an
> + up-to-date u32 copy of the system generation counter.
> + The mapped memory can be used as a low-latency generation counter
> + probe mechanism in critical sections - see examples.
> +
> +``close()``:
> + Removes the file descriptor as a system generation counter *watcher*.
> +
> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> + void watchdog_thread_handler(int *thread_active)
> + {
> + unsigned genid;

unsigned int genid;

> + int fd = open("/dev/sysgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
> +
> + do {
> + // read new gen ID - blocks until VM generation changes
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling gen ID update
> + write(fd, &genid, sizeof(genid));
> + } while (atomic_read(thread_active));
> +
> + close(fd);
> + }
> +
> +2) ASYNC simplified example::
> +
> + void handle_io_on_sysgenfd(int sysgenfd)
> + {
> + unsigned genid;

unsigned int genid;

> +
> + // read new gen ID - we need it to confirm we've handled update
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling the gen ID update
> + write(fd, &genid, sizeof(genid));
> + }
> +
> + int main() {
> + int epfd, sysgenfd;
> + struct epoll_event ev;
> +
> + epfd = epoll_create(EPOLL_QUEUE_LEN);
> +
> + sysgenfd = open("/dev/sysgenid",
> + O_RDWR | O_CLOEXEC | O_NONBLOCK,
> + S_IRUSR | S_IWUSR);
> +
> + // register sysgenid for polling
> + ev.events = EPOLLIN;
> + ev.data.fd = sysgenfd;
> + epoll_ctl(epfd, EPOLL_CTL_ADD, sysgenfd, &ev);
> +
> + // register other parts of your app for polling
> + // ...
> +
> + while (1) {
> + // wait for something to do...
> + int nfds = epoll_wait(epfd, events,
> + MAX_EPOLL_EVENTS_PER_RUN,
> + EPOLL_RUN_TIMEOUT);
> + if (nfds < 0) die("Error in epoll_wait!");
> +
> + // for each ready fd
> + for(int i = 0; i < nfds; i++) {
> + int fd = events[i].data.fd;
> +
> + if (fd == sysgenfd)
> + handle_io_on_sysgenfd(sysgenfd);
> + else
> + handle_some_other_part_of_the_app(fd);
> + }
> + }
> +
> + return 0;
> + }
> +
> +3) Mapped memory polling simplified example::
> +
> + /*
> + * app/library function that provides cached secrets
> + */
> + char * safe_cached_secret(app_data_t *app)
> + {
> + char *secret;
> + volatile unsigned *const genid_ptr = get_sysgenid_mapping(app);

unsigned int

> + again:
> + secret = __cached_secret(app);
> +
> + if (unlikely(*genid_ptr != app->cached_genid)) {
> + app->cached_genid = *genid_ptr;
> + barrier();
> +
> + // rebuild world then confirm the genid update (thru write)
> + rebuild_caches(app);
> +
> + ack_sysgenid_update(app);
> +
> + goto again;
> + }
> +
> + return secret;
> + }
> +
> +4) Orchestrator simplified example::
> +
> + /*
> + * orchestrator - manages multiple applications and libraries used by
> + * a service and tries to make sure all sensitive components gracefully
> + * handle VM generation changes.
> + * Following function is called on detection of a VM generation change.
> + */
> + int handle_sysgen_update(int sysgen_fd, unsigned new_gen_id)

unsigned int

> + {
> + // pause until all components have handled event
> + pause_service();
> +
> + // confirm *this* watcher as up-to-date
> + write(sysgen_fd, &new_gen_id, sizeof(unsigned));

unsigned int

> +
> + // wait for all *others* for at most 5 seconds.
> + ioctl(sysgen_fd, VMGENID_WAIT_WATCHERS, 5000);
> +
> + // all applications on the system have rebuilt worlds
> + resume_service();
> + }


> diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
> new file mode 100644
> index 0000000..ea38fd3
> --- /dev/null
> +++ b/include/uapi/linux/sysgenid.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_SYSGENID_H
> +#define _UAPI_LINUX_SYSGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define SYSGENID_IOCTL 0x2d

Please document new IOCTL major/magic values in
Documentation/userspace-api/ioctl/ioctl-number.rst.



thanks.
--
~Randy
"He closes his eyes and drops the goggles. You can't get hurt
by looking at a bitmap. Or can you?"
(Neal Stephenson: Snow Crash)

2021-01-21 09:58:42

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

On 12/01/2021, 15:07, "Greg KH" <[email protected]> wrote:

On Tue, Jan 12, 2021 at 02:15:59PM +0200, Adrian Catangiu wrote:
> + Partial reads are not allowed - read buffer needs to be at least
> + ``sizeof(unsigned)`` in size.

"sizeof(unsigned)"? How about being specific and making this a real "X
bits big" value please.

"unsigned" does not work well across user/kernel boundries. Ok, that's
on understatement, the correct thing is "does not work at all".

Please be specific in your apis.

This is listed elsewhere also.

Right, will do!

> + - SYSGENID_GET_OUTDATED_WATCHERS: immediately returns the number of
> + *outdated* watchers - number of file descriptors that were open
> + during a system generation change, and which have not yet confirmed
> + the new generation counter.

But this number can instantly change after it is read, what good is it?
It should never be relied on, so why is this needed at all?

What can userspace do with this information?

That is true, a userspace process either has to wait for all to adjust to the new generation
or not care about other processes. Intermediate probing doesn't bring real value. Will remove.

thanks,

greg k-h

Thanks for the feedback!
Adrian.




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-21 10:36:19

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] System Generation ID driver and VMGENID backend

On 12/01/2021, 14:49, "Michael S. Tsirkin" <[email protected]> wrote:

On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
> The first patch in the set implements a device driver which exposes a
> read-only device /dev/sysgenid to userspace, which contains a
> monotonically increasing u32 generation counter. Libraries and
> applications are expected to open() the device, and then call read()
> which blocks until the SysGenId changes. Following an update, read()
> calls no longer block until the application acknowledges the new
> SysGenId by write()ing it back to the device. Non-blocking read() calls
> return EAGAIN when there is no new SysGenId available. Alternatively,
> libraries can mmap() the device to get a single shared page which
> contains the latest SysGenId at offset 0.

Looking at some specifications, the gen ID might actually be located
at an arbitrary address. How about instead of hard-coding the offset,
we expose it e.g. in sysfs?

The functionality is split between SysGenID which exposes an internal u32
counter to userspace, and an (optional) VmGenID backend which drives
SysGenID generation changes based on hw vmgenid updates.

The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
otherwise exposed to userspace. It is only used internally by the vmgenid
driver to find out about VM generation changes and drive the more generic
SysGenID.

The SysGenID u32 monotonic increasing counter is the one that is mmaped to
userspace, but it is a software counter. I don't see any value in using a dynamic
offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
static so no need to dynamically calculate or find it at runtime.

Thanks,
Adrian.




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-21 12:52:58

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

On 12/01/2021, 15:09, "Michael S. Tsirkin" <[email protected]> wrote:

On Tue, Jan 12, 2021 at 02:15:59PM +0200, Adrian Catangiu wrote:

> +3) Mapped memory polling simplified example::
> +
> + /*
> + * app/library function that provides cached secrets
> + */
> + char * safe_cached_secret(app_data_t *app)
> + {
> + char *secret;
> + volatile unsigned *const genid_ptr = get_sysgenid_mapping(app);
> + again:
> + secret = __cached_secret(app);
> +
> + if (unlikely(*genid_ptr != app->cached_genid)) {
> + app->cached_genid = *genid_ptr;
> + barrier();
> +
> + // rebuild world then confirm the genid update (thru write)
> + rebuild_caches(app);
> +
> + ack_sysgenid_update(app);
> +
> + goto again;
> + }
> +
> + return secret;



Hmm. This seems to rely on the assumption that if you have
read the ID and it did not change, then all is well.

Problem is, in the interrupt driven implementation
this is not a safe assumption to make: ID
from hypervisor might have changed but interrupt
could be delayed.


If we map the hypervisor ID to userspace then we won't
have this race ... worth worry about? why not?

This is a very good point! Unfortunately, there is no immediate solution here.
The current patch-set makes this trade-off in order to gain the genericity of
a system-level generation ID which is not limited to VMs usecases, but can also
be used with things like CRIU.

Directly mapping the vmgenid UUID to userspace was the initial design of this
patch-set (see v1), but it was argued against and evolved into current state.

I would personally rather enhance the HW support (vmgenid device for example)
to also expose a configurable u32 Vm Gen Counter alongside the 128-bit UUID;
and add support in SysGenID to offload the counter to HW when applicable.


The broader view is we need to strike the right balance between a functional,
safe mechanism with today's technology, but also design a future-proof generic
mechanism.

Fixing the race you mentioned in SysGenID only moves the race a bit further up
the stack - you generate the secret race-free but the secret can still be duplicated
in the next layer. To make any mechanism completely safe we need to conceptually
disconnect ourselves from believing that a restored snapshot is immediately available.
There needs to be some entity that moves the restored VM/container/system
from a transient state to "available". That entity can be a process inside the VM,
but it can also be something outside the VM, in the hypervisor or in the stack
surrounding it. That entity is responsible for managing the transition of the VM
or container from transient -> available. It is responsible for not allowing the VM/
container to be used or usable until that transition is complete.

In the first generations of VM clones and CRIU production deployments, I expect
this entity to be a stack-specific component with intimate knowledge of the system
components, transient states, lifecycle, etc. Which this patch-set enables.


Thanks,
Adrian.





Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-27 23:57:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] System Generation ID driver and VMGENID backend

On Thu, Jan 21, 2021 at 10:28:16AM +0000, Catangiu, Adrian Costin wrote:
> On 12/01/2021, 14:49, "Michael S. Tsirkin" <[email protected]> wrote:
>
> On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
> > The first patch in the set implements a device driver which exposes a
> > read-only device /dev/sysgenid to userspace, which contains a
> > monotonically increasing u32 generation counter. Libraries and
> > applications are expected to open() the device, and then call read()
> > which blocks until the SysGenId changes. Following an update, read()
> > calls no longer block until the application acknowledges the new
> > SysGenId by write()ing it back to the device. Non-blocking read() calls
> > return EAGAIN when there is no new SysGenId available. Alternatively,
> > libraries can mmap() the device to get a single shared page which
> > contains the latest SysGenId at offset 0.
>
> Looking at some specifications, the gen ID might actually be located
> at an arbitrary address. How about instead of hard-coding the offset,
> we expose it e.g. in sysfs?
>
> The functionality is split between SysGenID which exposes an internal u32
> counter to userspace, and an (optional) VmGenID backend which drives
> SysGenID generation changes based on hw vmgenid updates.
>
> The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
> otherwise exposed to userspace. It is only used internally by the vmgenid
> driver to find out about VM generation changes and drive the more generic
> SysGenID.
>
> The SysGenID u32 monotonic increasing counter is the one that is mmaped to
> userspace, but it is a software counter. I don't see any value in using a dynamic
> offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
> static so no need to dynamically calculate or find it at runtime.

Well you are burning a whole page on it, using an offset the page
can be shared with other functionality.

> Thanks,
> Adrian.
>
>
>
>
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-01-28 00:43:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

Hi!

> - Solution
>
> The System Generation ID is a simple concept meant to alleviate the
> issue by providing a monotonically increasing u32 counter that changes
> each time the VM or container is restored from a snapshot.

I'd make it u64.

But as people explained, this has race problems that may be impossible
to solve?

Best regards,
Pavel

--
http://www.livejournal.com/~pavelmachek


Attachments:
(No filename) (425.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2021-01-28 13:04:24

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] System Generation ID driver and VMGENID backend

Hey Michael!

On 27.01.21 13:47, Michael S. Tsirkin wrote:
>
> On Thu, Jan 21, 2021 at 10:28:16AM +0000, Catangiu, Adrian Costin wrote:
>> On 12/01/2021, 14:49, "Michael S. Tsirkin" <[email protected]> wrote:
>>
>> On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
>> > The first patch in the set implements a device driver which exposes a
>> > read-only device /dev/sysgenid to userspace, which contains a
>> > monotonically increasing u32 generation counter. Libraries and
>> > applications are expected to open() the device, and then call read()
>> > which blocks until the SysGenId changes. Following an update, read()
>> > calls no longer block until the application acknowledges the new
>> > SysGenId by write()ing it back to the device. Non-blocking read() calls
>> > return EAGAIN when there is no new SysGenId available. Alternatively,
>> > libraries can mmap() the device to get a single shared page which
>> > contains the latest SysGenId at offset 0.
>>
>> Looking at some specifications, the gen ID might actually be located
>> at an arbitrary address. How about instead of hard-coding the offset,
>> we expose it e.g. in sysfs?
>>
>> The functionality is split between SysGenID which exposes an internal u32
>> counter to userspace, and an (optional) VmGenID backend which drives
>> SysGenID generation changes based on hw vmgenid updates.
>>
>> The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
>> otherwise exposed to userspace. It is only used internally by the vmgenid
>> driver to find out about VM generation changes and drive the more generic
>> SysGenID.
>>
>> The SysGenID u32 monotonic increasing counter is the one that is mmaped to
>> userspace, but it is a software counter. I don't see any value in using a dynamic
>> offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
>> static so no need to dynamically calculate or find it at runtime.
>
> Well you are burning a whole page on it, using an offset the page
> can be shared with other functionality.

Currently, the SysGenID lives is one page owned by Linux that we share
out to multiple user space clients. So yes, we burn a single page of the
system here.

If we put more data in that same page, what data would you put there?
Random other bits from other subsystems? At that point, we'd be
reinventing vdso all over again, no? Probably with the same problems.

Which gets me to the second alternative: Reuse VDSO. The problem there
is that the VDSO is an extremely architecture specific mechanism. Any
new architecture we'd want to support would need multiple layers of
changes in multiple layers of both kernel and libc. I'd like to avoid
that if we can :).

So that leaves us with either wasting a page per system or not having an
mmap() interface in the first place.

The reason we have the mmap() interface is that it's be easier to
consume for libraries, that are not hooked into the main event loop.

So, uh, what are you suggesting? :)


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



2021-02-02 22:26:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] drivers/misc: sysgenid: add system generation id driver

On Wed, Jan 27, 2021 at 11:15:05PM +0100, Pavel Machek wrote:
> Hi!
>
> > - Solution
> >
> > The System Generation ID is a simple concept meant to alleviate the
> > issue by providing a monotonically increasing u32 counter that changes
> > each time the VM or container is restored from a snapshot.
>
> I'd make it u64.
>
> But as people explained, this has race problems that may be impossible
> to solve?

Well the read/write interface could be used in a safe way thinkably:

- application checks VM gen id
- application sends a transaction e.g. to database
- application re-checks VM gen id
- if id changed, application checks the database for duplicate
transactions

not sure how can the mmap interface be used safely.
Drop it for now?



> Best regards,
> Pavel
>
> --
> http://www.livejournal.com/~pavelmachek


2021-02-02 22:28:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] System Generation ID driver and VMGENID backend

On Thu, Jan 28, 2021 at 01:58:12PM +0100, Alexander Graf wrote:
> Hey Michael!
>
> On 27.01.21 13:47, Michael S. Tsirkin wrote:
> >
> > On Thu, Jan 21, 2021 at 10:28:16AM +0000, Catangiu, Adrian Costin wrote:
> > > On 12/01/2021, 14:49, "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > On Tue, Jan 12, 2021 at 02:15:58PM +0200, Adrian Catangiu wrote:
> > > > The first patch in the set implements a device driver which exposes a
> > > > read-only device /dev/sysgenid to userspace, which contains a
> > > > monotonically increasing u32 generation counter. Libraries and
> > > > applications are expected to open() the device, and then call read()
> > > > which blocks until the SysGenId changes. Following an update, read()
> > > > calls no longer block until the application acknowledges the new
> > > > SysGenId by write()ing it back to the device. Non-blocking read() calls
> > > > return EAGAIN when there is no new SysGenId available. Alternatively,
> > > > libraries can mmap() the device to get a single shared page which
> > > > contains the latest SysGenId at offset 0.
> > >
> > > Looking at some specifications, the gen ID might actually be located
> > > at an arbitrary address. How about instead of hard-coding the offset,
> > > we expose it e.g. in sysfs?
> > >
> > > The functionality is split between SysGenID which exposes an internal u32
> > > counter to userspace, and an (optional) VmGenID backend which drives
> > > SysGenID generation changes based on hw vmgenid updates.
> > >
> > > The hw UUID you're referring to (vmgenid) is not mmap-ed to userspace or
> > > otherwise exposed to userspace. It is only used internally by the vmgenid
> > > driver to find out about VM generation changes and drive the more generic
> > > SysGenID.
> > >
> > > The SysGenID u32 monotonic increasing counter is the one that is mmaped to
> > > userspace, but it is a software counter. I don't see any value in using a dynamic
> > > offset in the mmaped page. Offset 0 is fast and easy and most importantly it is
> > > static so no need to dynamically calculate or find it at runtime.
> >
> > Well you are burning a whole page on it, using an offset the page
> > can be shared with other functionality.
>
> Currently, the SysGenID lives is one page owned by Linux that we share out
> to multiple user space clients. So yes, we burn a single page of the system
> here.
>
> If we put more data in that same page, what data would you put there? Random
> other bits from other subsystems? At that point, we'd be reinventing vdso
> all over again, no? Probably with the same problems.
>
> Which gets me to the second alternative: Reuse VDSO. The problem there is
> that the VDSO is an extremely architecture specific mechanism. Any new
> architecture we'd want to support would need multiple layers of changes in
> multiple layers of both kernel and libc. I'd like to avoid that if we can
> :).
>
> So that leaves us with either wasting a page per system or not having an
> mmap() interface in the first place.
>
> The reason we have the mmap() interface is that it's be easier to consume
> for libraries, that are not hooked into the main event loop.
>
> So, uh, what are you suggesting? :)

I'd drop mmap at this point. I haven't seen a way to use it
that isn't racy.

>
> Alex
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>