2021-02-01 17:27:35

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: [PATCH v5 0/2] System Generation ID driver and VMGENID backend

This feature is aimed at virtualized or containerized environments
where VM or container snapshotting duplicates memory state, which is a
challenge for applications that want to generate unique data such as
request IDs, UUIDs, and cryptographic nonces.

The patch set introduces a mechanism that provides a userspace
interface for applications and libraries to be made aware of uniqueness
breaking events such as VM or container snapshotting, and allow them to
react and adapt to such events.

Solving the uniqueness problem strongly enough for cryptographic
purposes requires a mechanism which can deterministically reseed
userspace PRNGs with new entropy at restore time. This mechanism must
also support the high-throughput and low-latency use-cases that led
programmers to pick a userspace PRNG in the first place; be usable by
both application code and libraries; allow transparent retrofitting
behind existing popular PRNG interfaces without changing application
code; it must be efficient, especially on snapshot restore; and be
simple enough for wide adoption.

The first patch in the set implements a device driver which exposes a
read-only device /dev/sysgenid to userspace, which contains a
monotonically increasing u32 generation counter. Libraries and
applications are expected to open() the device, and then call read()
which blocks until the SysGenId changes. Following an update, read()
calls no longer block until the application acknowledges the new
SysGenId by write()ing it back to the device. Non-blocking read() calls
return EAGAIN when there is no new SysGenId available. Alternatively,
libraries can mmap() the device to get a single shared page which
contains the latest SysGenId at offset 0.

SysGenId also supports a waiting mechanism exposed through a IOCTL on
the device. The SYSGENID_WAIT_WATCHERS IOCTL blocks until there are no
open file handles on the device which haven’t acknowledged the new id.
This waiting mechanism is intended for serverless and container control
planes, which want to confirm that all application code has detected
and reacted to the new SysGenId before sending an invoke to the
newly-restored sandbox.

The second patch in the set adds a VmGenId driver which makes use of
the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
on VM snapshots.

---

v4 -> v5:

- sysgenid: generation changes are also exported through uevents
- remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
- document sysgenid ioctl major/minor numbers
- rebase on linus latest + various nits

v3 -> v4:

- split functionality in two separate kernel modules:
1. drivers/misc/sysgenid.c which provides the generic userspace
interface and mechanisms
2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
kernel entropy and acts as a driving backend for the generic
sysgenid
- renamed /dev/vmgenid -> /dev/sysgenid
- renamed uapi header file vmgenid.h -> sysgenid.h
- renamed ioctls VMGENID_* -> SYSGENID_*
- added ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
- fixed races in documentation examples
- various style nits
- rebased on top of linus latest

v2 -> v3:

- separate the core driver logic and interface, from the ACPI device.
The ACPI vmgenid device is now one possible backend.
- fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
- add locking to avoid races between fs ops handlers and hw irq
driven generation updates
- change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
outdated or a generation change happens while waiting (thus making
current caller outdated), the ioctl returns -EINTR to signal the
user to handle event and retry. Fixes blocking on oneself.
- add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
CAP_CHECKPOINT_RESTORE capability, through which software can force
generation bump.

v1 -> v2:

- expose to userspace a monotonically increasing u32 Vm Gen Counter
instead of the hw VmGen UUID
- since the hw/hypervisor-provided 128-bit UUID is not public
anymore, add it to the kernel RNG as device randomness
- insert driver page containing Vm Gen Counter in the user vma in
the driver's mmap handler instead of using a fault handler
- turn driver into a misc device driver to auto-create /dev/vmgenid
- change ioctl arg to avoid leaking kernel structs to userspace
- update documentation
- various nits
- rebase on top of linus latest

Adrian Catangiu (2):
drivers/misc: sysgenid: add system generation id driver
drivers/virt: vmgenid: add vm generation id driver

Documentation/misc-devices/sysgenid.rst | 236 ++++++++++++++++
Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
Documentation/virt/vmgenid.rst | 34 +++
MAINTAINERS | 15 +
drivers/misc/Kconfig | 16 ++
drivers/misc/Makefile | 1 +
drivers/misc/sysgenid.c | 307 +++++++++++++++++++++
drivers/virt/Kconfig | 14 +
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 153 ++++++++++
include/uapi/linux/sysgenid.h | 17 ++
11 files changed, 795 insertions(+)
create mode 100644 Documentation/misc-devices/sysgenid.rst
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/misc/sysgenid.c
create mode 100644 drivers/virt/vmgenid.c
create mode 100644 include/uapi/linux/sysgenid.h

--
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.


2021-02-01 17:28:13

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: [PATCH v5 2/2] drivers/virt: vmgenid: add vm generation id driver

- Background

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature can be used to drive the `sysgenid` mechanism required in
virtualized environments by software that works with local copies and
caches of world-unique data such as random values, uuids, monotonically
increasing counters, etc.

- Solution

The VM Generation ID is a hypervisor/hardware provided 128-bit unique
ID that changes each time the VM is restored from a snapshot. It can be
used to differentiate between VMs or different generations of the same
VM.
This VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors.

The `vmgenid` driver uses ACPI events to be notified by hardware
changes to the 128-bit Vm Gen Id HW UUID. The UUID is not exposed to
userspace, it is added by the driver as device randomness to improve
kernel entropy following VM snapshot events.

This driver also acts as a backend for the `sysgenid` kernel module
(`drivers/misc/sysgenid.c`, `Documentation/misc-devices/sysgenid.rst`)
to drive changes to the "System Generation Id" which is further exposed
to userspace as a system-wide monotonically increasing counter.

This patch builds on top of Or Idgar <[email protected]>'s proposal
https://lkml.org/lkml/2018/3/1/498

Signed-off-by: Adrian Catangiu <[email protected]>
---
Documentation/virt/vmgenid.rst | 34 +++++++++
MAINTAINERS | 7 ++
drivers/virt/Kconfig | 14 ++++
drivers/virt/Makefile | 1 +
drivers/virt/vmgenid.c | 153 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 209 insertions(+)
create mode 100644 Documentation/virt/vmgenid.rst
create mode 100644 drivers/virt/vmgenid.c

diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..2106354
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,34 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+VMGENID
+=======
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by applications
+that work with local copies/caches of world-unique data such as random
+values, UUIDs, monotonically increasing counters, etc.
+Such applications can be negatively affected by VM snapshotting when
+the VM is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept meant to alleviate the issue
+by providing a unique ID that changes each time the VM is restored
+from a snapshot. The hardware provided UUID value can be used to
+differentiate between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The ``vmgenid`` driver uses ACPI events to be notified by hardware
+changes to the 128-bit Vm Gen Id UUID. This UUID is not exposed to
+userspace, it is added by the driver as device randomness to improve
+kernel entropy following VM snapshot events.
+
+This driver also acts as a backend for the ``sysgenid`` kernel module
+(``drivers/misc/sysgenid.c``, ``Documentation/misc-devices/sysgenid.rst``)
+to drive changes to the "System Generation Id" which is further exposed
+to userspace as a monotonically increasing counter.
diff --git a/MAINTAINERS b/MAINTAINERS
index f1158b0..4615612 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19009,6 +19009,13 @@ F: drivers/staging/vme/
F: drivers/vme/
F: include/linux/vme*

+VMGENID
+M: Adrian Catangiu <[email protected]>
+L: [email protected]
+S: Supported
+F: Documentation/virt/vmgenid.rst
+F: drivers/virt/vmgenid.c
+
VMWARE BALLOON DRIVER
M: Nadav Amit <[email protected]>
M: "VMware, Inc." <[email protected]>
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..4771633 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,20 @@ menuconfig VIRT_DRIVERS

if VIRT_DRIVERS

+config VMGENID
+ tristate "Virtual Machine Generation ID driver"
+ depends on ACPI && SYSGENID
+ default N
+ help
+ The driver uses the hypervisor provided Virtual Machine Generation ID
+ to drive the system generation counter mechanism exposed by sysgenid.
+ The vmgenid changes on VM snapshots or VM cloning. The hypervisor
+ provided 128-bit vmgenid is also used as device randomness to improve
+ kernel entropy following VM snapshot events.
+
+ To compile this driver as a module, choose M here: the
+ module will be called vmgenid.
+
config FSL_HV_MANAGER
tristate "Freescale hypervisor management driver"
depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f28425c..889be01 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,6 +4,7 @@
#

obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
+obj-$(CONFIG_VMGENID) += vmgenid.o
obj-y += vboxguest/

obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..d9d089a
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc. All rights reserved.
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ * Authors:
+ * Adrian Catangiu <[email protected]>
+ * Or Idgar <[email protected]>
+ * Gal Hammer <[email protected]>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct vmgenid_data {
+ uuid_t uuid;
+ void *uuid_iomap;
+};
+static struct vmgenid_data vmgenid_data;
+
+static int vmgenid_acpi_map(struct vmgenid_data *priv, acpi_handle handle)
+{
+ int i;
+ phys_addr_t phys_addr;
+ struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+ acpi_status status;
+ union acpi_object *pss;
+ union acpi_object *element;
+
+ status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ pss = buffer.pointer;
+ if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+ return -EINVAL;
+
+ phys_addr = 0;
+ for (i = 0; i < pss->package.count; i++) {
+ element = &(pss->package.elements[i]);
+ if (element->type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+ phys_addr |= element->integer.value << i * 32;
+ }
+
+ priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+ if (!priv->uuid_iomap) {
+ pr_err("Could not map memory at 0x%llx, size %u\n",
+ phys_addr,
+ (u32) sizeof(uuid_t));
+ return -ENOMEM;
+ }
+
+ memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+ return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+ int ret;
+
+ if (!device)
+ return -EINVAL;
+ device->driver_data = &vmgenid_data;
+
+ ret = vmgenid_acpi_map(device->driver_data, device->handle);
+ if (ret < 0) {
+ pr_err("vmgenid: failed to map acpi device\n");
+ device->driver_data = NULL;
+ }
+
+ return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+ if (!device || acpi_driver_data(device) != &vmgenid_data)
+ return -EINVAL;
+ device->driver_data = NULL;
+
+ if (vmgenid_data.uuid_iomap)
+ acpi_os_unmap_memory(vmgenid_data.uuid_iomap, sizeof(uuid_t));
+ vmgenid_data.uuid_iomap = NULL;
+
+ return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+ uuid_t old_uuid;
+
+ if (!device || acpi_driver_data(device) != &vmgenid_data) {
+ pr_err("VMGENID notify with unexpected driver private data\n");
+ return;
+ }
+
+ /* update VM Generation UUID */
+ old_uuid = vmgenid_data.uuid;
+ memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
+
+ if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
+ /* HW uuid updated */
+ sysgenid_bump_generation();
+ add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
+ }
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+ {"VMGENID", 0},
+ {"QEMUVGID", 0},
+ {"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+ .name = "vm_generation_id",
+ .ids = vmgenid_ids,
+ .owner = THIS_MODULE,
+ .ops = {
+ .add = vmgenid_acpi_add,
+ .remove = vmgenid_acpi_remove,
+ .notify = vmgenid_acpi_notify,
+ }
+};
+
+static int __init vmgenid_init(void)
+{
+ return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+ acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
--
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-02-01 17:28:24

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

- Background and problem

The System Generation ID feature is required in virtualized or
containerized environments by applications that work with local copies
or caches of world-unique data such as random values, uuids,
monotonically increasing counters, etc.
Such applications can be negatively affected by VM or container
snapshotting when the VM or container is either cloned or returned to
an earlier point in time.

Furthermore, simply finding out about a system generation change is
only the starting point of a process to renew internal states of
possibly multiple applications across the system. This process could
benefit from a driver that provides an interface through which
orchestration can be easily done.

- Solution

The System Generation ID is a simple concept meant to alleviate the
issue by providing a monotonically increasing u32 counter that changes
each time the VM or container is restored from a snapshot.

The `sysgenid` driver exposes a monotonic incremental System Generation
u32 counter via a char-dev FS interface that provides sync and async
SysGen counter updates notifications. It also provides SysGen counter
retrieval and confirmation mechanisms.

The counter starts from zero when the driver is initialized and
monotonically increments every time the system generation changes.

The `sysgenid` driver exports the `void sysgenid_bump_generation()`
symbol which can be used by backend drivers to drive system generation
changes based on hardware events.
System generation changes can also be driven by userspace software
through a dedicated driver ioctl.

Userspace applications or libraries can then (a)synchronously consume
the system generation counter through the provided FS interface to make
any necessary internal adjustments following a system generation
update.

Signed-off-by: Adrian Catangiu <[email protected]>
---
Documentation/misc-devices/sysgenid.rst | 236 ++++++++++++++++
Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 8 +
drivers/misc/Kconfig | 16 ++
drivers/misc/Makefile | 1 +
drivers/misc/sysgenid.c | 307 +++++++++++++++++++++
include/uapi/linux/sysgenid.h | 17 ++
7 files changed, 586 insertions(+)
create mode 100644 Documentation/misc-devices/sysgenid.rst
create mode 100644 drivers/misc/sysgenid.c
create mode 100644 include/uapi/linux/sysgenid.h

diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
new file mode 100644
index 0000000..4337ca0
--- /dev/null
+++ b/Documentation/misc-devices/sysgenid.rst
@@ -0,0 +1,236 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+SYSGENID
+========
+
+The System Generation ID feature is required in virtualized or
+containerized environments by applications that work with local copies
+or caches of world-unique data such as random values, UUIDs,
+monotonically increasing counters, etc.
+Such applications can be negatively affected by VM or container
+snapshotting when the VM or container is either cloned or returned to
+an earlier point in time.
+
+The System Generation ID is a simple concept meant to alleviate the
+issue by providing a monotonically increasing counter that changes
+each time the VM or container is restored from a snapshot.
+The driver for it lives at ``drivers/misc/sysgenid.c``.
+
+The ``sysgenid`` driver exposes a monotonic incremental System
+Generation u32 counter via a char-dev FS interface accessible through
+``/dev/sysgenid`` that provides sync and async SysGen counter update
+notifications. It also provides SysGen counter retrieval and
+confirmation mechanisms.
+
+The counter starts from zero when the driver is initialized and
+monotonically increments every time the system generation changes.
+
+The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
+symbol which can be used by backend drivers to drive system generation
+changes based on hardware events.
+System generation changes can also be driven by userspace software
+through a dedicated driver ioctl.
+
+Userspace applications or libraries can (a)synchronously consume the
+system generation counter through the provided FS interface, to make
+any necessary internal adjustments following a system generation update.
+
+Driver FS interface:
+
+``open()``:
+ When the device is opened, a copy of the current Sys-Gen-Id (counter)
+ is associated with the open file descriptor. The driver now tracks
+ this file as an independent *watcher*. The driver tracks how many
+ watchers are aware of the latest Sys-Gen-Id counter and how many of
+ them are *outdated*; outdated being those that have lived through
+ a Sys-Gen-Id change but not yet confirmed the new generation counter.
+
+``read()``:
+ Read is meant to provide the *new* system generation counter when a
+ generation change takes place. The read operation blocks until the
+ associated counter is no longer up to date, at which point the new
+ counter is provided/returned.
+ Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
+ *new* counter value available. The generation counter is considered
+ *new* for each open file descriptor that hasn't confirmed the new
+ value following a generation change. Therefore, once a generation
+ change takes place, all ``read()`` calls will immediately return the
+ new generation counter and will continue to do so until the
+ new value is confirmed back to the driver through ``write()``.
+ Partial reads are not allowed - read buffer needs to be at least
+ 32 bits in size.
+
+``write()``:
+ Write is used to confirm the up-to-date Sys Gen counter back to the
+ driver.
+ Following a VM generation change, all existing watchers are marked
+ as *outdated*. Each file descriptor will maintain the *outdated*
+ status until a ``write()`` confirms the up-to-date counter back to
+ the driver.
+ Partial writes are not allowed - write buffer should be exactly
+ 32 bits in size.
+
+``poll()``:
+ Poll is implemented to allow polling for generation counter updates.
+ Such updates result in ``EPOLLIN`` polling status until the new
+ up-to-date counter is confirmed back to the driver through a
+ ``write()``.
+
+``ioctl()``:
+ The driver also adds support for waiting on open file descriptors
+ that haven't acknowledged a generation counter update, as well as a
+ mechanism for userspace to *force* a generation update:
+
+ - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+ watchers, or if a ``timeout`` argument is provided, until the
+ timeout expires.
+ If the current caller is *outdated* or a generation change happens
+ while waiting (thus making current caller *outdated*), the ioctl
+ returns ``-EINTR`` to signal the user to handle event and retry.
+ - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
+ It takes a ``minimum-generation`` argument which represents the
+ minimum value the generation counter will be incremented to. For
+ example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
+ is called, the generation counter will increment to ``8``.
+ This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
+ or CAP_SYS_ADMIN capabilities.
+
+``mmap()``:
+ The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+ in size. The first 4 bytes of the mapped page will contain an
+ up-to-date u32 copy of the system generation counter.
+ The mapped memory can be used as a low-latency generation counter
+ probe mechanism in critical sections - see examples.
+
+``close()``:
+ Removes the file descriptor as a system generation counter *watcher*.
+
+Example application workflows
+-----------------------------
+
+1) Watchdog thread simplified example::
+
+ void watchdog_thread_handler(int *thread_active)
+ {
+ unsigned int genid;
+ int fd = open("/dev/sysgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
+
+ do {
+ // read new gen ID - blocks until VM generation changes
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling gen ID update
+ write(fd, &genid, sizeof(genid));
+ } while (atomic_read(thread_active));
+
+ close(fd);
+ }
+
+2) ASYNC simplified example::
+
+ void handle_io_on_sysgenfd(int sysgenfd)
+ {
+ unsigned int genid;
+
+ // read new gen ID - we need it to confirm we've handled update
+ read(fd, &genid, sizeof(genid));
+
+ // because of VM generation change, we need to rebuild world
+ reseed_app_env();
+
+ // confirm we're done handling the gen ID update
+ write(fd, &genid, sizeof(genid));
+ }
+
+ int main() {
+ int epfd, sysgenfd;
+ struct epoll_event ev;
+
+ epfd = epoll_create(EPOLL_QUEUE_LEN);
+
+ sysgenfd = open("/dev/sysgenid",
+ O_RDWR | O_CLOEXEC | O_NONBLOCK,
+ S_IRUSR | S_IWUSR);
+
+ // register sysgenid for polling
+ ev.events = EPOLLIN;
+ ev.data.fd = sysgenfd;
+ epoll_ctl(epfd, EPOLL_CTL_ADD, sysgenfd, &ev);
+
+ // register other parts of your app for polling
+ // ...
+
+ while (1) {
+ // wait for something to do...
+ int nfds = epoll_wait(epfd, events,
+ MAX_EPOLL_EVENTS_PER_RUN,
+ EPOLL_RUN_TIMEOUT);
+ if (nfds < 0) die("Error in epoll_wait!");
+
+ // for each ready fd
+ for(int i = 0; i < nfds; i++) {
+ int fd = events[i].data.fd;
+
+ if (fd == sysgenfd)
+ handle_io_on_sysgenfd(sysgenfd);
+ else
+ handle_some_other_part_of_the_app(fd);
+ }
+ }
+
+ return 0;
+ }
+
+3) Mapped memory polling simplified example::
+
+ /*
+ * app/library function that provides cached secrets
+ */
+ char * safe_cached_secret(app_data_t *app)
+ {
+ char *secret;
+ volatile unsigned int *const genid_ptr = get_sysgenid_mapping(app);
+ again:
+ secret = __cached_secret(app);
+
+ if (unlikely(*genid_ptr != app->cached_genid)) {
+ app->cached_genid = *genid_ptr;
+ barrier();
+
+ // rebuild world then confirm the genid update (thru write)
+ rebuild_caches(app);
+
+ ack_sysgenid_update(app);
+
+ goto again;
+ }
+
+ return secret;
+ }
+
+4) Orchestrator simplified example::
+
+ /*
+ * orchestrator - manages multiple applications and libraries used by
+ * a service and tries to make sure all sensitive components gracefully
+ * handle VM generation changes.
+ * Following function is called on detection of a VM generation change.
+ */
+ int handle_sysgen_update(int sysgen_fd, unsigned int new_gen_id)
+ {
+ // pause until all components have handled event
+ pause_service();
+
+ // confirm *this* watcher as up-to-date
+ write(sysgen_fd, &new_gen_id, sizeof(unsigned int));
+
+ // wait for all *others* for at most 5 seconds.
+ ioctl(sysgen_fd, VMGENID_WAIT_WATCHERS, 5000);
+
+ // all applications on the system have rebuilt worlds
+ resume_service();
+ }
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a2..d74fed4 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -356,6 +356,7 @@ Code Seq# Include File Comments
0xDB 00-0F drivers/char/mwave/mwavepub.h
0xDD 00-3F ZFCP device driver see drivers/s390/scsi/
<mailto:[email protected]>
+0xE4 01-02 uapi/linux/sysgenid.h SysGenID misc driver
0xE5 00-3F linux/fuse.h
0xEC 00-01 drivers/platform/chrome/cros_ec_dev.h ChromeOS EC driver
0xF3 00-3F drivers/usb/misc/sisusbvga/sisusb.h sisfb (in development)
diff --git a/MAINTAINERS b/MAINTAINERS
index d3e847f..f1158b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17203,6 +17203,14 @@ L: [email protected]
S: Maintained
F: drivers/mmc/host/sdhci-pci-dwc-mshc.c

+SYSGENID
+M: Adrian Catangiu <[email protected]>
+L: [email protected]
+S: Supported
+F: Documentation/misc-devices/sysgenid.rst
+F: drivers/misc/sysgenid.c
+F: include/uapi/linux/sysgenid.h
+
SYSTEM CONFIGURATION (SYSCON)
M: Lee Jones <[email protected]>
M: Arnd Bergmann <[email protected]>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index fafa8b0..931d716 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -456,6 +456,22 @@ config PVPANIC
a paravirtualized device provided by QEMU; it lets a virtual machine
(guest) communicate panic events to the host.

+config SYSGENID
+ tristate "System Generation ID driver"
+ default N
+ help
+ This is a System Generation ID driver which provides a system
+ generation counter. The driver exposes FS ops on /dev/sysgenid
+ through which it can provide information and notifications on system
+ generation changes that happen because of VM or container snapshots
+ or cloning.
+ This enables applications and libraries that store or cache
+ sensitive information, to know that they need to regenerate it
+ after process memory has been exposed to potential copying.
+
+ To compile this driver as a module, choose M here: the
+ module will be called sysgenid.
+
config HISI_HIKEY_USB
tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
depends on (OF && GPIOLIB) || COMPILE_TEST
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index d23231e..4b4933d 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI) += habanalabs/
obj-$(CONFIG_UACCE) += uacce/
obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o
obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o
+obj-$(CONFIG_SYSGENID) += sysgenid.o
diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
new file mode 100644
index 0000000..95b3421
--- /dev/null
+++ b/drivers/misc/sysgenid.c
@@ -0,0 +1,307 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * System Generation ID driver
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ * Authors:
+ * Adrian Catangiu <[email protected]>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/minmax.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+struct sysgenid_data {
+ unsigned long map_buf;
+ wait_queue_head_t read_waitq;
+ atomic_t generation_counter;
+
+ unsigned int watchers;
+ atomic_t outdated_watchers;
+ wait_queue_head_t outdated_waitq;
+ spinlock_t lock;
+};
+static struct sysgenid_data sysgenid_data;
+
+struct file_data {
+ unsigned int acked_gen_counter;
+};
+
+static void sysgenid_uevent(unsigned int gen_counter);
+
+static int equals_gen_counter(unsigned int counter)
+{
+ return counter == atomic_read(&sysgenid_data.generation_counter);
+}
+
+static void _bump_generation(int min_gen)
+{
+ unsigned long flags;
+ int counter;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
+ atomic_set(&sysgenid_data.generation_counter, counter);
+ *((int *) sysgenid_data.map_buf) = counter;
+ atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
+
+ wake_up_interruptible(&sysgenid_data.read_waitq);
+ wake_up_interruptible(&sysgenid_data.outdated_waitq);
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ sysgenid_uevent(counter);
+}
+
+void sysgenid_bump_generation(void)
+{
+ _bump_generation(0);
+}
+EXPORT_SYMBOL(sysgenid_bump_generation);
+
+static void put_outdated_watchers(void)
+{
+ if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
+ wake_up_interruptible(&sysgenid_data.outdated_waitq);
+}
+
+static int sysgenid_open(struct inode *inode, struct file *file)
+{
+ struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
+ unsigned long flags;
+
+ if (!fdata)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
+ ++sysgenid_data.watchers;
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ file->private_data = fdata;
+
+ return 0;
+}
+
+static int sysgenid_close(struct inode *inode, struct file *file)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned long flags;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ put_outdated_watchers();
+ --sysgenid_data.watchers;
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ kfree(fdata);
+
+ return 0;
+}
+
+static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct file_data *fdata = file->private_data;
+ ssize_t ret;
+ int gen_counter;
+
+ if (nbytes == 0)
+ return 0;
+ /* disallow partial reads */
+ if (nbytes < sizeof(gen_counter))
+ return -EINVAL;
+
+ if (equals_gen_counter(fdata->acked_gen_counter)) {
+ if (file->f_flags & O_NONBLOCK)
+ return -EAGAIN;
+ ret = wait_event_interruptible(
+ sysgenid_data.read_waitq,
+ !equals_gen_counter(fdata->acked_gen_counter)
+ );
+ if (ret)
+ return ret;
+ }
+
+ gen_counter = atomic_read(&sysgenid_data.generation_counter);
+ ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
+ if (ret)
+ return -EFAULT;
+
+ return sizeof(gen_counter);
+}
+
+static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned int new_acked_gen;
+ unsigned long flags;
+
+ /* disallow partial writes */
+ if (count != sizeof(new_acked_gen))
+ return -EINVAL;
+ if (copy_from_user(&new_acked_gen, ubuf, count))
+ return -EFAULT;
+
+ spin_lock_irqsave(&sysgenid_data.lock, flags);
+ /* wrong gen-counter acknowledged */
+ if (!equals_gen_counter(new_acked_gen)) {
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+ return -EINVAL;
+ }
+ if (!equals_gen_counter(fdata->acked_gen_counter)) {
+ fdata->acked_gen_counter = new_acked_gen;
+ put_outdated_watchers();
+ }
+ spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+ return (ssize_t)count;
+}
+
+static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
+{
+ __poll_t mask = 0;
+ struct file_data *fdata = file->private_data;
+
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ return EPOLLIN | EPOLLRDNORM;
+
+ poll_wait(file, &sysgenid_data.read_waitq, wait);
+
+ if (!equals_gen_counter(fdata->acked_gen_counter))
+ mask = EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static long sysgenid_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ struct file_data *fdata = file->private_data;
+ unsigned long timeout_ns, min_gen;
+ ktime_t until;
+ int ret = 0;
+
+ switch (cmd) {
+ case SYSGENID_WAIT_WATCHERS:
+ timeout_ns = arg * NSEC_PER_MSEC;
+ until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
+
+ ret = wait_event_interruptible_hrtimeout(
+ sysgenid_data.outdated_waitq,
+ (!atomic_read(&sysgenid_data.outdated_watchers) ||
+ !equals_gen_counter(fdata->acked_gen_counter)),
+ until
+ );
+ break;
+ case SYSGENID_FORCE_GEN_UPDATE:
+ if (!checkpoint_restore_ns_capable(current_user_ns()))
+ return -EACCES;
+ min_gen = arg;
+ _bump_generation(min_gen);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct file_data *fdata = file->private_data;
+
+ if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) != 0)
+ return -EPERM;
+
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags &= ~VM_MAYWRITE;
+ vma->vm_private_data = fdata;
+
+ return vm_insert_page(vma, vma->vm_start,
+ virt_to_page(sysgenid_data.map_buf));
+}
+
+static const struct file_operations fops = {
+ .owner = THIS_MODULE,
+ .mmap = sysgenid_mmap,
+ .open = sysgenid_open,
+ .release = sysgenid_close,
+ .read = sysgenid_read,
+ .write = sysgenid_write,
+ .poll = sysgenid_poll,
+ .unlocked_ioctl = sysgenid_ioctl,
+};
+
+static struct miscdevice sysgenid_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "sysgenid",
+ .fops = &fops,
+};
+
+static void sysgenid_uevent(unsigned int gen_counter)
+{
+ char event_string[22];
+ char *envp[] = { event_string, NULL };
+ struct device *dev = sysgenid_misc.this_device;
+
+ if (dev) {
+ sprintf(event_string, "SYSGENID=%u", gen_counter);
+ kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
+ }
+}
+
+static int __init sysgenid_init(void)
+{
+ int ret;
+
+ sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
+ if (!sysgenid_data.map_buf)
+ return -ENOMEM;
+
+ atomic_set(&sysgenid_data.generation_counter, 0);
+ atomic_set(&sysgenid_data.outdated_watchers, 0);
+ init_waitqueue_head(&sysgenid_data.read_waitq);
+ init_waitqueue_head(&sysgenid_data.outdated_waitq);
+ spin_lock_init(&sysgenid_data.lock);
+
+ ret = misc_register(&sysgenid_misc);
+ if (ret < 0) {
+ pr_err("misc_register() failed for sysgenid\n");
+ goto err;
+ }
+
+ return 0;
+
+err:
+ free_pages(sysgenid_data.map_buf, 0);
+ sysgenid_data.map_buf = 0;
+
+ return ret;
+}
+
+static void __exit sysgenid_exit(void)
+{
+ misc_deregister(&sysgenid_misc);
+ free_pages(sysgenid_data.map_buf, 0);
+ sysgenid_data.map_buf = 0;
+}
+
+module_init(sysgenid_init);
+module_exit(sysgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("System Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
new file mode 100644
index 0000000..ba370c8
--- /dev/null
+++ b/include/uapi/linux/sysgenid.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_SYSGENID_H
+#define _UAPI_LINUX_SYSGENID_H
+
+#include <linux/ioctl.h>
+
+#define SYSGENID_IOCTL 0xE4
+#define SYSGENID_WAIT_WATCHERS _IO(SYSGENID_IOCTL, 1)
+#define SYSGENID_FORCE_GEN_UPDATE _IO(SYSGENID_IOCTL, 2)
+
+#ifdef __KERNEL__
+void sysgenid_bump_generation(void);
+#endif /* __KERNEL__ */
+
+#endif /* _UAPI_LINUX_SYSGENID_H */
+
--
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-02-02 12:08:29

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> +config SYSGENID
> + tristate "System Generation ID driver"
> + default N

"N" is always the default, no need to list it again :(

2021-02-02 12:09:46

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> +EXPORT_SYMBOL(sysgenid_bump_generation);

EXPORT_SYMBOL_GPL()? I have to ask...

2021-02-02 12:13:34

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> +static long sysgenid_ioctl(struct file *file,
> + unsigned int cmd, unsigned long arg)

Very odd indentation style, checkpatch.pl didn't catch this?

2021-02-03 00:59:09

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

Hi--

On 2/1/21 9:24 AM, Adrian Catangiu wrote:
> - Background and problem
>
> The System Generation ID feature is required in virtualized or
> containerized environments by applications that work with local copies
> or caches of world-unique data such as random values, uuids,
> monotonically increasing counters, etc.

... if those applications want to comply with <some MS spec>.

> Such applications can be negatively affected by VM or container
> snapshotting when the VM or container is either cloned or returned to
> an earlier point in time.


> Signed-off-by: Adrian Catangiu <[email protected]>
> ---
> Documentation/misc-devices/sysgenid.rst | 236 ++++++++++++++++
> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
> MAINTAINERS | 8 +
> drivers/misc/Kconfig | 16 ++
> drivers/misc/Makefile | 1 +
> drivers/misc/sysgenid.c | 307 +++++++++++++++++++++
> include/uapi/linux/sysgenid.h | 17 ++
> 7 files changed, 586 insertions(+)
> create mode 100644 Documentation/misc-devices/sysgenid.rst
> create mode 100644 drivers/misc/sysgenid.c
> create mode 100644 include/uapi/linux/sysgenid.h
>
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..4337ca0
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,236 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is a simple concept meant to alleviate the
> +issue by providing a monotonically increasing counter that changes
> +each time the VM or container is restored from a snapshot.
> +The driver for it lives at ``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev FS interface accessible through

s/FS/filesystem/

> +``/dev/sysgenid`` that provides sync and async SysGen counter update
> +notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided FS interface, to make

s/FS/filesystem/

> +any necessary internal adjustments following a system generation update.
> +
> +Driver FS interface:
> +
> +``open()``:
> + When the device is opened, a copy of the current Sys-Gen-Id (counter)
> + is associated with the open file descriptor. The driver now tracks
> + this file as an independent *watcher*. The driver tracks how many
> + watchers are aware of the latest Sys-Gen-Id counter and how many of
> + them are *outdated*; outdated being those that have lived through
> + a Sys-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> + Read is meant to provide the *new* system generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date, at which point the new
> + counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + 32 bits in size.
> +
> +``write()``:
> + Write is used to confirm the up-to-date Sys Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + 32 bits in size.
> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for waiting on open file descriptors
> + that haven't acknowledged a generation counter update, as well as a
> + mechanism for userspace to *force* a generation update:
> +
> + - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> + watchers, or if a ``timeout`` argument is provided, until the
> + timeout expires.
> + If the current caller is *outdated* or a generation change happens
> + while waiting (thus making current caller *outdated*), the ioctl
> + returns ``-EINTR`` to signal the user to handle event and retry.
> + - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
> + It takes a ``minimum-generation`` argument which represents the
> + minimum value the generation counter will be incremented to. For

will be set to. For
It's not so much an increment as it is a "set to this value or higher".

> + example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
> + is called, the generation counter will increment to ``8``.
> + This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> + or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> + in size. The first 4 bytes of the mapped page will contain an
> + up-to-date u32 copy of the system generation counter.
> + The mapped memory can be used as a low-latency generation counter
> + probe mechanism in critical sections - see examples.
> +
> +``close()``:
> + Removes the file descriptor as a system generation counter *watcher*.
> +
> +Example application workflows
> +-----------------------------
> +
[snip]


--
~Randy

2021-02-09 14:47:52

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On 02/02/2021, 14:05, "Greg KH" <[email protected]> wrote:

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> +EXPORT_SYMBOL(sysgenid_bump_generation);

EXPORT_SYMBOL_GPL()? I have to ask...

Good catch! Will update.




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-02-09 14:51:36

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On 02/02/2021, 14:09, "Greg KH" <[email protected]> wrote:

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> +static long sysgenid_ioctl(struct file *file,
> + unsigned int cmd, unsigned long arg)

Very odd indentation style, checkpatch.pl didn't catch this?

Checkpatch.pl is happy with this, yes.
Will change it to a single tab nonetheless since it does look weird :)




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-02-09 14:56:22

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> - Background and problem
>
> The System Generation ID feature is required in virtualized or
> containerized environments by applications that work with local copies
> or caches of world-unique data such as random values, uuids,
> monotonically increasing counters, etc.
> Such applications can be negatively affected by VM or container
> snapshotting when the VM or container is either cloned or returned to
> an earlier point in time.
>
> Furthermore, simply finding out about a system generation change is
> only the starting point of a process to renew internal states of
> possibly multiple applications across the system. This process could
> benefit from a driver that provides an interface through which
> orchestration can be easily done.
>
> - Solution
>
> The System Generation ID is a simple concept meant to alleviate the
> issue by providing a monotonically increasing u32 counter that changes
> each time the VM or container is restored from a snapshot.
>
> The `sysgenid` driver exposes a monotonic incremental System Generation
> u32 counter via a char-dev FS interface that provides sync and async
> SysGen counter updates notifications. It also provides SysGen counter
> retrieval and confirmation mechanisms.
>
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the system generation changes.
>
> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
> symbol which can be used by backend drivers to drive system generation
> changes based on hardware events.
> System generation changes can also be driven by userspace software
> through a dedicated driver ioctl.
>
> Userspace applications or libraries can then (a)synchronously consume
> the system generation counter through the provided FS interface to make
> any necessary internal adjustments following a system generation
> update.
>
> Signed-off-by: Adrian Catangiu <[email protected]>
> ---
> Documentation/misc-devices/sysgenid.rst | 236 ++++++++++++++++
> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
> MAINTAINERS | 8 +
> drivers/misc/Kconfig | 16 ++
> drivers/misc/Makefile | 1 +
> drivers/misc/sysgenid.c | 307 +++++++++++++++++++++
> include/uapi/linux/sysgenid.h | 17 ++
> 7 files changed, 586 insertions(+)
> create mode 100644 Documentation/misc-devices/sysgenid.rst
> create mode 100644 drivers/misc/sysgenid.c
> create mode 100644 include/uapi/linux/sysgenid.h
>
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..4337ca0
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,236 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is a simple concept meant to alleviate the
> +issue by providing a monotonically increasing counter that changes
> +each time the VM or container is restored from a snapshot.
> +The driver for it lives at ``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev FS interface accessible through
> +``/dev/sysgenid`` that provides sync and async SysGen counter update
> +notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided FS interface, to make
> +any necessary internal adjustments following a system generation update.
> +
> +Driver FS interface:
> +
> +``open()``:
> + When the device is opened, a copy of the current Sys-Gen-Id (counter)
> + is associated with the open file descriptor. The driver now tracks
> + this file as an independent *watcher*. The driver tracks how many
> + watchers are aware of the latest Sys-Gen-Id counter and how many of
> + them are *outdated*; outdated being those that have lived through
> + a Sys-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> + Read is meant to provide the *new* system generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date, at which point the new
> + counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + 32 bits in size.
> +
> +``write()``:
> + Write is used to confirm the up-to-date Sys Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + 32 bits in size.
> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for waiting on open file descriptors
> + that haven't acknowledged a generation counter update, as well as a
> + mechanism for userspace to *force* a generation update:
> +
> + - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> + watchers, or if a ``timeout`` argument is provided, until the
> + timeout expires.
> + If the current caller is *outdated* or a generation change happens
> + while waiting (thus making current caller *outdated*), the ioctl
> + returns ``-EINTR`` to signal the user to handle event and retry.
> + - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
> + It takes a ``minimum-generation`` argument which represents the
> + minimum value the generation counter will be incremented to. For
> + example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
> + is called, the generation counter will increment to ``8``.
> + This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> + or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> + in size. The first 4 bytes of the mapped page will contain an
> + up-to-date u32 copy of the system generation counter.
> + The mapped memory can be used as a low-latency generation counter
> + probe mechanism in critical sections - see examples.
>

From what I have seen this mmap interface is still fundamentally racy.
How about splitting it to a separate patch?


> +``close()``:
> + Removes the file descriptor as a system generation counter *watcher*.
> +
> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> + void watchdog_thread_handler(int *thread_active)
> + {
> + unsigned int genid;
> + int fd = open("/dev/sysgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
> +
> + do {
> + // read new gen ID - blocks until VM generation changes
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling gen ID update
> + write(fd, &genid, sizeof(genid));
> + } while (atomic_read(thread_active));
> +
> + close(fd);
> + }
> +
> +2) ASYNC simplified example::
> +
> + void handle_io_on_sysgenfd(int sysgenfd)
> + {
> + unsigned int genid;
> +
> + // read new gen ID - we need it to confirm we've handled update
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling the gen ID update
> + write(fd, &genid, sizeof(genid));
> + }
> +
> + int main() {
> + int epfd, sysgenfd;
> + struct epoll_event ev;
> +
> + epfd = epoll_create(EPOLL_QUEUE_LEN);
> +
> + sysgenfd = open("/dev/sysgenid",
> + O_RDWR | O_CLOEXEC | O_NONBLOCK,
> + S_IRUSR | S_IWUSR);
> +
> + // register sysgenid for polling
> + ev.events = EPOLLIN;
> + ev.data.fd = sysgenfd;
> + epoll_ctl(epfd, EPOLL_CTL_ADD, sysgenfd, &ev);
> +
> + // register other parts of your app for polling
> + // ...
> +
> + while (1) {
> + // wait for something to do...
> + int nfds = epoll_wait(epfd, events,
> + MAX_EPOLL_EVENTS_PER_RUN,
> + EPOLL_RUN_TIMEOUT);
> + if (nfds < 0) die("Error in epoll_wait!");
> +
> + // for each ready fd
> + for(int i = 0; i < nfds; i++) {
> + int fd = events[i].data.fd;
> +
> + if (fd == sysgenfd)
> + handle_io_on_sysgenfd(sysgenfd);
> + else
> + handle_some_other_part_of_the_app(fd);
> + }
> + }
> +
> + return 0;
> + }
> +
> +3) Mapped memory polling simplified example::
> +
> + /*
> + * app/library function that provides cached secrets
> + */
> + char * safe_cached_secret(app_data_t *app)
> + {
> + char *secret;
> + volatile unsigned int *const genid_ptr = get_sysgenid_mapping(app);
> + again:
> + secret = __cached_secret(app);
> +
> + if (unlikely(*genid_ptr != app->cached_genid)) {
> + app->cached_genid = *genid_ptr;

And this is racy, isn't it? There is not system call to synchronize
with interrupts, so the value this reads could be out of date.


> + barrier();
> +
> + // rebuild world then confirm the genid update (thru write)
> + rebuild_caches(app);
> +
> + ack_sysgenid_update(app);
> +
> + goto again;
> + }
> +
> + return secret;
> + }
> +
> +4) Orchestrator simplified example::
> +
> + /*
> + * orchestrator - manages multiple applications and libraries used by
> + * a service and tries to make sure all sensitive components gracefully
> + * handle VM generation changes.
> + * Following function is called on detection of a VM generation change.
> + */
> + int handle_sysgen_update(int sysgen_fd, unsigned int new_gen_id)
> + {
> + // pause until all components have handled event
> + pause_service();
> +
> + // confirm *this* watcher as up-to-date
> + write(sysgen_fd, &new_gen_id, sizeof(unsigned int));
> +
> + // wait for all *others* for at most 5 seconds.
> + ioctl(sysgen_fd, VMGENID_WAIT_WATCHERS, 5000);
> +
> + // all applications on the system have rebuilt worlds
> + resume_service();
> + }
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a2..d74fed4 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -356,6 +356,7 @@ Code Seq# Include File Comments
> 0xDB 00-0F drivers/char/mwave/mwavepub.h
> 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/
> <mailto:[email protected]>
> +0xE4 01-02 uapi/linux/sysgenid.h SysGenID misc driver
> 0xE5 00-3F linux/fuse.h
> 0xEC 00-01 drivers/platform/chrome/cros_ec_dev.h ChromeOS EC driver
> 0xF3 00-3F drivers/usb/misc/sisusbvga/sisusb.h sisfb (in development)
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d3e847f..f1158b0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17203,6 +17203,14 @@ L: [email protected]
> S: Maintained
> F: drivers/mmc/host/sdhci-pci-dwc-mshc.c
>
> +SYSGENID
> +M: Adrian Catangiu <[email protected]>
> +L: [email protected]
> +S: Supported
> +F: Documentation/misc-devices/sysgenid.rst
> +F: drivers/misc/sysgenid.c
> +F: include/uapi/linux/sysgenid.h
> +
> SYSTEM CONFIGURATION (SYSCON)
> M: Lee Jones <[email protected]>
> M: Arnd Bergmann <[email protected]>
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index fafa8b0..931d716 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -456,6 +456,22 @@ config PVPANIC
> a paravirtualized device provided by QEMU; it lets a virtual machine
> (guest) communicate panic events to the host.
>
> +config SYSGENID
> + tristate "System Generation ID driver"
> + default N
> + help
> + This is a System Generation ID driver which provides a system
> + generation counter. The driver exposes FS ops on /dev/sysgenid
> + through which it can provide information and notifications on system
> + generation changes that happen because of VM or container snapshots
> + or cloning.
> + This enables applications and libraries that store or cache
> + sensitive information, to know that they need to regenerate it
> + after process memory has been exposed to potential copying.
> +
> + To compile this driver as a module, choose M here: the
> + module will be called sysgenid.
> +
> config HISI_HIKEY_USB
> tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
> depends on (OF && GPIOLIB) || COMPILE_TEST
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index d23231e..4b4933d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI) += habanalabs/
> obj-$(CONFIG_UACCE) += uacce/
> obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o
> obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o
> +obj-$(CONFIG_SYSGENID) += sysgenid.o
> diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
> new file mode 100644
> index 0000000..95b3421
> --- /dev/null
> +++ b/drivers/misc/sysgenid.c
> @@ -0,0 +1,307 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * System Generation ID driver
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + * Authors:
> + * Adrian Catangiu <[email protected]>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/minmax.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/sysgenid.h>
> +
> +struct sysgenid_data {
> + unsigned long map_buf;
> + wait_queue_head_t read_waitq;
> + atomic_t generation_counter;
> +
> + unsigned int watchers;
> + atomic_t outdated_watchers;
> + wait_queue_head_t outdated_waitq;
> + spinlock_t lock;
> +};
> +static struct sysgenid_data sysgenid_data;
> +
> +struct file_data {
> + unsigned int acked_gen_counter;
> +};
> +
> +static void sysgenid_uevent(unsigned int gen_counter);
> +
> +static int equals_gen_counter(unsigned int counter)
> +{
> + return counter == atomic_read(&sysgenid_data.generation_counter);
> +}
> +
> +static void _bump_generation(int min_gen)
> +{
> + unsigned long flags;
> + int counter;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
> + atomic_set(&sysgenid_data.generation_counter, counter);
> + *((int *) sysgenid_data.map_buf) = counter;
> + atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
> +
> + wake_up_interruptible(&sysgenid_data.read_waitq);
> + wake_up_interruptible(&sysgenid_data.outdated_waitq);
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + sysgenid_uevent(counter);
> +}
> +
> +void sysgenid_bump_generation(void)
> +{
> + _bump_generation(0);
> +}
> +EXPORT_SYMBOL(sysgenid_bump_generation);
> +
> +static void put_outdated_watchers(void)
> +{
> + if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
> + wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +}
> +
> +static int sysgenid_open(struct inode *inode, struct file *file)
> +{
> + struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
> + unsigned long flags;
> +
> + if (!fdata)
> + return -ENOMEM;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
> + ++sysgenid_data.watchers;
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + file->private_data = fdata;
> +
> + return 0;
> +}
> +
> +static int sysgenid_close(struct inode *inode, struct file *file)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + put_outdated_watchers();
> + --sysgenid_data.watchers;
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + kfree(fdata);
> +
> + return 0;
> +}
> +
> +static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
> + size_t nbytes, loff_t *ppos)
> +{
> + struct file_data *fdata = file->private_data;
> + ssize_t ret;
> + int gen_counter;
> +
> + if (nbytes == 0)
> + return 0;
> + /* disallow partial reads */
> + if (nbytes < sizeof(gen_counter))
> + return -EINVAL;
> +
> + if (equals_gen_counter(fdata->acked_gen_counter)) {
> + if (file->f_flags & O_NONBLOCK)
> + return -EAGAIN;
> + ret = wait_event_interruptible(
> + sysgenid_data.read_waitq,
> + !equals_gen_counter(fdata->acked_gen_counter)
> + );
> + if (ret)
> + return ret;
> + }
> +
> + gen_counter = atomic_read(&sysgenid_data.generation_counter);
> + ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
> + if (ret)
> + return -EFAULT;
> +
> + return sizeof(gen_counter);

interrupt could be in flight at this point. I think you need
to synch with that IRQ otherwise your read will return stale data.


> +}
> +
> +static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned int new_acked_gen;
> + unsigned long flags;
> +
> + /* disallow partial writes */
> + if (count != sizeof(new_acked_gen))
> + return -EINVAL;
> + if (copy_from_user(&new_acked_gen, ubuf, count))
> + return -EFAULT;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + /* wrong gen-counter acknowledged */
> + if (!equals_gen_counter(new_acked_gen)) {
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> + return -EINVAL;
> + }
> + if (!equals_gen_counter(fdata->acked_gen_counter)) {
> + fdata->acked_gen_counter = new_acked_gen;
> + put_outdated_watchers();
> + }
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + return (ssize_t)count;
> +}
> +
> +static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
> +{
> + __poll_t mask = 0;
> + struct file_data *fdata = file->private_data;
> +
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + return EPOLLIN | EPOLLRDNORM;
> +
> + poll_wait(file, &sysgenid_data.read_waitq, wait);
> +
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + mask = EPOLLIN | EPOLLRDNORM;
> +
> + return mask;
> +}
> +
> +static long sysgenid_ioctl(struct file *file,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned long timeout_ns, min_gen;
> + ktime_t until;
> + int ret = 0;
> +
> + switch (cmd) {
> + case SYSGENID_WAIT_WATCHERS:
> + timeout_ns = arg * NSEC_PER_MSEC;
> + until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
> +
> + ret = wait_event_interruptible_hrtimeout(
> + sysgenid_data.outdated_waitq,
> + (!atomic_read(&sysgenid_data.outdated_watchers) ||
> + !equals_gen_counter(fdata->acked_gen_counter)),
> + until
> + );
> + break;
> + case SYSGENID_FORCE_GEN_UPDATE:
> + if (!checkpoint_restore_ns_capable(current_user_ns()))
> + return -EACCES;
> + min_gen = arg;
> + _bump_generation(min_gen);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct file_data *fdata = file->private_data;
> +
> + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> + return -EINVAL;
> +
> + if ((vma->vm_flags & VM_WRITE) != 0)
> + return -EPERM;
> +
> + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> + vma->vm_flags &= ~VM_MAYWRITE;
> + vma->vm_private_data = fdata;
> +
> + return vm_insert_page(vma, vma->vm_start,
> + virt_to_page(sysgenid_data.map_buf));
> +}
> +
> +static const struct file_operations fops = {
> + .owner = THIS_MODULE,
> + .mmap = sysgenid_mmap,
> + .open = sysgenid_open,
> + .release = sysgenid_close,
> + .read = sysgenid_read,
> + .write = sysgenid_write,
> + .poll = sysgenid_poll,
> + .unlocked_ioctl = sysgenid_ioctl,
> +};
> +
> +static struct miscdevice sysgenid_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "sysgenid",
> + .fops = &fops,
> +};
> +
> +static void sysgenid_uevent(unsigned int gen_counter)
> +{
> + char event_string[22];
> + char *envp[] = { event_string, NULL };
> + struct device *dev = sysgenid_misc.this_device;
> +
> + if (dev) {
> + sprintf(event_string, "SYSGENID=%u", gen_counter);
> + kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
> + }
> +}
> +
> +static int __init sysgenid_init(void)
> +{
> + int ret;
> +
> + sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> + if (!sysgenid_data.map_buf)
> + return -ENOMEM;
> +
> + atomic_set(&sysgenid_data.generation_counter, 0);
> + atomic_set(&sysgenid_data.outdated_watchers, 0);
> + init_waitqueue_head(&sysgenid_data.read_waitq);
> + init_waitqueue_head(&sysgenid_data.outdated_waitq);
> + spin_lock_init(&sysgenid_data.lock);
> +
> + ret = misc_register(&sysgenid_misc);
> + if (ret < 0) {
> + pr_err("misc_register() failed for sysgenid\n");
> + goto err;
> + }
> +
> + return 0;
> +
> +err:
> + free_pages(sysgenid_data.map_buf, 0);
> + sysgenid_data.map_buf = 0;
> +
> + return ret;
> +}
> +
> +static void __exit sysgenid_exit(void)
> +{
> + misc_deregister(&sysgenid_misc);
> + free_pages(sysgenid_data.map_buf, 0);
> + sysgenid_data.map_buf = 0;
> +}
> +
> +module_init(sysgenid_init);
> +module_exit(sysgenid_exit);
> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("System Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");
> diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
> new file mode 100644
> index 0000000..ba370c8
> --- /dev/null
> +++ b/include/uapi/linux/sysgenid.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_SYSGENID_H
> +#define _UAPI_LINUX_SYSGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define SYSGENID_IOCTL 0xE4
> +#define SYSGENID_WAIT_WATCHERS _IO(SYSGENID_IOCTL, 1)
> +#define SYSGENID_FORCE_GEN_UPDATE _IO(SYSGENID_IOCTL, 2)
> +
> +#ifdef __KERNEL__
> +void sysgenid_bump_generation(void);
> +#endif /* __KERNEL__ */
> +
> +#endif /* _UAPI_LINUX_SYSGENID_H */
> +
> --
> 2.7.4
>
>
>
>
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-02-09 16:49:21

by Catangiu, Adrian Costin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] drivers/misc: sysgenid: add system generation id driver

On 09/02/2021, 16:53, "Michael S. Tsirkin" <[email protected]> wrote:

On Mon, Feb 01, 2021 at 07:24:53PM +0200, Adrian Catangiu wrote:
> - Background and problem
>
> The System Generation ID feature is required in virtualized or
> containerized environments by applications that work with local copies
> or caches of world-unique data such as random values, uuids,
> monotonically increasing counters, etc.
> Such applications can be negatively affected by VM or container
> snapshotting when the VM or container is either cloned or returned to
> an earlier point in time.
>
> Furthermore, simply finding out about a system generation change is
> only the starting point of a process to renew internal states of
> possibly multiple applications across the system. This process could
> benefit from a driver that provides an interface through which
> orchestration can be easily done.
>
> - Solution
>
> The System Generation ID is a simple concept meant to alleviate the
> issue by providing a monotonically increasing u32 counter that changes
> each time the VM or container is restored from a snapshot.
>
> The `sysgenid` driver exposes a monotonic incremental System Generation
> u32 counter via a char-dev FS interface that provides sync and async
> SysGen counter updates notifications. It also provides SysGen counter
> retrieval and confirmation mechanisms.
>
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the system generation changes.
>
> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
> symbol which can be used by backend drivers to drive system generation
> changes based on hardware events.
> System generation changes can also be driven by userspace software
> through a dedicated driver ioctl.
>
> Userspace applications or libraries can then (a)synchronously consume
> the system generation counter through the provided FS interface to make
> any necessary internal adjustments following a system generation
> update.
>
> Signed-off-by: Adrian Catangiu <[email protected]>
> ---
> Documentation/misc-devices/sysgenid.rst | 236 ++++++++++++++++
> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
> MAINTAINERS | 8 +
> drivers/misc/Kconfig | 16 ++
> drivers/misc/Makefile | 1 +
> drivers/misc/sysgenid.c | 307 +++++++++++++++++++++
> include/uapi/linux/sysgenid.h | 17 ++
> 7 files changed, 586 insertions(+)
> create mode 100644 Documentation/misc-devices/sysgenid.rst
> create mode 100644 drivers/misc/sysgenid.c
> create mode 100644 include/uapi/linux/sysgenid.h
>
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..4337ca0
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,236 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is a simple concept meant to alleviate the
> +issue by providing a monotonically increasing counter that changes
> +each time the VM or container is restored from a snapshot.
> +The driver for it lives at ``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev FS interface accessible through
> +``/dev/sysgenid`` that provides sync and async SysGen counter update
> +notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided FS interface, to make
> +any necessary internal adjustments following a system generation update.
> +
> +Driver FS interface:
> +
> +``open()``:
> + When the device is opened, a copy of the current Sys-Gen-Id (counter)
> + is associated with the open file descriptor. The driver now tracks
> + this file as an independent *watcher*. The driver tracks how many
> + watchers are aware of the latest Sys-Gen-Id counter and how many of
> + them are *outdated*; outdated being those that have lived through
> + a Sys-Gen-Id change but not yet confirmed the new generation counter.
> +
> +``read()``:
> + Read is meant to provide the *new* system generation counter when a
> + generation change takes place. The read operation blocks until the
> + associated counter is no longer up to date, at which point the new
> + counter is provided/returned.
> + Nonblocking ``read()`` uses ``EAGAIN`` to signal that there is no
> + *new* counter value available. The generation counter is considered
> + *new* for each open file descriptor that hasn't confirmed the new
> + value following a generation change. Therefore, once a generation
> + change takes place, all ``read()`` calls will immediately return the
> + new generation counter and will continue to do so until the
> + new value is confirmed back to the driver through ``write()``.
> + Partial reads are not allowed - read buffer needs to be at least
> + 32 bits in size.
> +
> +``write()``:
> + Write is used to confirm the up-to-date Sys Gen counter back to the
> + driver.
> + Following a VM generation change, all existing watchers are marked
> + as *outdated*. Each file descriptor will maintain the *outdated*
> + status until a ``write()`` confirms the up-to-date counter back to
> + the driver.
> + Partial writes are not allowed - write buffer should be exactly
> + 32 bits in size.
> +
> +``poll()``:
> + Poll is implemented to allow polling for generation counter updates.
> + Such updates result in ``EPOLLIN`` polling status until the new
> + up-to-date counter is confirmed back to the driver through a
> + ``write()``.
> +
> +``ioctl()``:
> + The driver also adds support for waiting on open file descriptors
> + that haven't acknowledged a generation counter update, as well as a
> + mechanism for userspace to *force* a generation update:
> +
> + - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> + watchers, or if a ``timeout`` argument is provided, until the
> + timeout expires.
> + If the current caller is *outdated* or a generation change happens
> + while waiting (thus making current caller *outdated*), the ioctl
> + returns ``-EINTR`` to signal the user to handle event and retry.
> + - SYSGENID_FORCE_GEN_UPDATE: forces a generation counter increment.
> + It takes a ``minimum-generation`` argument which represents the
> + minimum value the generation counter will be incremented to. For
> + example if current generation is ``5`` and ``SYSGENID_FORCE_GEN_UPDATE(8)``
> + is called, the generation counter will increment to ``8``.
> + This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> + or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> + The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> + in size. The first 4 bytes of the mapped page will contain an
> + up-to-date u32 copy of the system generation counter.
> + The mapped memory can be used as a low-latency generation counter
> + probe mechanism in critical sections - see examples.
>

From what I have seen this mmap interface is still fundamentally racy.
How about splitting it to a separate patch?

Hi all!
Michael, you're right that this is fundamentally racy on its own. Even if we eliminate the
IRQ race by mmaping HW directly, it is still fundamentally racy when looking one or more
layers above.
The only way to eliminate races is having a system-level overlord entity that quiesces the
system before a snapshot then, on snapshot-load/resume, makes sure all software adjusts
before unquiescing and allowing safe workload continuation.
Without this overlord synchronization there will be races in the callers of the sysgenid API
even if the API syscall itself is race-free.
Let's take an OpenSSL session token for example. OpenSSL is a prime target for the mmap()
interface so it can be fast (checks against memory - read/write syscalls in the hotpath will
likely not work). Even if the library code is made 100% race-free, the session token still lives
in the caller of the library and/or in the OS network stack, etc.

We even have the same problem with the transactional read/write API. It too, is racy if
snapshots can happen at any arbitrary time during an active workload. A library or application
can go through the read/write transaction successfully (race-free) and generate a
"safe/unique" secret, then get snapshotted while that secret is in-flight with no way to catch
it unless we'd check generation at every layer, every step of the way (which looks impossible
to me).

Bottom line is we don't have a way to track all of these in-flight secrets and dynamically scrub
them from existence with snapshot events happening arbitrarily.

But, we can do the next best thing. Simplify the problem space, drop the arbitrary part and
take some control over the snapshot flow. Use an overlord entity that quiesces the system
before snapshot, and un-quiesces post-snapshot-resume _only after_ system components
have readjusted to new environment/generation.
Add sysgenid framework that standardizes an API for system software to find out about
needing to readjust and at the same time provide a mechanism for the overlord entity to wait
for everyone to be done, the system to have readjusted, so it can un-quiesce.

In this context, this is how I see the utility and safe-usage of the sysgenid API:

- The poll(), read(), write() operations are used in event loops so software is notified (while
in quiesced state) that generation has changed and they need to do their specific internal
adjustments. If done transactionally, they can ACK that they are ready to thrive in the new
generation.
- The WAIT_WATCHERS ioctl() is used by the overlord to know when system is ready for
un-quiesce.
- SysGenID needs a (currently missing) IOCTL or some other type of knob for users to
enable/disable tracking, where tracking means sysgenid driver keeping track of that FD as
a *watcher* - keeping track if it is up-to-date or outdated.
- There is a class of software (like PRNG libraries or OpenSSL since we talked about it earlier)
that doesn't fit the event-loop model and also has strict latency requirements. These can
take advantage of the mmap() interface and lazily adjust on-demand whenever they are
called _after un-quiesce_.
For a well-designed service stack, these libraries should not be called while system is
quiesced. When workload is resumed by the overlord, on the first call into these libs, they
will JIT readjust race-free.
Users of mmap() interface would disable watcher tracking since they lazily on-demand
readjust and waiting on them doesn't make sense.

Now I know that there is a lot of subtlety involved here, but this is a hard problem without
an easy simple solution. If you and people here agree on this approach I will document all of
this in great detail with big fat warnings on how to safely use it (in driver documentation).

I'm sorry for the massive reply, but the proposed mechanism and its details are a direct
result of the valuable feedback coming from this upstream review, and I want to make sure
we align on all details so we get it right!

Let me know what you all think!

> +``close()``:
> + Removes the file descriptor as a system generation counter *watcher*.
> +
> +Example application workflows
> +-----------------------------
> +
> +1) Watchdog thread simplified example::
> +
> + void watchdog_thread_handler(int *thread_active)
> + {
> + unsigned int genid;
> + int fd = open("/dev/sysgenid", O_RDWR | O_CLOEXEC, S_IRUSR | S_IWUSR);
> +
> + do {
> + // read new gen ID - blocks until VM generation changes
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling gen ID update
> + write(fd, &genid, sizeof(genid));
> + } while (atomic_read(thread_active));
> +
> + close(fd);
> + }
> +
> +2) ASYNC simplified example::
> +
> + void handle_io_on_sysgenfd(int sysgenfd)
> + {
> + unsigned int genid;
> +
> + // read new gen ID - we need it to confirm we've handled update
> + read(fd, &genid, sizeof(genid));
> +
> + // because of VM generation change, we need to rebuild world
> + reseed_app_env();
> +
> + // confirm we're done handling the gen ID update
> + write(fd, &genid, sizeof(genid));
> + }
> +
> + int main() {
> + int epfd, sysgenfd;
> + struct epoll_event ev;
> +
> + epfd = epoll_create(EPOLL_QUEUE_LEN);
> +
> + sysgenfd = open("/dev/sysgenid",
> + O_RDWR | O_CLOEXEC | O_NONBLOCK,
> + S_IRUSR | S_IWUSR);
> +
> + // register sysgenid for polling
> + ev.events = EPOLLIN;
> + ev.data.fd = sysgenfd;
> + epoll_ctl(epfd, EPOLL_CTL_ADD, sysgenfd, &ev);
> +
> + // register other parts of your app for polling
> + // ...
> +
> + while (1) {
> + // wait for something to do...
> + int nfds = epoll_wait(epfd, events,
> + MAX_EPOLL_EVENTS_PER_RUN,
> + EPOLL_RUN_TIMEOUT);
> + if (nfds < 0) die("Error in epoll_wait!");
> +
> + // for each ready fd
> + for(int i = 0; i < nfds; i++) {
> + int fd = events[i].data.fd;
> +
> + if (fd == sysgenfd)
> + handle_io_on_sysgenfd(sysgenfd);
> + else
> + handle_some_other_part_of_the_app(fd);
> + }
> + }
> +
> + return 0;
> + }
> +
> +3) Mapped memory polling simplified example::
> +
> + /*
> + * app/library function that provides cached secrets
> + */
> + char * safe_cached_secret(app_data_t *app)
> + {
> + char *secret;
> + volatile unsigned int *const genid_ptr = get_sysgenid_mapping(app);
> + again:
> + secret = __cached_secret(app);
> +
> + if (unlikely(*genid_ptr != app->cached_genid)) {
> + app->cached_genid = *genid_ptr;

And this is racy, isn't it? There is not system call to synchronize
with interrupts, so the value this reads could be out of date.

Yes, without system quiescing and controlled snapshot moments, all of these are racy.
I will rewrite documentation to capture all details and pitfalls and how to use sysgenid
safely.

> + barrier();
> +
> + // rebuild world then confirm the genid update (thru write)
> + rebuild_caches(app);
> +
> + ack_sysgenid_update(app);
> +
> + goto again;
> + }
> +
> + return secret;
> + }
> +
> +4) Orchestrator simplified example::
> +
> + /*
> + * orchestrator - manages multiple applications and libraries used by
> + * a service and tries to make sure all sensitive components gracefully
> + * handle VM generation changes.
> + * Following function is called on detection of a VM generation change.
> + */
> + int handle_sysgen_update(int sysgen_fd, unsigned int new_gen_id)
> + {
> + // pause until all components have handled event
> + pause_service();
> +
> + // confirm *this* watcher as up-to-date
> + write(sysgen_fd, &new_gen_id, sizeof(unsigned int));
> +
> + // wait for all *others* for at most 5 seconds.
> + ioctl(sysgen_fd, VMGENID_WAIT_WATCHERS, 5000);
> +
> + // all applications on the system have rebuilt worlds
> + resume_service();
> + }
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a4c75a2..d74fed4 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -356,6 +356,7 @@ Code Seq# Include File Comments
> 0xDB 00-0F drivers/char/mwave/mwavepub.h
> 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/
> <mailto:[email protected]>
> +0xE4 01-02 uapi/linux/sysgenid.h SysGenID misc driver
> 0xE5 00-3F linux/fuse.h
> 0xEC 00-01 drivers/platform/chrome/cros_ec_dev.h ChromeOS EC driver
> 0xF3 00-3F drivers/usb/misc/sisusbvga/sisusb.h sisfb (in development)
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d3e847f..f1158b0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17203,6 +17203,14 @@ L: [email protected]
> S: Maintained
> F: drivers/mmc/host/sdhci-pci-dwc-mshc.c
>
> +SYSGENID
> +M: Adrian Catangiu <[email protected]>
> +L: [email protected]
> +S: Supported
> +F: Documentation/misc-devices/sysgenid.rst
> +F: drivers/misc/sysgenid.c
> +F: include/uapi/linux/sysgenid.h
> +
> SYSTEM CONFIGURATION (SYSCON)
> M: Lee Jones <[email protected]>
> M: Arnd Bergmann <[email protected]>
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index fafa8b0..931d716 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -456,6 +456,22 @@ config PVPANIC
> a paravirtualized device provided by QEMU; it lets a virtual machine
> (guest) communicate panic events to the host.
>
> +config SYSGENID
> + tristate "System Generation ID driver"
> + default N
> + help
> + This is a System Generation ID driver which provides a system
> + generation counter. The driver exposes FS ops on /dev/sysgenid
> + through which it can provide information and notifications on system
> + generation changes that happen because of VM or container snapshots
> + or cloning.
> + This enables applications and libraries that store or cache
> + sensitive information, to know that they need to regenerate it
> + after process memory has been exposed to potential copying.
> +
> + To compile this driver as a module, choose M here: the
> + module will be called sysgenid.
> +
> config HISI_HIKEY_USB
> tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
> depends on (OF && GPIOLIB) || COMPILE_TEST
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index d23231e..4b4933d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI) += habanalabs/
> obj-$(CONFIG_UACCE) += uacce/
> obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o
> obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o
> +obj-$(CONFIG_SYSGENID) += sysgenid.o
> diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
> new file mode 100644
> index 0000000..95b3421
> --- /dev/null
> +++ b/drivers/misc/sysgenid.c
> @@ -0,0 +1,307 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * System Generation ID driver
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + * Authors:
> + * Adrian Catangiu <[email protected]>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/minmax.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/sysgenid.h>
> +
> +struct sysgenid_data {
> + unsigned long map_buf;
> + wait_queue_head_t read_waitq;
> + atomic_t generation_counter;
> +
> + unsigned int watchers;
> + atomic_t outdated_watchers;
> + wait_queue_head_t outdated_waitq;
> + spinlock_t lock;
> +};
> +static struct sysgenid_data sysgenid_data;
> +
> +struct file_data {
> + unsigned int acked_gen_counter;
> +};
> +
> +static void sysgenid_uevent(unsigned int gen_counter);
> +
> +static int equals_gen_counter(unsigned int counter)
> +{
> + return counter == atomic_read(&sysgenid_data.generation_counter);
> +}
> +
> +static void _bump_generation(int min_gen)
> +{
> + unsigned long flags;
> + int counter;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
> + atomic_set(&sysgenid_data.generation_counter, counter);
> + *((int *) sysgenid_data.map_buf) = counter;
> + atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
> +
> + wake_up_interruptible(&sysgenid_data.read_waitq);
> + wake_up_interruptible(&sysgenid_data.outdated_waitq);
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + sysgenid_uevent(counter);
> +}
> +
> +void sysgenid_bump_generation(void)
> +{
> + _bump_generation(0);
> +}
> +EXPORT_SYMBOL(sysgenid_bump_generation);
> +
> +static void put_outdated_watchers(void)
> +{
> + if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
> + wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +}
> +
> +static int sysgenid_open(struct inode *inode, struct file *file)
> +{
> + struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
> + unsigned long flags;
> +
> + if (!fdata)
> + return -ENOMEM;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
> + ++sysgenid_data.watchers;
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + file->private_data = fdata;
> +
> + return 0;
> +}
> +
> +static int sysgenid_close(struct inode *inode, struct file *file)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + put_outdated_watchers();
> + --sysgenid_data.watchers;
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + kfree(fdata);
> +
> + return 0;
> +}
> +
> +static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
> + size_t nbytes, loff_t *ppos)
> +{
> + struct file_data *fdata = file->private_data;
> + ssize_t ret;
> + int gen_counter;
> +
> + if (nbytes == 0)
> + return 0;
> + /* disallow partial reads */
> + if (nbytes < sizeof(gen_counter))
> + return -EINVAL;
> +
> + if (equals_gen_counter(fdata->acked_gen_counter)) {
> + if (file->f_flags & O_NONBLOCK)
> + return -EAGAIN;
> + ret = wait_event_interruptible(
> + sysgenid_data.read_waitq,
> + !equals_gen_counter(fdata->acked_gen_counter)
> + );
> + if (ret)
> + return ret;
> + }
> +
> + gen_counter = atomic_read(&sysgenid_data.generation_counter);
> + ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
> + if (ret)
> + return -EFAULT;
> +
> + return sizeof(gen_counter);

interrupt could be in flight at this point. I think you need
to synch with that IRQ otherwise your read will return stale data.

Since the user also needs to ACK the right gen counter, the race is not here
(stale generation will fail ack). But you're on-point, this race does exist but
during write().
Interrupt could be in flight when user ACKs generation.

Alas I don't think we can safely support arbitrarily timed snapshots.
Safely used snapshots should each follow a full
quiesce -> take-snapshot -> load-snapshot -> full-system-adjust -> un-quiesce
cycle before repeating.
In such a scenario, there is no race here.

> +}
> +
> +static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
> + size_t count, loff_t *ppos)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned int new_acked_gen;
> + unsigned long flags;
> +
> + /* disallow partial writes */
> + if (count != sizeof(new_acked_gen))
> + return -EINVAL;
> + if (copy_from_user(&new_acked_gen, ubuf, count))
> + return -EFAULT;
> +
> + spin_lock_irqsave(&sysgenid_data.lock, flags);
> + /* wrong gen-counter acknowledged */
> + if (!equals_gen_counter(new_acked_gen)) {
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> + return -EINVAL;
> + }
> + if (!equals_gen_counter(fdata->acked_gen_counter)) {
> + fdata->acked_gen_counter = new_acked_gen;
> + put_outdated_watchers();
> + }
> + spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> + return (ssize_t)count;
> +}
> +
> +static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
> +{
> + __poll_t mask = 0;
> + struct file_data *fdata = file->private_data;
> +
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + return EPOLLIN | EPOLLRDNORM;
> +
> + poll_wait(file, &sysgenid_data.read_waitq, wait);
> +
> + if (!equals_gen_counter(fdata->acked_gen_counter))
> + mask = EPOLLIN | EPOLLRDNORM;
> +
> + return mask;
> +}
> +
> +static long sysgenid_ioctl(struct file *file,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct file_data *fdata = file->private_data;
> + unsigned long timeout_ns, min_gen;
> + ktime_t until;
> + int ret = 0;
> +
> + switch (cmd) {
> + case SYSGENID_WAIT_WATCHERS:
> + timeout_ns = arg * NSEC_PER_MSEC;
> + until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
> +
> + ret = wait_event_interruptible_hrtimeout(
> + sysgenid_data.outdated_waitq,
> + (!atomic_read(&sysgenid_data.outdated_watchers) ||
> + !equals_gen_counter(fdata->acked_gen_counter)),
> + until
> + );
> + break;
> + case SYSGENID_FORCE_GEN_UPDATE:
> + if (!checkpoint_restore_ns_capable(current_user_ns()))
> + return -EACCES;
> + min_gen = arg;
> + _bump_generation(min_gen);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct file_data *fdata = file->private_data;
> +
> + if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> + return -EINVAL;
> +
> + if ((vma->vm_flags & VM_WRITE) != 0)
> + return -EPERM;
> +
> + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> + vma->vm_flags &= ~VM_MAYWRITE;
> + vma->vm_private_data = fdata;
> +
> + return vm_insert_page(vma, vma->vm_start,
> + virt_to_page(sysgenid_data.map_buf));
> +}
> +
> +static const struct file_operations fops = {
> + .owner = THIS_MODULE,
> + .mmap = sysgenid_mmap,
> + .open = sysgenid_open,
> + .release = sysgenid_close,
> + .read = sysgenid_read,
> + .write = sysgenid_write,
> + .poll = sysgenid_poll,
> + .unlocked_ioctl = sysgenid_ioctl,
> +};
> +
> +static struct miscdevice sysgenid_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "sysgenid",
> + .fops = &fops,
> +};
> +
> +static void sysgenid_uevent(unsigned int gen_counter)
> +{
> + char event_string[22];
> + char *envp[] = { event_string, NULL };
> + struct device *dev = sysgenid_misc.this_device;
> +
> + if (dev) {
> + sprintf(event_string, "SYSGENID=%u", gen_counter);
> + kobject_uevent_env(&dev->kobj, KOBJ_CHANGE, envp);
> + }
> +}
> +
> +static int __init sysgenid_init(void)
> +{
> + int ret;
> +
> + sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> + if (!sysgenid_data.map_buf)
> + return -ENOMEM;
> +
> + atomic_set(&sysgenid_data.generation_counter, 0);
> + atomic_set(&sysgenid_data.outdated_watchers, 0);
> + init_waitqueue_head(&sysgenid_data.read_waitq);
> + init_waitqueue_head(&sysgenid_data.outdated_waitq);
> + spin_lock_init(&sysgenid_data.lock);
> +
> + ret = misc_register(&sysgenid_misc);
> + if (ret < 0) {
> + pr_err("misc_register() failed for sysgenid\n");
> + goto err;
> + }
> +
> + return 0;
> +
> +err:
> + free_pages(sysgenid_data.map_buf, 0);
> + sysgenid_data.map_buf = 0;
> +
> + return ret;
> +}
> +
> +static void __exit sysgenid_exit(void)
> +{
> + misc_deregister(&sysgenid_misc);
> + free_pages(sysgenid_data.map_buf, 0);
> + sysgenid_data.map_buf = 0;
> +}
> +
> +module_init(sysgenid_init);
> +module_exit(sysgenid_exit);
> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("System Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");
> diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
> new file mode 100644
> index 0000000..ba370c8
> --- /dev/null
> +++ b/include/uapi/linux/sysgenid.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_SYSGENID_H
> +#define _UAPI_LINUX_SYSGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define SYSGENID_IOCTL 0xE4
> +#define SYSGENID_WAIT_WATCHERS _IO(SYSGENID_IOCTL, 1)
> +#define SYSGENID_FORCE_GEN_UPDATE _IO(SYSGENID_IOCTL, 2)
> +
> +#ifdef __KERNEL__
> +void sysgenid_bump_generation(void);
> +#endif /* __KERNEL__ */
> +
> +#endif /* _UAPI_LINUX_SYSGENID_H */
> +
> --
> 2.7.4
>
>
>
>
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.





Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2021-02-10 05:22:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] drivers/virt: vmgenid: add vm generation id driver

> +The ``vmgenid`` driver uses ACPI events to be notified by hardware
> +changes to the 128-bit Vm Gen Id UUID.

That's ok, problem is ACPI event processing is asynchronous.
What we need is thus to flush out ACPI events whenever userspace
does a read, otherwise the value it gets will be stale.

--
MST