2024-02-05 12:02:45

by Gowans, James

[permalink] [raw]
Subject: [RFC 00/18] Pkernfs: Support persistence for live update

This RFC is to solicit feedback on the approach of implementing support for live
update via an in-memory filesystem responsible for storing all live update state
as files in the filesystem.

Hypervisor live update is a mechanism to support updating a hypervisor via kexec
in a way that has limited impact to running virtual machines. This is done by
pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
processes and then deserialising/resuming the VMs so that they continue running
from where they were. Virtual machines can have PCI devices passed through and
in order to support live update it’s necessary to persist the IOMMU page tables
so that the devices can continue to do DMA to guest RAM during kexec.

This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
which explored ways in which the live update problem could be tackled;
this was one of them:
https://lpc.events/event/17/contributions/1485/

The approach sketched out in this RFC introduces a new in-memory filesystem,
pkernfs. Pkernfs takes over ownership separate from Linux memory
management system RAM which is carved out from the normal MM allocator
and donated to pkernfs. Files are created in pkernfs for a few purposes:
There are a few things that need to be preserved and re-hydrated after
kexec to support this:

* Guest memory: to be able to restore the VM its memory must be
preserved. This is achieved by using a regular file in pkernfs for guest RAM.
As this guest RAM is not part of the normal linux core mm allocator and
has no struct pages, it can be removed from the direct map which
improves security posture for guest RAM. Similar to memfd_secret.

* IOMMU root page tables: for the IOMMU to have any ability to do DMA
during kexec it needs root page tables to look up per-domain page
tables. IOMMU root page tables are stored in a special path in pkernfs:
iommu/root-pgtables. The intel IOMMU driver is modified to hook into
pkernfs to get the chunk of memory that it can use to allocate root
pgtables.

* IOMMU domain page tables: in order for VM-initiated DMA operations to
continue running while kexec is happening the IOVA to PA address
translations for persisted devices needs to continue to work. Similar to
root pgtables the per-domain page tables for persisted devices are
allocated from a pkernfs file so they they are also persisted across
kexec. This is done by using pkernfs files for IOMMU domain page
tables. Not all devices are persistent, so VFIO is updated to support
defining persistent page tables on passed through devices.

* Updates to IOMMU and PCI are needed to make device handover across
kexec work properly. Although not fully complete some of the changed
needed around avoiding device re-setting and re-probing are sketched
in this RFC.

Guest RAM and IOMMU state are just the first two things needed for live update.
Pkernfs opens the door for other kernel state which can improve kexec or add
more capabilities to live update to also be persisted as new files.

The main aspect we’re looking for feedback/opinions on here is the concept of
putting all persistent state in a single filesystem: combining guest RAM and
IOMMU pgtables in one store. Also, the question of a hard separation between
persistent memory and ephemeral memory, compared to allowing arbitrary pages to
be persisted. Pkernfs does it via a hard separation defined at boot time, other
approaches could make the carving out of persistent pages dynamic.

Sign-offs are intentionally omitted to make it clear that this is a
concept sketch RFC and not intended for merging.

On CC are folks who have sent RFCs around this problem space before, as
well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.

== Alternatives ==

There have been other RFCs which cover some aspect of the live update problem
space. So far, all public approaches with KVM neglected device assignment which
introduces a new dimension of problems. Prior art in this space includes:

1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
across kexec. It also supports specifying persisted memory page which could be
used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.

2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
be used for guest RAM and is preserved across kexec by passing a pointer to the
root page.

3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
chunk of memory specified via kernel cmdline parameter. It is not persistent but
aims to remove the need for struct page overhead.

4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
to allocate persistent memory, and restore that memory after kexec. They do do
not attempt to provide the ability to store userspace accessible state or have a
filesystem interface.

== How to use ==

Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
offset 1 GiB:
memmap=1G%1G nopat pkernfs=1G!1G

Mount pkernfs somewhere, for example:
mount -t pkernfs /mnt/pkernfs

Allocate a file for guest RAM:
touch /mnt/pkernfs/guest-ram
truncate -s 100M /mnt/pkernfs/guest-ram

Add QEMU cmdline option to use this as guest RAM:
-object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
-M q35,memory-backend=pc.ram

Allocate a file for IOMMU domain page tables:
touch /mnt/pkernfs/iommu/dom-0
truncate -s 2M /mnt/pkernfs/iommu/dom-0

That file must be supplied to VFIO when creating the IOMMU container, via the
VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]

After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
state. When doing DMA mapping specify the additional flag
VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
Example: [5].

== Limitations ==

This is a RFC design to sketch out the concept so that there can be a discussion
about the general approach. There are many gaps and hacks; the idea is to keep
this RFC as simple as possible. Limitations include:

* Needing to supply the physical memory range for pkernfs as a kernel cmdline
parameter. Better would be to allocate memory for pkernfs dynamically on first
boot and pass that across kexec. Doing so would require additional integration
with memblocks and some ability to pass the dynamically allocated ranges
across. KHO [0] could support this.

* A single filesystem with no support for NUMA awareness. Better would be to
support multiple named pkernfs mounts which can cover different NUMA nodes.

* Skeletal filesystem code. There’s just enough functionality to make it usable to
demonstrate the concept of using files for guest RAM and IOMMU state.

* Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
RAM files being deleted or resized while IOMMU mappings are set up which would
allow DMA to freed memory. Better integration with guest RAM files and
IOMMU/VFIO is necessary.

* Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
Really we should move the abstraction one level up and make the whole VFIO
container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
container file and all of the DMA mappings inside VFIO would already be set up.

* Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
wasteful and an hard upper limit on file size.

[0] https://lore.kernel.org/kexec/[email protected]/
[1] https://lore.kernel.org/kexec/[email protected]/
[2] https://lkml.org/lkml/2020/12/7/342
[3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
[4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
[5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
[6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0


James Gowans (18):
pkernfs: Introduce filesystem skeleton
pkernfs: Add persistent inodes hooked into directies
pkernfs: Define an allocator for persistent pages
pkernfs: support file truncation
pkernfs: add file mmap callback
init: Add liveupdate cmdline param
pkernfs: Add file type for IOMMU root pgtables
iommu: Add allocator for pgtables from persistent region
intel-iommu: Use pkernfs for root/context pgtable pages
iommu/intel: zap context table entries on kexec
dma-iommu: Always enable deferred attaches for liveupdate
pkernfs: Add IOMMU domain pgtables file
vfio: add ioctl to define persistent pgtables on container
intel-iommu: Allocate domain pgtable pages from pkernfs
pkernfs: register device memory for IOMMU domain pgtables
vfio: support not mapping IOMMU pgtables on live-update
pci: Don't clear bus master is persistence enabled
vfio-pci: Assume device working after liveupdate

drivers/iommu/Makefile | 1 +
drivers/iommu/dma-iommu.c | 2 +-
drivers/iommu/intel/dmar.c | 1 +
drivers/iommu/intel/iommu.c | 93 +++++++++++++---
drivers/iommu/intel/iommu.h | 5 +
drivers/iommu/iommu.c | 22 ++--
drivers/iommu/pgtable_alloc.c | 43 +++++++
drivers/iommu/pgtable_alloc.h | 10 ++
drivers/pci/pci-driver.c | 4 +-
drivers/vfio/container.c | 27 +++++
drivers/vfio/pci/vfio_pci_core.c | 20 ++--
drivers/vfio/vfio.h | 2 +
drivers/vfio/vfio_iommu_type1.c | 51 ++++++---
fs/Kconfig | 1 +
fs/Makefile | 3 +
fs/pkernfs/Kconfig | 9 ++
fs/pkernfs/Makefile | 6 +
fs/pkernfs/allocator.c | 51 +++++++++
fs/pkernfs/dir.c | 43 +++++++
fs/pkernfs/file.c | 93 ++++++++++++++++
fs/pkernfs/inode.c | 185 +++++++++++++++++++++++++++++++
fs/pkernfs/iommu.c | 163 +++++++++++++++++++++++++++
fs/pkernfs/pkernfs.c | 115 +++++++++++++++++++
fs/pkernfs/pkernfs.h | 61 ++++++++++
include/linux/init.h | 1 +
include/linux/iommu.h | 6 +-
include/linux/pkernfs.h | 38 +++++++
include/uapi/linux/vfio.h | 10 ++
init/main.c | 10 ++
29 files changed, 1029 insertions(+), 47 deletions(-)
create mode 100644 drivers/iommu/pgtable_alloc.c
create mode 100644 drivers/iommu/pgtable_alloc.h
create mode 100644 fs/pkernfs/Kconfig
create mode 100644 fs/pkernfs/Makefile
create mode 100644 fs/pkernfs/allocator.c
create mode 100644 fs/pkernfs/dir.c
create mode 100644 fs/pkernfs/file.c
create mode 100644 fs/pkernfs/inode.c
create mode 100644 fs/pkernfs/iommu.c
create mode 100644 fs/pkernfs/pkernfs.c
create mode 100644 fs/pkernfs/pkernfs.h
create mode 100644 include/linux/pkernfs.h

--
2.40.1



2024-02-05 12:03:14

by Gowans, James

[permalink] [raw]
Subject: [RFC 01/18] pkernfs: Introduce filesystem skeleton

Add an in-memory filesystem: pkernfs. Memory is donated to pkernfs by
carving it out of the normal System RAM range with the memmap= cmdline
parameter and then giving that same physical range to pkernfs with the
pkernfs= cmdline parameter.

A new filesystem is added; so far it doesn't do much except persist a
super block at the start of the donated memory and allows itself to be
mounted.
---
fs/Kconfig | 1 +
fs/Makefile | 3 ++
fs/pkernfs/Kconfig | 9 ++++
fs/pkernfs/Makefile | 6 +++
fs/pkernfs/pkernfs.c | 99 ++++++++++++++++++++++++++++++++++++++++++++
fs/pkernfs/pkernfs.h | 6 +++
6 files changed, 124 insertions(+)
create mode 100644 fs/pkernfs/Kconfig
create mode 100644 fs/pkernfs/Makefile
create mode 100644 fs/pkernfs/pkernfs.c
create mode 100644 fs/pkernfs/pkernfs.h

diff --git a/fs/Kconfig b/fs/Kconfig
index aa7e03cc1941..33a9770ae657 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -331,6 +331,7 @@ source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"
source "fs/erofs/Kconfig"
source "fs/vboxsf/Kconfig"
+source "fs/pkernfs/Kconfig"

endif # MISC_FILESYSTEMS

diff --git a/fs/Makefile b/fs/Makefile
index f9541f40be4e..1af35b494b5d 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -19,6 +19,9 @@ obj-y := open.o read_write.o file_table.o super.o \

obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o
obj-$(CONFIG_PROC_FS) += proc_namespace.o
+
+obj-y += pkernfs/
+
obj-$(CONFIG_LEGACY_DIRECT_IO) += direct-io.o
obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
diff --git a/fs/pkernfs/Kconfig b/fs/pkernfs/Kconfig
new file mode 100644
index 000000000000..59621a1d9aef
--- /dev/null
+++ b/fs/pkernfs/Kconfig
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config PKERNFS_FS
+ bool "Persistent Kernel filesystem (pkernfs)"
+ help
+ An in-memory filesystem on top of reserved memory specified via
+ pkernfs= cmdline argument. Used for storing kernel state and
+ userspace memory which is preserved across kexec to support
+ live update.
diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile
new file mode 100644
index 000000000000..17258cb77f58
--- /dev/null
+++ b/fs/pkernfs/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for persistent kernel filesystem
+#
+
+obj-$(CONFIG_PKERNFS_FS) += pkernfs.o
diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c
new file mode 100644
index 000000000000..4c476ddc35b6
--- /dev/null
+++ b/fs/pkernfs/pkernfs.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pkernfs.h"
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/fs_context.h>
+#include <linux/io.h>
+
+static phys_addr_t pkernfs_base, pkernfs_size;
+static void *pkernfs_mem;
+static const struct super_operations pkernfs_super_ops = { };
+
+static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *inode;
+ struct dentry *dentry;
+ struct pkernfs_sb *psb;
+
+ pkernfs_mem = memremap(pkernfs_base, pkernfs_size, MEMREMAP_WB);
+ psb = (struct pkernfs_sb *) pkernfs_mem;
+
+ if (psb->magic_number == PKERNFS_MAGIC_NUMBER) {
+ pr_info("pkernfs: Restoring from super block\n");
+ } else {
+ pr_info("pkernfs: Clean super block; initialising\n");
+ psb->magic_number = PKERNFS_MAGIC_NUMBER;
+ }
+
+ sb->s_op = &pkernfs_super_ops;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return -ENOMEM;
+
+ inode->i_ino = 1;
+ inode->i_mode = S_IFDIR;
+ inode->i_op = &simple_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+ inode->i_atime = inode->i_mtime = current_time(inode);
+ inode_set_ctime_current(inode);
+ /* directory inodes start off with i_nlink == 2 (for "." entry) */
+ inc_nlink(inode);
+
+ dentry = d_make_root(inode);
+ if (!dentry)
+ return -ENOMEM;
+ sb->s_root = dentry;
+
+ return 0;
+}
+
+static int pkernfs_get_tree(struct fs_context *fc)
+{
+ return get_tree_nodev(fc, pkernfs_fill_super);
+}
+
+static const struct fs_context_operations pkernfs_context_ops = {
+ .get_tree = pkernfs_get_tree,
+};
+
+static int pkernfs_init_fs_context(struct fs_context *const fc)
+{
+ fc->ops = &pkernfs_context_ops;
+ return 0;
+}
+
+static struct file_system_type pkernfs_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "pkernfs",
+ .init_fs_context = pkernfs_init_fs_context,
+ .kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
+};
+
+static int __init pkernfs_init(void)
+{
+ int ret;
+
+ ret = register_filesystem(&pkernfs_fs_type);
+ return ret;
+}
+
+/**
+ * Format: pkernfs=<size>:<base>
+ * Just like: memmap=nn[KMG]!ss[KMG]
+ */
+static int __init parse_pkernfs_extents(char *p)
+{
+ pkernfs_size = memparse(p, &p);
+ p++; /* Skip over ! char */
+ pkernfs_base = memparse(p, &p);
+ return 0;
+}
+
+early_param("pkernfs", parse_pkernfs_extents);
+
+MODULE_ALIAS_FS("pkernfs");
+module_init(pkernfs_init);
diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
new file mode 100644
index 000000000000..bd1e2a6fd336
--- /dev/null
+++ b/fs/pkernfs/pkernfs.h
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#define PKERNFS_MAGIC_NUMBER 0x706b65726e6673
+struct pkernfs_sb {
+ unsigned long magic_number;
+};
--
2.40.1


2024-02-05 12:03:42

by Gowans, James

[permalink] [raw]
Subject: [RFC 02/18] pkernfs: Add persistent inodes hooked into directies

Add the ability to create inodes for files and directories inside
directories. Inodes are persistent in the in-memory filesystem; the
second 2 MiB is used as an "inode store." The inode store is one big array
of struct pkernfs_inodes and they use a linked list to point to the next
sibling inode or in the case of a directory the child inode which is the
first inode in that directory.

Free inodese are similarly maintained in a linked list with the first
free inode being pointed to by the super block.

Directory file_operations are added to support iterating through the
content of a directory.

Simiarly inode operations are added to support creating a file inside a
directory. This allocate the next free inode and makes it the head of
tthe "child inode" linked list for the directory. Unlink is implemented
to remove an inode from the linked list. This is a bit finicky as it is
done differently depending on whether the inode is the first child of a
directory or somewhere later in the linked list.
---
fs/pkernfs/Makefile | 2 +-
fs/pkernfs/dir.c | 43 +++++++++++++
fs/pkernfs/inode.c | 148 +++++++++++++++++++++++++++++++++++++++++++
fs/pkernfs/pkernfs.c | 13 ++--
fs/pkernfs/pkernfs.h | 34 ++++++++++
5 files changed, 234 insertions(+), 6 deletions(-)
create mode 100644 fs/pkernfs/dir.c
create mode 100644 fs/pkernfs/inode.c

diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile
index 17258cb77f58..0a66e98bda07 100644
--- a/fs/pkernfs/Makefile
+++ b/fs/pkernfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#

-obj-$(CONFIG_PKERNFS_FS) += pkernfs.o
+obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o dir.o
diff --git a/fs/pkernfs/dir.c b/fs/pkernfs/dir.c
new file mode 100644
index 000000000000..b10ce745f19d
--- /dev/null
+++ b/fs/pkernfs/dir.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pkernfs.h"
+
+static int pkernfs_dir_iterate(struct file *dir, struct dir_context *ctx)
+{
+ struct pkernfs_inode *pkernfs_inode;
+ struct super_block *sb = dir->f_inode->i_sb;
+
+ /* Indication from previous invoke that there's no more to iterate. */
+ if (ctx->pos == -1)
+ return 0;
+
+ if (!dir_emit_dots(dir, ctx))
+ return 0;
+
+ /*
+ * Just emitted this dir; go to dir contents. Use pos to smuggle
+ * the next inode number to emit across iterations.
+ * -1 indicates no valid inode. Can't use 0 because first loop has pos=0
+ */
+ if (ctx->pos == 2) {
+ ctx->pos = pkernfs_get_persisted_inode(sb, dir->f_inode->i_ino)->child_ino;
+ /* Empty dir case. */
+ if (ctx->pos == 0)
+ ctx->pos = -1;
+ }
+
+ while (ctx->pos > 1) {
+ pkernfs_inode = pkernfs_get_persisted_inode(sb, ctx->pos);
+ dir_emit(ctx, pkernfs_inode->filename, PKERNFS_FILENAME_LEN,
+ ctx->pos, DT_UNKNOWN);
+ ctx->pos = pkernfs_inode->sibling_ino;
+ if (!ctx->pos)
+ ctx->pos = -1;
+ }
+ return 0;
+}
+
+const struct file_operations pkernfs_dir_fops = {
+ .owner = THIS_MODULE,
+ .iterate_shared = pkernfs_dir_iterate,
+};
diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c
new file mode 100644
index 000000000000..f6584c8b8804
--- /dev/null
+++ b/fs/pkernfs/inode.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pkernfs.h"
+#include <linux/fs.h>
+
+const struct inode_operations pkernfs_dir_inode_operations;
+
+struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino)
+{
+ /*
+ * Inode index starts at 1, so -1 to get memory index.
+ */
+ return ((struct pkernfs_inode *) (pkernfs_mem + PMD_SIZE)) + ino - 1;
+}
+
+struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino)
+{
+ struct inode *inode = iget_locked(sb, ino);
+
+ /* If this inode is cached it is already populated; just return */
+ if (!(inode->i_state & I_NEW))
+ return inode;
+ inode->i_op = &pkernfs_dir_inode_operations;
+ inode->i_sb = sb;
+ inode->i_mode = S_IFREG;
+ unlock_new_inode(inode);
+ return inode;
+}
+
+static unsigned long pkernfs_allocate_inode(struct super_block *sb)
+{
+
+ unsigned long next_free_ino;
+ struct pkernfs_sb *psb = (struct pkernfs_sb *) pkernfs_mem;
+
+ next_free_ino = psb->next_free_ino;
+ if (!next_free_ino)
+ return -ENOMEM;
+ psb->next_free_ino =
+ pkernfs_get_persisted_inode(sb, next_free_ino)->sibling_ino;
+ return next_free_ino;
+}
+
+/*
+ * Zeroes the inode and makes it the head of the free list.
+ */
+static void pkernfs_free_inode(struct super_block *sb, unsigned long ino)
+{
+ struct pkernfs_sb *psb = (struct pkernfs_sb *) pkernfs_mem;
+ struct pkernfs_inode *inode = pkernfs_get_persisted_inode(sb, ino);
+
+ memset(inode, 0, sizeof(struct pkernfs_inode));
+ inode->sibling_ino = psb->next_free_ino;
+ psb->next_free_ino = ino;
+}
+
+void pkernfs_initialise_inode_store(struct super_block *sb)
+{
+ /* Inode store is a PMD sized (ie: 2 MiB) page */
+ memset(pkernfs_get_persisted_inode(sb, 1), 0, PMD_SIZE);
+ /* Point each inode for the next one; linked-list initialisation. */
+ for (unsigned long ino = 2; ino * sizeof(struct pkernfs_inode) < PMD_SIZE; ino++)
+ pkernfs_get_persisted_inode(sb, ino - 1)->sibling_ino = ino;
+}
+
+static int pkernfs_create(struct mnt_idmap *id, struct inode *dir,
+ struct dentry *dentry, umode_t mode, bool excl)
+{
+ unsigned long free_inode;
+ struct pkernfs_inode *pkernfs_inode;
+ struct inode *vfs_inode;
+
+ free_inode = pkernfs_allocate_inode(dir->i_sb);
+ if (free_inode <= 0)
+ return -ENOMEM;
+
+ pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, free_inode);
+ pkernfs_inode->sibling_ino = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino;
+ pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode;
+ strscpy(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN);
+ pkernfs_inode->flags = PKERNFS_INODE_FLAG_FILE;
+
+ vfs_inode = pkernfs_inode_get(dir->i_sb, free_inode);
+ d_instantiate(dentry, vfs_inode);
+ return 0;
+}
+
+static struct dentry *pkernfs_lookup(struct inode *dir,
+ struct dentry *dentry,
+ unsigned int flags)
+{
+ struct pkernfs_inode *pkernfs_inode;
+ unsigned long ino;
+
+ pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino);
+ ino = pkernfs_inode->child_ino;
+ while (ino) {
+ pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, ino);
+ if (!strncmp(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN)) {
+ d_add(dentry, pkernfs_inode_get(dir->i_sb, ino));
+ break;
+ }
+ ino = pkernfs_inode->sibling_ino;
+ }
+ return NULL;
+}
+
+static int pkernfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ unsigned long ino;
+ struct pkernfs_inode *inode;
+
+ ino = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino;
+
+ /* Special case for first file in dir */
+ if (ino == dentry->d_inode->i_ino) {
+ pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino =
+ pkernfs_get_persisted_inode(dir->i_sb, dentry->d_inode->i_ino)->sibling_ino;
+ pkernfs_free_inode(dir->i_sb, ino);
+ return 0;
+ }
+
+ /*
+ * Although we know exactly the inode to free, because we maintain only
+ * a singly linked list we need to scan for it to find the previous
+ * element so it's "next" pointer can be updated.
+ */
+ while (ino) {
+ inode = pkernfs_get_persisted_inode(dir->i_sb, ino);
+ /* We've found the one pointing to the one we want to delete */
+ if (inode->sibling_ino == dentry->d_inode->i_ino) {
+ inode->sibling_ino =
+ pkernfs_get_persisted_inode(dir->i_sb,
+ dentry->d_inode->i_ino)->sibling_ino;
+ pkernfs_free_inode(dir->i_sb, dentry->d_inode->i_ino);
+ break;
+ }
+ ino = pkernfs_get_persisted_inode(dir->i_sb, ino)->sibling_ino;
+ }
+
+ return 0;
+}
+
+const struct inode_operations pkernfs_dir_inode_operations = {
+ .create = pkernfs_create,
+ .lookup = pkernfs_lookup,
+ .unlink = pkernfs_unlink,
+};
diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c
index 4c476ddc35b6..518c610e3877 100644
--- a/fs/pkernfs/pkernfs.c
+++ b/fs/pkernfs/pkernfs.c
@@ -8,7 +8,7 @@
#include <linux/io.h>

static phys_addr_t pkernfs_base, pkernfs_size;
-static void *pkernfs_mem;
+void *pkernfs_mem;
static const struct super_operations pkernfs_super_ops = { };

static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
@@ -24,23 +24,26 @@ static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
pr_info("pkernfs: Restoring from super block\n");
} else {
pr_info("pkernfs: Clean super block; initialising\n");
+ pkernfs_initialise_inode_store(sb);
psb->magic_number = PKERNFS_MAGIC_NUMBER;
+ pkernfs_get_persisted_inode(sb, 1)->flags = PKERNFS_INODE_FLAG_DIR;
+ strscpy(pkernfs_get_persisted_inode(sb, 1)->filename, ".", PKERNFS_FILENAME_LEN);
+ psb->next_free_ino = 2;
}

sb->s_op = &pkernfs_super_ops;

- inode = new_inode(sb);
+ inode = pkernfs_inode_get(sb, 1);
if (!inode)
return -ENOMEM;

- inode->i_ino = 1;
inode->i_mode = S_IFDIR;
- inode->i_op = &simple_dir_inode_operations;
- inode->i_fop = &simple_dir_operations;
+ inode->i_fop = &pkernfs_dir_fops;
inode->i_atime = inode->i_mtime = current_time(inode);
inode_set_ctime_current(inode);
/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
+ inode_init_owner(&nop_mnt_idmap, inode, NULL, inode->i_mode);

dentry = d_make_root(inode);
if (!dentry)
diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
index bd1e2a6fd336..192e089b3151 100644
--- a/fs/pkernfs/pkernfs.h
+++ b/fs/pkernfs/pkernfs.h
@@ -1,6 +1,40 @@
/* SPDX-License-Identifier: GPL-2.0-only */

+#include <linux/fs.h>
+
#define PKERNFS_MAGIC_NUMBER 0x706b65726e6673
+#define PKERNFS_FILENAME_LEN 255
+
+extern void *pkernfs_mem;
+
struct pkernfs_sb {
unsigned long magic_number;
+ /* Inode number */
+ unsigned long next_free_ino;
};
+
+// If neither of these are set the inode is not in use.
+#define PKERNFS_INODE_FLAG_FILE (1 << 0)
+#define PKERNFS_INODE_FLAG_DIR (1 << 1)
+struct pkernfs_inode {
+ int flags;
+ /*
+ * Points to next inode in the same directory, or
+ * 0 if last file in directory.
+ */
+ unsigned long sibling_ino;
+ /*
+ * If this inode is a directory, this points to the
+ * first inode *in* that directory.
+ */
+ unsigned long child_ino;
+ char filename[PKERNFS_FILENAME_LEN];
+ int mappings_block;
+ int num_mappings;
+};
+
+void pkernfs_initialise_inode_store(struct super_block *sb);
+struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino);
+struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino);
+
+extern const struct file_operations pkernfs_dir_fops;
--
2.40.1


2024-02-05 12:04:24

by Gowans, James

[permalink] [raw]
Subject: [RFC 03/18] pkernfs: Define an allocator for persistent pages

This introduces the concept of a bitmap allocator for pages from the
pkernfs filesystem. The allocation bitmap is stored in the second half
of the first page. This imposes an artificial limit of the maximum size
of the filesystem; this needs to be made extensible.

The allocations can be zeroed, that's it so far. The next commit will
add the ability to allocate and use it.
---
fs/pkernfs/Makefile | 2 +-
fs/pkernfs/allocator.c | 27 +++++++++++++++++++++++++++
fs/pkernfs/pkernfs.c | 1 +
fs/pkernfs/pkernfs.h | 1 +
4 files changed, 30 insertions(+), 1 deletion(-)
create mode 100644 fs/pkernfs/allocator.c

diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile
index 0a66e98bda07..d8b92a74fbc6 100644
--- a/fs/pkernfs/Makefile
+++ b/fs/pkernfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#

-obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o dir.o
+obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o
diff --git a/fs/pkernfs/allocator.c b/fs/pkernfs/allocator.c
new file mode 100644
index 000000000000..1d4aac9c4545
--- /dev/null
+++ b/fs/pkernfs/allocator.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pkernfs.h"
+
+/**
+ * For allocating blocks from the pkernfs filesystem.
+ * The first two blocks are special:
+ * - the first block is persitent filesystme metadata and
+ * a bitmap of allocated blocks
+ * - the second block is an array of persisted inodes; the
+ * inode store.
+ */
+
+void *pkernfs_allocations_bitmap(struct super_block *sb)
+{
+ /* Allocations is 2nd half of first block */
+ return pkernfs_mem + (1 << 20);
+}
+
+void pkernfs_zero_allocations(struct super_block *sb)
+{
+ memset(pkernfs_allocations_bitmap(sb), 0, (1 << 20));
+ /* First page is persisted super block and allocator bitmap */
+ set_bit(0, pkernfs_allocations_bitmap(sb));
+ /* Second page is inode store */
+ set_bit(1, pkernfs_allocations_bitmap(sb));
+}
diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c
index 518c610e3877..199c2c648bca 100644
--- a/fs/pkernfs/pkernfs.c
+++ b/fs/pkernfs/pkernfs.c
@@ -25,6 +25,7 @@ static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
} else {
pr_info("pkernfs: Clean super block; initialising\n");
pkernfs_initialise_inode_store(sb);
+ pkernfs_zero_allocations(sb);
psb->magic_number = PKERNFS_MAGIC_NUMBER;
pkernfs_get_persisted_inode(sb, 1)->flags = PKERNFS_INODE_FLAG_DIR;
strscpy(pkernfs_get_persisted_inode(sb, 1)->filename, ".", PKERNFS_FILENAME_LEN);
diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
index 192e089b3151..4655780f31f2 100644
--- a/fs/pkernfs/pkernfs.h
+++ b/fs/pkernfs/pkernfs.h
@@ -34,6 +34,7 @@ struct pkernfs_inode {
};

void pkernfs_initialise_inode_store(struct super_block *sb);
+void pkernfs_zero_allocations(struct super_block *sb);
struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino);
struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino);

--
2.40.1


2024-02-05 12:05:22

by Gowans, James

[permalink] [raw]
Subject: [RFC 05/18] pkernfs: add file mmap callback

Make the file data useable to userspace by adding mmap. That's all that
QEMU needs for guest RAM, so that's all be bother implementing for now.

When mmaping the file the VMA is marked as PFNMAP to indicate that there
are no struct pages for the memory in this VMA. Remap_pfn_range() is
used to actually populate the page tables. All PTEs are pre-faulted into
the pgtables at mmap time so that the pgtables are useable when this
virtual address range is given to VFIO's MAP_DMA.
---
fs/pkernfs/file.c | 42 +++++++++++++++++++++++++++++++++++++++++-
fs/pkernfs/pkernfs.c | 2 +-
fs/pkernfs/pkernfs.h | 2 ++
3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/pkernfs/file.c b/fs/pkernfs/file.c
index 27a637423178..844b6cc63840 100644
--- a/fs/pkernfs/file.c
+++ b/fs/pkernfs/file.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only

#include "pkernfs.h"
+#include <linux/mm.h>

static int truncate(struct inode *inode, loff_t newsize)
{
@@ -42,6 +43,45 @@ static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct
return 0;
}

+/*
+ * To be able to use PFNMAP VMAs for VFIO DMA mapping we need the page tables
+ * populated with mappings. Pre-fault everything.
+ */
+static int mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ int rc;
+ unsigned long *mappings_block;
+ struct pkernfs_inode *pkernfs_inode;
+
+ pkernfs_inode = pkernfs_get_persisted_inode(filp->f_inode->i_sb, filp->f_inode->i_ino);
+
+ mappings_block = (unsigned long *)pkernfs_addr_for_block(filp->f_inode->i_sb,
+ pkernfs_inode->mappings_block);
+
+ /* Remap-pfn-range will mark the range VM_IO */
+ for (unsigned long vma_addr_offset = vma->vm_start;
+ vma_addr_offset < vma->vm_end;
+ vma_addr_offset += PMD_SIZE) {
+ int block, mapped_block;
+
+ block = (vma_addr_offset - vma->vm_start) / PMD_SIZE;
+ mapped_block = *(mappings_block + block);
+ /*
+ * It's wrong to use rempa_pfn_range; this will install PTE-level entries.
+ * The whole point of 2 MiB allocs is to improve TLB perf!
+ * We should use something like mm/huge_memory.c#insert_pfn_pmd
+ * but that is currently static.
+ * TODO: figure out the best way to install PMDs.
+ */
+ rc = remap_pfn_range(vma,
+ vma_addr_offset,
+ (pkernfs_base >> PAGE_SHIFT) + (mapped_block * 512),
+ PMD_SIZE,
+ vma->vm_page_prot);
+ }
+ return 0;
+}
+
const struct inode_operations pkernfs_file_inode_operations = {
.setattr = inode_setattr,
.getattr = simple_getattr,
@@ -49,5 +89,5 @@ const struct inode_operations pkernfs_file_inode_operations = {

const struct file_operations pkernfs_file_fops = {
.owner = THIS_MODULE,
- .iterate_shared = NULL,
+ .mmap = mmap,
};
diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c
index 199c2c648bca..f010c2d76c76 100644
--- a/fs/pkernfs/pkernfs.c
+++ b/fs/pkernfs/pkernfs.c
@@ -7,7 +7,7 @@
#include <linux/fs_context.h>
#include <linux/io.h>

-static phys_addr_t pkernfs_base, pkernfs_size;
+phys_addr_t pkernfs_base, pkernfs_size;
void *pkernfs_mem;
static const struct super_operations pkernfs_super_ops = { };

diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
index 8b4fee8c5b2e..1a7aa783a9be 100644
--- a/fs/pkernfs/pkernfs.h
+++ b/fs/pkernfs/pkernfs.h
@@ -6,6 +6,8 @@
#define PKERNFS_FILENAME_LEN 255

extern void *pkernfs_mem;
+/* Units of bytes */
+extern phys_addr_t pkernfs_base, pkernfs_size;

struct pkernfs_sb {
unsigned long magic_number;
--
2.40.1


2024-02-05 12:07:06

by Gowans, James

[permalink] [raw]
Subject: [RFC 08/18] iommu: Add allocator for pgtables from persistent region

The specific IOMMU drivers will need to ability to allocate pages from a
pkernfs IOMMU pgtable file for their pgtables. Also, the IOMMU drivers
will need to ability to consistent get the same page for the root PGD
page - add a specific function to get this PGD "root" page. This is
different to allocating regular pgtable pages because the exact same
page needs to be *restored* after kexec into the pgd pointer on the
IOMMU domain struct.

To support this sort of allocation the pkernfs region is treated as an
array of 512 4 KiB pages, the first of which is an allocation bitmap.
---
drivers/iommu/Makefile | 1 +
drivers/iommu/pgtable_alloc.c | 36 +++++++++++++++++++++++++++++++++++
drivers/iommu/pgtable_alloc.h | 9 +++++++++
3 files changed, 46 insertions(+)
create mode 100644 drivers/iommu/pgtable_alloc.c
create mode 100644 drivers/iommu/pgtable_alloc.h

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 769e43d780ce..cadebabe9581 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
obj-y += amd/ intel/ arm/ iommufd/
+obj-y += pgtable_alloc.o
obj-$(CONFIG_IOMMU_API) += iommu.o
obj-$(CONFIG_IOMMU_API) += iommu-traces.o
obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/pgtable_alloc.c b/drivers/iommu/pgtable_alloc.c
new file mode 100644
index 000000000000..f0c2e12f8a8b
--- /dev/null
+++ b/drivers/iommu/pgtable_alloc.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pgtable_alloc.h"
+#include <linux/mm.h>
+
+/*
+ * The first 4 KiB is the bitmap - set the first bit in the bitmap.
+ * Scan bitmap to find next free bits - it's next free page.
+ */
+
+void iommu_alloc_page_from_region(struct pkernfs_region *region, void **vaddr, unsigned long *paddr)
+{
+ int page_idx;
+
+ page_idx = bitmap_find_free_region(region->vaddr, 512, 0);
+ *vaddr = region->vaddr + (page_idx << PAGE_SHIFT);
+ if (paddr)
+ *paddr = region->paddr + (page_idx << PAGE_SHIFT);
+}
+
+
+void *pgtable_get_root_page(struct pkernfs_region *region, bool liveupdate)
+{
+ /*
+ * The page immediately after the bitmap is the root page.
+ * It would be wrong for the page to be allocated if we're
+ * NOT doing a liveupdate, or for a liveupdate to happen
+ * with no allocated page. Detect this mismatch.
+ */
+ if (test_bit(1, region->vaddr) ^ liveupdate) {
+ pr_err("%sdoing a liveupdate but root pg bit incorrect",
+ liveupdate ? "" : "NOT ");
+ }
+ set_bit(1, region->vaddr);
+ return region->vaddr + PAGE_SIZE;
+}
diff --git a/drivers/iommu/pgtable_alloc.h b/drivers/iommu/pgtable_alloc.h
new file mode 100644
index 000000000000..c1666a7be3d3
--- /dev/null
+++ b/drivers/iommu/pgtable_alloc.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <linux/types.h>
+#include <linux/pkernfs.h>
+
+void iommu_alloc_page_from_region(struct pkernfs_region *region,
+ void **vaddr, unsigned long *paddr);
+
+void *pgtable_get_root_page(struct pkernfs_region *region, bool liveupdate);
--
2.40.1


2024-02-05 12:09:04

by Gowans, James

[permalink] [raw]
Subject: [RFC 12/18] pkernfs: Add IOMMU domain pgtables file

Similar to the IOMMU root pgtables file which was added in a previous
commit, now support a file type for IOMMU domain pgtables in the IOMMU
directory. These domain pgtable files only need to be useable after the
system has booted up, for example by QEMU creating one of these files
and using it to back the IOMMU pgtables for a persistent VM. As such the
filesystem abstraction can be better maintained here as the kernel code
doesn't need to reach "behind" the filesystem abstraction like it does
for the root pgtables.

A new inode type is created for domain pgtable files, and the IOMMU
directory gets inode_operation callbacks to support creating and
deleting these files in it.

Note: there is a use-after-free risk here too: if the domain pgtable
file is truncated while it's in-use for IOMMU pgtables then freed memory
could still be mapped into the IOMMU. To mitigate this there should be a
machanism to "freeze" the files once they've been given to the IOMMU.
---
fs/pkernfs/inode.c | 9 +++++--
fs/pkernfs/iommu.c | 55 +++++++++++++++++++++++++++++++++++++++--
fs/pkernfs/pkernfs.h | 4 +++
include/linux/pkernfs.h | 1 +
4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c
index 1d712e0a82a1..35842cd61002 100644
--- a/fs/pkernfs/inode.c
+++ b/fs/pkernfs/inode.c
@@ -35,7 +35,11 @@ struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino)
inode->i_op = &pkernfs_iommu_dir_inode_operations;
inode->i_fop = &pkernfs_dir_fops;
inode->i_mode = S_IFDIR;
- } else if (pkernfs_inode->flags | PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES) {
+ } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES) {
+ inode->i_fop = &pkernfs_file_fops;
+ inode->i_mode = S_IFREG;
+ } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES) {
+ inode->i_fop = &pkernfs_file_fops;
inode->i_mode = S_IFREG;
}

@@ -175,6 +179,7 @@ const struct inode_operations pkernfs_dir_inode_operations = {
};

const struct inode_operations pkernfs_iommu_dir_inode_operations = {
+ .create = pkernfs_create_iommu_pgtables,
.lookup = pkernfs_lookup,
+ .unlink = pkernfs_unlink,
};
-
diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c
index 5bce8146d7bb..f14e76013e85 100644
--- a/fs/pkernfs/iommu.c
+++ b/fs/pkernfs/iommu.c
@@ -4,6 +4,27 @@
#include <linux/io.h>


+void pkernfs_alloc_iommu_domain_pgtables(struct file *ppts, struct pkernfs_region *pkernfs_region)
+{
+ struct pkernfs_inode *pkernfs_inode;
+ unsigned long *mappings_block_vaddr;
+ unsigned long inode_idx;
+
+ /*
+ * For a pkernfs region block, the "mappings_block" field is still
+ * just a block index, but that block doesn't actually contain mappings
+ * it contains the pkernfs_region data
+ */
+
+ inode_idx = ppts->f_inode->i_ino;
+ pkernfs_inode = pkernfs_get_persisted_inode(NULL, inode_idx);
+
+ mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL,
+ pkernfs_inode->mappings_block);
+ set_bit(0, mappings_block_vaddr);
+ pkernfs_region->vaddr = mappings_block_vaddr;
+ pkernfs_region->paddr = pkernfs_base + (pkernfs_inode->mappings_block * (2 << 20));
+}
void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region)
{
unsigned long *mappings_block_vaddr;
@@ -63,9 +84,8 @@ void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region)
* just a block index, but that block doesn't actually contain mappings
* it contains the pkernfs_region data
*/
-
mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL,
- iommu_pgtables->mappings_block);
+ iommu_pgtables->mappings_block);
set_bit(0, mappings_block_vaddr);
pkernfs_region->vaddr = mappings_block_vaddr;
pkernfs_region->paddr = pkernfs_base + (iommu_pgtables->mappings_block * PMD_SIZE);
@@ -88,6 +108,29 @@ void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region)
(iommu_pgtables->mappings_block * PMD_SIZE);
}

+int pkernfs_create_iommu_pgtables(struct mnt_idmap *id, struct inode *dir,
+ struct dentry *dentry, umode_t mode, bool excl)
+{
+ unsigned long free_inode;
+ struct pkernfs_inode *pkernfs_inode;
+ struct inode *vfs_inode;
+
+ free_inode = pkernfs_allocate_inode(dir->i_sb);
+ if (free_inode <= 0)
+ return -ENOMEM;
+
+ pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, free_inode);
+ pkernfs_inode->sibling_ino = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino;
+ pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode;
+ strscpy(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN);
+ pkernfs_inode->flags = PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES;
+ pkernfs_inode->mappings_block = pkernfs_alloc_block(dir->i_sb);
+ memset(pkernfs_addr_for_block(dir->i_sb, pkernfs_inode->mappings_block), 0, (2 << 20));
+ vfs_inode = pkernfs_inode_get(dir->i_sb, free_inode);
+ d_add(dentry, vfs_inode);
+ return 0;
+}
+
void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr)
{
if (WARN_ON(paddr >= region->paddr + region->bytes))
@@ -96,3 +139,11 @@ void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long
return NULL;
return region->vaddr + (paddr - region->paddr);
}
+
+bool pkernfs_is_iommu_domain_pgtables(struct file *f)
+{
+ return f &&
+ pkernfs_get_persisted_inode(f->f_inode->i_sb, f->f_inode->i_ino)->flags &
+ PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES;
+}
+
diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
index e1b7ae3fe7f1..9bea827f8b40 100644
--- a/fs/pkernfs/pkernfs.h
+++ b/fs/pkernfs/pkernfs.h
@@ -21,6 +21,7 @@ struct pkernfs_sb {
#define PKERNFS_INODE_FLAG_DIR (1 << 1)
#define PKERNFS_INODE_FLAG_IOMMU_DIR (1 << 2)
#define PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES (1 << 3)
+#define PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES (1 << 4)
struct pkernfs_inode {
int flags;
/*
@@ -50,8 +51,11 @@ void *pkernfs_addr_for_block(struct super_block *sb, int block_idx);
unsigned long pkernfs_allocate_inode(struct super_block *sb);
struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino);

+int pkernfs_create_iommu_pgtables(struct mnt_idmap *id, struct inode *dir,
+ struct dentry *dentry, umode_t mode, bool excl);

extern const struct file_operations pkernfs_dir_fops;
extern const struct file_operations pkernfs_file_fops;
extern const struct inode_operations pkernfs_file_inode_operations;
extern const struct inode_operations pkernfs_iommu_dir_inode_operations;
+extern const struct inode_operations pkernfs_iommu_domain_pgtables_inode_operations;
diff --git a/include/linux/pkernfs.h b/include/linux/pkernfs.h
index 0110e4784109..4ca923ee0d82 100644
--- a/include/linux/pkernfs.h
+++ b/include/linux/pkernfs.h
@@ -33,4 +33,5 @@ void pkernfs_alloc_page_from_region(struct pkernfs_region *pkernfs_region,
void **vaddr, unsigned long *paddr);
void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr);

+bool pkernfs_is_iommu_domain_pgtables(struct file *f);
#endif /* _LINUX_PKERNFS_H */
--
2.40.1


2024-02-05 12:11:32

by Gowans, James

[permalink] [raw]
Subject: [RFC 17/18] pci: Don't clear bus master is persistence enabled

In order for persistent devices to continue to DMA during kexec the bus
mastering capability needs to remain on. Do not disable bus mastering if
pkernfs is enabled, indicating that persistent devices are enabled.

Only persistent devices should have bus mastering left on during kexec
but this serves as a rough approximation of the functionality needed for
this pkernfs RFC.
---
drivers/pci/pci-driver.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 51ec9e7e784f..131127967811 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -9,6 +9,7 @@
#include <linux/init.h>
#include <linux/device.h>
#include <linux/mempolicy.h>
+#include <linux/pkernfs.h>
#include <linux/string.h>
#include <linux/slab.h>
#include <linux/sched.h>
@@ -519,7 +520,8 @@ static void pci_device_shutdown(struct device *dev)
* If it is not a kexec reboot, firmware will hit the PCI
* devices with big hammer and stop their DMA any way.
*/
- if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
+ if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot)
+ && !pkernfs_enabled())
pci_clear_master(pci_dev);
}

--
2.40.1


2024-02-05 12:13:16

by Gowans, James

[permalink] [raw]
Subject: [RFC 18/18] vfio-pci: Assume device working after liveupdate

When re-creating a VFIO device after liveupdate no desctructive actions
should be taken on it to avoid interrupting any ongoing DMA.
Specifically bus mastering should not be cleared and the device should
not be reset. Assume that reset works properly and skip over bus
mastering reset.

Ideally this would only be done for persistent devices but in this rough
RFC there currently is no mechanism at this point to easily tell if a
device is persisted or not.
---
drivers/vfio/pci/vfio_pci_core.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1929103ee59a..a7f56d43e0a4 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -480,19 +480,25 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
return ret;
}

- /* Don't allow our initial saved state to include busmaster */
- pci_clear_master(pdev);
+ if (!liveupdate) {
+ /* Don't allow our initial saved state to include busmaster */
+ pci_clear_master(pdev);
+ }

ret = pci_enable_device(pdev);
if (ret)
goto out_power;

- /* If reset fails because of the device lock, fail this path entirely */
- ret = pci_try_reset_function(pdev);
- if (ret == -EAGAIN)
- goto out_disable_device;
+ if (!liveupdate) {
+ /* If reset fails because of the device lock, fail this path entirely */
+ ret = pci_try_reset_function(pdev);
+ if (ret == -EAGAIN)
+ goto out_disable_device;

- vdev->reset_works = !ret;
+ vdev->reset_works = !ret;
+ } else {
+ vdev->reset_works = 1;
+ }
pci_save_state(pdev);
vdev->pci_saved_state = pci_store_saved_state(pdev);
if (!vdev->pci_saved_state)
--
2.40.1


2024-02-05 12:16:54

by Gowans, James

[permalink] [raw]
Subject: [RFC 14/18] intel-iommu: Allocate domain pgtable pages from pkernfs

In the previous commit VFIO was updated to be able to define persistent
pgtables on a container. Now the IOMMU driver is updated to accept the
file for persistent pgtables when the domain is allocated and use that
file as the source of pages for the pgtables.

The iommu_ops.domain_alloc callback is extended to page a struct file
for the pkernfs domain pgtables file. Most call sites are updated to
supply NULL here, indicating no persistent pgtables. VFIO's caller is
updated to plumb the pkernfs file through. When this file is supplied
the md_domain_init() function convers the file into a pkernfs region and
uses that region for pgtables.

Similarly to the root pgtables there are use after free issues with this
that need sorting out, and the free() functions also need to be updated
to free from the pkernfs region.

It may be better to store the struct file on the dmar_domain and map
file offset to addr every time rather than using a pkernfs region for
this.
---
drivers/iommu/intel/iommu.c | 35 +++++++++++++++++++++++++++--------
drivers/iommu/intel/iommu.h | 1 +
drivers/iommu/iommu.c | 22 ++++++++++++++--------
drivers/iommu/pgtable_alloc.c | 7 +++++++
drivers/iommu/pgtable_alloc.h | 1 +
fs/pkernfs/iommu.c | 2 +-
include/linux/iommu.h | 6 +++++-
include/linux/pkernfs.h | 1 +
8 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 315c6b7f901c..809ca9e93992 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -946,7 +946,13 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;

- tmp_page = alloc_pgtable_page(domain->nid, gfp);
+ if (domain->pgtables_allocator.vaddr)
+ iommu_alloc_page_from_region(
+ &domain->pgtables_allocator,
+ &tmp_page,
+ NULL);
+ else
+ tmp_page = alloc_pgtable_page(domain->nid, gfp);

if (!tmp_page)
return NULL;
@@ -2399,7 +2405,7 @@ static int iommu_domain_identity_map(struct dmar_domain *domain,
DMA_PTE_READ|DMA_PTE_WRITE, GFP_KERNEL);
}

-static int md_domain_init(struct dmar_domain *domain, int guest_width);
+static int md_domain_init(struct dmar_domain *domain, int guest_width, struct file *ppts);

static int __init si_domain_init(int hw)
{
@@ -2411,7 +2417,7 @@ static int __init si_domain_init(int hw)
if (!si_domain)
return -EFAULT;

- if (md_domain_init(si_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH)) {
+ if (md_domain_init(si_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH, NULL)) {
domain_exit(si_domain);
si_domain = NULL;
return -EFAULT;
@@ -4029,7 +4035,7 @@ static void device_block_translation(struct device *dev)
info->domain = NULL;
}

-static int md_domain_init(struct dmar_domain *domain, int guest_width)
+static int md_domain_init(struct dmar_domain *domain, int guest_width, struct file *ppts)
{
int adjust_width;

@@ -4042,8 +4048,21 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
domain->iommu_superpage = 0;
domain->max_addr = 0;

- /* always allocate the top pgd */
- domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+ if (ppts) {
+ unsigned long pgd_phy;
+
+ pkernfs_get_region_for_ppts(
+ ppts,
+ &domain->pgtables_allocator);
+ iommu_get_pgd_page(
+ &domain->pgtables_allocator,
+ (void **) &domain->pgd,
+ &pgd_phy);
+
+ } else {
+ /* always allocate the top pgd */
+ domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+ }
if (!domain->pgd)
return -ENOMEM;
domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
@@ -4064,7 +4083,7 @@ static struct iommu_domain blocking_domain = {
}
};

-static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *intel_iommu_domain_alloc(unsigned int type, struct file *ppts)
{
struct dmar_domain *dmar_domain;
struct iommu_domain *domain;
@@ -4079,7 +4098,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
pr_err("Can't allocate dmar_domain\n");
return NULL;
}
- if (md_domain_init(dmar_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH)) {
+ if (md_domain_init(dmar_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH, ppts)) {
pr_err("Domain initialization failed\n");
domain_exit(dmar_domain);
return NULL;
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 4a2f163a86f3..f772fdcf3828 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -602,6 +602,7 @@ struct dmar_domain {
struct list_head dev_pasids; /* all attached pasids */
struct list_head domains; /* all struct dmar_domains on this IOMMU */

+ struct pkernfs_region pgtables_allocator;
struct dma_pte *pgd; /* virtual address */
int gaw; /* max guest address width */

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3a67e636287a..f26e83d5b159 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -97,7 +97,7 @@ static int iommu_bus_notifier(struct notifier_block *nb,
unsigned long action, void *data);
static void iommu_release_device(struct device *dev);
static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
- unsigned type);
+ unsigned int type, struct file *ppts);
static int __iommu_attach_device(struct iommu_domain *domain,
struct device *dev);
static int __iommu_attach_group(struct iommu_domain *domain,
@@ -1734,7 +1734,7 @@ __iommu_group_alloc_default_domain(const struct bus_type *bus,
{
if (group->default_domain && group->default_domain->type == req_type)
return group->default_domain;
- return __iommu_domain_alloc(bus, req_type);
+ return __iommu_domain_alloc(bus, req_type, NULL);
}

/*
@@ -1971,7 +1971,7 @@ void iommu_set_fault_handler(struct iommu_domain *domain,
EXPORT_SYMBOL_GPL(iommu_set_fault_handler);

static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
- unsigned type)
+ unsigned int type, struct file *ppts)
{
struct iommu_domain *domain;
unsigned int alloc_type = type & IOMMU_DOMAIN_ALLOC_FLAGS;
@@ -1979,7 +1979,7 @@ static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;

- domain = bus->iommu_ops->domain_alloc(alloc_type);
+ domain = bus->iommu_ops->domain_alloc(alloc_type, ppts);
if (!domain)
return NULL;

@@ -2001,9 +2001,15 @@ static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
return domain;
}

+struct iommu_domain *iommu_domain_alloc_persistent(const struct bus_type *bus, struct file *ppts)
+{
+ return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED, ppts);
+}
+EXPORT_SYMBOL_GPL(iommu_domain_alloc_persistent);
+
struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
{
- return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED);
+ return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED, NULL);
}
EXPORT_SYMBOL_GPL(iommu_domain_alloc);

@@ -3198,14 +3204,14 @@ static int __iommu_group_alloc_blocking_domain(struct iommu_group *group)
return 0;

group->blocking_domain =
- __iommu_domain_alloc(dev->dev->bus, IOMMU_DOMAIN_BLOCKED);
+ __iommu_domain_alloc(dev->dev->bus, IOMMU_DOMAIN_BLOCKED, NULL);
if (!group->blocking_domain) {
/*
* For drivers that do not yet understand IOMMU_DOMAIN_BLOCKED
* create an empty domain instead.
*/
group->blocking_domain = __iommu_domain_alloc(
- dev->dev->bus, IOMMU_DOMAIN_UNMANAGED);
+ dev->dev->bus, IOMMU_DOMAIN_UNMANAGED, NULL);
if (!group->blocking_domain)
return -EINVAL;
}
@@ -3500,7 +3506,7 @@ struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
const struct iommu_ops *ops = dev_iommu_ops(dev);
struct iommu_domain *domain;

- domain = ops->domain_alloc(IOMMU_DOMAIN_SVA);
+ domain = ops->domain_alloc(IOMMU_DOMAIN_SVA, false);
if (!domain)
return NULL;

diff --git a/drivers/iommu/pgtable_alloc.c b/drivers/iommu/pgtable_alloc.c
index f0c2e12f8a8b..276db15932cc 100644
--- a/drivers/iommu/pgtable_alloc.c
+++ b/drivers/iommu/pgtable_alloc.c
@@ -7,6 +7,13 @@
* The first 4 KiB is the bitmap - set the first bit in the bitmap.
* Scan bitmap to find next free bits - it's next free page.
*/
+void iommu_get_pgd_page(struct pkernfs_region *region, void **vaddr, unsigned long *paddr)
+{
+ set_bit(1, region->vaddr);
+ *vaddr = region->vaddr + (1 << PAGE_SHIFT);
+ if (paddr)
+ *paddr = region->paddr + (1 << PAGE_SHIFT);
+}

void iommu_alloc_page_from_region(struct pkernfs_region *region, void **vaddr, unsigned long *paddr)
{
diff --git a/drivers/iommu/pgtable_alloc.h b/drivers/iommu/pgtable_alloc.h
index c1666a7be3d3..50c3abba922b 100644
--- a/drivers/iommu/pgtable_alloc.h
+++ b/drivers/iommu/pgtable_alloc.h
@@ -3,6 +3,7 @@
#include <linux/types.h>
#include <linux/pkernfs.h>

+void iommu_get_pgd_page(struct pkernfs_region *region, void **vaddr, unsigned long *paddr);
void iommu_alloc_page_from_region(struct pkernfs_region *region,
void **vaddr, unsigned long *paddr);

diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c
index f14e76013e85..5d0b256e7dd8 100644
--- a/fs/pkernfs/iommu.c
+++ b/fs/pkernfs/iommu.c
@@ -4,7 +4,7 @@
#include <linux/io.h>


-void pkernfs_alloc_iommu_domain_pgtables(struct file *ppts, struct pkernfs_region *pkernfs_region)
+void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkernfs_region)
{
struct pkernfs_inode *pkernfs_inode;
unsigned long *mappings_block_vaddr;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0225cf7445de..01bb89246ef7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -101,6 +101,7 @@ struct iommu_domain {
enum iommu_page_response_code (*iopf_handler)(struct iommu_fault *fault,
void *data);
void *fault_data;
+ struct file *persistent_pgtables;
union {
struct {
iommu_fault_handler_t handler;
@@ -266,7 +267,8 @@ struct iommu_ops {
void *(*hw_info)(struct device *dev, u32 *length, u32 *type);

/* Domain allocation and freeing by the iommu driver */
- struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+ /* If ppts is not null it is a persistent domain; null is non-persistent */
+ struct iommu_domain *(*domain_alloc)(unsigned int tiommu_domain_type, struct file *ppts);

struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
@@ -466,6 +468,8 @@ extern bool iommu_present(const struct bus_type *bus);
extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
extern struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus);
+extern struct iommu_domain *iommu_domain_alloc_persistent(const struct bus_type *bus,
+ struct file *ppts);
extern void iommu_domain_free(struct iommu_domain *domain);
extern int iommu_attach_device(struct iommu_domain *domain,
struct device *dev);
diff --git a/include/linux/pkernfs.h b/include/linux/pkernfs.h
index 4ca923ee0d82..8aa69ef5a2d8 100644
--- a/include/linux/pkernfs.h
+++ b/include/linux/pkernfs.h
@@ -31,6 +31,7 @@ struct pkernfs_region {
void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region);
void pkernfs_alloc_page_from_region(struct pkernfs_region *pkernfs_region,
void **vaddr, unsigned long *paddr);
+void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkernfs_region);
void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr);

bool pkernfs_is_iommu_domain_pgtables(struct file *f);
--
2.40.1


2024-02-05 12:24:46

by Gowans, James

[permalink] [raw]
Subject: [RFC 04/18] pkernfs: support file truncation

In the previous commit a block allocator was added. Now use that block
allocator to allocate blocks for files when ftruncate is run on them.

To do that a inode_operations is added on the file inodes with a getattr
callback handling the ATTR_SIZE attribute. When this is invoked pages
are allocated, the indexes of which are put into a mappings block.
The mappings block is an array with the index being the file offset
block and the value at that index being the pkernfs block backign that
file offset.
---
fs/pkernfs/Makefile | 2 +-
fs/pkernfs/allocator.c | 24 +++++++++++++++++++
fs/pkernfs/file.c | 53 ++++++++++++++++++++++++++++++++++++++++++
fs/pkernfs/inode.c | 27 ++++++++++++++++++---
fs/pkernfs/pkernfs.h | 7 ++++++
5 files changed, 109 insertions(+), 4 deletions(-)
create mode 100644 fs/pkernfs/file.c

diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile
index d8b92a74fbc6..e41f06cc490f 100644
--- a/fs/pkernfs/Makefile
+++ b/fs/pkernfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#

-obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o
+obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o file.o
diff --git a/fs/pkernfs/allocator.c b/fs/pkernfs/allocator.c
index 1d4aac9c4545..3905ce92b4a9 100644
--- a/fs/pkernfs/allocator.c
+++ b/fs/pkernfs/allocator.c
@@ -25,3 +25,27 @@ void pkernfs_zero_allocations(struct super_block *sb)
/* Second page is inode store */
set_bit(1, pkernfs_allocations_bitmap(sb));
}
+
+/*
+ * Allocs one 2 MiB block, and returns the block index.
+ * Index is 2 MiB chunk index.
+ */
+unsigned long pkernfs_alloc_block(struct super_block *sb)
+{
+ unsigned long free_bit;
+
+ /* Allocations is 2nd half of first page */
+ void *allocations_mem = pkernfs_allocations_bitmap(sb);
+ free_bit = bitmap_find_next_zero_area(allocations_mem,
+ PMD_SIZE / 2, /* Size */
+ 0, /* Start */
+ 1, /* Number of zeroed bits to look for */
+ 0); /* Alignment mask - none required. */
+ bitmap_set(allocations_mem, free_bit, 1);
+ return free_bit;
+}
+
+void *pkernfs_addr_for_block(struct super_block *sb, int block_idx)
+{
+ return pkernfs_mem + (block_idx * PMD_SIZE);
+}
diff --git a/fs/pkernfs/file.c b/fs/pkernfs/file.c
new file mode 100644
index 000000000000..27a637423178
--- /dev/null
+++ b/fs/pkernfs/file.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pkernfs.h"
+
+static int truncate(struct inode *inode, loff_t newsize)
+{
+ unsigned long free_block;
+ struct pkernfs_inode *pkernfs_inode;
+ unsigned long *mappings;
+
+ pkernfs_inode = pkernfs_get_persisted_inode(inode->i_sb, inode->i_ino);
+ mappings = (unsigned long *)pkernfs_addr_for_block(inode->i_sb,
+ pkernfs_inode->mappings_block);
+ i_size_write(inode, newsize);
+ for (int block_idx = 0; block_idx * PMD_SIZE < newsize; ++block_idx) {
+ free_block = pkernfs_alloc_block(inode->i_sb);
+ if (free_block <= 0)
+ /* TODO: roll back allocations. */
+ return -ENOMEM;
+ *(mappings + block_idx) = free_block;
+ ++pkernfs_inode->num_mappings;
+ }
+ return 0;
+}
+
+static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *iattr)
+{
+ struct inode *inode = dentry->d_inode;
+ int error;
+
+ error = setattr_prepare(idmap, dentry, iattr);
+ if (error)
+ return error;
+
+ if (iattr->ia_valid & ATTR_SIZE) {
+ error = truncate(inode, iattr->ia_size);
+ if (error)
+ return error;
+ }
+ setattr_copy(idmap, inode, iattr);
+ mark_inode_dirty(inode);
+ return 0;
+}
+
+const struct inode_operations pkernfs_file_inode_operations = {
+ .setattr = inode_setattr,
+ .getattr = simple_getattr,
+};
+
+const struct file_operations pkernfs_file_fops = {
+ .owner = THIS_MODULE,
+ .iterate_shared = NULL,
+};
diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c
index f6584c8b8804..7fe4e7b220cc 100644
--- a/fs/pkernfs/inode.c
+++ b/fs/pkernfs/inode.c
@@ -15,14 +15,28 @@ struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int in

struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino)
{
+ struct pkernfs_inode *pkernfs_inode;
struct inode *inode = iget_locked(sb, ino);

/* If this inode is cached it is already populated; just return */
if (!(inode->i_state & I_NEW))
return inode;
- inode->i_op = &pkernfs_dir_inode_operations;
+ pkernfs_inode = pkernfs_get_persisted_inode(sb, ino);
inode->i_sb = sb;
- inode->i_mode = S_IFREG;
+ if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_DIR) {
+ inode->i_op = &pkernfs_dir_inode_operations;
+ inode->i_mode = S_IFDIR;
+ } else {
+ inode->i_op = &pkernfs_file_inode_operations;
+ inode->i_mode = S_IFREG;
+ inode->i_fop = &pkernfs_file_fops;
+ }
+
+ inode->i_atime = inode->i_mtime = current_time(inode);
+ inode_set_ctime_current(inode);
+ set_nlink(inode, 1);
+
+ /* Switch based on file type */
unlock_new_inode(inode);
return inode;
}
@@ -79,6 +93,8 @@ static int pkernfs_create(struct mnt_idmap *id, struct inode *dir,
pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode;
strscpy(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN);
pkernfs_inode->flags = PKERNFS_INODE_FLAG_FILE;
+ pkernfs_inode->mappings_block = pkernfs_alloc_block(dir->i_sb);
+ memset(pkernfs_addr_for_block(dir->i_sb, pkernfs_inode->mappings_block), 0, (2 << 20));

vfs_inode = pkernfs_inode_get(dir->i_sb, free_inode);
d_instantiate(dentry, vfs_inode);
@@ -90,6 +106,7 @@ static struct dentry *pkernfs_lookup(struct inode *dir,
unsigned int flags)
{
struct pkernfs_inode *pkernfs_inode;
+ struct inode *vfs_inode;
unsigned long ino;

pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino);
@@ -97,7 +114,10 @@ static struct dentry *pkernfs_lookup(struct inode *dir,
while (ino) {
pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, ino);
if (!strncmp(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN)) {
- d_add(dentry, pkernfs_inode_get(dir->i_sb, ino));
+ vfs_inode = pkernfs_inode_get(dir->i_sb, ino);
+ mark_inode_dirty(dir);
+ dir->i_atime = current_time(dir);
+ d_add(dentry, vfs_inode);
break;
}
ino = pkernfs_inode->sibling_ino;
@@ -146,3 +166,4 @@ const struct inode_operations pkernfs_dir_inode_operations = {
.lookup = pkernfs_lookup,
.unlink = pkernfs_unlink,
};
+
diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
index 4655780f31f2..8b4fee8c5b2e 100644
--- a/fs/pkernfs/pkernfs.h
+++ b/fs/pkernfs/pkernfs.h
@@ -34,8 +34,15 @@ struct pkernfs_inode {
};

void pkernfs_initialise_inode_store(struct super_block *sb);
+
void pkernfs_zero_allocations(struct super_block *sb);
+unsigned long pkernfs_alloc_block(struct super_block *sb);
struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino);
+void *pkernfs_addr_for_block(struct super_block *sb, int block_idx);
+
struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino);

+
extern const struct file_operations pkernfs_dir_fops;
+extern const struct file_operations pkernfs_file_fops;
+extern const struct inode_operations pkernfs_file_inode_operations;
--
2.40.1


2024-02-05 12:25:27

by Gowans, James

[permalink] [raw]
Subject: [RFC 06/18] init: Add liveupdate cmdline param

This will allow other subsystems to know when we're going a LU and hence
when they should be restoring rather than reinitialising state.
---
include/linux/init.h | 1 +
init/main.c | 10 ++++++++++
2 files changed, 11 insertions(+)

diff --git a/include/linux/init.h b/include/linux/init.h
index 266c3e1640d4..d7c68c7bfaf0 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -146,6 +146,7 @@ extern int do_one_initcall(initcall_t fn);
extern char __initdata boot_command_line[];
extern char *saved_command_line;
extern unsigned int saved_command_line_len;
+extern bool liveupdate;
extern unsigned int reset_devices;

/* used by init/main.c */
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..7807a56c3473 100644
--- a/init/main.c
+++ b/init/main.c
@@ -165,6 +165,16 @@ static char *ramdisk_execute_command = "/init";
bool static_key_initialized __read_mostly;
EXPORT_SYMBOL_GPL(static_key_initialized);

+bool liveupdate __read_mostly;
+EXPORT_SYMBOL(liveupdate);
+
+static int __init set_liveupdate(char *param)
+{
+ liveupdate = true;
+ return 0;
+}
+early_param("liveupdate", set_liveupdate);
+
/*
* If set, this is an indication to the drivers that reset the underlying
* device before going ahead with the initialization otherwise driver might
--
2.40.1


2024-02-05 12:27:05

by Gowans, James

[permalink] [raw]
Subject: [RFC 09/18] intel-iommu: Use pkernfs for root/context pgtable pages

The previous commits were preparation for using pkernfs memory for IOMMU
pgtables: a file in the filesystem is available and an allocator to
allocate 4-KiB pages from that file is available.

Now use those to actually use pkernfs memory for root and context
pgtable pages. If pkernfs is enabled then a "region" (physical and
virtual memory chunk) is fetch from pkernfs and used to drive the
allocator. Should this rather just be a pointer to a pkernfs inode? That
abstraction seems leaky but without having the ability to store struct
files at this point it's probably the more accurate.

The freeing still needs to be hooked into the allocator...
---
drivers/iommu/intel/iommu.c | 24 ++++++++++++++++++++----
drivers/iommu/intel/iommu.h | 2 ++
2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 744e4e6b8d72..2dd3f055dbce 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -19,6 +19,7 @@
#include <linux/memory.h>
#include <linux/pci.h>
#include <linux/pci-ats.h>
+#include <linux/pkernfs.h>
#include <linux/spinlock.h>
#include <linux/syscore_ops.h>
#include <linux/tboot.h>
@@ -28,6 +29,7 @@
#include "../dma-iommu.h"
#include "../irq_remapping.h"
#include "../iommu-sva.h"
+#include "../pgtable_alloc.h"
#include "pasid.h"
#include "cap_audit.h"
#include "perfmon.h"
@@ -617,7 +619,12 @@ struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
if (!alloc)
return NULL;

- context = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+ if (pkernfs_enabled())
+ iommu_alloc_page_from_region(
+ &iommu->pkernfs_region,
+ (void **) &context, NULL);
+ else
+ context = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!context)
return NULL;

@@ -1190,7 +1197,15 @@ static int iommu_alloc_root_entry(struct intel_iommu *iommu)
{
struct root_entry *root;

- root = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+ if (pkernfs_enabled()) {
+ pkernfs_alloc_iommu_root_pgtables(&iommu->pkernfs_region);
+ root = pgtable_get_root_page(
+ &iommu->pkernfs_region,
+ liveupdate);
+ } else {
+ root = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+ }
+
if (!root) {
pr_err("Allocating root entry for %s failed\n",
iommu->name);
@@ -2790,7 +2805,7 @@ static int __init init_dmars(void)

init_translation_status(iommu);

- if (translation_pre_enabled(iommu) && !is_kdump_kernel()) {
+ if (translation_pre_enabled(iommu) && !is_kdump_kernel() && !liveupdate) {
iommu_disable_translation(iommu);
clear_translation_pre_enabled(iommu);
pr_warn("Translation was enabled for %s but we are not in kdump mode\n",
@@ -2806,7 +2821,8 @@ static int __init init_dmars(void)
if (ret)
goto free_iommu;

- if (translation_pre_enabled(iommu)) {
+ /* For the live update case restore pgtables, don't copy */
+ if (translation_pre_enabled(iommu) && !liveupdate) {
pr_info("Translation already enabled - trying to copy translation structures\n");

ret = copy_translation_tables(iommu);
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index e6a3e7065616..a2338e398ba3 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -22,6 +22,7 @@
#include <linux/bitfield.h>
#include <linux/xarray.h>
#include <linux/perf_event.h>
+#include <linux/pkernfs.h>

#include <asm/cacheflush.h>
#include <asm/iommu.h>
@@ -672,6 +673,7 @@ struct intel_iommu {
unsigned long *copied_tables; /* bitmap of copied tables */
spinlock_t lock; /* protect context, domain ids */
struct root_entry *root_entry; /* virtual address */
+ struct pkernfs_region pkernfs_region;

struct iommu_flush flush;
#endif
--
2.40.1


2024-02-05 12:27:21

by Gowans, James

[permalink] [raw]
Subject: [RFC 10/18] iommu/intel: zap context table entries on kexec

In the next commit the IOMMU shutdown function will be modified to not
actually shut down the IOMMU when doing a kexec. To prevent leaving DMA
mappings for non-persistent devices around during kexec we add a
function to the kexec flow which iterates though all IOMMU domains and
zaps the context entries for the devices belonging to those domain.

A list of domains for the IOMMU is added and maintained.
---
drivers/iommu/intel/dmar.c | 1 +
drivers/iommu/intel/iommu.c | 34 ++++++++++++++++++++++++++++++----
drivers/iommu/intel/iommu.h | 2 ++
3 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 23cb80d62a9a..00f69f40a4ac 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1097,6 +1097,7 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd)
iommu->segment = drhd->segment;

iommu->node = NUMA_NO_NODE;
+ INIT_LIST_HEAD(&iommu->domains);

ver = readl(iommu->reg + DMAR_VER_REG);
pr_info("%s: reg_base_addr %llx ver %d:%d cap %llx ecap %llx\n",
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 2dd3f055dbce..315c6b7f901c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1831,6 +1831,7 @@ static int domain_attach_iommu(struct dmar_domain *domain,
goto err_clear;
}
domain_update_iommu_cap(domain);
+ list_add(&domain->domains, &iommu->domains);

spin_unlock(&iommu->lock);
return 0;
@@ -3608,6 +3609,33 @@ static void intel_disable_iommus(void)
iommu_disable_translation(iommu);
}

+void zap_context_table_entries(struct intel_iommu *iommu)
+{
+ struct context_entry *context;
+ struct dmar_domain *domain;
+ struct device_domain_info *device;
+ int bus, devfn;
+ u16 did_old;
+
+ list_for_each_entry(domain, &iommu->domains, domains) {
+ list_for_each_entry(device, &domain->devices, link) {
+ context = iommu_context_addr(iommu, device->bus, device->devfn, 0);
+ if (!context || !context_present(context))
+ continue;
+ context_domain_id(context);
+ context_clear_entry(context);
+ __iommu_flush_cache(iommu, context, sizeof(*context));
+ iommu->flush.flush_context(iommu,
+ did_old,
+ (((u16)bus) << 8) | devfn,
+ DMA_CCMD_MASK_NOBIT,
+ DMA_CCMD_DEVICE_INVL);
+ iommu->flush.flush_iotlb(iommu, did_old, 0, 0,
+ DMA_TLB_DSI_FLUSH);
+ }
+ }
+}
+
void intel_iommu_shutdown(void)
{
struct dmar_drhd_unit *drhd;
@@ -3620,10 +3648,8 @@ void intel_iommu_shutdown(void)

/* Disable PMRs explicitly here. */
for_each_iommu(iommu, drhd)
- iommu_disable_protect_mem_regions(iommu);
-
- /* Make sure the IOMMUs are switched off */
- intel_disable_iommus();
+ zap_context_table_entries(iommu);
+ return

up_write(&dmar_global_lock);
}
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index a2338e398ba3..4a2f163a86f3 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -600,6 +600,7 @@ struct dmar_domain {
spinlock_t lock; /* Protect device tracking lists */
struct list_head devices; /* all devices' list */
struct list_head dev_pasids; /* all attached pasids */
+ struct list_head domains; /* all struct dmar_domains on this IOMMU */

struct dma_pte *pgd; /* virtual address */
int gaw; /* max guest address width */
@@ -700,6 +701,7 @@ struct intel_iommu {
void *perf_statistic;

struct iommu_pmu *pmu;
+ struct list_head domains; /* all struct dmar_domains on this IOMMU */
};

/* PCI domain-device relationship */
--
2.40.1


2024-02-05 12:27:40

by Gowans, James

[permalink] [raw]
Subject: [RFC 13/18] vfio: add ioctl to define persistent pgtables on container

The previous commits added a file type in pkernfs for IOMMU persistent
page tables. Now support actually setting persistent page tables on an
IOMMU domain. This is done via a VFIO ioctl on a VFIO container.

Userspace needs to create and open a IOMMU persistent page tables file
and then supply that fd to the new VFIO_CONTAINER_SET_PERSISTENT_PGTABLES
ioctl. That ioctl sets the supplied struct file on the
struct vfio_container. Later when the IOMMU domain is allocated by VFIO,
VFIO will check to see if the persistent pagetables have been defined
and if they have will use the iommu_domain_alloc_persistent API which
was introduced in the previous commit to pass the struct file down to
the IOMMU which will actually use it for page tables.

After kexec userspace needs to open the same IOMMU page table file and
set it again via the same ioctl so that the IOMMU continues to use the
same memory region for its page tables for that domain.
---
drivers/vfio/container.c | 27 +++++++++++++++++++++++++++
drivers/vfio/vfio.h | 2 ++
drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++++++++++++++++--
include/uapi/linux/vfio.h | 9 +++++++++
4 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/container.c b/drivers/vfio/container.c
index d53d08f16973..b60fcbf7bad0 100644
--- a/drivers/vfio/container.c
+++ b/drivers/vfio/container.c
@@ -11,6 +11,7 @@
#include <linux/iommu.h>
#include <linux/miscdevice.h>
#include <linux/vfio.h>
+#include <linux/pkernfs.h>
#include <uapi/linux/vfio.h>

#include "vfio.h"
@@ -21,6 +22,7 @@ struct vfio_container {
struct rw_semaphore group_lock;
struct vfio_iommu_driver *iommu_driver;
void *iommu_data;
+ struct file *persistent_pgtables;
bool noiommu;
};

@@ -306,6 +308,8 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
continue;
}

+ driver->ops->set_persistent_pgtables(data, container->persistent_pgtables);
+
ret = __vfio_container_attach_groups(container, driver, data);
if (ret) {
driver->ops->release(data);
@@ -324,6 +328,26 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
return ret;
}

+static int vfio_ioctl_set_persistent_pgtables(struct vfio_container *container,
+ unsigned long arg)
+{
+ struct vfio_set_persistent_pgtables set_ppts;
+ struct file *ppts;
+
+ if (copy_from_user(&set_ppts, (void __user *)arg, sizeof(set_ppts)))
+ return -EFAULT;
+
+ ppts = fget(set_ppts.persistent_pgtables_fd);
+ if (!ppts)
+ return -EBADF;
+ if (!pkernfs_is_iommu_domain_pgtables(ppts)) {
+ fput(ppts);
+ return -EBADF;
+ }
+ container->persistent_pgtables = ppts;
+ return 0;
+}
+
static long vfio_fops_unl_ioctl(struct file *filep,
unsigned int cmd, unsigned long arg)
{
@@ -345,6 +369,9 @@ static long vfio_fops_unl_ioctl(struct file *filep,
case VFIO_SET_IOMMU:
ret = vfio_ioctl_set_iommu(container, arg);
break;
+ case VFIO_CONTAINER_SET_PERSISTENT_PGTABLES:
+ ret = vfio_ioctl_set_persistent_pgtables(container, arg);
+ break;
default:
driver = container->iommu_driver;
data = container->iommu_data;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 307e3f29b527..6fa301bf6474 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -226,6 +226,8 @@ struct vfio_iommu_driver_ops {
void *data, size_t count, bool write);
struct iommu_domain *(*group_iommu_domain)(void *iommu_data,
struct iommu_group *group);
+ int (*set_persistent_pgtables)(void *iommu_data,
+ struct file *ppts);
};

struct vfio_iommu_driver {
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index eacd6ec04de5..b36edfc5c9ef 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -75,6 +75,7 @@ struct vfio_iommu {
bool nesting;
bool dirty_page_tracking;
struct list_head emulated_iommu_groups;
+ struct file *persistent_pgtables;
};

struct vfio_domain {
@@ -2143,9 +2144,14 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,

static int vfio_iommu_domain_alloc(struct device *dev, void *data)
{
+ /* data is an in pointer to PPTs, and an out to the new domain. */
+ struct file *ppts = *(struct file **) data;
struct iommu_domain **domain = data;

- *domain = iommu_domain_alloc(dev->bus);
+ if (ppts)
+ *domain = iommu_domain_alloc_persistent(dev->bus, ppts);
+ else
+ *domain = iommu_domain_alloc(dev->bus);
return 1; /* Don't iterate */
}

@@ -2156,6 +2162,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct vfio_iommu_group *group;
struct vfio_domain *domain, *d;
bool resv_msi;
+ /* In/out ptr to iommu_domain_alloc. */
+ void *domain_alloc_data;
phys_addr_t resv_msi_base = 0;
struct iommu_domain_geometry *geo;
LIST_HEAD(iova_copy);
@@ -2203,8 +2211,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
* want to iterate beyond the first device (if any).
*/
ret = -EIO;
- iommu_group_for_each_dev(iommu_group, &domain->domain,
+ /* Smuggle the PPTs in the data field; it will be clobbered with the new domain */
+ domain_alloc_data = iommu->persistent_pgtables;
+ iommu_group_for_each_dev(iommu_group, &domain_alloc_data,
vfio_iommu_domain_alloc);
+ domain->domain = domain_alloc_data;
+
if (!domain->domain)
goto out_free_domain;

@@ -3165,6 +3177,16 @@ vfio_iommu_type1_group_iommu_domain(void *iommu_data,
return domain;
}

+int vfio_iommu_type1_set_persistent_pgtables(void *iommu_data,
+ struct file *ppts)
+{
+
+ struct vfio_iommu *iommu = iommu_data;
+
+ iommu->persistent_pgtables = ppts;
+ return 0;
+}
+
static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.name = "vfio-iommu-type1",
.owner = THIS_MODULE,
@@ -3179,6 +3201,7 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.unregister_device = vfio_iommu_type1_unregister_device,
.dma_rw = vfio_iommu_type1_dma_rw,
.group_iommu_domain = vfio_iommu_type1_group_iommu_domain,
+ .set_persistent_pgtables = vfio_iommu_type1_set_persistent_pgtables,
};

static int __init vfio_iommu_type1_init(void)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index afc1369216d9..fa9676bb4b26 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1797,6 +1797,15 @@ struct vfio_iommu_spapr_tce_remove {
};
#define VFIO_IOMMU_SPAPR_TCE_REMOVE _IO(VFIO_TYPE, VFIO_BASE + 20)

+struct vfio_set_persistent_pgtables {
+ /*
+ * File descriptor for a pkernfs IOMMU pgtables
+ * file to be used for persistence.
+ */
+ __u32 persistent_pgtables_fd;
+};
+#define VFIO_CONTAINER_SET_PERSISTENT_PGTABLES _IO(VFIO_TYPE, VFIO_BASE + 21)
+
/* ***************************************************************** */

#endif /* _UAPIVFIO_H */
--
2.40.1


2024-02-05 12:27:48

by Gowans, James

[permalink] [raw]
Subject: [RFC 16/18] vfio: support not mapping IOMMU pgtables on live-update

When restoring VMs after live update kexec, the IOVAs for the guest VM
are already present in the persisted page tables. It is unnecessary to
clobber the existing pgtable entries and it may introduce races if
pgtable modifications happen concurrently with DMA.

Provide a new VFIO MAP_DMA flag which userspace can supply to inform
VFIO that the IOVAs are already mapped. In this case VFIO will skip over
the call to the IOMMU driver to do the mapping. VFIO still needs the
MAP_DMA ioctl to set up its internal data structures about the mapping.

It would probably be better to move the persistence one layer up and
persist the VFIO container in pkernfs. That way the whole container
could be picked up and re-used without needing to do any MAP_DMA ioctls
after kexec.
---
drivers/vfio/vfio_iommu_type1.c | 24 +++++++++++++-----------
include/uapi/linux/vfio.h | 1 +
2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b36edfc5c9ef..dc2682fbda2e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1456,7 +1456,7 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
}

static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
- size_t map_size)
+ size_t map_size, unsigned int flags)
{
dma_addr_t iova = dma->iova;
unsigned long vaddr = dma->vaddr;
@@ -1479,14 +1479,16 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
break;
}

- /* Map it! */
- ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
- dma->prot);
- if (ret) {
- vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
- npage, true);
- vfio_batch_unpin(&batch, dma);
- break;
+ if (!(flags & VFIO_DMA_MAP_FLAG_LIVE_UPDATE)) {
+ /* Map it! */
+ ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
+ dma->prot);
+ if (ret) {
+ vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
+ npage, true);
+ vfio_batch_unpin(&batch, dma);
+ break;
+ }
}

size -= npage << PAGE_SHIFT;
@@ -1662,7 +1664,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
if (list_empty(&iommu->domain_list))
dma->size = size;
else
- ret = vfio_pin_map_dma(iommu, dma, size);
+ ret = vfio_pin_map_dma(iommu, dma, size, map->flags);

if (!ret && iommu->dirty_page_tracking) {
ret = vfio_dma_bitmap_alloc(dma, pgsize);
@@ -2836,7 +2838,7 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
struct vfio_iommu_type1_dma_map map;
unsigned long minsz;
uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
- VFIO_DMA_MAP_FLAG_VADDR;
+ VFIO_DMA_MAP_FLAG_VADDR | VFIO_DMA_MAP_FLAG_LIVE_UPDATE;

minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index fa9676bb4b26..d04d28e52110 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1536,6 +1536,7 @@ struct vfio_iommu_type1_dma_map {
#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
+#define VFIO_DMA_MAP_FLAG_LIVE_UPDATE (1 << 3) /* IOVAs already mapped in IOMMU before LU */
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
--
2.40.1


2024-02-05 12:28:06

by Gowans, James

[permalink] [raw]
Subject: [RFC 11/18] dma-iommu: Always enable deferred attaches for liveupdate

Seeing as translations are pre-enabled, all devices will be set for
deferred attach. The deferred attached actually has to be done when
doing DMA mapping for devices to work.

There may be a better way to do this be, for example, consulting the
context entry table and only deferring attach if there is a persisted
context table entry for this device.
---
drivers/iommu/dma-iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index e5d087bd6da1..76f916848f48 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1750,7 +1750,7 @@ void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)

static int iommu_dma_init(void)
{
- if (is_kdump_kernel())
+ if (is_kdump_kernel() || liveupdate)
static_branch_enable(&iommu_deferred_attach_enabled);

return iova_cache_get();
--
2.40.1


2024-02-05 12:32:14

by Gowans, James

[permalink] [raw]
Subject: [RFC 07/18] pkernfs: Add file type for IOMMU root pgtables

So far pkernfs is able to hold regular files for userspace to mmap and
in which store persisted data. Now begin the IOMMU integration for
persistent IOMMU pgtables.

A new type of inode is created for an IOMMU data directory. A new type
of inode is also created for a file which holds the IOMMU root pgtables.
The inode types are specified by flags on the inodes. Different inode
ops are also registed on the IOMMU pgtables file to ensure that
userspace can't access it. These IOMMU directory and data inodes are
created lazily: pkernfs_alloc_iommu_root_pgtables() scans for these and
returns them if they already exist (ie: after kexec) or creates them if
they don't exist (ie: cold boot).

The data in the IOMMU root pgtables file needs to be accessible early in
system boot: before filesystems are initialised and before anything is
mounted. To support this the pkernfs initialisation code is split out
into an pkernfs_init() function which is responsible for making the
pkernfs memory available. Here the filesystem abstraction starts to
creak: the pkernfs functions responsible for the IOMMU pgtables files
manipulated persisted inodes directly. It may be preferable to somehow
get pkernfs mounted early in system boot before it's needed by the IOMMU
so that filesystem paths can be used, but it is unclear if that's
possible.

The need for super blocks in the pkernfs functions has been limited so
far, super blocks are barely used because the pkernfs extents are stored
as global variables in pkernfs.c. Now NULLs are actually supplied to
functions which take a super block. This is also not pretty and this
code should probably rather be plumbing some sort of wrapper around the
persisted super block which would allow supporting multiple mount
moints.

Additionally, the memory backing the IOMMU root pgtable file is mapped
into the direct map by registering it as a device. This is needed
because the IOMMU does phys_to_virt in a few places when traversing the
pgtables so the direct map virtual address should be populated. The
alternative would be to replace all of the phy_to_virt calls in the
IOMMU driver with wrappers which understand if the phys_addr is part of
a pkernfs file.

The next commit will use this pkernfs file for root pgtables.
---
fs/pkernfs/Makefile | 2 +-
fs/pkernfs/inode.c | 17 +++++--
fs/pkernfs/iommu.c | 98 +++++++++++++++++++++++++++++++++++++++++
fs/pkernfs/pkernfs.c | 38 ++++++++++------
fs/pkernfs/pkernfs.h | 7 +++
include/linux/pkernfs.h | 36 +++++++++++++++
6 files changed, 181 insertions(+), 17 deletions(-)
create mode 100644 fs/pkernfs/iommu.c
create mode 100644 include/linux/pkernfs.h

diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile
index e41f06cc490f..7f0f7a4cd3a1 100644
--- a/fs/pkernfs/Makefile
+++ b/fs/pkernfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#

-obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o file.o
+obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o file.o iommu.o
diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c
index 7fe4e7b220cc..1d712e0a82a1 100644
--- a/fs/pkernfs/inode.c
+++ b/fs/pkernfs/inode.c
@@ -25,11 +25,18 @@ struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino)
inode->i_sb = sb;
if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_DIR) {
inode->i_op = &pkernfs_dir_inode_operations;
+ inode->i_fop = &pkernfs_dir_fops;
inode->i_mode = S_IFDIR;
- } else {
+ } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_FILE) {
inode->i_op = &pkernfs_file_inode_operations;
- inode->i_mode = S_IFREG;
inode->i_fop = &pkernfs_file_fops;
+ inode->i_mode = S_IFREG;
+ } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_IOMMU_DIR) {
+ inode->i_op = &pkernfs_iommu_dir_inode_operations;
+ inode->i_fop = &pkernfs_dir_fops;
+ inode->i_mode = S_IFDIR;
+ } else if (pkernfs_inode->flags | PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES) {
+ inode->i_mode = S_IFREG;
}

inode->i_atime = inode->i_mtime = current_time(inode);
@@ -41,7 +48,7 @@ struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino)
return inode;
}

-static unsigned long pkernfs_allocate_inode(struct super_block *sb)
+unsigned long pkernfs_allocate_inode(struct super_block *sb)
{

unsigned long next_free_ino;
@@ -167,3 +174,7 @@ const struct inode_operations pkernfs_dir_inode_operations = {
.unlink = pkernfs_unlink,
};

+const struct inode_operations pkernfs_iommu_dir_inode_operations = {
+ .lookup = pkernfs_lookup,
+};
+
diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c
new file mode 100644
index 000000000000..5bce8146d7bb
--- /dev/null
+++ b/fs/pkernfs/iommu.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "pkernfs.h"
+#include <linux/io.h>
+
+
+void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region)
+{
+ unsigned long *mappings_block_vaddr;
+ unsigned long inode_idx;
+ struct pkernfs_inode *iommu_pgtables, *iommu_dir = NULL;
+ int rc;
+
+ pkernfs_init();
+
+ /* Try find a 'iommu' directory */
+ inode_idx = pkernfs_get_persisted_inode(NULL, 1)->child_ino;
+ while (inode_idx) {
+ if (!strncmp(pkernfs_get_persisted_inode(NULL, inode_idx)->filename,
+ "iommu", PKERNFS_FILENAME_LEN)) {
+ iommu_dir = pkernfs_get_persisted_inode(NULL, inode_idx);
+ break;
+ }
+ inode_idx = pkernfs_get_persisted_inode(NULL, inode_idx)->sibling_ino;
+ }
+
+ if (!iommu_dir) {
+ unsigned long root_pgtables_ino = 0;
+ unsigned long iommu_dir_ino = pkernfs_allocate_inode(NULL);
+
+ iommu_dir = pkernfs_get_persisted_inode(NULL, iommu_dir_ino);
+ strscpy(iommu_dir->filename, "iommu", PKERNFS_FILENAME_LEN);
+ iommu_dir->flags = PKERNFS_INODE_FLAG_IOMMU_DIR;
+
+ /* Make this the head of the list. */
+ iommu_dir->sibling_ino = pkernfs_get_persisted_inode(NULL, 1)->child_ino;
+ pkernfs_get_persisted_inode(NULL, 1)->child_ino = iommu_dir_ino;
+
+ /* Add a child file for pgtables. */
+ root_pgtables_ino = pkernfs_allocate_inode(NULL);
+ iommu_pgtables = pkernfs_get_persisted_inode(NULL, root_pgtables_ino);
+ strscpy(iommu_pgtables->filename, "root-pgtables", PKERNFS_FILENAME_LEN);
+ iommu_pgtables->sibling_ino = iommu_dir->child_ino;
+ iommu_dir->child_ino = root_pgtables_ino;
+ iommu_pgtables->flags = PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES;
+ iommu_pgtables->mappings_block = pkernfs_alloc_block(NULL);
+ /* TODO: make alloc zero. */
+ memset(pkernfs_addr_for_block(NULL, iommu_pgtables->mappings_block), 0, (2 << 20));
+ } else {
+ inode_idx = iommu_dir->child_ino;
+ while (inode_idx) {
+ if (!strncmp(pkernfs_get_persisted_inode(NULL, inode_idx)->filename,
+ "root-pgtables", PKERNFS_FILENAME_LEN)) {
+ iommu_pgtables = pkernfs_get_persisted_inode(NULL, inode_idx);
+ break;
+ }
+ inode_idx = pkernfs_get_persisted_inode(NULL, inode_idx)->sibling_ino;
+ }
+ }
+
+ /*
+ * For a pkernfs region block, the "mappings_block" field is still
+ * just a block index, but that block doesn't actually contain mappings
+ * it contains the pkernfs_region data
+ */
+
+ mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL,
+ iommu_pgtables->mappings_block);
+ set_bit(0, mappings_block_vaddr);
+ pkernfs_region->vaddr = mappings_block_vaddr;
+ pkernfs_region->paddr = pkernfs_base + (iommu_pgtables->mappings_block * PMD_SIZE);
+ pkernfs_region->bytes = PMD_SIZE;
+
+ dev_set_name(&pkernfs_region->dev, "iommu_root_pgtables");
+ rc = device_register(&pkernfs_region->dev);
+ if (rc)
+ pr_err("device_register failed: %i\n", rc);
+
+ pkernfs_region->pgmap.range.start = pkernfs_base +
+ (iommu_pgtables->mappings_block * PMD_SIZE);
+ pkernfs_region->pgmap.range.end =
+ pkernfs_region->pgmap.range.start + PMD_SIZE - 1;
+ pkernfs_region->pgmap.nr_range = 1;
+ pkernfs_region->pgmap.type = MEMORY_DEVICE_GENERIC;
+ pkernfs_region->vaddr =
+ devm_memremap_pages(&pkernfs_region->dev, &pkernfs_region->pgmap);
+ pkernfs_region->paddr = pkernfs_base +
+ (iommu_pgtables->mappings_block * PMD_SIZE);
+}
+
+void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr)
+{
+ if (WARN_ON(paddr >= region->paddr + region->bytes))
+ return NULL;
+ if (WARN_ON(paddr < region->paddr))
+ return NULL;
+ return region->vaddr + (paddr - region->paddr);
+}
diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c
index f010c2d76c76..2e8c4b0a5807 100644
--- a/fs/pkernfs/pkernfs.c
+++ b/fs/pkernfs/pkernfs.c
@@ -11,12 +11,14 @@ phys_addr_t pkernfs_base, pkernfs_size;
void *pkernfs_mem;
static const struct super_operations pkernfs_super_ops = { };

-static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
+void pkernfs_init(void)
{
- struct inode *inode;
- struct dentry *dentry;
+ static int inited;
struct pkernfs_sb *psb;

+ if (inited++)
+ return;
+
pkernfs_mem = memremap(pkernfs_base, pkernfs_size, MEMREMAP_WB);
psb = (struct pkernfs_sb *) pkernfs_mem;

@@ -24,13 +26,21 @@ static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
pr_info("pkernfs: Restoring from super block\n");
} else {
pr_info("pkernfs: Clean super block; initialising\n");
- pkernfs_initialise_inode_store(sb);
- pkernfs_zero_allocations(sb);
+ pkernfs_initialise_inode_store(NULL);
+ pkernfs_zero_allocations(NULL);
psb->magic_number = PKERNFS_MAGIC_NUMBER;
- pkernfs_get_persisted_inode(sb, 1)->flags = PKERNFS_INODE_FLAG_DIR;
- strscpy(pkernfs_get_persisted_inode(sb, 1)->filename, ".", PKERNFS_FILENAME_LEN);
+ pkernfs_get_persisted_inode(NULL, 1)->flags = PKERNFS_INODE_FLAG_DIR;
+ strscpy(pkernfs_get_persisted_inode(NULL, 1)->filename, ".", PKERNFS_FILENAME_LEN);
psb->next_free_ino = 2;
}
+}
+
+static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *inode;
+ struct dentry *dentry;
+
+ pkernfs_init();

sb->s_op = &pkernfs_super_ops;

@@ -77,12 +87,9 @@ static struct file_system_type pkernfs_fs_type = {
.fs_flags = FS_USERNS_MOUNT,
};

-static int __init pkernfs_init(void)
+static int __init pkernfs_fs_init(void)
{
- int ret;
-
- ret = register_filesystem(&pkernfs_fs_type);
- return ret;
+ return register_filesystem(&pkernfs_fs_type);
}

/**
@@ -97,7 +104,12 @@ static int __init parse_pkernfs_extents(char *p)
return 0;
}

+bool pkernfs_enabled(void)
+{
+ return !!pkernfs_base;
+}
+
early_param("pkernfs", parse_pkernfs_extents);

MODULE_ALIAS_FS("pkernfs");
-module_init(pkernfs_init);
+module_init(pkernfs_fs_init);
diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h
index 1a7aa783a9be..e1b7ae3fe7f1 100644
--- a/fs/pkernfs/pkernfs.h
+++ b/fs/pkernfs/pkernfs.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0-only */

#include <linux/fs.h>
+#include <linux/pkernfs.h>

#define PKERNFS_MAGIC_NUMBER 0x706b65726e6673
#define PKERNFS_FILENAME_LEN 255
@@ -18,6 +19,8 @@ struct pkernfs_sb {
// If neither of these are set the inode is not in use.
#define PKERNFS_INODE_FLAG_FILE (1 << 0)
#define PKERNFS_INODE_FLAG_DIR (1 << 1)
+#define PKERNFS_INODE_FLAG_IOMMU_DIR (1 << 2)
+#define PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES (1 << 3)
struct pkernfs_inode {
int flags;
/*
@@ -31,20 +34,24 @@ struct pkernfs_inode {
*/
unsigned long child_ino;
char filename[PKERNFS_FILENAME_LEN];
+ /* Block index for where the mappings live. */
int mappings_block;
int num_mappings;
};

void pkernfs_initialise_inode_store(struct super_block *sb);
+void pkernfs_init(void);

void pkernfs_zero_allocations(struct super_block *sb);
unsigned long pkernfs_alloc_block(struct super_block *sb);
struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino);
void *pkernfs_addr_for_block(struct super_block *sb, int block_idx);

+unsigned long pkernfs_allocate_inode(struct super_block *sb);
struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino);


extern const struct file_operations pkernfs_dir_fops;
extern const struct file_operations pkernfs_file_fops;
extern const struct inode_operations pkernfs_file_inode_operations;
+extern const struct inode_operations pkernfs_iommu_dir_inode_operations;
diff --git a/include/linux/pkernfs.h b/include/linux/pkernfs.h
new file mode 100644
index 000000000000..0110e4784109
--- /dev/null
+++ b/include/linux/pkernfs.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: MIT */
+
+#ifndef _LINUX_PKERNFS_H
+#define _LINUX_PKERNFS_H
+
+#include <linux/memremap.h>
+#include <linux/device.h>
+
+#ifdef CONFIG_PKERNFS_FS
+extern bool pkernfs_enabled(void);
+#else
+static inline bool pkernfs_enabled(void)
+{
+ return false;
+}
+#endif
+
+/*
+ * This is a light wrapper around the data behind a pkernfs
+ * file. Really it should be a file but the filesystem comes
+ * up too late: IOMMU needs root pgtables before fs is up.
+ */
+struct pkernfs_region {
+ void *vaddr;
+ unsigned long paddr;
+ unsigned long bytes;
+ struct dev_pagemap pgmap;
+ struct device dev;
+};
+
+void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region);
+void pkernfs_alloc_page_from_region(struct pkernfs_region *pkernfs_region,
+ void **vaddr, unsigned long *paddr);
+void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr);
+
+#endif /* _LINUX_PKERNFS_H */
--
2.40.1


2024-02-05 12:33:07

by Gowans, James

[permalink] [raw]
Subject: [RFC 15/18] pkernfs: register device memory for IOMMU domain pgtables

Similarly to the root/context pgtables, the IOMMU driver also does
phys_to_virt when walking the domain pgtables. To make this work
properly the physical memory needs to be mapped in at the correct place
in the direct map. Register a memory device to support this.

The alternative would be to wrap all of the phys_to_virt functions in
something which is pkernfs aware.
---
fs/pkernfs/iommu.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c
index 5d0b256e7dd8..073b9dd48237 100644
--- a/fs/pkernfs/iommu.c
+++ b/fs/pkernfs/iommu.c
@@ -9,6 +9,7 @@ void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkern
struct pkernfs_inode *pkernfs_inode;
unsigned long *mappings_block_vaddr;
unsigned long inode_idx;
+ int rc;

/*
* For a pkernfs region block, the "mappings_block" field is still
@@ -22,7 +23,20 @@ void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkern
mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL,
pkernfs_inode->mappings_block);
set_bit(0, mappings_block_vaddr);
- pkernfs_region->vaddr = mappings_block_vaddr;
+
+ dev_set_name(&pkernfs_region->dev, "vfio-ppt-%s", pkernfs_inode->filename);
+ rc = device_register(&pkernfs_region->dev);
+ if (rc)
+ pr_err("device_register failed: %i\n", rc);
+
+ pkernfs_region->pgmap.range.start = pkernfs_base +
+ (pkernfs_inode->mappings_block * PMD_SIZE);
+ pkernfs_region->pgmap.range.end =
+ pkernfs_region->pgmap.range.start + PMD_SIZE - 1;
+ pkernfs_region->pgmap.nr_range = 1;
+ pkernfs_region->pgmap.type = MEMORY_DEVICE_GENERIC;
+ pkernfs_region->vaddr =
+ devm_memremap_pages(&pkernfs_region->dev, &pkernfs_region->pgmap);
pkernfs_region->paddr = pkernfs_base + (pkernfs_inode->mappings_block * (2 << 20));
}
void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region)
--
2.40.1


2024-02-05 17:11:09

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update

On Mon, 5 Feb 2024 12:01:45 +0000
James Gowans <[email protected]> wrote:

> This RFC is to solicit feedback on the approach of implementing support for live
> update via an in-memory filesystem responsible for storing all live update state
> as files in the filesystem.
>
> Hypervisor live update is a mechanism to support updating a hypervisor via kexec
> in a way that has limited impact to running virtual machines. This is done by
> pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
> processes and then deserialising/resuming the VMs so that they continue running
> from where they were. Virtual machines can have PCI devices passed through and
> in order to support live update it’s necessary to persist the IOMMU page tables
> so that the devices can continue to do DMA to guest RAM during kexec.
>
> This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
> which explored ways in which the live update problem could be tackled;
> this was one of them:
> https://lpc.events/event/17/contributions/1485/
>
> The approach sketched out in this RFC introduces a new in-memory filesystem,
> pkernfs. Pkernfs takes over ownership separate from Linux memory
> management system RAM which is carved out from the normal MM allocator
> and donated to pkernfs. Files are created in pkernfs for a few purposes:
> There are a few things that need to be preserved and re-hydrated after
> kexec to support this:
>
> * Guest memory: to be able to restore the VM its memory must be
> preserved. This is achieved by using a regular file in pkernfs for guest RAM.
> As this guest RAM is not part of the normal linux core mm allocator and
> has no struct pages, it can be removed from the direct map which
> improves security posture for guest RAM. Similar to memfd_secret.
>
> * IOMMU root page tables: for the IOMMU to have any ability to do DMA
> during kexec it needs root page tables to look up per-domain page
> tables. IOMMU root page tables are stored in a special path in pkernfs:
> iommu/root-pgtables. The intel IOMMU driver is modified to hook into
> pkernfs to get the chunk of memory that it can use to allocate root
> pgtables.
>
> * IOMMU domain page tables: in order for VM-initiated DMA operations to
> continue running while kexec is happening the IOVA to PA address
> translations for persisted devices needs to continue to work. Similar to
> root pgtables the per-domain page tables for persisted devices are
> allocated from a pkernfs file so they they are also persisted across
> kexec. This is done by using pkernfs files for IOMMU domain page
> tables. Not all devices are persistent, so VFIO is updated to support
> defining persistent page tables on passed through devices.
>
> * Updates to IOMMU and PCI are needed to make device handover across
> kexec work properly. Although not fully complete some of the changed
> needed around avoiding device re-setting and re-probing are sketched
> in this RFC.
>
> Guest RAM and IOMMU state are just the first two things needed for live update.
> Pkernfs opens the door for other kernel state which can improve kexec or add
> more capabilities to live update to also be persisted as new files.
>
> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.
>
> Sign-offs are intentionally omitted to make it clear that this is a
> concept sketch RFC and not intended for merging.
>
> On CC are folks who have sent RFCs around this problem space before, as
> well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.
>
> == Alternatives ==
>
> There have been other RFCs which cover some aspect of the live update problem
> space. So far, all public approaches with KVM neglected device assignment which
> introduces a new dimension of problems. Prior art in this space includes:
>
> 1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
> across kexec. It also supports specifying persisted memory page which could be
> used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.
>
> 2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
> be used for guest RAM and is preserved across kexec by passing a pointer to the
> root page.
>
> 3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
> chunk of memory specified via kernel cmdline parameter. It is not persistent but
> aims to remove the need for struct page overhead.
>
> 4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
> to allocate persistent memory, and restore that memory after kexec. They do do
> not attempt to provide the ability to store userspace accessible state or have a
> filesystem interface.
>
> == How to use ==
>
> Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
> donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
> offset 1 GiB:
> memmap=1G%1G nopat pkernfs=1G!1G
>
> Mount pkernfs somewhere, for example:
> mount -t pkernfs /mnt/pkernfs
>
> Allocate a file for guest RAM:
> touch /mnt/pkernfs/guest-ram
> truncate -s 100M /mnt/pkernfs/guest-ram
>
> Add QEMU cmdline option to use this as guest RAM:
> -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
> -M q35,memory-backend=pc.ram
>
> Allocate a file for IOMMU domain page tables:
> touch /mnt/pkernfs/iommu/dom-0
> truncate -s 2M /mnt/pkernfs/iommu/dom-0
>
> That file must be supplied to VFIO when creating the IOMMU container, via the
> VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]
>
> After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
> state. When doing DMA mapping specify the additional flag
> VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
> Example: [5].
>
> == Limitations ==
>
> This is a RFC design to sketch out the concept so that there can be a discussion
> about the general approach. There are many gaps and hacks; the idea is to keep
> this RFC as simple as possible. Limitations include:
>
> * Needing to supply the physical memory range for pkernfs as a kernel cmdline
> parameter. Better would be to allocate memory for pkernfs dynamically on first
> boot and pass that across kexec. Doing so would require additional integration
> with memblocks and some ability to pass the dynamically allocated ranges
> across. KHO [0] could support this.
>
> * A single filesystem with no support for NUMA awareness. Better would be to
> support multiple named pkernfs mounts which can cover different NUMA nodes.
>
> * Skeletal filesystem code. There’s just enough functionality to make it usable to
> demonstrate the concept of using files for guest RAM and IOMMU state.
>
> * Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
> RAM files being deleted or resized while IOMMU mappings are set up which would
> allow DMA to freed memory. Better integration with guest RAM files and
> IOMMU/VFIO is necessary.
>
> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.

Note that the vfio container is on a path towards deprecation, this
should be refocused on vfio relative to iommufd. There would need to
be a strong argument for a container/type1 extension to support this,
iommufd would need to be the first class implementation. Thanks,

Alex

> * Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
> wasteful and an hard upper limit on file size.
>
> [0] https://lore.kernel.org/kexec/[email protected]/
> [1] https://lore.kernel.org/kexec/[email protected]/
> [2] https://lkml.org/lkml/2020/12/7/342
> [3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
> [4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
> [5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
> [6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
>
>
> James Gowans (18):
> pkernfs: Introduce filesystem skeleton
> pkernfs: Add persistent inodes hooked into directies
> pkernfs: Define an allocator for persistent pages
> pkernfs: support file truncation
> pkernfs: add file mmap callback
> init: Add liveupdate cmdline param
> pkernfs: Add file type for IOMMU root pgtables
> iommu: Add allocator for pgtables from persistent region
> intel-iommu: Use pkernfs for root/context pgtable pages
> iommu/intel: zap context table entries on kexec
> dma-iommu: Always enable deferred attaches for liveupdate
> pkernfs: Add IOMMU domain pgtables file
> vfio: add ioctl to define persistent pgtables on container
> intel-iommu: Allocate domain pgtable pages from pkernfs
> pkernfs: register device memory for IOMMU domain pgtables
> vfio: support not mapping IOMMU pgtables on live-update
> pci: Don't clear bus master is persistence enabled
> vfio-pci: Assume device working after liveupdate
>
> drivers/iommu/Makefile | 1 +
> drivers/iommu/dma-iommu.c | 2 +-
> drivers/iommu/intel/dmar.c | 1 +
> drivers/iommu/intel/iommu.c | 93 +++++++++++++---
> drivers/iommu/intel/iommu.h | 5 +
> drivers/iommu/iommu.c | 22 ++--
> drivers/iommu/pgtable_alloc.c | 43 +++++++
> drivers/iommu/pgtable_alloc.h | 10 ++
> drivers/pci/pci-driver.c | 4 +-
> drivers/vfio/container.c | 27 +++++
> drivers/vfio/pci/vfio_pci_core.c | 20 ++--
> drivers/vfio/vfio.h | 2 +
> drivers/vfio/vfio_iommu_type1.c | 51 ++++++---
> fs/Kconfig | 1 +
> fs/Makefile | 3 +
> fs/pkernfs/Kconfig | 9 ++
> fs/pkernfs/Makefile | 6 +
> fs/pkernfs/allocator.c | 51 +++++++++
> fs/pkernfs/dir.c | 43 +++++++
> fs/pkernfs/file.c | 93 ++++++++++++++++
> fs/pkernfs/inode.c | 185 +++++++++++++++++++++++++++++++
> fs/pkernfs/iommu.c | 163 +++++++++++++++++++++++++++
> fs/pkernfs/pkernfs.c | 115 +++++++++++++++++++
> fs/pkernfs/pkernfs.h | 61 ++++++++++
> include/linux/init.h | 1 +
> include/linux/iommu.h | 6 +-
> include/linux/pkernfs.h | 38 +++++++
> include/uapi/linux/vfio.h | 10 ++
> init/main.c | 10 ++
> 29 files changed, 1029 insertions(+), 47 deletions(-)
> create mode 100644 drivers/iommu/pgtable_alloc.c
> create mode 100644 drivers/iommu/pgtable_alloc.h
> create mode 100644 fs/pkernfs/Kconfig
> create mode 100644 fs/pkernfs/Makefile
> create mode 100644 fs/pkernfs/allocator.c
> create mode 100644 fs/pkernfs/dir.c
> create mode 100644 fs/pkernfs/file.c
> create mode 100644 fs/pkernfs/inode.c
> create mode 100644 fs/pkernfs/iommu.c
> create mode 100644 fs/pkernfs/pkernfs.c
> create mode 100644 fs/pkernfs/pkernfs.h
> create mode 100644 include/linux/pkernfs.h
>


2024-02-05 17:15:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 13/18] vfio: add ioctl to define persistent pgtables on container

On Mon, Feb 05, 2024 at 12:01:58PM +0000, James Gowans wrote:
> The previous commits added a file type in pkernfs for IOMMU persistent
> page tables. Now support actually setting persistent page tables on an
> IOMMU domain. This is done via a VFIO ioctl on a VFIO container.

Please no changes to VFIO in the iommu area, new features need to be
implemented on top of iommufd instead.

We already have the infrastructure there to customize iommu_domain
allocations that this can be implemented on top of.

Jason

2024-02-05 17:22:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 14/18] intel-iommu: Allocate domain pgtable pages from pkernfs

On Mon, Feb 05, 2024 at 12:01:59PM +0000, James Gowans wrote:
> @@ -946,7 +946,13 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
> if (!dma_pte_present(pte)) {
> uint64_t pteval;
>
> - tmp_page = alloc_pgtable_page(domain->nid, gfp);
> + if (domain->pgtables_allocator.vaddr)
> + iommu_alloc_page_from_region(
> + &domain->pgtables_allocator,
> + &tmp_page,
> + NULL);

I'm really worried about this change - I plan to redo all of this page
table handling code so it makes use of struct page members for things
like RCU free and more.

Does this end up making the entire struct page owned by the
filesystem?

Jason

2024-02-05 17:42:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update

On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:

> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.

I think if you are going to attempt something like this then the end
result must bring things back to having the same data structures fully
restored.

It is fine that the pkernfs holds some persistant memory that
guarentees the IOMMU can remain programmed and the VM pages can become
fixed across the kexec

But once the VMM starts to restore it self we need to get back to the
original configuration:
- A mmap that points to the VM's physical pages
- An iommufd IOAS that points to the above mmap
- An iommufd HWPT that represents that same mapping
- An iommu_domain programmed into HW that the HWPT

Ie you can't just reboot and leave the IOMMU hanging out in some
undefined land - especially in latest kernels!

For vt-d you need to retain the entire root table and all the required
context entries too, The restarting iommu needs to understand that it
has to "restore" a temporary iommu_domain from the pkernfs.

You can later reconstitute a proper iommu_domain from the VMM and
atomic switch.

So, I'm surprised to see this approach where things just live forever
in the kernfs, I don't see how "restore" is going to work very well
like this.

I would think that a save/restore mentalitity would make more
sense. For instance you could make a special iommu_domain that is fixed
and lives in the pkernfs. The operation would be to copy from the live
iommu_domain to the fixed one and then replace the iommu HW to the
fixed one.

In the post-kexec world the iommu would recreate that special domain
and point the iommu at it. (copying the root and context descriptions
out of the pkernfs). Then somehow that would get into iommufd and VFIO
so that it could take over that special mapping during its startup.

Then you'd build the normal operating ioas and hwpt (with all the
right page refcounts/etc) then switch to it and free the pkernfs
memory.

It seems alot less invasive to me. The special case is clearly a
special case and doesn't mess up the normal operation of the drivers.

It becomes more like kdump where the iommu driver is running in a
fairly normal mode, just with some stuff copied from the prior kernel.

Your text spent alot of time talking about the design of how the pages
persist, which is interesting, but it seems like only a small part of
the problem. Actually using that mechanism in a sane way and cover all
the functional issues in the HW drivers is going to be really
challenging.

> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.

I doubt this.. It probably needs to be much finer grained actually,
otherwise you are going to be serializing everything. Somehow I think
you are better to serialize a minimum and try to reconstruct
everything else in userspace. Like conserving iommufd IDs would be a
huge PITA.

There are also going to be lots of security questions here, like we
can't just let userspace feed in any garbage and violate vfio and
iommu invariants.

Jason

2024-02-05 17:45:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/18] dma-iommu: Always enable deferred attaches for liveupdate

On Mon, Feb 05, 2024 at 12:01:56PM +0000, James Gowans wrote:
> Seeing as translations are pre-enabled, all devices will be set for
> deferred attach. The deferred attached actually has to be done when
> doing DMA mapping for devices to work.

This shouldn't be part of your solution..

A DMA API using driver should never be attached to a device that is
being persisted across a kexec. If it does then the device should be
fully reset and re-attached to the normal DMA API iommu domain.

You should be striving to have an iommu_domain to represent the
special post-kexec translation regime.

Jason

2024-02-05 23:36:52

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC 05/18] pkernfs: add file mmap callback

On Mon, Feb 05, 2024 at 12:01:50PM +0000, James Gowans wrote:
> Make the file data useable to userspace by adding mmap. That's all that
> QEMU needs for guest RAM, so that's all be bother implementing for now.
>
> When mmaping the file the VMA is marked as PFNMAP to indicate that there
> are no struct pages for the memory in this VMA. Remap_pfn_range() is
> used to actually populate the page tables. All PTEs are pre-faulted into
> the pgtables at mmap time so that the pgtables are useable when this
> virtual address range is given to VFIO's MAP_DMA.

And so what happens when this file is truncated whilst it is mmap()d
by an application? Ain't that just a great big UAF waiting to be
exploited?

-Dave.
--
Dave Chinner
[email protected]

2024-02-06 14:58:25

by Luca Boccassi

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update


> Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing
> arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot
> time, other
> approaches could make the carving out of persistent pages dynamic.

Speaking from experience here - in Azure (Boost) we have been using
hard-carved out memory areas (DAX devices with ranges configured via
DTB) for persisting state across kexec for ~5 years or so. In a
nutshell: don't, it's a mistake.

It's a constant and consistence source of problems, headaches, issues
and workarounds piled upon workarounds, held together with duct tape
and prayers. It's just not flexible enough for any modern system. For
example, unless _all_ the machines are ridicolously overprovisioned in
terms of memory capacity (and guaranteed to remain so, forever), you
end up wasting enormous amounts of memory.

In Azure we are very much interested in a nice, well-abstracted, first-
class replacement for that setup that allows persisting data across
kexec, and in systemd userspace we'd very much want to use it as well,
but it really, really needs to be dynamic, and avoid the pitfall of
hard-configured carved out chunk.

--
Kind regards,
Luca Boccassi

2024-02-07 14:46:18

by Gowans, James

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update

Hi Jason,

Thanks for this great feedback on the approach - it's exactly the sort
of thing we were looking for.

On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
>
> On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:
>
> > The main aspect we’re looking for feedback/opinions on here is the concept of
> > putting all persistent state in a single filesystem: combining guest RAM and
> > IOMMU pgtables in one store. Also, the question of a hard separation between
> > persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> > be persisted. Pkernfs does it via a hard separation defined at boot time, other
> > approaches could make the carving out of persistent pages dynamic.
>
> I think if you are going to attempt something like this then the end
> result must bring things back to having the same data structures fully
> restored.
>
> It is fine that the pkernfs holds some persistant memory that
> guarentees the IOMMU can remain programmed and the VM pages can become
> fixed across the kexec
>
> But once the VMM starts to restore it self we need to get back to the
> original configuration:
> - A mmap that points to the VM's physical pages
> - An iommufd IOAS that points to the above mmap
> - An iommufd HWPT that represents that same mapping
> - An iommu_domain programmed into HW that the HWPT

(A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
now because that's what I know, but will explore iommufd more and reply
in more detail about iommufd in the other email thread.)

How much of this do you think should be done automatically, vs how much
should userspace need to drive? With this RFC userspace basically re-
drives everything, including re-injecting the file containing the
persistent page tables into the IOMMU domain via VFIO.

Part of the reason is simplicity, to avoid having auto-deserialise code
paths in the drivers and modules. Another part of the reason so that
userspace can get FD handles on the resources. Typically FDs are
returned by doing actions like creating VFIO containers. If we make all
that automatic then there needs to be some other mechanism for auto-
restored resources to present themselves to userspace so that userspace
can discover and pick them up again.

One possible way to do this would be to populate a bunch of files in
procfs for each persisted IOMMU domain that allows userspace to discover
and pick it up.

Can you expand on what you mean by "A mmap that points to the VM's
physical pages?" Are you suggesting that the QEMU process automatically
gets something appearing in it's address space? Part of the live update
process involves potentially changing the userspace binaries: doing
kexec and booting a new system is an opportunity to boot new versions of
the userspace binary. So we shouldn't try to preserve too much of
userspace state; it's better to let it re-create internal data
structures do fresh mmaps.

What I'm really asking is: do you have a specific suggestion about how
these persistent resources should present themselves to userspace and
how userspace can discover them and pick them up?

>
> Ie you can't just reboot and leave the IOMMU hanging out in some
> undefined land - especially in latest kernels!

Not too sure what you mean by "undefined land" - the idea is that the
IOMMU keeps doing what it was going until userspace comes along re-
creates the handles to the IOMMU at which point it can do modifications
like change mappings or tear the domain down. This is what deferred
attached gives us, I believe, and why I had to change it to be enabled.
Just leave the IOMMU domain alone until userspace re-creates it with the
original tables.
Perhaps I'm missing your point. :-)

>
> For vt-d you need to retain the entire root table and all the required
> context entries too, The restarting iommu needs to understand that it
> has to "restore" a temporary iommu_domain from the pkernfs.
> You can later reconstitute a proper iommu_domain from the VMM and
> atomic switch.

Why does it need to go via a temporary domain? The current approach is
just to leave the IOMMU domain running as-is via deferred attached, and
later when userspace starts up it will create the iommu_domain backed by
the same persistent page tables.
>
> So, I'm surprised to see this approach where things just live forever
> in the kernfs, I don't see how "restore" is going to work very well
> like this.

Can you expand on why the suggested restore path will be problematic? In
summary the idea is to re-create all of the "ephemeral" data structures
by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU
root/context tables pointed at the original persistent page tables. The
ephemeral data structures are re-created in userspace but the persistent
page tables left alone. This is of course dependent on userspace re-
creating things *correctly* - it can easily do the wrong thing. Perhaps
this is the issue? Or is there a problem even if userspace is sane.

> I would think that a save/restore mentalitity would make more
> sense. For instance you could make a special iommu_domain that is fixed
> and lives in the pkernfs. The operation would be to copy from the live
> iommu_domain to the fixed one and then replace the iommu HW to the
> fixed one.
>
> In the post-kexec world the iommu would recreate that special domain
> and point the iommu at it. (copying the root and context descriptions
> out of the pkernfs). Then somehow that would get into iommufd and VFIO
> so that it could take over that special mapping during its startup.

The save and restore model is super interesting - I'm keen to discuss
this as an alternative. You're suggesting that IOMMU driver have a
serialise phase just before kexec where it dumps everything into
persistent memory and then after kexec pulls it back into ephemeral
memory. That's probably do-able, but it may increase the critical
section latency of live update (every millisecond counts!) and I'm also
not too sure what that buys compared to always working with persistent
memory and just always being in a state where persistent data is always
being used and can be picked up as-is.

However, the idea of a serialise and deserialise operation is relevant
to a possible alternative to this RFC. My colleague Alex Graf is working
on a framework called Kexec Hand Over (KHO):
https://lore.kernel.org/all/[email protected]/#r
That allows drivers/modules to mark arbitrary memory pages as persistent
(ie: not allocatable by next kernel) and to pass over some serialised
state across kexec.
An alternative to IOMMU domain persistence could be to use KHO to mark
the IOMMU root, context and domain page table pages as persistent via
KHO.

>
> Then you'd build the normal operating ioas and hwpt (with all the
> right page refcounts/etc) then switch to it and free the pkernfs
> memory.
>
> It seems alot less invasive to me. The special case is clearly a
> special case and doesn't mess up the normal operation of the drivers.

Yes, a serialise/deserialise approach does have this distinct advantage
of not needing to change the alloc/free code paths. Pkernfs requires a
shim in the allocator to use persistent memory.


>
> > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> > Really we should move the abstraction one level up and make the whole VFIO
> > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> > container file and all of the DMA mappings inside VFIO would already be set up.
>
> I doubt this.. It probably needs to be much finer grained actually,
> otherwise you are going to be serializing everything. Somehow I think
> you are better to serialize a minimum and try to reconstruct
> everything else in userspace. Like conserving iommufd IDs would be a
> huge PITA.
>
> There are also going to be lots of security questions here, like we
> can't just let userspace feed in any garbage and violate vfio and
> iommu invariants.

Right! This is definitely one of the big gaps at the moment: this
approach requires that VFIO has the same state re-driven into it from
userspace so that the persistent and ephemeral data match. If userspace
does something dodgy, well, it may cause problems. :-)
That's exactly why I thought we should move the abstraction up to a
level that doesn't depend on userspace re-driving data. It sounds like
you were suggesting similar in the first part of your comment, but I
didn't fully understand how you'd like to see it presented to userspace.

JG

2024-02-07 15:08:28

by Gowans, James

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update

On Mon, 2024-02-05 at 10:10 -0700, Alex Williamson wrote:
> > * Needing to drive and re-hydrate the IOMMU page tables by defining
> > an IOMMU file.
> > Really we should move the abstraction one level up and make the
> > whole VFIO
> > container persistent via a pkernfs file. That way you’d "just" re-
> > open the VFIO
> > container file and all of the DMA mappings inside VFIO would already
> > be set up.
>
> Note that the vfio container is on a path towards deprecation, this
> should be refocused on vfio relative to iommufd.  There would need to
> be a strong argument for a container/type1 extension to support this,
> iommufd would need to be the first class implementation.  Thanks,

Ack! When I first started putting pkernfs together, iommufd wasn't
integrated into QEMU yet, hence I stuck with VFIO for this PoC.
I'm thrilled to see that iommufd now seems to be integrated in QEMU!
Good opportunity to get to grips with it.

The VFIO-specific part of this patch is essentially ioctls on the
*container* to be able to:

1. define persistent page tables (PPTs) on the containers so that those
PPTs are used by the IOMMU domain and hence by all devices added to that
container.
https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e

2. Tell VFIO to avoid mapping the memory in again after live update
because it already exists.
https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
(the above flag should only be set *after* live update...).

Do you have a rough suggestion about how similar could be done with
iommufd?

JG

2024-02-07 15:29:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update

On Wed, Feb 07, 2024 at 02:56:33PM +0000, Gowans, James wrote:
> 2. Tell VFIO to avoid mapping the memory in again after live update
> because it already exists.
> https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
> (the above flag should only be set *after* live update...).

Definately no to that entire idea. It completely breaks how the memory
lifetime model works in iommufd.

iommufd has to re-establish its pins, and has to rebuild all its
mapping data structures. Otherwise it won't work correctly at all.

This is what I was saying in the other thread, you can't just ignore
fully restoring the iommu environment.

The end goal must be to have fully reconstituted iommufd with all its
maps, ioas's, and memory pins back to fully normal operation.

IMHO you need to focus on atomic replace where you go from the frozen
pkernfs environment to a live operating enviornment by hitlessly
replacing the IO page table in the HW. Ie going from an
IOMMU_DOMAIN_PKERFS to an IOMMU_DOMAIN_PAGING owned by iommufd that
describes exactly the same translation.

"adopting" an entire io page table with unknown contents, and still
being able to correctly do map/unmap seems way too hard.

Jason

2024-02-07 15:31:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 00/18] Pkernfs: Support persistence for live update

On Wed, Feb 07, 2024 at 02:45:42PM +0000, Gowans, James wrote:
> Hi Jason,
>
> Thanks for this great feedback on the approach - it's exactly the sort
> of thing we were looking for.
>
> On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
> >
> > On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:
> >
> > > The main aspect we’re looking for feedback/opinions on here is the concept of
> > > putting all persistent state in a single filesystem: combining guest RAM and
> > > IOMMU pgtables in one store. Also, the question of a hard separation between
> > > persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> > > be persisted. Pkernfs does it via a hard separation defined at boot time, other
> > > approaches could make the carving out of persistent pages dynamic.
> >
> > I think if you are going to attempt something like this then the end
> > result must bring things back to having the same data structures fully
> > restored.
> >
> > It is fine that the pkernfs holds some persistant memory that
> > guarentees the IOMMU can remain programmed and the VM pages can become
> > fixed across the kexec
> >
> > But once the VMM starts to restore it self we need to get back to the
> > original configuration:
> > - A mmap that points to the VM's physical pages
> > - An iommufd IOAS that points to the above mmap
> > - An iommufd HWPT that represents that same mapping
> > - An iommu_domain programmed into HW that the HWPT
>
> (A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
> now because that's what I know, but will explore iommufd more and reply
> in more detail about iommufd in the other email thread.)
>
> How much of this do you think should be done automatically, vs how much
> should userspace need to drive? With this RFC userspace basically re-
> drives everything, including re-injecting the file containing the
> persistent page tables into the IOMMU domain via VFIO.

My guess is that fully automatically is hard/impossible as there is
lots and lots of related state that has to come back. Like how do you
get all the internal iommufs IOAS related datastructures
automatically. Seems way too hard.

Feels simpler to have userspace redo whatever setup was needed to get
back to the right spot.

> Part of the reason is simplicity, to avoid having auto-deserialise code
> paths in the drivers and modules. Another part of the reason so that
> userspace can get FD handles on the resources. Typically FDs are
> returned by doing actions like creating VFIO containers. If we make all
> that automatic then there needs to be some other mechanism for auto-
> restored resources to present themselves to userspace so that userspace
> can discover and pick them up again.

Right, there is lots of state all over the place that would hard to
just re-materialize.

> Can you expand on what you mean by "A mmap that points to the VM's
> physical pages?" Are you suggesting that the QEMU process automatically
> gets something appearing in it's address space? Part of the live update
> process involves potentially changing the userspace binaries: doing
> kexec and booting a new system is an opportunity to boot new versions of
> the userspace binary. So we shouldn't try to preserve too much of
> userspace state; it's better to let it re-create internal data
> structures do fresh mmaps.

I expect the basic flow would be like:

Starting kernel
- run the VMM
- Allocate VM memory in the pkernfs
- mmap that VM memory
- Attach the VM memory to KVM
- Attach the VM memory mmap to IOMMUFD
- Operate the VM

Suspending the kernel
- Stop touching iommufd
- Freeze changes to the IOMMU, and move its working memory to pkernfs
- Exit the kernel

New kernel
- Recover the frozen IOMMU back to partially running, like crash
dump. Continue to use some of the working memory in the pkernfs
- run the new VMM. Some IOMMU_DOMAIN_PKERNFS thing to represent
this state
- mmap the VM memory
- Get KVM going again
- Attach the new VMM's VM memory mmap to IOMMUFD
- Replace the iommu partial configuration with a full configuration
- Free the pkernfs iommu related memory

> What I'm really asking is: do you have a specific suggestion about how
> these persistent resources should present themselves to userspace and
> how userspace can discover them and pick them up?

The only tricky bit in the above is having VFIO know it should leave
the iommu and PCI device state alone when the VFIO cdev is first
opened. Otherwise everything else is straightforward.

Presumably vfio would know it inherited a pkernfs blob and would do
the right stuff. May be some uAPI fussing there to handshake that
properly

Once VFIO knows this it can operate iommufd to conserve the
IOMMU_DOMAIN_PKERNFS as well.

> > Ie you can't just reboot and leave the IOMMU hanging out in some
> > undefined land - especially in latest kernels!
>
> Not too sure what you mean by "undefined land" - the idea is that the
> IOMMU keeps doing what it was going until userspace comes along re-

In terms of how the iommu subystems understands what the iommu is
doing. The iommu subsystem now forces the iommu into defined states as
part of its startup and you need an explicit defined state which means
"continuing to use the pkernfs saved state" which the iommu driver
deliberately enters.

> creates the handles to the IOMMU at which point it can do modifications
> like change mappings or tear the domain down. This is what deferred
> attached gives us, I believe, and why I had to change it to be
> enabled.

VFIO doesn't trigger deferred attach at all, that patch made no sense.

> > For vt-d you need to retain the entire root table and all the required
> > context entries too, The restarting iommu needs to understand that it
> > has to "restore" a temporary iommu_domain from the pkernfs.
> > You can later reconstitute a proper iommu_domain from the VMM and
> > atomic switch.
>
> Why does it need to go via a temporary domain?

Because that is the software model we have now. You must be explicit
not in some lalal undefined land of "i don't know WTF is going on but
if I squint this is doing some special thing!" That concept is dead in
the iommu subsystem, you must be explicit.

If the iommu is translating through special page tables stored in a
pkernfs then you need a IOMMU_DOMAIN_PKERNFS to represent that
behavior.

> > So, I'm surprised to see this approach where things just live forever
> > in the kernfs, I don't see how "restore" is going to work very well
> > like this.
>
> Can you expand on why the suggested restore path will be problematic? In
> summary the idea is to re-create all of the "ephemeral" data structures
> by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU
> root/context tables pointed at the original persistent page tables. The
> ephemeral data structures are re-created in userspace but the persistent
> page tables left alone. This is of course dependent on userspace re-
> creating things *correctly* - it can easily do the wrong thing. Perhaps
> this is the issue? Or is there a problem even if userspace is sane.

Because how do you regain control of the iommu in a fully configured
way with all the right pin counts and so forth? It seems impossible
like this, all that information is washed away during the kexec.

It seems easier if the pkernfs version of the iommu configuration is
temporary and very special. The normal working mode is just exactly as
today.

> > I would think that a save/restore mentalitity would make more
> > sense. For instance you could make a special iommu_domain that is fixed
> > and lives in the pkernfs. The operation would be to copy from the live
> > iommu_domain to the fixed one and then replace the iommu HW to the
> > fixed one.
> >
> > In the post-kexec world the iommu would recreate that special domain
> > and point the iommu at it. (copying the root and context descriptions
> > out of the pkernfs). Then somehow that would get into iommufd and VFIO
> > so that it could take over that special mapping during its startup.
>
> The save and restore model is super interesting - I'm keen to discuss
> this as an alternative. You're suggesting that IOMMU driver have a
> serialise phase just before kexec where it dumps everything into
> persistent memory and then after kexec pulls it back into ephemeral
> memory. That's probably do-able, but it may increase the critical
> section latency of live update (every millisecond counts!)

Suspending preperation can be done before stopping the vCPUs. You have
to commit to freezing the iommu which only means things like memory
hotplug can't progress. So it isn't critical path

Same on resume, you can resum kvm and the vCPUs and leave the IOMMU in
its suspended state while you work on returning it to normal
operation. Again only memory hotplug becomes blocked so it isn't
critical path.

> and I'm also not too sure what that buys compared to always working
> with persistent memory and just always being in a state where
> persistent data is always being used and can be picked up as-is.

You don't mess up the entire driver and all of its memory management,
and end up with a problem where you can't actually properly restore it
anyhow :)

> However, the idea of a serialise and deserialise operation is relevant
> to a possible alternative to this RFC. My colleague Alex Graf is working
> on a framework called Kexec Hand Over (KHO):
> https://lore.kernel.org/all/[email protected]/#r
> That allows drivers/modules to mark arbitrary memory pages as persistent
> (ie: not allocatable by next kernel) and to pass over some serialised
> state across kexec.
> An alternative to IOMMU domain persistence could be to use KHO to mark
> the IOMMU root, context and domain page table pages as persistent via
> KHO.

IMHO it doesn't matter how you get the memory across the kexec, you
still have to answer all these questions about how does the new kernel
actually keep working with this inherited data, and how does it
transform the inherited data into operating data that is properly
situated in the kernel data structures.

You can't just startup iommufd and point it at a set of io page tables
that something else populated. It is fundamentally wrong and would
lead to corrupting the mm's pin counts.

> > > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> > > Really we should move the abstraction one level up and make the whole VFIO
> > > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> > > container file and all of the DMA mappings inside VFIO would already be set up.
> >
> > I doubt this.. It probably needs to be much finer grained actually,
> > otherwise you are going to be serializing everything. Somehow I think
> > you are better to serialize a minimum and try to reconstruct
> > everything else in userspace. Like conserving iommufd IDs would be a
> > huge PITA.
> >
> > There are also going to be lots of security questions here, like we
> > can't just let userspace feed in any garbage and violate vfio and
> > iommu invariants.
>
> Right! This is definitely one of the big gaps at the moment: this
> approach requires that VFIO has the same state re-driven into it from
> userspace so that the persistent and ephemeral data match. If userspace
> does something dodgy, well, it may cause problems. :-)
> That's exactly why I thought we should move the abstraction up to a
> level that doesn't depend on userspace re-driving data. It sounds like
> you were suggesting similar in the first part of your comment, but I
> didn't fully understand how you'd like to see it presented to userspace.

I'd think you end up with some scenario where the pkernfs data has to
be trusted and sealed somehow before vfio would understand it. Ie you
have to feed it into vfio/etc via kexec only.

From a security perspective it does seem horribly wrong to expose such
sensitive data in a filesystem API where there are API surfaces that
would let userspace manipulate it.

At least from the iommu/vfio perspective:

The trusted data should originate inside a signed kernel only.

The signed kernel should prevent userspace from reading or writing it

The next kernel should trust that the prior kernel put the correct
data in there. There should be no option for the next kernel userspace
to read or write the data.

The next kernel can automatically affiliate things with the trusted
inherited data that it knows was passed from the prior signed kernel.
eg autocreate a IOMMU_DOMAIN_PKERNFS, tweak VFIO, etc.

I understand the appeal of making a pkernfs to hold the VM's memory
pages, but it doesn't seem so secure for kernel internal data
strucures..

Jason