This RFC series try to implement the fd-based KVM guest private memory
proposal described at [1] and an improved 'New Proposal' described at [2].
In general this patch series introduce fd-based memslot which provide
guest memory through fd[offset,size] instead of hva/size. The fd then
can be created from a supported memory filesystem like tmpfs/hugetlbfs,
etc which we refer as memory backing store. KVM and backing store
exchange some callbacks when such memslot gets created. At runtime KVM
will call into callbacks provided by backing store to get the pfn with
the fd+offset. Backing store will also call into KVM callbacks when
userspace fallocate/punch hole on fd to notify KVM to map/unmap second
MMU page tables.
Comparing to existing hva-based memslot, this new type of memslot allow
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and bring some other benefits.
Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the backing store can provide more
enforcement on the fd and KVM can use a single memslot to hold both
private and shared part of the guest memory. For more detailed
description please refer to [2].
Because this design introducing some callbacks between memory backing
store and KVM, and for private memory KVM relies on backing store to do
additonal enforcement and to tell if a address is private or shared,
I would like KVM/mm/fs people can have a look at this part.
[1]
https://lkml.kernel.org/kvm/[email protected]/
[2]
https://lkml.kernel.org/linux-fsdevel/[email protected]/
Thanks,
Chao
---
Chao Peng (12):
KVM: Add KVM_EXIT_MEMORY_ERROR exit
KVM: Extend kvm_userspace_memory_region to support fd based memslot
KVM: Add fd-based memslot data structure and utils
KVM: Implement fd-based memory using new memfd interfaces
KVM: Register/unregister memfd backed memslot
KVM: Handle page fault for fd based memslot
KVM: Rename hva memory invalidation code to cover fd-based offset
KVM: Introduce kvm_memfd_invalidate_range
KVM: Match inode for invalidation of fd-based slot
KVM: Add kvm_map_gfn_range
KVM: Introduce kvm_memfd_fallocate_range
KVM: Enable memfd based page invalidation/fallocate
Kirill A. Shutemov (1):
mm/shmem: Introduce F_SEAL_GUEST
arch/arm64/kvm/mmu.c | 14 +--
arch/mips/kvm/mips.c | 14 +--
arch/powerpc/include/asm/kvm_ppc.h | 28 ++---
arch/powerpc/kvm/book3s.c | 14 +--
arch/powerpc/kvm/book3s_hv.c | 14 +--
arch/powerpc/kvm/book3s_pr.c | 14 +--
arch/powerpc/kvm/booke.c | 14 +--
arch/powerpc/kvm/powerpc.c | 14 +--
arch/riscv/kvm/mmu.c | 14 +--
arch/s390/kvm/kvm-s390.c | 14 +--
arch/x86/include/asm/kvm_host.h | 6 +-
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu/mmu.c | 122 ++++++++++++++++++++-
arch/x86/kvm/vmx/main.c | 6 +-
arch/x86/kvm/vmx/tdx.c | 6 +-
arch/x86/kvm/vmx/tdx_stubs.c | 6 +-
arch/x86/kvm/x86.c | 16 +--
include/linux/kvm_host.h | 58 ++++++++--
include/linux/memfd.h | 24 +++++
include/linux/shmem_fs.h | 9 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/kvm.h | 27 +++++
mm/memfd.c | 33 +++++-
mm/shmem.c | 123 ++++++++++++++++++++-
virt/kvm/kvm_main.c | 165 +++++++++++++++++++++++------
virt/kvm/memfd.c | 123 +++++++++++++++++++++
26 files changed, 733 insertions(+), 149 deletions(-)
create mode 100644 virt/kvm/memfd.c
--
2.17.1
From: "Kirill A. Shutemov" <[email protected]>
The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.
F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.
Userspace is in charge of guest memory lifecycle: it can allocate the
memory with falloc or punch hole to free memory from the guest.
The file descriptor passed down to KVM as guest memory backend. KVM
register itself as the owner of the memfd via memfd_register_guest().
KVM provides callback that needed to be called on fallocate and punch
hole.
memfd_register_guest() returns callbacks that need be used for
requesting a new page from memfd.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/memfd.h | 24 ++++++++
include/linux/shmem_fs.h | 9 +++
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 33 +++++++++-
mm/shmem.c | 123 ++++++++++++++++++++++++++++++++++++-
5 files changed, 186 insertions(+), 4 deletions(-)
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..ff920ef28688 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -4,13 +4,37 @@
#include <linux/file.h>
+struct guest_ops {
+ void (*invalidate_page_range)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+ void (*fallocate)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+ unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
+ bool alloc, int *order);
+ void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+extern inline int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
{
return -EINVAL;
}
+static inline int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ return -EINVAL;
+}
#endif
#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..8280c918775a 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@
/* inode in-kernel data */
+struct guest_ops;
+struct guest_mem_ops;
+
struct shmem_inode_info {
spinlock_t lock;
unsigned int seals; /* shmem seals */
@@ -25,6 +28,8 @@ struct shmem_inode_info {
struct simple_xattrs xattrs; /* list of xattrs */
atomic_t stop_eviction; /* hold when working on inode */
struct inode vfs_inode;
+ void *guest_owner;
+ const struct guest_ops *guest_ops;
};
struct shmem_sb_info {
@@ -96,6 +101,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);
+extern int shmem_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops);
+
/* Flag allocation requirements to shmem_getpage */
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_GUEST 0x0020
/* (1U << 31) is reserved for signed error codes */
/*
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..a98b30bcf982 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,25 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
return NULL;
}
+int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ if (shmem_mapping(inode->i_mapping)) {
+ return shmem_register_guest(inode, owner,
+ guest_ops, guest_mem_ops);
+ }
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(memfd_register_guest);
+
#define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
F_SEAL_WRITE | \
- F_SEAL_FUTURE_WRITE)
+ F_SEAL_FUTURE_WRITE | \
+ F_SEAL_GUEST)
static int memfd_add_seals(struct file *file, unsigned int seals)
{
@@ -203,10 +217,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
}
}
+ if (seals & F_SEAL_GUEST) {
+ i_mmap_lock_read(inode->i_mapping);
+
+ if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) {
+ error = -EBUSY;
+ goto unlock;
+ }
+
+ if (i_size_read(inode)) {
+ error = -EBUSY;
+ goto unlock;
+ }
+ }
+
*file_seals |= seals;
error = 0;
unlock:
+ if (seals & F_SEAL_GUEST)
+ i_mmap_unlock_read(inode->i_mapping);
+
inode_unlock(inode);
return error;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 23c91a8beb78..38b3b6b9a3a5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
#include <linux/userfaultfd_k.h>
#include <linux/rmap.h>
#include <linux/uuid.h>
+#include <linux/memfd.h>
#include <linux/uaccess.h>
@@ -906,6 +907,21 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
return split_huge_page(page) >= 0;
}
+static void guest_invalidate_page(struct inode *inode,
+ struct page *page, pgoff_t start, pgoff_t end)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!info->guest_ops || !info->guest_ops->invalidate_page_range)
+ return;
+
+ start = max(start, page->index);
+ end = min(end, page->index + thp_nr_pages(page)) - 1;
+
+ info->guest_ops->invalidate_page_range(inode, info->guest_owner,
+ start, end);
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +965,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
index += thp_nr_pages(page) - 1;
+ guest_invalidate_page(inode, page, start, end);
+
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -1025,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index--;
break;
}
+
+ guest_invalidate_page(inode, page, start, end);
+
VM_BUG_ON_PAGE(PageWriteback(page), page);
if (shmem_punch_compound(page, start, end))
truncate_inode_page(mapping, page);
@@ -1098,6 +1119,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
(newsize > oldsize && (info->seals & F_SEAL_GROW)))
return -EPERM;
+ if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK))
+ return -EINVAL;
+
if (newsize != oldsize) {
error = shmem_reacct_size(SHMEM_I(inode)->flags,
oldsize, newsize);
@@ -1364,6 +1388,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
goto redirty;
if (!total_swap_pages)
goto redirty;
+ if (info->seals & F_SEAL_GUEST)
+ goto redirty;
/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -2262,6 +2288,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
if (ret)
return ret;
+ if (info->seals & F_SEAL_GUEST)
+ return -EPERM;
+
/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;
@@ -2459,12 +2488,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
int ret = 0;
/* i_rwsem is held by caller */
- if (unlikely(info->seals & (F_SEAL_GROW |
- F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+ F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) {
if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
+ if (info->seals & F_SEAL_GUEST)
+ return -EPERM;
}
ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2546,6 +2577,20 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
end_index = i_size >> PAGE_SHIFT;
if (index > end_index)
break;
+
+ /*
+ * inode_lock protects setting up seals as well as write to
+ * i_size. Setting F_SEAL_GUEST only allowed with i_size == 0.
+ *
+ * Check F_SEAL_GUEST after i_size. It effectively serialize
+ * read vs. setting F_SEAL_GUEST without taking inode_lock in
+ * read path.
+ */
+ if (SHMEM_I(inode)->seals & F_SEAL_GUEST) {
+ error = -EPERM;
+ break;
+ }
+
if (index == end_index) {
nr = i_size & ~PAGE_MASK;
if (nr <= offset)
@@ -2677,6 +2722,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}
+ if ((info->seals & F_SEAL_GUEST) &&
+ (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+ error = -EINVAL;
+ goto out;
+ }
+
shmem_falloc.waitq = &shmem_falloc_waitq;
shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -2796,6 +2847,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
i_size_write(inode, offset + len);
inode->i_ctime = current_time(inode);
+ if (info->guest_ops && info->guest_ops->fallocate)
+ info->guest_ops->fallocate(inode, info->guest_owner, start, end);
undone:
spin_lock(&inode->i_lock);
inode->i_private = NULL;
@@ -3800,6 +3853,20 @@ static int shmem_error_remove_page(struct address_space *mapping,
return 0;
}
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ struct inode *inode = mapping->host;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (info->seals & F_SEAL_GUEST)
+ return -ENOTSUPP;
+ return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
const struct address_space_operations shmem_aops = {
.writepage = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
@@ -3808,12 +3875,62 @@ const struct address_space_operations shmem_aops = {
.write_end = shmem_write_end,
#endif
#ifdef CONFIG_MIGRATION
- .migratepage = migrate_page,
+ .migratepage = shmem_migrate_page,
#endif
.error_remove_page = shmem_error_remove_page,
};
EXPORT_SYMBOL(shmem_aops);
+static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset,
+ bool alloc, int *order)
+{
+ struct page *page;
+ int ret;
+ enum sgp_type sgp = alloc ? SGP_WRITE : SGP_READ;
+
+ ret = shmem_getpage(inode, offset, &page, sgp);
+ if (ret)
+ return ret;
+
+ *order = thp_order(compound_head(page));
+
+ return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ set_page_dirty(page);
+ unlock_page(page);
+ put_page(page);
+}
+
+static const struct guest_mem_ops shmem_guest_ops = {
+ .get_lock_pfn = shmem_get_lock_pfn,
+ .put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!owner)
+ return -EINVAL;
+
+ if (info->guest_owner && info->guest_owner != owner)
+ return -EPERM;
+
+ info->guest_owner = owner;
+ info->guest_ops = guest_ops;
+ *guest_mem_ops = &shmem_guest_ops;
+ return 0;
+}
+
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
.get_unmapped_area = shmem_get_unmapped_area,
--
2.17.1
The patch introduces two fds into memslot, guarded by KVM_MEM_FD flag.
The userspace_addr field is repurposed as the offset into two fds, for
respectively the shared and private views. If private_fd == -1, the
memory slot has only a shared view.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/arm64/kvm/mmu.c | 14 +++++++-------
arch/mips/kvm/mips.c | 14 +++++++-------
arch/powerpc/include/asm/kvm_ppc.h | 28 ++++++++++++++--------------
arch/powerpc/kvm/book3s.c | 14 +++++++-------
arch/powerpc/kvm/book3s_hv.c | 14 +++++++-------
arch/powerpc/kvm/book3s_pr.c | 14 +++++++-------
arch/powerpc/kvm/booke.c | 14 +++++++-------
arch/powerpc/kvm/powerpc.c | 14 +++++++-------
arch/riscv/kvm/mmu.c | 14 +++++++-------
arch/s390/kvm/kvm-s390.c | 14 +++++++-------
arch/x86/include/asm/kvm_host.h | 6 +++---
arch/x86/kvm/vmx/main.c | 6 +++---
arch/x86/kvm/vmx/tdx.c | 6 +++---
arch/x86/kvm/vmx/tdx_stubs.c | 6 +++---
arch/x86/kvm/x86.c | 16 ++++++++--------
include/linux/kvm_host.h | 18 +++++++++---------
include/uapi/linux/kvm.h | 12 ++++++++++++
virt/kvm/kvm_main.c | 23 +++++++++++++++--------
18 files changed, 133 insertions(+), 114 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 326cdfec74a1..395e52314834 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1463,10 +1463,10 @@ int kvm_mmu_init(u32 *hyp_va_bits)
}
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
/*
* At this point memslot has been committed and there is an
@@ -1486,9 +1486,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 562aa878b266..ef71146809d5 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -233,18 +233,18 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
return 0;
}
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
int needs_flush;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 671fbd1a765e..7cdc756a94a0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -200,14 +200,14 @@ extern void kvmppc_core_destroy_vm(struct kvm *kvm);
extern void kvmppc_core_free_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot);
extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change);
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change);
extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- const struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change);
+ const struct kvm_userspace_memory_region_ext *mem,
+ const struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change);
extern int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm,
struct kvm_ppc_smmu_info *info);
extern void kvmppc_core_flush_memslot(struct kvm *kvm,
@@ -274,14 +274,14 @@ struct kvmppc_ops {
int (*get_dirty_log)(struct kvm *kvm, struct kvm_dirty_log *log);
void (*flush_memslot)(struct kvm *kvm, struct kvm_memory_slot *memslot);
int (*prepare_memory_region)(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change);
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change);
void (*commit_memory_region)(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- const struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change);
+ const struct kvm_userspace_memory_region_ext *mem,
+ const struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change);
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index b785f6772391..6b4bf08e7c8b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -847,19 +847,19 @@ void kvmppc_core_flush_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot)
}
int kvmppc_core_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
return kvm->arch.kvm_ops->prepare_memory_region(kvm, memslot, mem,
change);
}
void kvmppc_core_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- const struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ const struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
}
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7b74fc0a986b..3b7be7894c48 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4854,9 +4854,9 @@ static void kvmppc_core_free_memslot_hv(struct kvm_memory_slot *slot)
}
static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *slot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
unsigned long npages = mem->memory_size >> PAGE_SHIFT;
@@ -4871,10 +4871,10 @@ static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
}
static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- const struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ const struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
unsigned long npages = mem->memory_size >> PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 6bc9425acb32..4dd06b24c1b6 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1899,18 +1899,18 @@ static void kvmppc_core_flush_memslot_pr(struct kvm *kvm,
}
static int kvmppc_core_prepare_memory_region_pr(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
return 0;
}
static void kvmppc_core_commit_memory_region_pr(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- const struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ const struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
return;
}
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 8c15c90dd3a9..f2d1acd782bf 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1821,18 +1821,18 @@ void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
}
int kvmppc_core_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
return 0;
}
void kvmppc_core_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- const struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ const struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 35e9cccdeef9..4aa5ef921710 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -706,18 +706,18 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
return kvmppc_core_prepare_memory_region(kvm, memslot, mem, change);
}
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
kvmppc_core_commit_memory_region(kvm, mem, old, new, change);
}
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index d81bae8eb55e..a7f25b0da391 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -456,10 +456,10 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
}
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
/*
* At this point memslot has been committed and there is an
@@ -471,9 +471,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c6257f625929..dc9d1ec3d337 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5018,9 +5018,9 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
/* Section: memory related */
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
/* A few sanity checks. We can have memory slots which have to be
located/ended at a segment boundary (1MB). The memory in userland is
@@ -5043,10 +5043,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
}
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
int rc = 0;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9ab707646ed1..86a17a23d6be 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1556,9 +1556,9 @@ struct kvm_x86_ops {
void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
int (*prepare_memory_region)(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change);
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change);
#ifdef CONFIG_KVM_TDX_SEAM_BACKDOOR
void (*do_seamcall)(struct kvm_seamcall *call);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 1473fe8ce5a6..0a8bedaf9c1b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -992,9 +992,9 @@ static void vt_setup_mce(struct kvm_vcpu *vcpu)
}
static int vt_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
if (is_td(kvm))
tdx_prepare_memory_region(kvm, memslot, mem, change);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4992750b6db0..839740a98d47 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2679,9 +2679,9 @@ static void tdx_flush_gprs(struct kvm_vcpu *vcpu)
}
static int tdx_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
/* TDX Secure-EPT allows only RWX. */
if (mem->flags & KVM_MEM_READONLY)
diff --git a/arch/x86/kvm/vmx/tdx_stubs.c b/arch/x86/kvm/vmx/tdx_stubs.c
index 9c6023d18afd..490a5faeb411 100644
--- a/arch/x86/kvm/vmx/tdx_stubs.c
+++ b/arch/x86/kvm/vmx/tdx_stubs.c
@@ -28,9 +28,9 @@ static int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) { ret
static void tdx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2,
u32 *intr_info, u32 *error_code) {}
static int tdx_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change) { return 0; }
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change) { return 0; }
static void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static int __init tdx_check_processor_compatibility(void) { return 0; }
static void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a02920b49b26..1558f6375949 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11635,7 +11635,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
}
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_userspace_memory_region m;
+ struct kvm_userspace_memory_region_ext m;
m.slot = id | (i << 16);
m.flags = 0;
@@ -11841,9 +11841,9 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change)
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change)
{
int err;
@@ -11948,10 +11948,10 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
}
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change)
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change)
{
if (!kvm->arch.n_requested_mmu_pages)
kvm_mmu_change_mmu_pages(kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3dd5c349f52e..99e9f9969703 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -824,20 +824,20 @@ enum kvm_mr_change {
};
int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region_ext *mem);
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region_ext *mem);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
- const struct kvm_userspace_memory_region *mem,
- enum kvm_mr_change change);
+ struct kvm_memory_slot *memslot,
+ const struct kvm_userspace_memory_region_ext *mem,
+ enum kvm_mr_change change);
void kvm_arch_commit_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_memory_slot *old,
- const struct kvm_memory_slot *new,
- enum kvm_mr_change change);
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *old,
+ const struct kvm_memory_slot *new,
+ enum kvm_mr_change change);
/* flush all memory translations */
void kvm_arch_flush_shadow_all(struct kvm *kvm);
/* flush memory translations pointing to 'slot' */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7e3a8935534b..374da6767ef6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};
+struct kvm_userspace_memory_region_ext {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+ __u64 userspace_addr; /* offset into fd/private_fd */
+ __s32 fd;
+ __s32 private_fd; /* valid if guest private memory is supported */
+ __u32 padding[6];
+};
+
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
@@ -110,6 +121,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_FD (1UL << 2)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1578be8e4441..271cef8d1cd0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1424,7 +1424,7 @@ static void update_memslots(struct kvm_memslots *slots,
}
static int check_memory_region_flags(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *mem)
{
u32 valid_flags = 0;
@@ -1537,7 +1537,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
}
static int kvm_set_memslot(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
+ const struct kvm_userspace_memory_region_ext *mem,
struct kvm_memory_slot *old,
struct kvm_memory_slot *new, int as_id,
enum kvm_mr_change change)
@@ -1629,7 +1629,7 @@ static int kvm_set_memslot(struct kvm *kvm,
}
static int kvm_delete_memslot(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
+ const struct kvm_userspace_memory_region_ext *mem,
struct kvm_memory_slot *old, int as_id)
{
struct kvm_memory_slot new;
@@ -1663,7 +1663,7 @@ static int kvm_delete_memslot(struct kvm *kvm,
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *mem)
{
struct kvm_memory_slot old, new;
struct kvm_memory_slot *tmp;
@@ -1783,7 +1783,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *mem)
{
int r;
@@ -1795,7 +1795,7 @@ int kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(kvm_set_memory_region);
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
- struct kvm_userspace_memory_region *mem)
+ struct kvm_userspace_memory_region_ext *mem)
{
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;
@@ -4368,12 +4368,19 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
case KVM_SET_USER_MEMORY_REGION: {
- struct kvm_userspace_memory_region kvm_userspace_mem;
+ struct kvm_userspace_memory_region_ext kvm_userspace_mem;
r = -EFAULT;
if (copy_from_user(&kvm_userspace_mem, argp,
- sizeof(kvm_userspace_mem)))
+ sizeof(struct kvm_userspace_memory_region)))
goto out;
+ if (kvm_userspace_mem.flags & KVM_MEM_FD) {
+ int offset = offsetof(
+ struct kvm_userspace_memory_region_ext, fd);
+ if (copy_from_user(&kvm_userspace_mem.fd, argp + offset,
+ sizeof(kvm_userspace_mem) - offset))
+ goto out;
+ }
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
break;
--
2.17.1
This new exit allows userspace to handle memory related error. It will
be used for shared memory <-->private memory conversion.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/kvm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7c93d61cb19e..7e3a8935534b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -285,6 +285,18 @@ struct kvm_tdx_exit {
} u;
};
+struct kvm_memory_exit {
+#define KVM_EXIT_MEM_MAP_SHARED 1
+#define KVM_EXIT_MEM_MAP_PRIVATE 2
+ __u32 type;
+ union {
+ struct {
+ __u64 gpa;
+ __u64 size;
+ } map;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576
@@ -324,6 +336,7 @@ struct kvm_tdx_exit {
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
#define KVM_EXIT_RISCV_SBI 35
+#define KVM_EXIT_MEMORY_ERROR 36
#define KVM_EXIT_TDX 50 /* dump number to avoid conflict. */
/* For KVM_EXIT_INTERNAL_ERROR */
@@ -542,6 +555,8 @@ struct kvm_run {
unsigned long args[6];
unsigned long ret[2];
} riscv_sbi;
+ /* KVM_EXIT_MEMORY_ERROR */
+ struct kvm_memory_exit mem;
/* KVM_EXIT_TDX_VMCALL */
struct kvm_tdx_exit tdx;
/* Fix the size of the union. */
--
2.17.1
This patch pairs a fd-based memslot to a memory backing store. Two sides
handshake to exchange callbacks that will be called later.
KVM->memfd:
- get_pfn: get or allocate(when alloc is true) page at specified
offset in the fd, the page will be locked
- put_pfn: put and unlock the pfn
memfd->KVM:
- invalidate_page_range: called when userspace punch hole on the fd,
KVM should unmap related pages in the second MMU
- fallocate: called when userspace fallocate space on the fd, KVM
can map related pages in the second MMU
Currently tmpfs behind memfd interface is supported.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/Makefile | 3 +-
include/linux/kvm_host.h | 6 +++
virt/kvm/memfd.c | 101 +++++++++++++++++++++++++++++++++++++++
3 files changed, 109 insertions(+), 1 deletion(-)
create mode 100644 virt/kvm/memfd.c
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f919df73e5e3..5d7f289b1ca0 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -11,7 +11,8 @@ KVM := ../../../virt/kvm
kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
- $(KVM)/dirty_ring.o $(KVM)/binary_stats.o
+ $(KVM)/dirty_ring.o $(KVM)/binary_stats.o \
+ $(KVM)/memfd.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1d4ac0c9b63b..e8646103356b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -769,6 +769,12 @@ static inline void kvm_irqfd_exit(void)
{
}
#endif
+
+int kvm_memfd_register(struct kvm *kvm,
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *slot);
+void kvm_memfd_unregister(struct kvm *kvm, struct kvm_memory_slot *slot);
+
int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
struct module *module);
void kvm_exit(void);
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
new file mode 100644
index 000000000000..bd930dcb455f
--- /dev/null
+++ b/virt/kvm/memfd.c
@@ -0,0 +1,101 @@
+
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memfd.c: routines for fd based guest memory backing store
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ * Author:
+ * Chao Peng <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/memfd.h>
+const static struct guest_mem_ops *memfd_ops;
+
+static void memfd_invalidate_page_range(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end)
+{
+ //!!!We can get here after the owner no longer exists
+}
+
+static void memfd_fallocate(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end)
+{
+ //!!!We can get here after the owner no longer exists
+}
+
+static const struct guest_ops memfd_notifier = {
+ .invalidate_page_range = memfd_invalidate_page_range,
+ .fallocate = memfd_fallocate,
+};
+
+static kvm_pfn_t kvm_memfd_get_pfn(struct kvm_memory_slot *slot,
+ struct file *file, gfn_t gfn,
+ bool alloc, int *order)
+{
+ pgoff_t index = gfn - slot->base_gfn +
+ (slot->userspace_addr >> PAGE_SHIFT);
+
+ return memfd_ops->get_lock_pfn(file->f_inode, index, alloc, order);
+}
+
+static void kvm_memfd_put_pfn(kvm_pfn_t pfn)
+{
+ memfd_ops->put_unlock_pfn(pfn);
+}
+
+static struct kvm_memfd_ops kvm_memfd_ops = {
+ .get_pfn = kvm_memfd_get_pfn,
+ .put_pfn = kvm_memfd_put_pfn,
+};
+
+int kvm_memfd_register(struct kvm *kvm,
+ const struct kvm_userspace_memory_region_ext *mem,
+ struct kvm_memory_slot *slot)
+{
+ int ret;
+ struct fd fd = fdget(mem->fd);
+
+ if (!fd.file)
+ return -EINVAL;
+
+ ret = memfd_register_guest(fd.file->f_inode, kvm,
+ &memfd_notifier, &memfd_ops);
+ if (ret)
+ return ret;
+ slot->file = fd.file;
+
+ if (mem->private_fd >= 0) {
+ fd = fdget(mem->private_fd);
+ if (!fd.file) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ ret = memfd_register_guest(fd.file->f_inode, kvm,
+ &memfd_notifier, &memfd_ops);
+ if (ret)
+ goto err;
+ slot->priv_file = fd.file;
+ }
+
+ slot->memfd_ops = &kvm_memfd_ops;
+ return 0;
+err:
+ kvm_memfd_unregister(kvm, slot);
+ return ret;
+}
+
+void kvm_memfd_unregister(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ if (slot->file) {
+ fput(slot->file);
+ slot->file = NULL;
+ }
+
+ if (slot->priv_file) {
+ fput(slot->priv_file);
+ slot->priv_file = NULL;
+ }
+ slot->memfd_ops = NULL;
+}
--
2.17.1
For fd-based memslot store the file references for shared fd and the
private fd (if any) in the memslot structure. Since there is no 'hva'
concept we cannot call hva_to_pfn() to get a pfn, instead kvm_memfd_ops
is added to get_pfn/put_pfn from the memory backing stores that provide
these fds.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 99e9f9969703..1d4ac0c9b63b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -424,6 +424,12 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
*/
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
+struct kvm_memfd_ops {
+ kvm_pfn_t (*get_pfn)(struct kvm_memory_slot *slot, struct file *file,
+ gfn_t gfn, bool alloc, int *order);
+ void (*put_pfn)(kvm_pfn_t pfn);
+};
+
struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
@@ -433,6 +439,9 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+ struct file *file;
+ struct file *priv_file;
+ struct kvm_memfd_ops *memfd_ops;
};
static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
@@ -1310,6 +1319,20 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
return gfn_to_memslot(kvm, gfn)->id;
}
+static inline bool memslot_is_memfd(const struct kvm_memory_slot *slot)
+{
+ if (slot && slot->memfd_ops)
+ return true;
+ return false;
+}
+
+static inline bool memslot_has_private(const struct kvm_memory_slot *slot)
+{
+ if (slot && slot->priv_file)
+ return true;
+ return false;
+}
+
static inline gfn_t
hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
{
--
2.17.1
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
virt/kvm/kvm_main.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 271cef8d1cd0..b8673490d301 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1426,7 +1426,7 @@ static void update_memslots(struct kvm_memslots *slots,
static int check_memory_region_flags(struct kvm *kvm,
const struct kvm_userspace_memory_region_ext *mem)
{
- u32 valid_flags = 0;
+ u32 valid_flags = KVM_MEM_FD;
if (!kvm->dirty_log_unsupported)
valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
@@ -1604,10 +1604,20 @@ static int kvm_set_memslot(struct kvm *kvm,
kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
}
+ if (mem->flags & KVM_MEM_FD && change == KVM_MR_CREATE) {
+ r = kvm_memfd_register(kvm, mem, new);
+ if (r)
+ goto out_slots;
+ }
+
r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
if (r)
goto out_slots;
+ if (mem->flags & KVM_MEM_FD && (r || change == KVM_MR_DELETE)) {
+ kvm_memfd_unregister(kvm, new);
+ }
+
update_memslots(slots, new, change);
slots = install_new_memslots(kvm, as_id, slots);
@@ -1683,10 +1693,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
return -EINVAL;
- /* We can read the guest memory with __xxx_user() later on. */
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
- !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+ (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
+ return -EINVAL;
+ /* We can read the guest memory with __xxx_user() later on. */
+ if (!(mem->flags & KVM_MEM_FD) &&
+ !access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
@@ -1727,6 +1739,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.dirty_bitmap = NULL;
memset(&new.arch, 0, sizeof(new.arch));
} else { /* Modify an existing slot. */
+ /* Private memslots are immutable, they can only be deleted. */
+ if (mem->flags & KVM_MEM_FD && mem->private_fd >= 0)
+ return -EINVAL;
if ((new.userspace_addr != old.userspace_addr) ||
(new.npages != old.npages) ||
((new.flags ^ old.flags) & KVM_MEM_READONLY))
--
2.17.1
Current code assume the private memory is persistent and KVM can check
with backing store to see if private memory exists at the same address
by calling get_pfn(alloc=false).
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 75 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 73 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 40377901598b..cd5d1f923694 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3277,6 +3277,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
+ if (memslot_is_memfd(slot))
+ return max_level;
+
host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
return min(host_level, max_level);
}
@@ -4555,6 +4558,65 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
}
+static bool kvm_faultin_pfn_memfd(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault, int *r)
+{ int order;
+ kvm_pfn_t pfn;
+ struct kvm_memory_slot *slot = fault->slot;
+ bool priv_gfn = kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT);
+ bool priv_slot_exists = memslot_has_private(slot);
+ bool priv_gfn_exists = false;
+ int mem_convert_type;
+
+ if (priv_gfn && !priv_slot_exists) {
+ *r = RET_PF_INVALID;
+ return true;
+ }
+
+ if (priv_slot_exists) {
+ pfn = slot->memfd_ops->get_pfn(slot, slot->priv_file,
+ fault->gfn, false, &order);
+ if (pfn >= 0)
+ priv_gfn_exists = true;
+ }
+
+ if (priv_gfn && !priv_gfn_exists) {
+ mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
+ goto out_convert;
+ }
+
+ if (!priv_gfn && priv_gfn_exists) {
+ slot->memfd_ops->put_pfn(pfn);
+ mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
+ goto out_convert;
+ }
+
+ if (!priv_gfn) {
+ pfn = slot->memfd_ops->get_pfn(slot, slot->file,
+ fault->gfn, true, &order);
+ if (fault->pfn < 0) {
+ *r = RET_PF_INVALID;
+ return true;
+ }
+ }
+
+ if (slot->flags & KVM_MEM_READONLY)
+ fault->map_writable = false;
+ if (order == 0)
+ fault->max_level = PG_LEVEL_4K;
+
+ return false;
+
+out_convert:
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+ vcpu->run->mem.type = mem_convert_type;
+ vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->mem.u.map.size = PAGE_SIZE;
+ fault->pfn = -1;
+ *r = -1;
+ return true;
+}
+
static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
{
struct kvm_memory_slot *slot = fault->slot;
@@ -4596,6 +4658,9 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
}
}
+ if (memslot_is_memfd(slot))
+ return kvm_faultin_pfn_memfd(vcpu, fault, r);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
fault->write, &fault->map_writable,
@@ -4660,7 +4725,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
else
write_lock(&vcpu->kvm->mmu_lock);
- if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
+ if (fault->slot && !memslot_is_memfd(fault->slot) &&
+ mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
if (r)
@@ -4676,7 +4742,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+
+ if (memslot_is_memfd(fault->slot))
+ fault->slot->memfd_ops->put_pfn(fault->pfn);
+ else
+ kvm_release_pfn_clean(fault->pfn);
+
return r;
}
--
2.17.1
The poupose is for fd-based memslot reusing the same code for memory
invalidation. The code can be reused except changing 'hva' to more
neutral naming 'useraddr'.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 4 ++--
virt/kvm/kvm_main.c | 44 ++++++++++++++++++++--------------------
2 files changed, 24 insertions(+), 24 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e8646103356b..925c4d9f0a31 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1340,9 +1340,9 @@ static inline bool memslot_has_private(const struct kvm_memory_slot *slot)
}
static inline gfn_t
-hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
+useraddr_to_gfn_memslot(unsigned long useraddr, struct kvm_memory_slot *slot)
{
- gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
+ gfn_t gfn_offset = (useraddr - slot->userspace_addr) >> PAGE_SHIFT;
return slot->base_gfn + gfn_offset;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b8673490d301..d9a6890dd18a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,16 +471,16 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
srcu_read_unlock(&kvm->srcu, idx);
}
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
unsigned long end);
-struct kvm_hva_range {
+struct kvm_useraddr_range {
unsigned long start;
unsigned long end;
pte_t pte;
- hva_handler_t handler;
+ gfn_handler_t handler;
on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
@@ -499,8 +499,8 @@ static void kvm_null_fn(void)
}
#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct kvm_hva_range *range)
+static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
+ const struct kvm_useraddr_range *range)
{
bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
@@ -518,12 +518,12 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot(slot, slots) {
- unsigned long hva_start, hva_end;
+ unsigned long useraddr_start, useraddr_end;
- hva_start = max(range->start, slot->userspace_addr);
- hva_end = min(range->end, slot->userspace_addr +
+ useraddr_start = max(range->start, slot->userspace_addr);
+ useraddr_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));
- if (hva_start >= hva_end)
+ if (useraddr_start >= useraddr_end)
continue;
/*
@@ -536,11 +536,11 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.may_block = range->may_block;
/*
- * {gfn(page) | page intersects with [hva_start, hva_end)} =
+ * {gfn(page) | page intersects with [useraddr_start, useraddr_end)} =
* {gfn_start, gfn_start+1, ..., gfn_end-1}.
*/
- gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
- gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
+ gfn_range.start = useraddr_to_gfn_memslot(useraddr_start, slot);
+ gfn_range.end = useraddr_to_gfn_memslot(useraddr_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;
if (!locked) {
@@ -571,10 +571,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
pte_t pte,
- hva_handler_t handler)
+ gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
+ const struct kvm_useraddr_range range = {
.start = start,
.end = end,
.pte = pte,
@@ -584,16 +584,16 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
.may_block = false,
};
- return __kvm_handle_hva_range(kvm, &range);
+ return __kvm_handle_useraddr_range(kvm, &range);
}
static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
- hva_handler_t handler)
+ gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
+ const struct kvm_useraddr_range range = {
.start = start,
.end = end,
.pte = __pte(0),
@@ -603,7 +603,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.may_block = false,
};
- return __kvm_handle_hva_range(kvm, &range);
+ return __kvm_handle_useraddr_range(kvm, &range);
}
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
@@ -661,7 +661,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range hva_range = {
+ const struct kvm_useraddr_range useraddr_range = {
.start = range->start,
.end = range->end,
.pte = __pte(0),
@@ -685,7 +685,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm->mn_active_invalidate_count++;
spin_unlock(&kvm->mn_invalidate_lock);
- __kvm_handle_hva_range(kvm, &hva_range);
+ __kvm_handle_useraddr_range(kvm, &useraddr_range);
return 0;
}
@@ -712,7 +712,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range hva_range = {
+ const struct kvm_useraddr_range useraddr_range = {
.start = range->start,
.end = range->end,
.pte = __pte(0),
@@ -723,7 +723,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
};
bool wake;
- __kvm_handle_hva_range(kvm, &hva_range);
+ __kvm_handle_useraddr_range(kvm, &useraddr_range);
/* Pairs with the increment in range_start(). */
spin_lock(&kvm->mn_invalidate_lock);
--
2.17.1
Invalidate on fd-based memslot can reuse the code from existing MMU
notifier.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 3 +++
virt/kvm/kvm_main.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 925c4d9f0a31..f0fd32f6eab3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1883,4 +1883,7 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end);
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9a6890dd18a..090afbadb03f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -811,6 +811,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
}
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ int ret;
+ const struct kvm_useraddr_range useraddr_range = {
+ .start = start,
+ .end = end,
+ .pte = __pte(0),
+ .handler = kvm_unmap_gfn_range,
+ .on_lock = (void *)kvm_null_fn,
+ .flush_on_ret = true,
+ .may_block = false,
+ };
+
+
+ /* Prevent memslot modification */
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count++;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ ret = __kvm_handle_useraddr_range(kvm, &useraddr_range);
+
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count--;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ return ret;
+}
+
#else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -818,6 +847,12 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
return 0;
}
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ return 0;
+}
+
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
--
2.17.1
Different fd/priv_fd can have the same userspace_addr so start/end
is meaningful only when they are used together with fd/priv_fd.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
virt/kvm/kvm_main.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 090afbadb03f..65055ac460eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -479,6 +479,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
struct kvm_useraddr_range {
unsigned long start;
unsigned long end;
+ struct inode *inode;
pte_t pte;
gfn_handler_t handler;
on_lock_fn_t on_lock;
@@ -520,6 +521,17 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
kvm_for_each_memslot(slot, slots) {
unsigned long useraddr_start, useraddr_end;
+ /*
+ * Skip the slot if range->inode is not the same as
+ * that in slot->file or slot->priv_file.
+ */
+ if (range->inode &&
+ (!slot->file ||
+ slot->file->f_inode != range->inode) &&
+ (!slot->priv_file ||
+ slot->priv_file->f_inode != range->inode))
+ continue;
+
useraddr_start = max(range->start, slot->userspace_addr);
useraddr_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));
@@ -818,6 +830,7 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
const struct kvm_useraddr_range useraddr_range = {
.start = start,
.end = end,
+ .inode = inode,
.pte = __pte(0),
.handler = kvm_unmap_gfn_range,
.on_lock = (void *)kvm_null_fn,
--
2.17.1
This may be used in the fallocate callback for memfd based memory
to setup the mapping for KVM second MMU when the pages are allocated
in the memory backing store.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 47 ++++++++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 5 +++++
3 files changed, 54 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cd5d1f923694..5c475a161a3c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1951,6 +1951,53 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
return ret;
}
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ struct kvm_vcpu *vcpu;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+ int idx;
+ bool ret = true;
+
+ /* Need vcpu context for kvm_mmu_do_page_fault. */
+ vcpu = kvm_get_vcpu(kvm, 0);
+ if (mutex_lock_killable(&vcpu->mutex))
+ return false;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&kvm->srcu);
+
+ kvm_mmu_reload(vcpu);
+
+ gfn = range->start;
+ while (gfn < range->end) {
+ if (signal_pending(current)) {
+ ret = false;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ pfn = kvm_mmu_do_page_fault(vcpu, gfn << PAGE_SHIFT,
+ PFERR_WRITE_MASK | PFERR_USER_MASK,
+ false);
+ if (is_error_noslot_pfn(pfn) || kvm->vm_bugged) {
+ ret = false;
+ break;
+ }
+
+ gfn++;
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ mutex_unlock(&vcpu->mutex);
+
+ return ret;
+}
+
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f0fd32f6eab3..d841ed877b4b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -237,6 +237,8 @@ struct kvm_gfn_range {
pte_t pte;
bool may_block;
};
+
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 65055ac460eb..492c1a99ec63 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,6 +471,11 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
srcu_read_unlock(&kvm->srcu, idx);
}
+bool __weak kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return false;
+}
+
typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
--
2.17.1
It reuses the same code for kvm_memfd_invalidate_range, except using
kvm_map_gfn_range as its handler.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 28 +++++++++++++++++++++++++---
2 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d841ed877b4b..f1d7856be05b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1887,5 +1887,7 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
unsigned long start, unsigned long end);
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end);
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 492c1a99ec63..7eaafc0ae6ab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -828,8 +828,10 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
}
-int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
- unsigned long start, unsigned long end)
+int kvm_memfd_handle_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end,
+ gfn_handler_t handler)
+
{
int ret;
const struct kvm_useraddr_range useraddr_range = {
@@ -837,7 +839,7 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
.end = end,
.inode = inode,
.pte = __pte(0),
- .handler = kvm_unmap_gfn_range,
+ .handler = handler,
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = true,
.may_block = false,
@@ -858,6 +860,20 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
return ret;
}
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ return kvm_memfd_handle_range(kvm, inode, start, end,
+ kvm_unmap_gfn_range);
+}
+
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ return kvm_memfd_handle_range(kvm, inode, start, end,
+ kvm_map_gfn_range);
+}
+
#else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
static int kvm_init_mmu_notifier(struct kvm *kvm)
@@ -871,6 +887,12 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
return 0;
}
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ return 0;
+}
+
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
--
2.17.1
Since the memory backing store does not get notified when VM is
destroyed so need check if VM is still live in these callbacks.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
virt/kvm/memfd.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index bd930dcb455f..bcfdc685ce22 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -12,16 +12,38 @@
#include <linux/memfd.h>
const static struct guest_mem_ops *memfd_ops;
+static bool vm_is_dead(struct kvm *vm)
+{
+ struct kvm *kvm;
+
+ list_for_each_entry(kvm, &vm_list, vm_list) {
+ if (kvm == vm)
+ return false;
+ }
+
+ return true;
+}
+
static void memfd_invalidate_page_range(struct inode *inode, void *owner,
pgoff_t start, pgoff_t end)
{
//!!!We can get here after the owner no longer exists
+ if (vm_is_dead(owner))
+ return;
+
+ kvm_memfd_invalidate_range(owner, inode, start >> PAGE_SHIFT,
+ end >> PAGE_SHIFT);
}
static void memfd_fallocate(struct inode *inode, void *owner,
pgoff_t start, pgoff_t end)
{
//!!!We can get here after the owner no longer exists
+ if (vm_is_dead(owner))
+ return;
+
+ kvm_memfd_fallocate_range(owner, inode, start >> PAGE_SHIFT,
+ end >> PAGE_SHIFT);
}
static const struct guest_ops memfd_notifier = {
--
2.17.1
On 19.11.21 14:47, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The new seal type provides semantics required for KVM guest private
> memory support. A file descriptor with the seal set is going to be used
> as source of guest memory in confidential computing environments such as
> Intel TDX and AMD SEV.
>
> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> userspace cannot read, write or mmap the memfd.
>
> Userspace is in charge of guest memory lifecycle: it can allocate the
> memory with falloc or punch hole to free memory from the guest.
>
> The file descriptor passed down to KVM as guest memory backend. KVM
> register itself as the owner of the memfd via memfd_register_guest().
>
> KVM provides callback that needed to be called on fallocate and punch
> hole.
>
> memfd_register_guest() returns callbacks that need be used for
> requesting a new page from memfd.
>
Repeating the feedback I already shared in a private mail thread:
As long as page migration / swapping is not supported, these pages
behave like any longterm pinned pages (e.g., VFIO) or secretmem pages.
1. These pages are not MOVABLE. They must not end up on ZONE_MOVABLE or
MIGRATE_CMA.
That should be easy to handle, you have to adjust the gfp_mask to
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
just as mm/secretmem.c:secretmem_file_create() does.
2. These pages behave like mlocked pages and should be accounted as such.
This is probably where the accounting "fun" starts, but maybe it's
easier than I think to handle.
See mm/secretmem.c:secretmem_mmap(), where we account the pages as
VM_LOCKED and will consequently check per-process mlock limits. As we
don't mmap(), the same approach cannot be reused.
See drivers/vfio/vfio_iommu_type1.c:vfio_pin_map_dma() and
vfio_pin_pages_remote() on how to manually account via mm->locked_vm .
But it's a bit hairy because these pages are not actually mapped into
the page tables of the MM, so it might need some thought. Similarly,
these pages actually behave like "pinned" (as in mm->pinned_vm), but we
just don't increase the refcount AFAIR. Again, accounting really is a
bit hairy ...
--
Thanks,
David / dhildenb
On Fri, Nov 19, 2021 at 09:47:27PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The new seal type provides semantics required for KVM guest private
> memory support. A file descriptor with the seal set is going to be used
> as source of guest memory in confidential computing environments such as
> Intel TDX and AMD SEV.
>
> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> userspace cannot read, write or mmap the memfd.
>
> Userspace is in charge of guest memory lifecycle: it can allocate the
> memory with falloc or punch hole to free memory from the guest.
>
> The file descriptor passed down to KVM as guest memory backend. KVM
> register itself as the owner of the memfd via memfd_register_guest().
>
> KVM provides callback that needed to be called on fallocate and punch
> hole.
>
> memfd_register_guest() returns callbacks that need be used for
> requesting a new page from memfd.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> include/linux/memfd.h | 24 ++++++++
> include/linux/shmem_fs.h | 9 +++
> include/uapi/linux/fcntl.h | 1 +
> mm/memfd.c | 33 +++++++++-
> mm/shmem.c | 123 ++++++++++++++++++++++++++++++++++++-
> 5 files changed, 186 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..ff920ef28688 100644
> +++ b/include/linux/memfd.h
> @@ -4,13 +4,37 @@
>
> #include <linux/file.h>
>
> +struct guest_ops {
> + void (*invalidate_page_range)(struct inode *inode, void *owner,
> + pgoff_t start, pgoff_t end);
> + void (*fallocate)(struct inode *inode, void *owner,
> + pgoff_t start, pgoff_t end);
> +};
> +
> +struct guest_mem_ops {
> + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
> + bool alloc, int *order);
> + void (*put_unlock_pfn)(unsigned long pfn);
> +
> +};
Ignoring confidential compute for a moment
If qmeu can put all the guest memory in a memfd and not map it, then
I'd also like to see that the IOMMU can use this interface too so we
can have VFIO working in this configuration.
As designed the above looks useful to import a memfd to a VFIO
container but could you consider some more generic naming than calling
this 'guest' ?
Along the same lines, to support fast migration, we'd want to be able
to send these things to the RDMA subsytem as well so we can do data
xfer. Very similar to VFIO.
Also, shouldn't this be two patches? F_SEAL is not really related to
these acessors, is it?
> +extern inline int memfd_register_guest(struct inode *inode, void *owner,
> + const struct guest_ops *guest_ops,
> + const struct guest_mem_ops **guest_mem_ops);
Why does this take an inode and not a file *?
> +int shmem_register_guest(struct inode *inode, void *owner,
> + const struct guest_ops *guest_ops,
> + const struct guest_mem_ops **guest_mem_ops)
> +{
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + if (!owner)
> + return -EINVAL;
> +
> + if (info->guest_owner && info->guest_owner != owner)
> + return -EPERM;
And this looks like it means only a single subsytem can use this API
at once, not so nice..
Jason
On 19.11.21 16:19, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 09:47:27PM +0800, Chao Peng wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> The new seal type provides semantics required for KVM guest private
>> memory support. A file descriptor with the seal set is going to be used
>> as source of guest memory in confidential computing environments such as
>> Intel TDX and AMD SEV.
>>
>> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
>> userspace cannot read, write or mmap the memfd.
>>
>> Userspace is in charge of guest memory lifecycle: it can allocate the
>> memory with falloc or punch hole to free memory from the guest.
>>
>> The file descriptor passed down to KVM as guest memory backend. KVM
>> register itself as the owner of the memfd via memfd_register_guest().
>>
>> KVM provides callback that needed to be called on fallocate and punch
>> hole.
>>
>> memfd_register_guest() returns callbacks that need be used for
>> requesting a new page from memfd.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Signed-off-by: Chao Peng <[email protected]>
>> include/linux/memfd.h | 24 ++++++++
>> include/linux/shmem_fs.h | 9 +++
>> include/uapi/linux/fcntl.h | 1 +
>> mm/memfd.c | 33 +++++++++-
>> mm/shmem.c | 123 ++++++++++++++++++++++++++++++++++++-
>> 5 files changed, 186 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
>> index 4f1600413f91..ff920ef28688 100644
>> +++ b/include/linux/memfd.h
>> @@ -4,13 +4,37 @@
>>
>> #include <linux/file.h>
>>
>> +struct guest_ops {
>> + void (*invalidate_page_range)(struct inode *inode, void *owner,
>> + pgoff_t start, pgoff_t end);
>> + void (*fallocate)(struct inode *inode, void *owner,
>> + pgoff_t start, pgoff_t end);
>> +};
>> +
>> +struct guest_mem_ops {
>> + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
>> + bool alloc, int *order);
>> + void (*put_unlock_pfn)(unsigned long pfn);
>> +
>> +};
>
> Ignoring confidential compute for a moment
>
> If qmeu can put all the guest memory in a memfd and not map it, then
> I'd also like to see that the IOMMU can use this interface too so we
> can have VFIO working in this configuration.
In QEMU we usually want to (and must) be able to access guest memory
from user space, with the current design we wouldn't even be able to
temporarily mmap it -- which makes sense for encrypted memory only. The
corner case really is encrypted memory. So I don't think we'll see a
broad use of this feature outside of encrypted VMs in QEMU. I might be
wrong, most probably I am :)
>
> As designed the above looks useful to import a memfd to a VFIO
> container but could you consider some more generic naming than calling
> this 'guest' ?
+1 the guest terminology is somewhat sob-optimal.
>
> Along the same lines, to support fast migration, we'd want to be able
> to send these things to the RDMA subsytem as well so we can do data
> xfer. Very similar to VFIO.
>
> Also, shouldn't this be two patches? F_SEAL is not really related to
> these acessors, is it?
Apart from the special "encrypted memory" semantics, I assume nothing
speaks against allowing for mmaping these memfds, for example, for any
other VFIO use cases.
--
Thanks,
David / dhildenb
On Fri, Nov 19, 2021 at 04:39:15PM +0100, David Hildenbrand wrote:
> > If qmeu can put all the guest memory in a memfd and not map it, then
> > I'd also like to see that the IOMMU can use this interface too so we
> > can have VFIO working in this configuration.
>
> In QEMU we usually want to (and must) be able to access guest memory
> from user space, with the current design we wouldn't even be able to
> temporarily mmap it -- which makes sense for encrypted memory only. The
> corner case really is encrypted memory. So I don't think we'll see a
> broad use of this feature outside of encrypted VMs in QEMU. I might be
> wrong, most probably I am :)
Interesting..
The non-encrypted case I had in mind is the horrible flow in VFIO to
support qemu re-execing itself (VFIO_DMA_UNMAP_FLAG_VADDR).
Here VFIO is connected to a VA in a mm_struct that will become invalid
during the kexec period, but VFIO needs to continue to access it. For
IOMMU cases this is OK because the memory is already pinned, but for
the 'emulated iommu' used by mdevs pages are pinned dynamically. qemu
needs to ensure that VFIO can continue to access the pages across the
kexec, even though there is nothing to pin_user_pages() on.
This flow would work a lot better if VFIO was connected to the memfd
that is storing the guest memory. Then it naturally doesn't get
disrupted by exec() and we don't need the mess in the kernel..
I was wondering if we could get here using the direct_io APIs but this
would do the job too.
> Apart from the special "encrypted memory" semantics, I assume nothing
> speaks against allowing for mmaping these memfds, for example, for any
> other VFIO use cases.
We will eventually have VFIO with "encrypted memory". There was a talk
in LPC about the enabling work for this.
So, if the plan is to put fully encrpyted memory inside a memfd, then
we still will eventually need a way to pull the pfns it into the
IOMMU, presumably along with the access control parameters needed to
pass to the secure monitor to join a PCI device to the secure memory.
Jason
On Fri, Nov 19, 2021, David Hildenbrand wrote:
> On 19.11.21 16:19, Jason Gunthorpe wrote:
> > As designed the above looks useful to import a memfd to a VFIO
> > container but could you consider some more generic naming than calling
> > this 'guest' ?
>
> +1 the guest terminology is somewhat sob-optimal.
For the F_SEAL part, maybe F_SEAL_UNMAPPABLE?
No ideas for the kernel API, but that's also less concerning since it's not set
in stone. I'm also not sure that dedicated APIs for each high-ish level use case
would be a bad thing, as the semantics are unlikely to be different to some extent.
E.g. for the KVM use case, there can be at most one guest associated with the fd,
but there can be any number of VFIO devices attached to the fd.
On Fri, Nov 19, 2021 at 07:18:00PM +0000, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, David Hildenbrand wrote:
> > On 19.11.21 16:19, Jason Gunthorpe wrote:
> > > As designed the above looks useful to import a memfd to a VFIO
> > > container but could you consider some more generic naming than calling
> > > this 'guest' ?
> >
> > +1 the guest terminology is somewhat sob-optimal.
>
> For the F_SEAL part, maybe F_SEAL_UNMAPPABLE?
Perhaps INACCESSIBLE?
> No ideas for the kernel API, but that's also less concerning since
> it's not set in stone. I'm also not sure that dedicated APIs for
> each high-ish level use case would be a bad thing, as the semantics
> are unlikely to be different to some extent. E.g. for the KVM use
> case, there can be at most one guest associated with the fd, but
> there can be any number of VFIO devices attached to the fd.
Even the kvm thing is not a hard restriction when you take away
confidential compute.
Why can't we have multiple KVMs linked to the same FD if the memory
isn't encrypted? Sure it isn't actually useful but it should work
fine.
Supporting only one thing is just a way to avoid having a linked list
of clients to broadcast invalidations too - for instance by using a
standard notifier block...
Also, how does dirty tracking work on this memory?
Jason
On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 07:18:00PM +0000, Sean Christopherson wrote:
> > No ideas for the kernel API, but that's also less concerning since
> > it's not set in stone. I'm also not sure that dedicated APIs for
> > each high-ish level use case would be a bad thing, as the semantics
> > are unlikely to be different to some extent. E.g. for the KVM use
> > case, there can be at most one guest associated with the fd, but
> > there can be any number of VFIO devices attached to the fd.
>
> Even the kvm thing is not a hard restriction when you take away
> confidential compute.
>
> Why can't we have multiple KVMs linked to the same FD if the memory
> isn't encrypted? Sure it isn't actually useful but it should work
> fine.
Hmm, true, but I want the KVM semantics to be 1:1 even if memory isn't encrypted.
Encrypting memory with a key that isn't available to the host is necessary to
(mostly) remove the host kernel from the guest's TCB, but it's not necessary to
remove host userspace from the TCB. KVM absolutely can and should be able to do
that without relying on additional hardware/firmware. Ignoring attestation and
whether or not the guest fully trusts the host kernel, there's value in preventing
a buggy or compromised userspace from attacking/corrupting the guest by remapping
guest memory or by mapping the same memory into multiple guests.
> Supporting only one thing is just a way to avoid having a linked list
> of clients to broadcast invalidations too - for instance by using a
> standard notifier block...
It's not just avoiding the linked list, there's a trust element as well. E.g. in
the scenario where a device can access a confidential VM's encrypted private memory,
the guest is still the "owner" of the memory and needs to explicitly grant access to
a third party, e.g. the device or perhaps another VM.
That said, I'm certainly not dead set on having "guest" in the name, nor am I
opposed to implementing multi-consumer support from the get-go so we don't end
up with a mess later on.
> Also, how does dirty tracking work on this memory?
For KVM usage, KVM would provide the dirty bit info. No idea how VFIO or other
use cases would work.
On Fri, Nov 19, 2021 at 10:21:39PM +0000, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 07:18:00PM +0000, Sean Christopherson wrote:
> > > No ideas for the kernel API, but that's also less concerning since
> > > it's not set in stone. I'm also not sure that dedicated APIs for
> > > each high-ish level use case would be a bad thing, as the semantics
> > > are unlikely to be different to some extent. E.g. for the KVM use
> > > case, there can be at most one guest associated with the fd, but
> > > there can be any number of VFIO devices attached to the fd.
> >
> > Even the kvm thing is not a hard restriction when you take away
> > confidential compute.
> >
> > Why can't we have multiple KVMs linked to the same FD if the memory
> > isn't encrypted? Sure it isn't actually useful but it should work
> > fine.
>
> Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> isn't encrypted.
That is policy and it doesn't belong hardwired into the kernel.
Your explanation makes me think that the F_SEAL_XX isn't defined
properly. It should be a userspace trap door to prevent any new
external accesses, including establishing new kvms, iommu's, rdmas,
mmaps, read/write, etc.
> It's not just avoiding the linked list, there's a trust element as
> well. E.g. in the scenario where a device can access a confidential
> VM's encrypted private memory, the guest is still the "owner" of the
> memory and needs to explicitly grant access to a third party,
> e.g. the device or perhaps another VM.
Authorization is some other issue - the internal kAPI should be able
to indicate it is secured memory and the API user should do whatever
dance to gain access to it. Eg for VFIO ask the realm manager to
associate the pci_device with the owner realm.
Jason
On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 10:21:39PM +0000, Sean Christopherson wrote:
> > On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > > On Fri, Nov 19, 2021 at 07:18:00PM +0000, Sean Christopherson wrote:
> > > > No ideas for the kernel API, but that's also less concerning since
> > > > it's not set in stone. I'm also not sure that dedicated APIs for
> > > > each high-ish level use case would be a bad thing, as the semantics
> > > > are unlikely to be different to some extent. E.g. for the KVM use
> > > > case, there can be at most one guest associated with the fd, but
> > > > there can be any number of VFIO devices attached to the fd.
> > >
> > > Even the kvm thing is not a hard restriction when you take away
> > > confidential compute.
> > >
> > > Why can't we have multiple KVMs linked to the same FD if the memory
> > > isn't encrypted? Sure it isn't actually useful but it should work
> > > fine.
> >
> > Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> > isn't encrypted.
>
> That is policy and it doesn't belong hardwired into the kernel.
Agreed. I had a blurb typed up about that policy just being an "exclusive" flag
in the kernel API that KVM would set when creating a confidential VM, but deleted
it and forgot to restore it when I went down the tangent of removing userspace
from the TCB without an assist from hardware/firmware.
> Your explanation makes me think that the F_SEAL_XX isn't defined
> properly. It should be a userspace trap door to prevent any new
> external accesses, including establishing new kvms, iommu's, rdmas,
> mmaps, read/write, etc.
Hmm, the way I was thinking of it is that it the F_SEAL_XX itself would prevent
mapping/accessing it from userspace, and that any policy beyond that would be
done via kernel APIs and thus handled by whatever in-kernel agent can access the
memory. E.g. in the confidential VM case, without support for trusted devices,
KVM would require that it be the sole owner of the file.
On Fri, Nov 19, 2021 at 09:47:33PM +0800, Chao Peng wrote:
> Current code assume the private memory is persistent and KVM can check
> with backing store to see if private memory exists at the same address
> by calling get_pfn(alloc=false).
>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 75 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 73 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 40377901598b..cd5d1f923694 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3277,6 +3277,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> + if (memslot_is_memfd(slot))
> + return max_level;
> +
> host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> return min(host_level, max_level);
> }
> @@ -4555,6 +4558,65 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> }
>
> +static bool kvm_faultin_pfn_memfd(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault, int *r)
> +{ int order;
> + kvm_pfn_t pfn;
> + struct kvm_memory_slot *slot = fault->slot;
> + bool priv_gfn = kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT);
> + bool priv_slot_exists = memslot_has_private(slot);
> + bool priv_gfn_exists = false;
> + int mem_convert_type;
> +
> + if (priv_gfn && !priv_slot_exists) {
> + *r = RET_PF_INVALID;
> + return true;
> + }
> +
> + if (priv_slot_exists) {
> + pfn = slot->memfd_ops->get_pfn(slot, slot->priv_file,
> + fault->gfn, false, &order);
> + if (pfn >= 0)
> + priv_gfn_exists = true;
Need "fault->pfn = pfn" here if actual pfn is returned in
get_pfn(alloc=false) case for private page case.
> + }
> +
> + if (priv_gfn && !priv_gfn_exists) {
> + mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
> + goto out_convert;
> + }
> +
> + if (!priv_gfn && priv_gfn_exists) {
> + slot->memfd_ops->put_pfn(pfn);
> + mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
> + goto out_convert;
> + }
> +
> + if (!priv_gfn) {
> + pfn = slot->memfd_ops->get_pfn(slot, slot->file,
> + fault->gfn, true, &order);
Need "fault->pfn = pfn" here, because he pfn for
share page is getted here only.
> + if (fault->pfn < 0) {
> + *r = RET_PF_INVALID;
> + return true;
> + }
> + }
> +
> + if (slot->flags & KVM_MEM_READONLY)
> + fault->map_writable = false;
> + if (order == 0)
> + fault->max_level = PG_LEVEL_4K;
> +
> + return false;
> +
> +out_convert:
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> + vcpu->run->mem.type = mem_convert_type;
> + vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->mem.u.map.size = PAGE_SIZE;
> + fault->pfn = -1;
> + *r = -1;
> + return true;
> +}
> +
> static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -4596,6 +4658,9 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> }
> }
>
> + if (memslot_is_memfd(slot))
> + return kvm_faultin_pfn_memfd(vcpu, fault, r);
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> fault->write, &fault->map_writable,
> @@ -4660,7 +4725,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> else
> write_lock(&vcpu->kvm->mmu_lock);
>
> - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> + if (fault->slot && !memslot_is_memfd(fault->slot) &&
> + mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> goto out_unlock;
> r = make_mmu_pages_available(vcpu);
> if (r)
> @@ -4676,7 +4742,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> read_unlock(&vcpu->kvm->mmu_lock);
> else
> write_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> +
> + if (memslot_is_memfd(fault->slot))
> + fault->slot->memfd_ops->put_pfn(fault->pfn);
> + else
> + kvm_release_pfn_clean(fault->pfn);
> +
> return r;
> }
>
> --
> 2.17.1
>
On Sat, Nov 20, 2021 at 01:23:16AM +0000, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 10:21:39PM +0000, Sean Christopherson wrote:
> > > On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > > > On Fri, Nov 19, 2021 at 07:18:00PM +0000, Sean Christopherson wrote:
> > > > > No ideas for the kernel API, but that's also less concerning since
> > > > > it's not set in stone. I'm also not sure that dedicated APIs for
> > > > > each high-ish level use case would be a bad thing, as the semantics
> > > > > are unlikely to be different to some extent. E.g. for the KVM use
> > > > > case, there can be at most one guest associated with the fd, but
> > > > > there can be any number of VFIO devices attached to the fd.
> > > >
> > > > Even the kvm thing is not a hard restriction when you take away
> > > > confidential compute.
> > > >
> > > > Why can't we have multiple KVMs linked to the same FD if the memory
> > > > isn't encrypted? Sure it isn't actually useful but it should work
> > > > fine.
> > >
> > > Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> > > isn't encrypted.
> >
> > That is policy and it doesn't belong hardwired into the kernel.
>
> Agreed. I had a blurb typed up about that policy just being an "exclusive" flag
> in the kernel API that KVM would set when creating a confidential
> VM,
I still think that is policy in the kernel, what is wrong with
userspace doing it?
> > Your explanation makes me think that the F_SEAL_XX isn't defined
> > properly. It should be a userspace trap door to prevent any new
> > external accesses, including establishing new kvms, iommu's, rdmas,
> > mmaps, read/write, etc.
>
> Hmm, the way I was thinking of it is that it the F_SEAL_XX itself would prevent
> mapping/accessing it from userspace, and that any policy beyond that would be
> done via kernel APIs and thus handled by whatever in-kernel agent can access the
> memory. E.g. in the confidential VM case, without support for trusted devices,
> KVM would require that it be the sole owner of the file.
And how would kvm know if there is support for trusted devices?
Again seems like policy choices that should be left in userspace.
Especially for what could be a general in-kernel mechanism with many
users and not tightly linked to KVM as imagined here.
Jason
On Sat, Nov 20, 2021 at 09:55:29AM +0800, Yao Yuan wrote:
> On Fri, Nov 19, 2021 at 09:47:33PM +0800, Chao Peng wrote:
> > Current code assume the private memory is persistent and KVM can check
> > with backing store to see if private memory exists at the same address
> > by calling get_pfn(alloc=false).
> >
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 75 ++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 73 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 40377901598b..cd5d1f923694 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3277,6 +3277,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > + if (memslot_is_memfd(slot))
> > + return max_level;
> > +
> > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > return min(host_level, max_level);
> > }
> > @@ -4555,6 +4558,65 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> > }
> >
> > +static bool kvm_faultin_pfn_memfd(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault, int *r)
> > +{ int order;
> > + kvm_pfn_t pfn;
> > + struct kvm_memory_slot *slot = fault->slot;
> > + bool priv_gfn = kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT);
> > + bool priv_slot_exists = memslot_has_private(slot);
> > + bool priv_gfn_exists = false;
> > + int mem_convert_type;
> > +
> > + if (priv_gfn && !priv_slot_exists) {
> > + *r = RET_PF_INVALID;
> > + return true;
> > + }
> > +
> > + if (priv_slot_exists) {
> > + pfn = slot->memfd_ops->get_pfn(slot, slot->priv_file,
> > + fault->gfn, false, &order);
> > + if (pfn >= 0)
> > + priv_gfn_exists = true;
>
> Need "fault->pfn = pfn" here if actual pfn is returned in
> get_pfn(alloc=false) case for private page case.
>
> > + }
> > +
> > + if (priv_gfn && !priv_gfn_exists) {
> > + mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
> > + goto out_convert;
> > + }
> > +
> > + if (!priv_gfn && priv_gfn_exists) {
> > + slot->memfd_ops->put_pfn(pfn);
> > + mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
> > + goto out_convert;
> > + }
> > +
> > + if (!priv_gfn) {
> > + pfn = slot->memfd_ops->get_pfn(slot, slot->file,
> > + fault->gfn, true, &order);
>
> Need "fault->pfn = pfn" here, because he pfn for
> share page is getted here only.
>
> > + if (fault->pfn < 0) {
> > + *r = RET_PF_INVALID;
> > + return true;
> > + }
> > + }
Right, I actually have "fault->pfn = pfn" here but accidentally deleted
in a code factoring.
Chao
> > +
> > + if (slot->flags & KVM_MEM_READONLY)
> > + fault->map_writable = false;
> > + if (order == 0)
> > + fault->max_level = PG_LEVEL_4K;
> > +
> > + return false;
> > +
> > +out_convert:
> > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> > + vcpu->run->mem.type = mem_convert_type;
> > + vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
> > + vcpu->run->mem.u.map.size = PAGE_SIZE;
> > + fault->pfn = -1;
> > + *r = -1;
> > + return true;
> > +}
> > +
> > static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> > {
> > struct kvm_memory_slot *slot = fault->slot;
> > @@ -4596,6 +4658,9 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > }
> > }
> >
> > + if (memslot_is_memfd(slot))
> > + return kvm_faultin_pfn_memfd(vcpu, fault, r);
> > +
> > async = false;
> > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> > fault->write, &fault->map_writable,
> > @@ -4660,7 +4725,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > else
> > write_lock(&vcpu->kvm->mmu_lock);
> >
> > - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> > + if (fault->slot && !memslot_is_memfd(fault->slot) &&
> > + mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva))
> > goto out_unlock;
> > r = make_mmu_pages_available(vcpu);
> > if (r)
> > @@ -4676,7 +4742,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > read_unlock(&vcpu->kvm->mmu_lock);
> > else
> > write_unlock(&vcpu->kvm->mmu_lock);
> > - kvm_release_pfn_clean(fault->pfn);
> > +
> > + if (memslot_is_memfd(fault->slot))
> > + fault->slot->memfd_ops->put_pfn(fault->pfn);
> > + else
> > + kvm_release_pfn_clean(fault->pfn);
> > +
> > return r;
> > }
> >
> > --
> > 2.17.1
> >
On 19.11.21 17:00, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 04:39:15PM +0100, David Hildenbrand wrote:
>
>>> If qmeu can put all the guest memory in a memfd and not map it, then
>>> I'd also like to see that the IOMMU can use this interface too so we
>>> can have VFIO working in this configuration.
>>
>> In QEMU we usually want to (and must) be able to access guest memory
>> from user space, with the current design we wouldn't even be able to
>> temporarily mmap it -- which makes sense for encrypted memory only. The
>> corner case really is encrypted memory. So I don't think we'll see a
>> broad use of this feature outside of encrypted VMs in QEMU. I might be
>> wrong, most probably I am :)
>
> Interesting..
>
> The non-encrypted case I had in mind is the horrible flow in VFIO to
> support qemu re-execing itself (VFIO_DMA_UNMAP_FLAG_VADDR).
Thanks for sharing!
>
> Here VFIO is connected to a VA in a mm_struct that will become invalid
> during the kexec period, but VFIO needs to continue to access it. For
> IOMMU cases this is OK because the memory is already pinned, but for
> the 'emulated iommu' used by mdevs pages are pinned dynamically. qemu
> needs to ensure that VFIO can continue to access the pages across the
> kexec, even though there is nothing to pin_user_pages() on.
>
> This flow would work a lot better if VFIO was connected to the memfd
> that is storing the guest memory. Then it naturally doesn't get
> disrupted by exec() and we don't need the mess in the kernel..
I do wonder if we want to support sharing such memfds between processes
in all cases ... we most certainly don't want to be able to share
encrypted memory between VMs (I heard that the kernel has to forbid
that). It would make sense in the use case you describe, though.
>
> I was wondering if we could get here using the direct_io APIs but this
> would do the job too.
>
>> Apart from the special "encrypted memory" semantics, I assume nothing
>> speaks against allowing for mmaping these memfds, for example, for any
>> other VFIO use cases.
>
> We will eventually have VFIO with "encrypted memory". There was a talk
> in LPC about the enabling work for this.
Yes, I heard about that as well. In the foreseeable future, we'll have
shared memory only visible for VFIO devices.
>
> So, if the plan is to put fully encrpyted memory inside a memfd, then
> we still will eventually need a way to pull the pfns it into the
> IOMMU, presumably along with the access control parameters needed to
> pass to the secure monitor to join a PCI device to the secure memory.
Long-term, agreed.
--
Thanks,
David / dhildenb
On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> I do wonder if we want to support sharing such memfds between processes
> in all cases ... we most certainly don't want to be able to share
> encrypted memory between VMs (I heard that the kernel has to forbid
> that). It would make sense in the use case you describe, though.
If there is a F_SEAL_XX that blocks every kind of new access, who
cares if userspace passes the FD around or not?
Jason
On 22.11.21 14:31, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
>
>> I do wonder if we want to support sharing such memfds between processes
>> in all cases ... we most certainly don't want to be able to share
>> encrypted memory between VMs (I heard that the kernel has to forbid
>> that). It would make sense in the use case you describe, though.
>
> If there is a F_SEAL_XX that blocks every kind of new access, who
> cares if userspace passes the FD around or not?
I was imagining that you actually would want to do some kind of "change
ownership". But yeah, the intended semantics and all use cases we have
in mind are not fully clear to me yet. If it's really "no new access"
(side note: is "access" the right word?) then sure, we can pass the fd
around.
--
Thanks,
David / dhildenb
On Fri, Nov 19, 2021 at 02:51:11PM +0100, David Hildenbrand wrote:
> On 19.11.21 14:47, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > The new seal type provides semantics required for KVM guest private
> > memory support. A file descriptor with the seal set is going to be used
> > as source of guest memory in confidential computing environments such as
> > Intel TDX and AMD SEV.
> >
> > F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> > userspace cannot read, write or mmap the memfd.
> >
> > Userspace is in charge of guest memory lifecycle: it can allocate the
> > memory with falloc or punch hole to free memory from the guest.
> >
> > The file descriptor passed down to KVM as guest memory backend. KVM
> > register itself as the owner of the memfd via memfd_register_guest().
> >
> > KVM provides callback that needed to be called on fallocate and punch
> > hole.
> >
> > memfd_register_guest() returns callbacks that need be used for
> > requesting a new page from memfd.
> >
>
> Repeating the feedback I already shared in a private mail thread:
>
>
> As long as page migration / swapping is not supported, these pages
> behave like any longterm pinned pages (e.g., VFIO) or secretmem pages.
>
> 1. These pages are not MOVABLE. They must not end up on ZONE_MOVABLE or
> MIGRATE_CMA.
>
> That should be easy to handle, you have to adjust the gfp_mask to
> mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> just as mm/secretmem.c:secretmem_file_create() does.
Okay, fair enough. mapping_set_unevictable() also makes sesne.
> 2. These pages behave like mlocked pages and should be accounted as such.
>
> This is probably where the accounting "fun" starts, but maybe it's
> easier than I think to handle.
>
> See mm/secretmem.c:secretmem_mmap(), where we account the pages as
> VM_LOCKED and will consequently check per-process mlock limits. As we
> don't mmap(), the same approach cannot be reused.
>
> See drivers/vfio/vfio_iommu_type1.c:vfio_pin_map_dma() and
> vfio_pin_pages_remote() on how to manually account via mm->locked_vm .
>
> But it's a bit hairy because these pages are not actually mapped into
> the page tables of the MM, so it might need some thought. Similarly,
> these pages actually behave like "pinned" (as in mm->pinned_vm), but we
> just don't increase the refcount AFAIR. Again, accounting really is a
> bit hairy ...
Accounting is fun indeed. Non-mapped mlocked memory is going to be
confusing. Hm...
I will look closer.
--
Kirill A. Shutemov
On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
> On 22.11.21 14:31, Jason Gunthorpe wrote:
> > On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> >
> >> I do wonder if we want to support sharing such memfds between processes
> >> in all cases ... we most certainly don't want to be able to share
> >> encrypted memory between VMs (I heard that the kernel has to forbid
> >> that). It would make sense in the use case you describe, though.
> >
> > If there is a F_SEAL_XX that blocks every kind of new access, who
> > cares if userspace passes the FD around or not?
> I was imagining that you actually would want to do some kind of "change
> ownership". But yeah, the intended semantics and all use cases we have
> in mind are not fully clear to me yet. If it's really "no new access"
> (side note: is "access" the right word?) then sure, we can pass the fd
> around.
What is "ownership" in a world with kvm and iommu are reading pages
out of the same fd?
"no new access" makes sense to me, we have access through
read/write/mmap/splice/etc and access to pages through the private in
kernel interface (kvm, iommu)
Jason
On Fri, Nov 19, 2021 at 09:47:39PM +0800, Chao Peng wrote:
> Since the memory backing store does not get notified when VM is
> destroyed so need check if VM is still live in these callbacks.
>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> virt/kvm/memfd.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
> index bd930dcb455f..bcfdc685ce22 100644
> --- a/virt/kvm/memfd.c
> +++ b/virt/kvm/memfd.c
> @@ -12,16 +12,38 @@
> #include <linux/memfd.h>
> const static struct guest_mem_ops *memfd_ops;
>
> +static bool vm_is_dead(struct kvm *vm)
> +{
> + struct kvm *kvm;
> +
> + list_for_each_entry(kvm, &vm_list, vm_list) {
> + if (kvm == vm)
> + return false;
> + }
I don't think this is enough. The struct kvm can be freed and re-allocated
from the slab and this function will give false-negetive.
Maybe the kvm has to be tagged with a sequential id that incremented every
allocation. This id can be checked here.
> +
> + return true;
> +}
--
Kirill A. Shutemov
On 22.11.21 15:01, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
>> On 22.11.21 14:31, Jason Gunthorpe wrote:
>>> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
>>>
>>>> I do wonder if we want to support sharing such memfds between processes
>>>> in all cases ... we most certainly don't want to be able to share
>>>> encrypted memory between VMs (I heard that the kernel has to forbid
>>>> that). It would make sense in the use case you describe, though.
>>>
>>> If there is a F_SEAL_XX that blocks every kind of new access, who
>>> cares if userspace passes the FD around or not?
>> I was imagining that you actually would want to do some kind of "change
>> ownership". But yeah, the intended semantics and all use cases we have
>> in mind are not fully clear to me yet. If it's really "no new access"
>> (side note: is "access" the right word?) then sure, we can pass the fd
>> around.
>
> What is "ownership" in a world with kvm and iommu are reading pages
> out of the same fd?
In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
IMHO (for example, only it can migrate or swap out these pages; it's
might be debatable if the TDX module or KVM actually "own" these pages ).
--
Thanks,
David / dhildenb
On Mon, Nov 22, 2021 at 03:57:17PM +0100, David Hildenbrand wrote:
> On 22.11.21 15:01, Jason Gunthorpe wrote:
> > On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
> >> On 22.11.21 14:31, Jason Gunthorpe wrote:
> >>> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> >>>
> >>>> I do wonder if we want to support sharing such memfds between processes
> >>>> in all cases ... we most certainly don't want to be able to share
> >>>> encrypted memory between VMs (I heard that the kernel has to forbid
> >>>> that). It would make sense in the use case you describe, though.
> >>>
> >>> If there is a F_SEAL_XX that blocks every kind of new access, who
> >>> cares if userspace passes the FD around or not?
> >> I was imagining that you actually would want to do some kind of "change
> >> ownership". But yeah, the intended semantics and all use cases we have
> >> in mind are not fully clear to me yet. If it's really "no new access"
> >> (side note: is "access" the right word?) then sure, we can pass the fd
> >> around.
> >
> > What is "ownership" in a world with kvm and iommu are reading pages
> > out of the same fd?
>
> In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
> IMHO (for example, only it can migrate or swap out these pages; it's
> might be debatable if the TDX module or KVM actually "own" these pages ).
Sounds like it is a swap provider more than an owner?
Jason
On 22.11.21 16:09, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 03:57:17PM +0100, David Hildenbrand wrote:
>> On 22.11.21 15:01, Jason Gunthorpe wrote:
>>> On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
>>>> On 22.11.21 14:31, Jason Gunthorpe wrote:
>>>>> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
>>>>>
>>>>>> I do wonder if we want to support sharing such memfds between processes
>>>>>> in all cases ... we most certainly don't want to be able to share
>>>>>> encrypted memory between VMs (I heard that the kernel has to forbid
>>>>>> that). It would make sense in the use case you describe, though.
>>>>>
>>>>> If there is a F_SEAL_XX that blocks every kind of new access, who
>>>>> cares if userspace passes the FD around or not?
>>>> I was imagining that you actually would want to do some kind of "change
>>>> ownership". But yeah, the intended semantics and all use cases we have
>>>> in mind are not fully clear to me yet. If it's really "no new access"
>>>> (side note: is "access" the right word?) then sure, we can pass the fd
>>>> around.
>>>
>>> What is "ownership" in a world with kvm and iommu are reading pages
>>> out of the same fd?
>>
>> In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
>> IMHO (for example, only it can migrate or swap out these pages; it's
>> might be debatable if the TDX module or KVM actually "own" these pages ).
>
> Sounds like it is a swap provider more than an owner?
Yes, I think we can phrase it that way, + "migrate provider"
--
Thanks,
David / dhildenb
On Mon, Nov 22, 2021 at 05:16:47PM +0300, Kirill A. Shutemov wrote:
> On Fri, Nov 19, 2021 at 09:47:39PM +0800, Chao Peng wrote:
> > Since the memory backing store does not get notified when VM is
> > destroyed so need check if VM is still live in these callbacks.
> >
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > virt/kvm/memfd.c | 22 ++++++++++++++++++++++
> > 1 file changed, 22 insertions(+)
> >
> > diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
> > index bd930dcb455f..bcfdc685ce22 100644
> > --- a/virt/kvm/memfd.c
> > +++ b/virt/kvm/memfd.c
> > @@ -12,16 +12,38 @@
> > #include <linux/memfd.h>
> > const static struct guest_mem_ops *memfd_ops;
> >
> > +static bool vm_is_dead(struct kvm *vm)
> > +{
> > + struct kvm *kvm;
> > +
> > + list_for_each_entry(kvm, &vm_list, vm_list) {
> > + if (kvm == vm)
> > + return false;
> > + }
>
> I don't think this is enough. The struct kvm can be freed and re-allocated
> from the slab and this function will give false-negetive.
Right.
>
> Maybe the kvm has to be tagged with a sequential id that incremented every
> allocation. This id can be checked here.
Sounds like a sequential id will be needed, no existing fields in struct
kvm can work for this.
>
> > +
> > + return true;
> > +}
>
> --
> Kirill A. Shutemov
On 11/19/21 14:47, Chao Peng wrote:
> For fd-based memslot store the file references for shared fd and the
> private fd (if any) in the memslot structure. Since there is no 'hva'
> concept we cannot call hva_to_pfn() to get a pfn, instead kvm_memfd_ops
> is added to get_pfn/put_pfn from the memory backing stores that provide
> these fds.
>
> Signed-off-by: Yu Zhang<[email protected]>
> Signed-off-by: Chao Peng<[email protected]>
> ---
What about kvm_read/write_guest? Maybe the proposal which kept
userspace_addr for the shared fd is more doable (it would be great to
ultimately remove the mandatory userspace mapping for the shared fd, but
I think KVM is not quite ready for that).
Paolo
On 11/19/21 14:47, Chao Peng wrote:
> +
> + /* Prevent memslot modification */
> + spin_lock(&kvm->mn_invalidate_lock);
> + kvm->mn_active_invalidate_count++;
> + spin_unlock(&kvm->mn_invalidate_lock);
> +
> + ret = __kvm_handle_useraddr_range(kvm, &useraddr_range);
> +
> + spin_lock(&kvm->mn_invalidate_lock);
> + kvm->mn_active_invalidate_count--;
> + spin_unlock(&kvm->mn_invalidate_lock);
> +
You need to follow this with a rcuwait_wake_up as in
kvm_mmu_notifier_invalidate_range_end.
It's probably best if you move the manipulations of
mn_active_invalidate_count from kvm_mmu_notifier_invalidate_range_* to
two separate functions.
Paolo
On 11/19/21 14:47, Chao Peng wrote:
> + list_for_each_entry(kvm, &vm_list, vm_list) {
> + if (kvm == vm)
> + return false;
> + }
> +
> + return true;
This would have to take the kvm_lock, but see my reply to patch 1.
Paolo
On 11/19/21 14:47, Chao Peng wrote:
> +static void guest_invalidate_page(struct inode *inode,
> + struct page *page, pgoff_t start, pgoff_t end)
> +{
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + if (!info->guest_ops || !info->guest_ops->invalidate_page_range)
> + return;
> +
> + start = max(start, page->index);
> + end = min(end, page->index + thp_nr_pages(page)) - 1;
> +
> + info->guest_ops->invalidate_page_range(inode, info->guest_owner,
> + start, end);
> +}
The lack of protection makes the API quite awkward to use;
the usual way to do this is with refcount_inc_not_zero (aka
kvm_get_kvm_safe).
Can you use the shmem_inode_info spinlock to protect against this? If
register/unregister take the spinlock, the invalidate and fallocate can
take a reference under the same spinlock, like this:
if (!info->guest_ops)
return;
spin_lock(&info->lock);
ops = info->guest_ops;
if (!ops) {
spin_unlock(&info->lock);
return;
}
/* Calls kvm_get_kvm_safe. */
r = ops->get_guest_owner(info->guest_owner);
spin_unlock(&info->lock);
if (r < 0)
return;
start = max(start, page->index);
end = min(end, page->index + thp_nr_pages(page)) - 1;
ops->invalidate_page_range(inode, info->guest_owner,
start, end);
ops->put_guest_owner(info->guest_owner);
Considering that you have to take a mutex anyway in patch 13, and that
the critical section here is very small, the extra indirect calls are
cheaper than walking the vm_list; and it makes the API clearer.
Paolo
On 11/19/21 16:39, David Hildenbrand wrote:
>> If qmeu can put all the guest memory in a memfd and not map it, then
>> I'd also like to see that the IOMMU can use this interface too so we
>> can have VFIO working in this configuration.
>
> In QEMU we usually want to (and must) be able to access guest memory
> from user space, with the current design we wouldn't even be able to
> temporarily mmap it -- which makes sense for encrypted memory only. The
> corner case really is encrypted memory. So I don't think we'll see a
> broad use of this feature outside of encrypted VMs in QEMU. I might be
> wrong, most probably I am:)
It's not _that_ crazy an idea, but it's going to be some work to teach
KVM that it has to kmap/kunmap around all memory accesses.
I think it's great that memfd hooks are usable by more than one
subsystem, OTOH it's fair that whoever needs it does the work---and VFIO
does not need it for confidential VMs, yet, so it should be fine for now
to have a single user.
On the other hand, as I commented already, the lack of locking in the
register/unregister functions has to be fixed even with a single user.
Another thing we can do already is change the guest_ops/guest_mem_ops to
something like memfd_falloc_notifier_ops/memfd_pfn_ops, and the
register/unregister functions to memfd_register/unregister_falloc_notifier.
Chao, can you also put this under a new CONFIG such as "bool MEMFD_OPS",
and select it from KVM?
Thanks,
Paolo
On 11/23/21 02:06, Chao Peng wrote:
>> Maybe the kvm has to be tagged with a sequential id that incremented every
>> allocation. This id can be checked here.
> Sounds like a sequential id will be needed, no existing fields in struct
> kvm can work for this.
There's no need to new concepts when there's a perfectly usable
reference count. :)
Paolo
On Tue, Nov 23, 2021 at 09:46:34AM +0100, Paolo Bonzini wrote:
> On 11/19/21 14:47, Chao Peng wrote:
> > +
> > + /* Prevent memslot modification */
> > + spin_lock(&kvm->mn_invalidate_lock);
> > + kvm->mn_active_invalidate_count++;
> > + spin_unlock(&kvm->mn_invalidate_lock);
> > +
> > + ret = __kvm_handle_useraddr_range(kvm, &useraddr_range);
> > +
> > + spin_lock(&kvm->mn_invalidate_lock);
> > + kvm->mn_active_invalidate_count--;
> > + spin_unlock(&kvm->mn_invalidate_lock);
> > +
>
>
> You need to follow this with a rcuwait_wake_up as in
> kvm_mmu_notifier_invalidate_range_end.
Oh right.
>
> It's probably best if you move the manipulations of
> mn_active_invalidate_count from kvm_mmu_notifier_invalidate_range_* to two
> separate functions.
Will do.
>
> Paolo
On Tue, Nov 23, 2021 at 09:41:34AM +0100, Paolo Bonzini wrote:
> On 11/19/21 14:47, Chao Peng wrote:
> > For fd-based memslot store the file references for shared fd and the
> > private fd (if any) in the memslot structure. Since there is no 'hva'
> > concept we cannot call hva_to_pfn() to get a pfn, instead kvm_memfd_ops
> > is added to get_pfn/put_pfn from the memory backing stores that provide
> > these fds.
> >
> > Signed-off-by: Yu Zhang<[email protected]>
> > Signed-off-by: Chao Peng<[email protected]>
> > ---
>
> What about kvm_read/write_guest?
Hmm, that would be another area KVM needs to change. Not totally
undoable.
> Maybe the proposal which kept
> userspace_addr for the shared fd is more doable (it would be great to
> ultimately remove the mandatory userspace mapping for the shared fd, but I
> think KVM is not quite ready for that).
Agree for short term keeping shared part unchanged would be making work
easy:) Let me try that to see if any blocker.
>
> Paolo
On Tue, Nov 23, 2021 at 10:06:02AM +0100, Paolo Bonzini wrote:
> On 11/19/21 16:39, David Hildenbrand wrote:
> > > If qmeu can put all the guest memory in a memfd and not map it, then
> > > I'd also like to see that the IOMMU can use this interface too so we
> > > can have VFIO working in this configuration.
> >
> > In QEMU we usually want to (and must) be able to access guest memory
> > from user space, with the current design we wouldn't even be able to
> > temporarily mmap it -- which makes sense for encrypted memory only. The
> > corner case really is encrypted memory. So I don't think we'll see a
> > broad use of this feature outside of encrypted VMs in QEMU. I might be
> > wrong, most probably I am:)
>
> It's not _that_ crazy an idea, but it's going to be some work to teach KVM
> that it has to kmap/kunmap around all memory accesses.
>
> I think it's great that memfd hooks are usable by more than one subsystem,
> OTOH it's fair that whoever needs it does the work---and VFIO does not need
> it for confidential VMs, yet, so it should be fine for now to have a single
> user.
>
> On the other hand, as I commented already, the lack of locking in the
> register/unregister functions has to be fixed even with a single user.
> Another thing we can do already is change the guest_ops/guest_mem_ops to
> something like memfd_falloc_notifier_ops/memfd_pfn_ops, and the
> register/unregister functions to memfd_register/unregister_falloc_notifier.
I'm satisified with this naming ;)
>
> Chao, can you also put this under a new CONFIG such as "bool MEMFD_OPS", and
> select it from KVM?
Yes, reasonable.
>
> Thanks,
>
> Paolo
On Tue, Nov 23, 2021 at 10:09:28AM +0100, Paolo Bonzini wrote:
> On 11/23/21 02:06, Chao Peng wrote:
> > > Maybe the kvm has to be tagged with a sequential id that incremented every
> > > allocation. This id can be checked here.
> > Sounds like a sequential id will be needed, no existing fields in struct
> > kvm can work for this.
>
> There's no need to new concepts when there's a perfectly usable reference
> count. :)
Indeed, thanks.
>
> Paolo
On 23.11.21 10:06, Paolo Bonzini wrote:
> On 11/19/21 16:39, David Hildenbrand wrote:
>>> If qmeu can put all the guest memory in a memfd and not map it, then
>>> I'd also like to see that the IOMMU can use this interface too so we
>>> can have VFIO working in this configuration.
>>
>> In QEMU we usually want to (and must) be able to access guest memory
>> from user space, with the current design we wouldn't even be able to
>> temporarily mmap it -- which makes sense for encrypted memory only. The
>> corner case really is encrypted memory. So I don't think we'll see a
>> broad use of this feature outside of encrypted VMs in QEMU. I might be
>> wrong, most probably I am:)
>
> It's not _that_ crazy an idea, but it's going to be some work to teach
> KVM that it has to kmap/kunmap around all memory accesses.
I'm also concerned about userspace access. But you sound like you have a
plan :)
--
Thanks,
David / dhildenb
On Tue, Nov 23, 2021 at 10:06:02AM +0100, Paolo Bonzini wrote:
> I think it's great that memfd hooks are usable by more than one subsystem,
> OTOH it's fair that whoever needs it does the work---and VFIO does not need
> it for confidential VMs, yet, so it should be fine for now to have a single
> user.
I think adding a new interface to a core kernel subsystem should come
with a greater requirement to work out something generally useful and
not be overly wedded to a single use case (eg F_SEAL_GUEST)
Especially if something like 'single user' is not just a small
implementation artifact but a key design tennant of the whole eventual
solution.
Jason
On 19/11/2021 13:47, Chao Peng wrote:
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> virt/kvm/kvm_main.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 271cef8d1cd0..b8673490d301 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1426,7 +1426,7 @@ static void update_memslots(struct kvm_memslots *slots,
> static int check_memory_region_flags(struct kvm *kvm,
> const struct kvm_userspace_memory_region_ext *mem)
> {
> - u32 valid_flags = 0;
> + u32 valid_flags = KVM_MEM_FD;
>
> if (!kvm->dirty_log_unsupported)
> valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
> @@ -1604,10 +1604,20 @@ static int kvm_set_memslot(struct kvm *kvm,
> kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
> }
>
> + if (mem->flags & KVM_MEM_FD && change == KVM_MR_CREATE) {
> + r = kvm_memfd_register(kvm, mem, new);
> + if (r)
> + goto out_slots;
> + }
> +
> r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
> if (r)
> goto out_slots;
>
> + if (mem->flags & KVM_MEM_FD && (r || change == KVM_MR_DELETE)) {
^
r will never be non-zero as the 'if' above will catch that case and jump
to out_slots.
I *think* the intention was that the "if (r)" code should be after this
check to clean up in the case of error from
kvm_arch_prepare_memory_region() (as well as an explicit MR_DELETE).
Steve
> + kvm_memfd_unregister(kvm, new);
> + }
> +
> update_memslots(slots, new, change);
> slots = install_new_memslots(kvm, as_id, slots);
>
> @@ -1683,10 +1693,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
> return -EINVAL;
> if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> return -EINVAL;
> - /* We can read the guest memory with __xxx_user() later on. */
> if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> - !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> + (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
> + return -EINVAL;
> + /* We can read the guest memory with __xxx_user() later on. */
> + if (!(mem->flags & KVM_MEM_FD) &&
> + !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> mem->memory_size))
> return -EINVAL;
> if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> @@ -1727,6 +1739,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> new.dirty_bitmap = NULL;
> memset(&new.arch, 0, sizeof(new.arch));
> } else { /* Modify an existing slot. */
> + /* Private memslots are immutable, they can only be deleted. */
> + if (mem->flags & KVM_MEM_FD && mem->private_fd >= 0)
> + return -EINVAL;
> if ((new.userspace_addr != old.userspace_addr) ||
> (new.npages != old.npages) ||
> ((new.flags ^ old.flags) & KVM_MEM_READONLY))
>
On 11/19/21 05:47, Chao Peng wrote:
> This RFC series try to implement the fd-based KVM guest private memory
> proposal described at [1] and an improved 'New Proposal' described at [2].
I generally like this. Thanks!
On 11/19/21 05:47, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The new seal type provides semantics required for KVM guest private
> memory support. A file descriptor with the seal set is going to be used
> as source of guest memory in confidential computing environments such as
> Intel TDX and AMD SEV.
>
> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> userspace cannot read, write or mmap the memfd.
I don't have a strong objection here, but, given that you're only
supporting it for memfd, would a memfd_create() flag be more
straightforward? If nothing else, it would avoid any possible locking
issue.
I'm also very very slightly nervous about a situation in which one
program sends a memfd to an untrusted other process and that process
truncates the memfd and then F_SEAL_GUESTs it. This could be mostly
mitigated by also requiring that no other seals be set when F_SEAL_GUEST
happens, but the alternative MFD_GUEST would eliminate this issue too.