This is the third version of this series which try to implement the
fd-based KVM guest private memory. Earlier this week I sent another v3
version at link:
https://lore.kernel.org/linux-mm/[email protected]/T/
That version is based on the latest TDX codebase. In contrast the one you
are reading is the same code rebased to latest kvm/queue branch at commit:
c34c87a69727 KVM: x86: Update vPMCs when retiring branch instructions
There are some changes made to fit into the kvm queue branch but
generally the two versions are the same code in logic.
There is also difference in test. In the previous one I tested the new
private memory feature with TDX but in this rebased version I can not
test the new feature because lack TDX. I did run simple regression
test on this new version.
Introduction
------------
In general this patch series introduce fd-based memslot which provide
guest memory through a memfd file descriptor fd[offset,size] instead of
hva/size. The fd then can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc which we refer as memory backend. KVM and the
memory backend exchange some callbacks when such memslot gets created.
At runtime KVM will call into callbacks provided by backend to get the
pfn with the fd+offset. Memory backend will also call into KVM callbacks
when userspace fallocate/punch hole on the fd to notify KVM to map/unmap
secondary MMU page tables.
Comparing to existing hva-based memslot, this new type of memslot allow
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent userspace bugs.
Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backend can provide more
enforcement on the fd and KVM can use a single memslot to hold both the
private and shared part of the guest memory.
Memfd/shmem extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
created with this flag cannot read(), write() or mmap() etc.
In addition, two sets of callbacks are introduced as new MEMFD_OPS:
- memfd_falloc_notifier: memfd -> KVM notifier when memory gets
allocated/invalidated through fallocate().
- memfd_pfn_ops: kvm -> memfd to get a pfn with the fd+offset.
Memslot extension
-----------------
Add the private fd and the offset into the fd to existing 'shared' memslot
so that both private/shared guest memory can live in one single memslot.
A page in the memslot is either private or shared. A page is private only
when it's already allocated in the backend fd, all the other cases it's
treated as shared, this includes those already mapped as shared as well as
those having not been mapped. This means the memory backend is the place
which tells the truth of which page is private.
Private memory map/unmap and conversion
---------------------------------------
Userspace's map/unmap operations are done by fallocate() ioctl on the
backend fd.
- map: default fallocate() with mode=0.
- unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
The map/unmap will trigger above memfd_falloc_notifier to let KVM
map/unmap second MMU page tables.
Test
----
NOTE: below is the test for previous TDX based version. For this version
I only tested regular vm booting.
This code has been tested with latest TDX code patches hosted at
(https://github.com/intel/tdx/tree/kvm-upstream) with minimal TDX
adaption and QEMU support.
Example QEMU command line:
-object tdx-guest,id=tdx \
-object memory-backend-memfd-private,id=ram1,size=2G \
-machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
Changelog
----------
v3:
- Added locking protection when calling
invalidate_page_range/fallocate callbacks.
- Changed memslot structure to keep use useraddr for shared memory.
- Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
- Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
- Commit message improvement.
- Many small fixes for comments from the last version.
Links of previous discussions
-----------------------------
[1] Original design proposal:
https://lkml.kernel.org/kvm/[email protected]/
[2] Updated proposal and RFC patch v1:
https://lkml.kernel.org/linux-fsdevel/[email protected]/
[3] RFC patch v2:
https://x-lore.kernel.org/qemu-devel/[email protected]/
Chao Peng (14):
mm/memfd: Introduce MFD_INACCESSIBLE flag
KVM: Extend the memslot to support fd-based private memory
KVM: Maintain ofs_tree for fast memslot lookup by file offset
KVM: Implement fd-based memory using MEMFD_OPS interfaces
KVM: Refactor hva based memory invalidation code
KVM: Special handling for fd-based memory invalidation
KVM: Split out common memory invalidation code
KVM: Implement fd-based memory invalidation
KVM: Add kvm_map_gfn_range
KVM: Implement fd-based memory fallocation
KVM: Add KVM_EXIT_MEMORY_ERROR exit
KVM: Handle page fault for private memory
KVM: Use kvm_userspace_memory_region_ext
KVM: Register/unregister private memory slot to memfd
Kirill A. Shutemov (2):
mm/shmem: Introduce F_SEAL_INACCESSIBLE
mm/memfd: Introduce MEMFD_OPS
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 120 ++++++++++++++-
arch/x86/kvm/mmu/paging_tmpl.h | 11 +-
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 43 +++++-
include/linux/memfd.h | 22 +++
include/linux/shmem_fs.h | 16 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/kvm.h | 27 ++++
include/uapi/linux/memfd.h | 1 +
mm/Kconfig | 4 +
mm/memfd.c | 33 ++++-
mm/shmem.c | 195 +++++++++++++++++++++++-
virt/kvm/Makefile.kvm | 2 +-
virt/kvm/kvm_main.c | 262 +++++++++++++++++++++++++--------
virt/kvm/memfd.c | 95 ++++++++++++
16 files changed, 753 insertions(+), 82 deletions(-)
create mode 100644 virt/kvm/memfd.c
--
2.17.1
From: "Kirill A. Shutemov" <[email protected]>
Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace in any possible ways like
read(),write() or mmap() etc.
It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.
At this time only shmem implements this seal.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/fcntl.h | 1 +
mm/shmem.c | 37 +++++++++++++++++++++++++++++++++++--
2 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..e2bad051936f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_INACCESSIBLE 0x0020 /* prevent file from accessing */
/* (1U << 31) is reserved for signed error codes */
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 18f93c2d68f1..faa7e9b1b9bc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1098,6 +1098,10 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
(newsize > oldsize && (info->seals & F_SEAL_GROW)))
return -EPERM;
+ if ((info->seals & F_SEAL_INACCESSIBLE) &&
+ (newsize & ~PAGE_MASK))
+ return -EINVAL;
+
if (newsize != oldsize) {
error = shmem_reacct_size(SHMEM_I(inode)->flags,
oldsize, newsize);
@@ -1364,6 +1368,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
goto redirty;
if (!total_swap_pages)
goto redirty;
+ if (info->seals & F_SEAL_INACCESSIBLE)
+ goto redirty;
/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -2262,6 +2268,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
if (ret)
return ret;
+ if (info->seals & F_SEAL_INACCESSIBLE)
+ return -EPERM;
+
/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;
@@ -2459,12 +2468,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index = pos >> PAGE_SHIFT;
/* i_rwsem is held by caller */
- if (unlikely(info->seals & (F_SEAL_GROW |
- F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+ F_SEAL_FUTURE_WRITE |
+ F_SEAL_INACCESSIBLE))) {
if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
+ if (info->seals & F_SEAL_INACCESSIBLE)
+ return -EPERM;
}
return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2538,6 +2550,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
end_index = i_size >> PAGE_SHIFT;
if (index > end_index)
break;
+
+ /*
+ * inode_lock protects setting up seals as well as write to
+ * i_size. Setting F_SEAL_INACCESSIBLE only allowed with
+ * i_size == 0.
+ *
+ * Check F_SEAL_INACCESSIBLE after i_size. It effectively
+ * serialize read vs. setting F_SEAL_INACCESSIBLE without
+ * taking inode_lock in read path.
+ */
+ if (SHMEM_I(inode)->seals & F_SEAL_INACCESSIBLE) {
+ error = -EPERM;
+ break;
+ }
+
if (index == end_index) {
nr = i_size & ~PAGE_MASK;
if (nr <= offset)
@@ -2663,6 +2690,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}
+ if ((info->seals & F_SEAL_INACCESSIBLE) &&
+ (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+ error = -EINVAL;
+ goto out;
+ }
+
shmem_falloc.waitq = &shmem_falloc_waitq;
shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
--
2.17.1
Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace. It does this by force
setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
with MFD_ALLOW_SEALING.
Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/memfd.h | 1 +
mm/memfd.c | 12 +++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
#define MFD_CLOEXEC 0x0001U
#define MFD_ALLOW_SEALING 0x0002U
#define MFD_HUGETLB 0x0004U
+#define MFD_INACCESSIBLE 0x0008U
/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 9f80f162791a..c898a007fb76 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+ MFD_INACCESSIBLE)
SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
@@ -267,6 +268,10 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
}
+ /* Disallow sealing when MFD_INACCESSIBLE is set. */
+ if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+ return -EINVAL;
+
/* length includes terminating zero */
len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
if (len <= 0)
@@ -315,6 +320,11 @@ SYSCALL_DEFINE2(memfd_create,
*file_seals &= ~F_SEAL_SEAL;
}
+ if (flags & MFD_INACCESSIBLE) {
+ file_seals = memfd_file_seals_ptr(file);
+ *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;
+ }
+
fd_install(fd, file);
kfree(name);
return fd;
--
2.17.1
From: "Kirill A. Shutemov" <[email protected]>
The patch introduces new MEMFD_OPS facility around file created by
memfd_create() to allow a third kernel component to make use of memory
bookmarked in a memfd and gets notifier when the memory in the file
is allocated/invalidated. It will be used for KVM to use memfd file
descriptor as the guest memory backend and KVM will use MEMFD_OPS to
interact with memfd subsystem. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).
It consists two set of callbacks:
- memfd_falloc_notifier: callbacks which provided by KVM and called
by memfd when memory gets allocated/invalidated through fallocate()
ioctl.
- memfd_pfn_ops: callbacks which provided by memfd and called by KVM
to request memory page from memfd.
Locking is needed for above callbacks to prevent race condition.
- get_owner/put_owner is used to ensure the owner is still alive in
the invalidate_page_range/fallocate callback handlers using a
reference mechanism.
- page is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is
still valid when it's used (e.g. when KVM page fault handler uses
it to establish the mapping in the secondary MMU page tables).
Userspace is in charge of guest memory lifecycle: it can allocate the
memory with fallocate() or punch hole to free memory from the guest.
The file descriptor passed down to KVM as guest memory backend. KVM
registers itself as the owner of the memfd via
memfd_register_falloc_notifier() and provides memfd_falloc_notifier
callbacks that need to be called on fallocate() and punching hole.
memfd_register_falloc_notifier() returns memfd_pfn_ops callbacks that
need to be used for requesting a new page from KVM.
At this time only shmem is supported.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/memfd.h | 22 ++++++
include/linux/shmem_fs.h | 16 ++++
mm/Kconfig | 4 +
mm/memfd.c | 21 ++++++
mm/shmem.c | 158 +++++++++++++++++++++++++++++++++++++++
5 files changed, 221 insertions(+)
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..0007073b53dc 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,26 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
}
#endif
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier {
+ void (*invalidate_page_range)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+ void (*fallocate)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+ bool (*get_owner)(void *owner);
+ void (*put_owner)(void *owner);
+};
+
+struct memfd_pfn_ops {
+ long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+ void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
+extern int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+ const struct memfd_falloc_notifier *notifier,
+ const struct memfd_pfn_ops **pfn_ops);
+extern void memfd_unregister_falloc_notifier(struct inode *inode);
+#endif
+
#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..503adc63728c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,11 @@
/* inode in-kernel data */
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier;
+struct memfd_pfn_ops;
+#endif
+
struct shmem_inode_info {
spinlock_t lock;
unsigned int seals; /* shmem seals */
@@ -24,6 +29,10 @@ struct shmem_inode_info {
struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
atomic_t stop_eviction; /* hold when working on inode */
+#ifdef CONFIG_MEMFD_OPS
+ void *owner;
+ const struct memfd_falloc_notifier *falloc_notifier;
+#endif
struct inode vfs_inode;
};
@@ -96,6 +105,13 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);
+#ifdef CONFIG_MEMFD_OPS
+extern int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+ const struct memfd_falloc_notifier *notifier,
+ const struct memfd_pfn_ops **pfn_ops);
+extern void shmem_unregister_falloc_notifier(struct inode *inode);
+#endif
+
/* Flag allocation requirements to shmem_getpage */
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
diff --git a/mm/Kconfig b/mm/Kconfig
index 28edafc820ad..9989904d1b56 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,10 @@ config IO_MAPPING
config SECRETMEM
def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+config MEMFD_OPS
+ bool
+ depends on MEMFD_CREATE
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/memfd.c b/mm/memfd.c
index c898a007fb76..41861870fc21 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,6 +130,27 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
return NULL;
}
+#ifdef CONFIG_MEMFD_OPS
+int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+ const struct memfd_falloc_notifier *notifier,
+ const struct memfd_pfn_ops **pfn_ops)
+{
+ if (shmem_mapping(inode->i_mapping))
+ return shmem_register_falloc_notifier(inode, owner,
+ notifier, pfn_ops);
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(memfd_register_falloc_notifier);
+
+void memfd_unregister_falloc_notifier(struct inode *inode)
+{
+ if (shmem_mapping(inode->i_mapping))
+ shmem_unregister_falloc_notifier(inode);
+}
+EXPORT_SYMBOL_GPL(memfd_unregister_falloc_notifier);
+#endif
+
#define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
diff --git a/mm/shmem.c b/mm/shmem.c
index faa7e9b1b9bc..4d8a75c4d037 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
#include <linux/userfaultfd_k.h>
#include <linux/rmap.h>
#include <linux/uuid.h>
+#include <linux/memfd.h>
#include <linux/uaccess.h>
@@ -906,6 +907,68 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
return split_huge_page(page) >= 0;
}
+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ const struct memfd_falloc_notifier *notifier;
+ void *owner;
+ bool ret;
+
+ if (!info->falloc_notifier)
+ return;
+
+ spin_lock(&info->lock);
+ notifier = info->falloc_notifier;
+ if (!notifier) {
+ spin_unlock(&info->lock);
+ return;
+ }
+
+ owner = info->owner;
+ ret = notifier->get_owner(owner);
+ spin_unlock(&info->lock);
+ if (!ret)
+ return;
+
+ notifier->fallocate(inode, owner, start, end);
+ notifier->put_owner(owner);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct page *page,
+ pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ const struct memfd_falloc_notifier *notifier;
+ void *owner;
+ bool ret;
+
+ if (!info->falloc_notifier)
+ return;
+
+ spin_lock(&info->lock);
+ notifier = info->falloc_notifier;
+ if (!notifier) {
+ spin_unlock(&info->lock);
+ return;
+ }
+
+ owner = info->owner;
+ ret = notifier->get_owner(owner);
+ spin_unlock(&info->lock);
+ if (!ret)
+ return;
+
+ start = max(start, page->index);
+ end = min(end, page->index + thp_nr_pages(page));
+
+ notifier->invalidate_page_range(inode, owner, start, end);
+ notifier->put_owner(owner);
+#endif
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +1012,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
index += thp_nr_pages(page) - 1;
+ notify_invalidate_page(inode, page, start, end);
+
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -1025,6 +1090,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index--;
break;
}
+
+ notify_invalidate_page(inode, page, start, end);
+
VM_BUG_ON_PAGE(PageWriteback(page), page);
if (shmem_punch_compound(page, start, end))
truncate_inode_page(mapping, page);
@@ -2815,6 +2883,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
i_size_write(inode, offset + len);
inode->i_ctime = current_time(inode);
+ notify_fallocate(inode, start, end);
undone:
spin_lock(&inode->i_lock);
inode->i_private = NULL;
@@ -3784,6 +3853,20 @@ static void shmem_destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
}
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping, struct page *newpage,
+ struct page *page, enum migrate_mode mode)
+{
+#ifdef CONFIG_MEMFD_OPS
+ struct inode *inode = mapping->host;
+
+ if (SHMEM_I(inode)->owner)
+ return -EOPNOTSUPP;
+#endif
+ return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
const struct address_space_operations shmem_aops = {
.writepage = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
@@ -3798,6 +3881,81 @@ const struct address_space_operations shmem_aops = {
};
EXPORT_SYMBOL(shmem_aops);
+#ifdef CONFIG_MEMFD_OPS
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+ struct page *page;
+ int ret;
+
+ ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+ if (ret)
+ return ret;
+
+ *order = thp_order(compound_head(page));
+
+ return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ set_page_dirty(page);
+ unlock_page(page);
+ put_page(page);
+}
+
+static const struct memfd_pfn_ops shmem_pfn_ops = {
+ .get_lock_pfn = shmem_get_lock_pfn,
+ .put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+ const struct memfd_falloc_notifier *notifier,
+ const struct memfd_pfn_ops **pfn_ops)
+{
+ gfp_t gfp;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!inode || !owner || !notifier || !pfn_ops ||
+ !notifier->invalidate_page_range ||
+ !notifier->fallocate ||
+ !notifier->get_owner ||
+ !notifier->put_owner)
+ return -EINVAL;
+
+ spin_lock(&info->lock);
+ if (info->owner && info->owner != owner) {
+ spin_unlock(&info->lock);
+ return -EPERM;
+ }
+
+ info->owner = owner;
+ info->falloc_notifier = notifier;
+ spin_unlock(&info->lock);
+
+ gfp = mapping_gfp_mask(inode->i_mapping);
+ gfp &= ~__GFP_MOVABLE;
+ mapping_set_gfp_mask(inode->i_mapping, gfp);
+ mapping_set_unevictable(inode->i_mapping);
+
+ *pfn_ops = &shmem_pfn_ops;
+ return 0;
+}
+
+void shmem_unregister_falloc_notifier(struct inode *inode)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ spin_lock(&info->lock);
+ info->owner = NULL;
+ info->falloc_notifier = NULL;
+ spin_unlock(&info->lock);
+}
+#endif
+
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
.get_unmapped_area = shmem_get_unmapped_area,
--
2.17.1
Extend the memslot definition to provide fd-based private memory support
by adding two new fields(fd/ofs). The memslot then can maintain memory
for both shared and private pages in a single memslot. Shared pages are
provided in the existing way by using userspace_addr(hva) field and
get_user_pages() while private pages are provided through the new
fields(fd/ofs). Since there is no 'hva' concept anymore for private
memory we cannot call get_user_pages() to get a pfn, instead we rely on
the newly introduced MEMFD_OPS callbacks to do the same job.
This new extension is indicated by a new flag KVM_MEM_PRIVATE.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 10 ++++++++++
include/uapi/linux/kvm.h | 12 ++++++++++++
2 files changed, 22 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f8ed799e8674..2cd35560c44b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -460,8 +460,18 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+ u32 fd;
+ struct file *file;
+ u64 ofs;
};
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+ if (slot && (slot->flags & KVM_MEM_PRIVATE))
+ return true;
+ return false;
+}
+
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..41434322fa23 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};
+struct kvm_userspace_memory_region_ext {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+ __u64 userspace_addr; /* hva */
+ __u64 ofs; /* offset into fd */
+ __u32 fd;
+ __u32 padding[5];
+};
+
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
@@ -110,6 +121,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_PRIVATE (1UL << 2)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
--
2.17.1
This patch adds the new memfd facility in KVM using MEMFD_OPS to provide
guest memory from a file descriptor created in userspace with
memfd_create() instead of traditional userspace hva. It mainly provides
two kind of functions:
- Pair/unpair a fd-based memslot to a memory backend that owns the
file descriptor when such memslot gets created/deleted.
- Get/put a pfn that to be used in KVM page fault handler from/to the
paired memory backend.
At the pairing time, KVM and the memfd subsystem exchange calllbacks
that each can call into the other side. These callbacks are the major
places to implement fd-based guest memory provisioning.
KVM->memfd:
- get_pfn: get and lock a page at specified offset in the fd.
- put_pfn: put and unlock the pfn.
Note: page needs to be locked between get_pfn/put_pfn to ensure pfn
is valid when KVM uses it to establish the mapping in the secondary
MMU page table.
memfd->KVM:
- invalidate_page_range: called when userspace punches hole on the fd,
KVM should unmap related pages in the secondary MMU.
- fallocate: called when userspace fallocates space on the fd, KVM
can map related pages in the secondary MMU.
- get/put_owner: used to ensure guest is still alive using a reference
mechanism when calling above invalidate/fallocate callbacks.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 6 +++
virt/kvm/Makefile.kvm | 2 +-
virt/kvm/memfd.c | 91 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 99 insertions(+), 1 deletion(-)
create mode 100644 virt/kvm/memfd.c
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 03b2ce34e7f4..86655cd660ca 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
select SRCU
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
+ select MEMFD_OPS
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3bd875f9669f..21f8b1880723 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -806,6 +806,12 @@ static inline void kvm_irqfd_exit(void)
{
}
#endif
+
+int kvm_memfd_register(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_memfd_unregister(struct kvm_memory_slot *slot);
+long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order);
+void kvm_memfd_put_pfn(kvm_pfn_t pfn);
+
int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
struct module *module);
void kvm_exit(void);
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index ffdcad3cc97a..8842128d8429 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -5,7 +5,7 @@
KVM ?= ../../../virt/kvm
-kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
+kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o $(KVM)/memfd.o
kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
new file mode 100644
index 000000000000..662393a76782
--- /dev/null
+++ b/virt/kvm/memfd.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memfd.c: routines for fd based guest memory
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ * Author:
+ * Chao Peng <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/memfd.h>
+
+#ifdef CONFIG_MEMFD_OPS
+static const struct memfd_pfn_ops *memfd_ops;
+
+static void memfd_invalidate_page_range(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end)
+{
+}
+
+static void memfd_fallocate(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end)
+{
+}
+
+static bool memfd_get_owner(void *owner)
+{
+ return kvm_get_kvm_safe(owner);
+}
+
+static void memfd_put_owner(void *owner)
+{
+ kvm_put_kvm(owner);
+}
+
+static const struct memfd_falloc_notifier memfd_notifier = {
+ .invalidate_page_range = memfd_invalidate_page_range,
+ .fallocate = memfd_fallocate,
+ .get_owner = memfd_get_owner,
+ .put_owner = memfd_put_owner,
+};
+#endif
+
+long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order)
+{
+#ifdef CONFIG_MEMFD_OPS
+ pgoff_t index = gfn - slot->base_gfn + (slot->ofs >> PAGE_SHIFT);
+
+ return memfd_ops->get_lock_pfn(slot->file->f_inode, index, order);
+#else
+ return -EOPNOTSUPP;
+#endif
+}
+
+void kvm_memfd_put_pfn(kvm_pfn_t pfn)
+{
+#ifdef CONFIG_MEMFD_OPS
+ memfd_ops->put_unlock_pfn(pfn);
+#endif
+}
+
+int kvm_memfd_register(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_MEMFD_OPS
+ int ret;
+ struct fd fd = fdget(slot->fd);
+
+ if (!fd.file)
+ return -EINVAL;
+
+ ret = memfd_register_falloc_notifier(fd.file->f_inode, kvm,
+ &memfd_notifier, &memfd_ops);
+ if (ret)
+ return ret;
+
+ slot->file = fd.file;
+ return 0;
+#else
+ return -EOPNOTSUPP;
+#endif
+}
+
+void kvm_memfd_unregister(struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_MEMFD_OPS
+ if (slot->file) {
+ fput(slot->file);
+ slot->file = NULL;
+ }
+#endif
+}
--
2.17.1
For fd-based guest memory, the memory backend (e.g. the fd provider)
should notify KVM to unmap/invalidate the privated memory from KVM
secondary MMU when userspace punches hole on the fd (e.g. when
userspace converts private memory to shared memory).
To support fd-based memory invalidation, existing hva-based memory
invalidation needs to be extended. A new 'inode' for the fd is passed in
from memfd_falloc_notifier and the 'start/end' will represent start/end
offset in the fd instead of hva range. During the invalidation KVM needs
to check this inode against that in the memslot. Only when the 'inode' in
memslot equals to the passed-in 'inode' we should invalidate the mapping
in KVM.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
virt/kvm/kvm_main.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7a1c4d7eaaa..19736a0013a0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -494,6 +494,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
struct kvm_useraddr_range {
unsigned long start;
unsigned long end;
+ struct inode *inode;
pte_t pte;
gfn_handler_t handler;
on_lock_fn_t on_lock;
@@ -544,14 +545,27 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
struct interval_tree_node *node;
slots = __kvm_memslots(kvm, i);
- useraddr_tree = &slots->hva_tree;
+ useraddr_tree = range->inode ? &slots->ofs_tree : &slots->hva_tree;
kvm_for_each_memslot_in_useraddr_range(node, useraddr_tree,
range->start, range->end - 1) {
unsigned long useraddr_start, useraddr_end;
+ unsigned long useraddr_base;
+
+ if (range->inode) {
+ slot = container_of(node, struct kvm_memory_slot,
+ ofs_node[slots->node_idx]);
+ if (!slot->file ||
+ slot->file->f_inode != range->inode)
+ continue;
+ useraddr_base = slot->ofs;
+ } else {
+ slot = container_of(node, struct kvm_memory_slot,
+ hva_node[slots->node_idx]);
+ useraddr_base = slot->userspace_addr;
+ }
- slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
- useraddr_start = max(range->start, slot->userspace_addr);
- useraddr_end = min(range->end, slot->userspace_addr +
+ useraddr_start = max(range->start, useraddr_base);
+ useraddr_end = min(range->end, useraddr_base +
(slot->npages << PAGE_SHIFT));
/*
@@ -568,10 +582,10 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
* {gfn_start, gfn_start+1, ..., gfn_end-1}.
*/
gfn_range.start = useraddr_to_gfn_memslot(useraddr_start,
- slot, true);
+ slot, !range->inode);
gfn_range.end = useraddr_to_gfn_memslot(
useraddr_end + PAGE_SIZE - 1,
- slot, true);
+ slot, !range->inode);
gfn_range.slot = slot;
if (!locked) {
@@ -613,6 +627,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = true,
.may_block = false,
+ .inode = NULL,
};
return __kvm_handle_useraddr_range(kvm, &range);
@@ -632,6 +647,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = false,
.may_block = false,
+ .inode = NULL,
};
return __kvm_handle_useraddr_range(kvm, &range);
@@ -700,6 +716,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
.on_lock = kvm_inc_notifier_count,
.flush_on_ret = true,
.may_block = mmu_notifier_range_blockable(range),
+ .inode = NULL,
};
trace_kvm_unmap_hva_range(range->start, range->end);
@@ -751,6 +768,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
.on_lock = kvm_dec_notifier_count,
.flush_on_ret = false,
.may_block = mmu_notifier_range_blockable(range),
+ .inode = NULL,
};
bool wake;
--
2.17.1
Similar to hva_tree for hva range, maintain interval tree ofs_tree for
offset range of a fd-based memslot so the lookup by offset range can be
faster when memslot count is high.
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 17 +++++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2cd35560c44b..3bd875f9669f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -451,6 +451,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
struct kvm_memory_slot {
struct hlist_node id_node[2];
struct interval_tree_node hva_node[2];
+ struct interval_tree_node ofs_node[2];
struct rb_node gfn_node[2];
gfn_t base_gfn;
unsigned long npages;
@@ -560,6 +561,7 @@ struct kvm_memslots {
u64 generation;
atomic_long_t last_used_slot;
struct rb_root_cached hva_tree;
+ struct rb_root_cached ofs_tree;
struct rb_root gfn_tree;
/*
* The mapping table from slot id to memslot.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b0f7e6eb00ff..47e96d1eb233 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1087,6 +1087,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
atomic_long_set(&slots->last_used_slot, (unsigned long)NULL);
slots->hva_tree = RB_ROOT_CACHED;
+ slots->ofs_tree = RB_ROOT_CACHED;
slots->gfn_tree = RB_ROOT;
hash_init(slots->id_hash);
slots->node_idx = j;
@@ -1363,7 +1364,7 @@ static void kvm_replace_gfn_node(struct kvm_memslots *slots,
* With NULL @old this simply adds @new.
* With NULL @new this simply removes @old.
*
- * If @new is non-NULL its hva_node[slots_idx] range has to be set
+ * If @new is non-NULL its hva/ofs_node[slots_idx] range has to be set
* appropriately.
*/
static void kvm_replace_memslot(struct kvm *kvm,
@@ -1377,6 +1378,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
if (old) {
hash_del(&old->id_node[idx]);
interval_tree_remove(&old->hva_node[idx], &slots->hva_tree);
+ interval_tree_remove(&old->ofs_node[idx], &slots->ofs_tree);
if ((long)old == atomic_long_read(&slots->last_used_slot))
atomic_long_set(&slots->last_used_slot, (long)new);
@@ -1388,20 +1390,27 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
/*
- * Initialize @new's hva range. Do this even when replacing an @old
+ * Initialize @new's hva/ofs range. Do this even when replacing an @old
* slot, kvm_copy_memslot() deliberately does not touch node data.
*/
new->hva_node[idx].start = new->userspace_addr;
new->hva_node[idx].last = new->userspace_addr +
(new->npages << PAGE_SHIFT) - 1;
+ if (kvm_slot_is_private(new)) {
+ new->ofs_node[idx].start = new->ofs;
+ new->ofs_node[idx].last = new->ofs +
+ (new->npages << PAGE_SHIFT) - 1;
+ }
/*
* (Re)Add the new memslot. There is no O(1) interval_tree_replace(),
- * hva_node needs to be swapped with remove+insert even though hva can't
- * change when replacing an existing slot.
+ * hva_node/ofs_node needs to be swapped with remove+insert even though
+ * hva/ofs can't change when replacing an existing slot.
*/
hash_add(slots->id_hash, &new->id_node[idx], new->id);
interval_tree_insert(&new->hva_node[idx], &slots->hva_tree);
+ if (kvm_slot_is_private(new))
+ interval_tree_insert(&new->ofs_node[idx], &slots->ofs_tree);
/*
* If the memslot gfn is unchanged, rb_replace_node() can be used to
--
2.17.1
The purpose of this patch is for fd-based memslot to reuse the same
mmu_notifier based guest memory invalidation code for private pages.
No functional changes except renaming 'hva' to more neutral 'useraddr'
so that it can also cover 'offset' in a fd that private pages live in.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 8 ++++--
virt/kvm/kvm_main.c | 55 ++++++++++++++++++++++------------------
2 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 21f8b1880723..07863ff855cd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1464,9 +1464,13 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
}
static inline gfn_t
-hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
+useraddr_to_gfn_memslot(unsigned long useraddr, struct kvm_memory_slot *slot,
+ bool addr_is_hva)
{
- gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
+ unsigned long useraddr_base = addr_is_hva ? slot->userspace_addr
+ : slot->ofs;
+
+ gfn_t gfn_offset = (useraddr - useraddr_base) >> PAGE_SHIFT;
return slot->base_gfn + gfn_offset;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 47e96d1eb233..b7a1c4d7eaaa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -486,16 +486,16 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
srcu_read_unlock(&kvm->srcu, idx);
}
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
unsigned long end);
-struct kvm_hva_range {
+struct kvm_useraddr_range {
unsigned long start;
unsigned long end;
pte_t pte;
- hva_handler_t handler;
+ gfn_handler_t handler;
on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
@@ -515,13 +515,13 @@ static void kvm_null_fn(void)
#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
/* Iterate over each memslot intersecting [start, last] (inclusive) range */
-#define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
- for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
+#define kvm_for_each_memslot_in_useraddr_range(node, tree, start, last) \
+ for (node = interval_tree_iter_first(tree, start, last); \
node; \
node = interval_tree_iter_next(node, start, last)) \
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct kvm_hva_range *range)
+static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
+ const struct kvm_useraddr_range *range)
{
bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
@@ -540,17 +540,19 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
idx = srcu_read_lock(&kvm->srcu);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ struct rb_root_cached *useraddr_tree;
struct interval_tree_node *node;
slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot_in_hva_range(node, slots,
+ useraddr_tree = &slots->hva_tree;
+ kvm_for_each_memslot_in_useraddr_range(node, useraddr_tree,
range->start, range->end - 1) {
- unsigned long hva_start, hva_end;
+ unsigned long useraddr_start, useraddr_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
- hva_start = max(range->start, slot->userspace_addr);
- hva_end = min(range->end, slot->userspace_addr +
- (slot->npages << PAGE_SHIFT));
+ useraddr_start = max(range->start, slot->userspace_addr);
+ useraddr_end = min(range->end, slot->userspace_addr +
+ (slot->npages << PAGE_SHIFT));
/*
* To optimize for the likely case where the address
@@ -562,11 +564,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.may_block = range->may_block;
/*
- * {gfn(page) | page intersects with [hva_start, hva_end)} =
+ * {gfn(page) | page intersects with [useraddr_start, useraddr_end)} =
* {gfn_start, gfn_start+1, ..., gfn_end-1}.
*/
- gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
- gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
+ gfn_range.start = useraddr_to_gfn_memslot(useraddr_start,
+ slot, true);
+ gfn_range.end = useraddr_to_gfn_memslot(
+ useraddr_end + PAGE_SIZE - 1,
+ slot, true);
gfn_range.slot = slot;
if (!locked) {
@@ -597,10 +602,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
pte_t pte,
- hva_handler_t handler)
+ gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
+ const struct kvm_useraddr_range range = {
.start = start,
.end = end,
.pte = pte,
@@ -610,16 +615,16 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
.may_block = false,
};
- return __kvm_handle_hva_range(kvm, &range);
+ return __kvm_handle_useraddr_range(kvm, &range);
}
static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
- hva_handler_t handler)
+ gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
+ const struct kvm_useraddr_range range = {
.start = start,
.end = end,
.pte = __pte(0),
@@ -629,7 +634,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.may_block = false,
};
- return __kvm_handle_hva_range(kvm, &range);
+ return __kvm_handle_useraddr_range(kvm, &range);
}
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
@@ -687,7 +692,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range hva_range = {
+ const struct kvm_useraddr_range useraddr_range = {
.start = range->start,
.end = range->end,
.pte = __pte(0),
@@ -711,7 +716,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm->mn_active_invalidate_count++;
spin_unlock(&kvm->mn_invalidate_lock);
- __kvm_handle_hva_range(kvm, &hva_range);
+ __kvm_handle_useraddr_range(kvm, &useraddr_range);
return 0;
}
@@ -738,7 +743,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range hva_range = {
+ const struct kvm_useraddr_range useraddr_range = {
.start = range->start,
.end = range->end,
.pte = __pte(0),
@@ -749,7 +754,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
};
bool wake;
- __kvm_handle_hva_range(kvm, &hva_range);
+ __kvm_handle_useraddr_range(kvm, &useraddr_range);
/* Pairs with the increment in range_start(). */
spin_lock(&kvm->mn_invalidate_lock);
--
2.17.1
When fd-based memory is enabled, there will be two types of memory
invalidation:
- memory invalidation from native MMU through mmu_notifier callback
for hva-based memory, and,
- memory invalidation from memfd through memfd_notifier callback for
fd-based memory.
Some code can be shared between these two types of memory invalidation.
This patch moves those shared code into one place so that it can be
used for both CONFIG_MMU_NOTIFIER and CONFIG_MEMFD_NOTIFIER.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
virt/kvm/kvm_main.c | 35 +++++++++++++++++++----------------
1 file changed, 19 insertions(+), 16 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 19736a0013a0..7b7530b1ea1e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -469,22 +469,6 @@ void kvm_destroy_vcpus(struct kvm *kvm)
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
-{
- return container_of(mn, struct kvm, mmu_notifier);
-}
-
-static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
- struct kvm *kvm = mmu_notifier_to_kvm(mn);
- int idx;
-
- idx = srcu_read_lock(&kvm->srcu);
- kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
- srcu_read_unlock(&kvm->srcu, idx);
-}
typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -611,6 +595,25 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
/* The notifiers are averse to booleans. :-( */
return (int)ret;
}
+#endif
+
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+ return container_of(mn, struct kvm, mmu_notifier);
+}
+
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
--
2.17.1
KVM gets notified when userspace punches a hole in a fd which is used
for guest memory. KVM should invalidate the mapping in the secondary
MMU page tables. This is the same logic as MMU notifier invalidation
except the fd related information is carried around to indicate the
memory range. KVM hence can reuse most of existing MMU notifier
invalidation code including looping through the memslots and then
calling into kvm_unmap_gfn_range() which should do whatever needed for
fd-based memory unmapping (e.g. for private memory managed by TDX it
may need call into SEAM-MODULE).
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 8 ++++-
virt/kvm/kvm_main.c | 69 +++++++++++++++++++++++++++++++---------
virt/kvm/memfd.c | 2 ++
3 files changed, 63 insertions(+), 16 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 07863ff855cd..be567925831b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -233,7 +233,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFD_OPS)
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -2012,4 +2012,10 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+#ifdef CONFIG_MEMFD_OPS
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end);
+#endif /* CONFIG_MEMFD_OPS */
+
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7b7530b1ea1e..f495c1a313bd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -468,7 +468,8 @@ void kvm_destroy_vcpus(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#if defined(CONFIG_MEMFD_OPS) ||\
+ (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -595,6 +596,30 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
/* The notifiers are averse to booleans. :-( */
return (int)ret;
}
+
+static void mn_active_invalidate_count_inc(struct kvm *kvm)
+{
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count++;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+}
+
+static void mn_active_invalidate_count_dec(struct kvm *kvm)
+{
+ bool wake;
+
+ spin_lock(&kvm->mn_invalidate_lock);
+ wake = (--kvm->mn_active_invalidate_count == 0);
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * There can only be one waiter, since the wait happens under
+ * slots_lock.
+ */
+ if (wake)
+ rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+}
#endif
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
@@ -732,9 +757,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
*
* Pairs with the decrement in range_end().
*/
- spin_lock(&kvm->mn_invalidate_lock);
- kvm->mn_active_invalidate_count++;
- spin_unlock(&kvm->mn_invalidate_lock);
+ mn_active_invalidate_count_inc(kvm);
__kvm_handle_useraddr_range(kvm, &useraddr_range);
@@ -773,21 +796,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
.may_block = mmu_notifier_range_blockable(range),
.inode = NULL,
};
- bool wake;
__kvm_handle_useraddr_range(kvm, &useraddr_range);
/* Pairs with the increment in range_start(). */
- spin_lock(&kvm->mn_invalidate_lock);
- wake = (--kvm->mn_active_invalidate_count == 0);
- spin_unlock(&kvm->mn_invalidate_lock);
-
- /*
- * There can only be one waiter, since the wait happens under
- * slots_lock.
- */
- if (wake)
- rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+ mn_active_invalidate_count_dec(kvm);
BUG_ON(kvm->mmu_notifier_count < 0);
}
@@ -872,6 +885,32 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_MEMFD_OPS
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ int ret;
+ const struct kvm_useraddr_range useraddr_range = {
+ .start = start,
+ .end = end,
+ .pte = __pte(0),
+ .handler = kvm_unmap_gfn_range,
+ .on_lock = (void *)kvm_null_fn,
+ .flush_on_ret = true,
+ .may_block = false,
+ .inode = inode,
+ };
+
+
+ /* Prevent memslot modification */
+ mn_active_invalidate_count_inc(kvm);
+ ret = __kvm_handle_useraddr_range(kvm, &useraddr_range);
+ mn_active_invalidate_count_dec(kvm);
+
+ return ret;
+}
+#endif /* CONFIG_MEMFD_OPS */
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
static int kvm_pm_notifier_call(struct notifier_block *bl,
unsigned long state,
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index 662393a76782..547f65f5a187 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -16,6 +16,8 @@ static const struct memfd_pfn_ops *memfd_ops;
static void memfd_invalidate_page_range(struct inode *inode, void *owner,
pgoff_t start, pgoff_t end)
{
+ kvm_memfd_invalidate_range(owner, inode, start >> PAGE_SHIFT,
+ end >> PAGE_SHIFT);
}
static void memfd_fallocate(struct inode *inode, void *owner,
--
2.17.1
This new function establishes the mapping in KVM page tables for a
given gfn range. It can be used in the memory fallocate callback for
memfd based memory to establish the mapping for KVM secondary MMU when
the pages are allocated in the memory backend.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 47 ++++++++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 5 +++++
3 files changed, 54 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d275e9d76b5..2856eb662a21 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1568,6 +1568,53 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
return ret;
}
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ struct kvm_vcpu *vcpu;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+ int idx;
+ bool ret = true;
+
+ /* Need vcpu context for kvm_mmu_do_page_fault. */
+ vcpu = kvm_get_vcpu(kvm, 0);
+ if (mutex_lock_killable(&vcpu->mutex))
+ return false;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&kvm->srcu);
+
+ kvm_mmu_reload(vcpu);
+
+ gfn = range->start;
+ while (gfn < range->end) {
+ if (signal_pending(current)) {
+ ret = false;
+ break;
+ }
+
+ if (need_resched())
+ cond_resched();
+
+ pfn = kvm_mmu_do_page_fault(vcpu, gfn << PAGE_SHIFT,
+ PFERR_WRITE_MASK | PFERR_USER_MASK,
+ false);
+ if (is_error_noslot_pfn(pfn) || kvm->vm_bugged) {
+ ret = false;
+ break;
+ }
+
+ gfn++;
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ mutex_unlock(&vcpu->mutex);
+
+ return ret;
+}
+
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index be567925831b..8c2359175509 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -241,6 +241,8 @@ struct kvm_gfn_range {
pte_t pte;
bool may_block;
};
+
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f495c1a313bd..660ce15973ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,6 +471,11 @@ EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
#if defined(CONFIG_MEMFD_OPS) ||\
(defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
+bool __weak kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return false;
+}
+
typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
--
2.17.1
KVM gets notified through memfd_notifier when userspace allocatea space
via fallocate() on the fd which is used for guest memory. KVM can set up
the mapping in the secondary MMU page tables at this time. This patch
adds function in KVM to map pfn to gfn when the page is allocated in the
memory backend.
While it's possible to postpone the mapping of the secondary MMU to KVM
page fault handler but we can reduce some VMExits by also mapping the
secondary page tables when a page is mapped in the primary MMU.
It reuses the same code for kvm_memfd_invalidate_range, except using
kvm_map_gfn_range as its handler.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 22 +++++++++++++++++++---
virt/kvm/memfd.c | 2 ++
3 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c2359175509..ad89a0e8bf6b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2017,6 +2017,8 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
#ifdef CONFIG_MEMFD_OPS
int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
unsigned long start, unsigned long end);
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end);
#endif /* CONFIG_MEMFD_OPS */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 660ce15973ad..36dd2adcd7fc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -891,15 +891,17 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
#ifdef CONFIG_MEMFD_OPS
-int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
- unsigned long start, unsigned long end)
+int kvm_memfd_handle_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end,
+ gfn_handler_t handler)
+
{
int ret;
const struct kvm_useraddr_range useraddr_range = {
.start = start,
.end = end,
.pte = __pte(0),
- .handler = kvm_unmap_gfn_range,
+ .handler = handler,
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = true,
.may_block = false,
@@ -914,6 +916,20 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
return ret;
}
+
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ return kvm_memfd_handle_range(kvm, inode, start, end,
+ kvm_unmap_gfn_range);
+}
+
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+ unsigned long start, unsigned long end)
+{
+ return kvm_memfd_handle_range(kvm, inode, start, end,
+ kvm_map_gfn_range);
+}
#endif /* CONFIG_MEMFD_OPS */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index 547f65f5a187..91a17c9fbc49 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -23,6 +23,8 @@ static void memfd_invalidate_page_range(struct inode *inode, void *owner,
static void memfd_fallocate(struct inode *inode, void *owner,
pgoff_t start, pgoff_t end)
{
+ kvm_memfd_fallocate_range(owner, inode, start >> PAGE_SHIFT,
+ end >> PAGE_SHIFT);
}
static bool memfd_get_owner(void *owner)
--
2.17.1
This new exit allows user space to handle memory-related errors.
Currently it supports two types (KVM_EXIT_MEM_MAP_SHARED/PRIVATE) of
errors which are used for shared memory <-> private memory conversion
in memory encryption usage.
After private memory is enabled, there are two places in KVM that can
exit to userspace to trigger private <-> shared conversion:
- explicit conversion: happens when guest explicitly calls into KVM to
map a range (as private or shared), KVM then exits to userspace to
do the map/unmap operations.
- implicit conversion: happens in KVM page fault handler.
* if the fault is due to a private memory access then causes a
userspace exit for a shared->private conversion request when the
page has not been allocated in the private memory backend.
* If the fault is due to a shared memory access then causes a
userspace exit for a private->shared conversion request when the
page has already been allocated in the private memory backend.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/kvm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 41434322fa23..d68db3b2eeec 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -243,6 +243,18 @@ struct kvm_xen_exit {
} u;
};
+struct kvm_memory_exit {
+#define KVM_EXIT_MEM_MAP_SHARED 1
+#define KVM_EXIT_MEM_MAP_PRIVATE 2
+ __u32 type;
+ union {
+ struct {
+ __u64 gpa;
+ __u64 size;
+ } map;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576
@@ -282,6 +294,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
#define KVM_EXIT_RISCV_SBI 35
+#define KVM_EXIT_MEMORY_ERROR 36
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -499,6 +512,8 @@ struct kvm_run {
unsigned long args[6];
unsigned long ret[2];
} riscv_sbi;
+ /* KVM_EXIT_MEMORY_ERROR */
+ struct kvm_memory_exit mem;
/* Fix the size of the union. */
char padding[256];
};
--
2.17.1
Use the new extended memslot structure kvm_userspace_memory_region_ext
which includes two additional fd/ofs fields comparing to the current
kvm_userspace_memory_region. The fields fd/ofs will be copied from
userspace only when KVM_MEM_PRIVATE is set.
Internal the KVM we change all existing kvm_userspace_memory_region to
kvm_userspace_memory_region_ext since the new extended structure covers
all the existing fields in kvm_userspace_memory_region.
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 4 ++--
virt/kvm/kvm_main.c | 19 +++++++++++++------
3 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42bde45a1bc2..52942195def3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11551,7 +11551,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
}
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_userspace_memory_region m;
+ struct kvm_userspace_memory_region_ext m;
m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ad89a0e8bf6b..fabab3b77d57 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -981,9 +981,9 @@ enum kvm_mr_change {
};
int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region_ext *mem);
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region_ext *mem);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 36dd2adcd7fc..cf8dcb3b8c7f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1514,7 +1514,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_userspace_memory_region_ext *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -1907,7 +1907,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *mem)
{
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2011,7 +2011,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *mem)
{
int r;
@@ -2023,7 +2023,7 @@ int kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(kvm_set_memory_region);
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
- struct kvm_userspace_memory_region *mem)
+ struct kvm_userspace_memory_region_ext *mem)
{
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;
@@ -4569,12 +4569,19 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
case KVM_SET_USER_MEMORY_REGION: {
- struct kvm_userspace_memory_region kvm_userspace_mem;
+ struct kvm_userspace_memory_region_ext kvm_userspace_mem;
r = -EFAULT;
if (copy_from_user(&kvm_userspace_mem, argp,
- sizeof(kvm_userspace_mem)))
+ sizeof(struct kvm_userspace_memory_region)))
goto out;
+ if (kvm_userspace_mem.flags & KVM_MEM_PRIVATE) {
+ int offset = offsetof(
+ struct kvm_userspace_memory_region_ext, ofs);
+ if (copy_from_user(&kvm_userspace_mem.ofs, argp + offset,
+ sizeof(kvm_userspace_mem) - offset))
+ goto out;
+ }
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
break;
--
2.17.1
When a page fault from the secondary page table while the guest is
running happens in a memslot with KVM_MEM_PRIVATE, we need go
different paths for private access and shared access.
- For private access, KVM checks if the page is already allocated in
the memory backend, if yes KVM establishes the mapping, otherwise
exits to userspace to convert a shared page to private one.
- For shared access, KVM also checks if the page is already allocated
in the memory backend, if yes then exit to userspace to convert a
private page to shared one, otherwise it's treated as a traditional
hva-based shared memory, KVM lets existing code to obtain a pfn with
get_user_pages() and establish the mapping.
The above code assume private memory is persistent and pre-allocated in
the memory backend so KVM can use this information as an indicator for
a page is private or shared. The above check is then performed by
calling kvm_memfd_get_pfn() which currently is implemented as a
pagecache search but in theory that can be implemented differently
(i.e. when the page is even not mapped into host pagecache there should
be some different implementation).
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
2 files changed, 77 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2856eb662a21..fbcdf62f8281 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
+ if (kvm_slot_is_private(slot))
+ return max_level;
+
host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
return min(host_level, max_level);
}
@@ -3950,7 +3953,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
}
-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+ /*
+ * At this time private gfn has not been supported yet. Other patch
+ * that enables it should change this.
+ */
+ return false;
+}
+
+static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault,
+ bool *is_private_pfn, int *r)
+{
+ int order;
+ int mem_convert_type;
+ struct kvm_memory_slot *slot = fault->slot;
+ long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
+
+ if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
+ if (pfn < 0)
+ mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
+ else {
+ fault->pfn = pfn;
+ if (slot->flags & KVM_MEM_READONLY)
+ fault->map_writable = false;
+ else
+ fault->map_writable = true;
+
+ if (order == 0)
+ fault->max_level = PG_LEVEL_4K;
+ *is_private_pfn = true;
+ *r = RET_PF_FIXED;
+ return true;
+ }
+ } else {
+ if (pfn < 0)
+ return false;
+
+ kvm_memfd_put_pfn(pfn);
+ mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
+ }
+
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+ vcpu->run->mem.type = mem_convert_type;
+ vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->mem.u.map.size = PAGE_SIZE;
+ fault->pfn = -1;
+ *r = -1;
+ return true;
+}
+
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ bool *is_private_pfn, int *r)
{
struct kvm_memory_slot *slot = fault->slot;
bool async;
@@ -3984,6 +4039,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
}
}
+ if (kvm_slot_is_private(slot) &&
+ kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r))
+ return *r == RET_PF_FIXED ? false : true;
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
fault->write, &fault->map_writable,
@@ -4044,6 +4103,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
unsigned long mmu_seq;
+ bool is_private_pfn = false;
int r;
fault->gfn = fault->addr >> PAGE_SHIFT;
@@ -4063,7 +4123,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
- if (kvm_faultin_pfn(vcpu, fault, &r))
+ if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
return r;
if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
@@ -4076,7 +4136,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
else
write_lock(&vcpu->kvm->mmu_lock);
- if (is_page_fault_stale(vcpu, fault, mmu_seq))
+ if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
@@ -4093,7 +4153,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+
+ if (is_private_pfn)
+ kvm_memfd_put_pfn(fault->pfn);
+ else
+ kvm_release_pfn_clean(fault->pfn);
+
return r;
}
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..640fd1e2fe4c 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
int r;
unsigned long mmu_seq;
bool is_self_change_mapping;
+ bool is_private_pfn = false;
+
pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
WARN_ON_ONCE(fault->is_tdp);
@@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
- if (kvm_faultin_pfn(vcpu, fault, &r))
+ if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
return r;
if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
@@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);
- if (is_page_fault_stale(vcpu, fault, mmu_seq))
+ if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
goto out_unlock;
kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
@@ -913,7 +915,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ if (is_private_pfn)
+ kvm_memfd_put_pfn(fault->pfn);
+ else
+ kvm_release_pfn_clean(fault->pfn);
return r;
}
--
2.17.1
Expose KVM_MEM_PRIVATE flag and register/unregister private memory
slot to memfd when userspace sets the flag.
KVM_MEM_PRIVATE is disallowed by default but architecture code can
turn on it by implementing kvm_arch_private_memory_supported().
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 34 ++++++++++++++++++++++++++++++++--
2 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fabab3b77d57..5173c52e70d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1229,6 +1229,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_private_memory_supported(struct kvm *kvm);
#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf8dcb3b8c7f..1caebded52c4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1514,10 +1514,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}
-static int check_memory_region_flags(const struct kvm_userspace_memory_region_ext *mem)
+bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
+{
+ return false;
+}
+
+static int check_memory_region_flags(struct kvm *kvm,
+ const struct kvm_userspace_memory_region_ext *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+ if (kvm_arch_private_memory_supported(kvm))
+ valid_flags |= KVM_MEM_PRIVATE;
+
#ifdef __KVM_HAVE_READONLY_MEM
valid_flags |= KVM_MEM_READONLY;
#endif
@@ -1756,6 +1765,8 @@ static void kvm_delete_memslot(struct kvm *kvm,
struct kvm_memory_slot *old,
struct kvm_memory_slot *invalid_slot)
{
+ if (old->flags & KVM_MEM_PRIVATE)
+ kvm_memfd_unregister(old);
/*
* Remove the old memslot (in the inactive memslots) by passing NULL as
* the "new" slot, and for the invalid version in the active slots.
@@ -1836,6 +1847,14 @@ static int kvm_set_memslot(struct kvm *kvm,
kvm_invalidate_memslot(kvm, old, invalid_slot);
}
+ if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
+ r = kvm_memfd_register(kvm, new);
+ if (r) {
+ mutex_unlock(&kvm->slots_arch_lock);
+ return r;
+ }
+ }
+
r = kvm_prepare_memory_region(kvm, old, new, change);
if (r) {
/*
@@ -1850,6 +1869,10 @@ static int kvm_set_memslot(struct kvm *kvm,
} else {
mutex_unlock(&kvm->slots_arch_lock);
}
+
+ if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
+ kvm_memfd_unregister(new);
+
return r;
}
@@ -1917,7 +1940,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;
- r = check_memory_region_flags(mem);
+ r = check_memory_region_flags(kvm, mem);
if (r)
return r;
@@ -1974,6 +1997,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
return -EINVAL;
} else { /* Modify an existing slot. */
+ /* Private memslots are immutable, they can only be deleted. */
+ if (mem->flags & KVM_MEM_PRIVATE)
+ return -EINVAL;
+
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2002,6 +2029,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
new->npages = npages;
new->flags = mem->flags;
new->userspace_addr = mem->userspace_addr;
+ new->fd = mem->fd;
+ new->file = NULL;
+ new->ofs = mem->ofs;
r = kvm_set_memslot(kvm, old, new, change);
if (r)
--
2.17.1
On Thu, Dec 23, 2021, Chao Peng wrote:
> Extend the memslot definition to provide fd-based private memory support
> by adding two new fields(fd/ofs). The memslot then can maintain memory
> for both shared and private pages in a single memslot. Shared pages are
> provided in the existing way by using userspace_addr(hva) field and
> get_user_pages() while private pages are provided through the new
> fields(fd/ofs). Since there is no 'hva' concept anymore for private
> memory we cannot call get_user_pages() to get a pfn, instead we rely on
> the newly introduced MEMFD_OPS callbacks to do the same job.
>
> This new extension is indicated by a new flag KVM_MEM_PRIVATE.
>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/kvm_host.h | 10 ++++++++++
> include/uapi/linux/kvm.h | 12 ++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f8ed799e8674..2cd35560c44b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -460,8 +460,18 @@ struct kvm_memory_slot {
> u32 flags;
> short id;
> u16 as_id;
> + u32 fd;
There should be no need to store the fd in the memslot, the fd should be unneeded
outside of __kvm_set_memory_region(), e.g.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1caebded52c4..4e43262887a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2029,10 +2029,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
new->npages = npages;
new->flags = mem->flags;
new->userspace_addr = mem->userspace_addr;
- new->fd = mem->fd;
- new->file = NULL;
- new->ofs = mem->ofs;
-
+ if (mem->flags & KVM_MEM_PRIVATE) {
+ new->private_file = fget(mem->private_fd);
+ new->private_offset = mem->private_offset;
+ }
r = kvm_set_memslot(kvm, old, new, change);
if (r)
kfree(new);
> + struct file *file;
Please use more descriptive names, shaving characters is not at all priority.
> + u64 ofs;
I believe this should be loff_t.
struct file *private_file;
struct loff_t private_offset;
> };
>
> +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> +{
> + if (slot && (slot->flags & KVM_MEM_PRIVATE))
> + return true;
> + return false;
return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +
> static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> {
> return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1daa45268de2..41434322fa23 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> +struct kvm_userspace_memory_region_ext {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size; /* bytes */
> + __u64 userspace_addr; /* hva */
Would it make sense to embed "struct kvm_userspace_memory_region"?
> + __u64 ofs; /* offset into fd */
> + __u32 fd;
Again, use descriptive names, then comments like "offset into fd" are unnecessary.
__u64 private_offset;
__u32 private_fd;
> + __u32 padding[5];
> +};
> +
> /*
> * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +121,7 @@ struct kvm_userspace_memory_region {
> */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> +#define KVM_MEM_PRIVATE (1UL << 2)
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> --
> 2.17.1
>
On Thu, Dec 23, 2021, Chao Peng wrote:
> Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> offset range of a fd-based memslot so the lookup by offset range can be
> faster when memslot count is high.
This won't work. The hva_tree relies on there being exactly one virtual address
space, whereas with private memory, userspace can map multiple files into the
guest at different gfns, but with overlapping offsets.
I also dislike hijacking __kvm_handle_hva_range() in patch 07.
KVM also needs to disallow mapping the same file+offset into multiple gfns, which
I don't see anywhere in this series.
In other words, there needs to be a 1:1 gfn:file+offset mapping. Since userspace
likely wants to allocate a single file for guest private memory and map it into
multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
of my head would be to register the notifier on a per-slot basis, not a per-VM
basis. It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
not a huge deal.
That way, KVM's notifier callback already knows the memslot and can compute overlap
between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
Then, armed with the gfn and slot, invalidation is just a matter of constructing
a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
On Thu, Dec 23, 2021, Chao Peng wrote:
> This new function establishes the mapping in KVM page tables for a
> given gfn range. It can be used in the memory fallocate callback for
> memfd based memory to establish the mapping for KVM secondary MMU when
> the pages are allocated in the memory backend.
NAK, under no circumstance should KVM install SPTEs in response to allocating
memory in a file. The correct thing to do is to invalidate the gfn range
associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
with the memslot.
On Thu, Dec 23, 2021, Chao Peng wrote:
> This new exit allows user space to handle memory-related errors.
> Currently it supports two types (KVM_EXIT_MEM_MAP_SHARED/PRIVATE) of
> errors which are used for shared memory <-> private memory conversion
> in memory encryption usage.
>
> After private memory is enabled, there are two places in KVM that can
> exit to userspace to trigger private <-> shared conversion:
> - explicit conversion: happens when guest explicitly calls into KVM to
> map a range (as private or shared), KVM then exits to userspace to
> do the map/unmap operations.
> - implicit conversion: happens in KVM page fault handler.
> * if the fault is due to a private memory access then causes a
> userspace exit for a shared->private conversion request when the
> page has not been allocated in the private memory backend.
> * If the fault is due to a shared memory access then causes a
> userspace exit for a private->shared conversion request when the
> page has already been allocated in the private memory backend.
>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/uapi/linux/kvm.h | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 41434322fa23..d68db3b2eeec 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -243,6 +243,18 @@ struct kvm_xen_exit {
> } u;
> };
>
> +struct kvm_memory_exit {
> +#define KVM_EXIT_MEM_MAP_SHARED 1
> +#define KVM_EXIT_MEM_MAP_PRIVATE 2
I don't think the exit should explicitly say "map", it's userspace's decision on
what to do in response to the exit, i. e. describe the properties of the exit,
not what userspace should do in response to the exit.
> + __u32 type;
Hmm, I think private vs. shared should be a flag, not a type, and should be split
out from the type of error, i.e. !KVM_EXIT_MEMORY_PRIVATE == KVM_EXIT_MEMORY_SHARED.
By error type I mean page fault vs. KVM access vs. ???. And we'll probably want to
communicate read/write/execute information as well.
To get this uABI right the first time, I think we should implement this support
in advance of this series and wire up all architectures to use the new exit reason
instead of -EFAULT. It's mostly just page fault handlers, but KVM access to guest
memory, e.g. when reading/writing steal_time, also needs to use this new exit
reason.
Having two __u32s for "error" and "flags" would also pad things so that the __u64s
are correctly aligned.
> + union {
> + struct {
> + __u64 gpa;
> + __u64 size;
> + } map;
> + } u;
I'd strongly prefer to avoid another union, those get nasty when userspace just
wants to dump the info because then the meaning of each field is conditional on
the flags/error. I don't we'll get _that_ many fields, certainly far fewer than
256 bytes total, so the footprint really isn't an issue.
> +};
> +
> #define KVM_S390_GET_SKEYS_NONE 1
> #define KVM_S390_SKEYS_MAX 1048576
>
> @@ -282,6 +294,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_X86_BUS_LOCK 33
> #define KVM_EXIT_XEN 34
> #define KVM_EXIT_RISCV_SBI 35
> +#define KVM_EXIT_MEMORY_ERROR 36
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -499,6 +512,8 @@ struct kvm_run {
> unsigned long args[6];
> unsigned long ret[2];
> } riscv_sbi;
> + /* KVM_EXIT_MEMORY_ERROR */
> + struct kvm_memory_exit mem;
As gross as it is to make struct kvm_run super long, I actually prefer the inline
definitions, e.g.
struct {
__u32 flags;
__u32 padding;
__u64 gpa;
__u64 size;
} memory;
> /* Fix the size of the union. */
> char padding[256];
> };
> --
> 2.17.1
>
On Thu, Dec 23, 2021, Chao Peng wrote:
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 03b2ce34e7f4..86655cd660ca 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -46,6 +46,7 @@ config KVM
> select SRCU
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> + select MEMFD_OPS
MEMFD_OPS is a weird Kconfig name given that it's not just memfd() that can
implement the ops.
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3bd875f9669f..21f8b1880723 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -806,6 +806,12 @@ static inline void kvm_irqfd_exit(void)
> {
> }
> #endif
> +
> +int kvm_memfd_register(struct kvm *kvm, struct kvm_memory_slot *slot);
> +void kvm_memfd_unregister(struct kvm_memory_slot *slot);
> +long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order);
> +void kvm_memfd_put_pfn(kvm_pfn_t pfn);
> +
> int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
> struct module *module);
> void kvm_exit(void);
> diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
> index ffdcad3cc97a..8842128d8429 100644
> --- a/virt/kvm/Makefile.kvm
> +++ b/virt/kvm/Makefile.kvm
> @@ -5,7 +5,7 @@
>
> KVM ?= ../../../virt/kvm
>
> -kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
> +kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o $(KVM)/memfd.o
This should be
kvm-$(CONFIG_MEMFD_OPS) += $(KVM)/memfd.o
with stubs provided in a header file as needed. I also really dislike naming KVM's
file memfd.c, though I don't have a good alternative off the top of my head.
> kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
> kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> +#ifdef CONFIG_MEMFD_OPS
> +static const struct memfd_pfn_ops *memfd_ops;
memfd_ops needs to be associated with the slot, e.g. userspace should be able to
map multiple types of a backing stores into a single VM. This doesn't even allow
that for multiple VMs, and there are all kinds of ordering issues.
> +void kvm_memfd_unregister(struct kvm_memory_slot *slot)
> +{
> +#ifdef CONFIG_MEMFD_OPS
> + if (slot->file) {
> + fput(slot->file);
Needs to actually unregister.
> + slot->file = NULL;
> + }
> +#endif
> +}
> --
> 2.17.1
>
On 12/23/21 19:34, Sean Christopherson wrote:
>> select HAVE_KVM_PM_NOTIFIER if PM
>> + select MEMFD_OPS
> MEMFD_OPS is a weird Kconfig name given that it's not just memfd() that can
> implement the ops.
>
Or, it's kvm that implements them to talk to memfd?
Paolo
On Thu, 2021-12-23 at 20:29 +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The patch introduces new MEMFD_OPS facility around file created by
> memfd_create() to allow a third kernel component to make use of
> memory
> bookmarked in a memfd and gets notifier when the memory in the file
> is allocated/invalidated. It will be used for KVM to use memfd file
> descriptor as the guest memory backend and KVM will use MEMFD_OPS to
> interact with memfd subsystem. In the future there might be other
> consumers (e.g. VFIO with encrypted device memory).
>
> It consists two set of callbacks:
> - memfd_falloc_notifier: callbacks which provided by KVM and called
> by memfd when memory gets allocated/invalidated through
> fallocate()
> ioctl.
> - memfd_pfn_ops: callbacks which provided by memfd and called by
> KVM
> to request memory page from memfd.
>
> Locking is needed for above callbacks to prevent race condition.
> - get_owner/put_owner is used to ensure the owner is still alive in
> the invalidate_page_range/fallocate callback handlers using a
> reference mechanism.
> - page is locked between get_lock_pfn/put_unlock_pfn to ensure pfn
> is
> still valid when it's used (e.g. when KVM page fault handler uses
> it to establish the mapping in the secondary MMU page tables).
>
> Userspace is in charge of guest memory lifecycle: it can allocate the
> memory with fallocate() or punch hole to free memory from the guest.
>
> The file descriptor passed down to KVM as guest memory backend. KVM
> registers itself as the owner of the memfd via
> memfd_register_falloc_notifier() and provides memfd_falloc_notifier
> callbacks that need to be called on fallocate() and punching hole.
>
> memfd_register_falloc_notifier() returns memfd_pfn_ops callbacks that
> need to be used for requesting a new page from KVM.
>
> At this time only shmem is supported.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/memfd.h | 22 ++++++
> include/linux/shmem_fs.h | 16 ++++
> mm/Kconfig | 4 +
> mm/memfd.c | 21 ++++++
> mm/shmem.c | 158
> +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 221 insertions(+)
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..0007073b53dc 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -13,4 +13,26 @@ static inline long memfd_fcntl(struct file *f,
> unsigned int c, unsigned long a)
> }
> #endif
>
> +#ifdef CONFIG_MEMFD_OPS
> +struct memfd_falloc_notifier {
> + void (*invalidate_page_range)(struct inode *inode, void *owner,
> + pgoff_t start, pgoff_t end);
> + void (*fallocate)(struct inode *inode, void *owner,
> + pgoff_t start, pgoff_t end);
> + bool (*get_owner)(void *owner);
> + void (*put_owner)(void *owner);
> +};
> +
> +struct memfd_pfn_ops {
> + long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int
> *order);
> + void (*put_unlock_pfn)(unsigned long pfn);
> +
> +};
> +
> +extern int memfd_register_falloc_notifier(struct inode *inode, void
> *owner,
> + const struct memfd_falloc_notifier
> *notifier,
> + const struct memfd_pfn_ops **pfn_ops);
> +extern void memfd_unregister_falloc_notifier(struct inode *inode);
> +#endif
> +
> #endif /* __LINUX_MEMFD_H */
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 166158b6e917..503adc63728c 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -12,6 +12,11 @@
>
> /* inode in-kernel data */
>
> +#ifdef CONFIG_MEMFD_OPS
> +struct memfd_falloc_notifier;
> +struct memfd_pfn_ops;
> +#endif
> +
> struct shmem_inode_info {
> spinlock_t lock;
> unsigned int seals; /* shmem seals */
> @@ -24,6 +29,10 @@ struct shmem_inode_info {
> struct shared_policy policy; /* NUMA memory alloc
> policy */
> struct simple_xattrs xattrs; /* list of xattrs */
> atomic_t stop_eviction; /* hold when working on inode
> */
> +#ifdef CONFIG_MEMFD_OPS
> + void *owner;
> + const struct memfd_falloc_notifier *falloc_notifier;
> +#endif
> struct inode vfs_inode;
> };
>
> @@ -96,6 +105,13 @@ extern unsigned long shmem_swap_usage(struct
> vm_area_struct *vma);
> extern unsigned long shmem_partial_swap_usage(struct address_space
> *mapping,
> pgoff_t start, pgoff_t
> end);
>
> +#ifdef CONFIG_MEMFD_OPS
> +extern int shmem_register_falloc_notifier(struct inode *inode, void
> *owner,
> + const struct memfd_falloc_notifier
> *notifier,
> + const struct memfd_pfn_ops **pfn_ops);
> +extern void shmem_unregister_falloc_notifier(struct inode *inode);
> +#endif
> +
> /* Flag allocation requirements to shmem_getpage */
> enum sgp_type {
> SGP_READ, /* don't exceed i_size, don't allocate page */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 28edafc820ad..9989904d1b56 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -900,6 +900,10 @@ config IO_MAPPING
> config SECRETMEM
> def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config MEMFD_OPS
> + bool
> + depends on MEMFD_CREATE
> +
> source "mm/damon/Kconfig"
>
> endmenu
> diff --git a/mm/memfd.c b/mm/memfd.c
> index c898a007fb76..41861870fc21 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -130,6 +130,27 @@ static unsigned int *memfd_file_seals_ptr(struct
> file *file)
> return NULL;
> }
>
> +#ifdef CONFIG_MEMFD_OPS
> +int memfd_register_falloc_notifier(struct inode *inode, void *owner,
> + const struct memfd_falloc_notifier
> *notifier,
> + const struct memfd_pfn_ops
> **pfn_ops)
> +{
> + if (shmem_mapping(inode->i_mapping))
> + return shmem_register_falloc_notifier(inode, owner,
> + notifier,
> pfn_ops);
> +
> + return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(memfd_register_falloc_notifier);
> +
> +void memfd_unregister_falloc_notifier(struct inode *inode)
> +{
> + if (shmem_mapping(inode->i_mapping))
> + shmem_unregister_falloc_notifier(inode);
> +}
> +EXPORT_SYMBOL_GPL(memfd_unregister_falloc_notifier);
> +#endif
> +
> #define F_ALL_SEALS (F_SEAL_SEAL | \
> F_SEAL_SHRINK | \
> F_SEAL_GROW | \
> diff --git a/mm/shmem.c b/mm/shmem.c
> index faa7e9b1b9bc..4d8a75c4d037 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
> #include <linux/userfaultfd_k.h>
> #include <linux/rmap.h>
> #include <linux/uuid.h>
> +#include <linux/memfd.h>
>
> #include <linux/uaccess.h>
>
> @@ -906,6 +907,68 @@ static bool shmem_punch_compound(struct page
> *page, pgoff_t start, pgoff_t end)
> return split_huge_page(page) >= 0;
> }
>
> +static void notify_fallocate(struct inode *inode, pgoff_t start,
> pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFD_OPS
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + const struct memfd_falloc_notifier *notifier;
> + void *owner;
> + bool ret;
> +
> + if (!info->falloc_notifier)
> + return;
> +
> + spin_lock(&info->lock);
> + notifier = info->falloc_notifier;
> + if (!notifier) {
> + spin_unlock(&info->lock);
> + return;
> + }
> +
> + owner = info->owner;
> + ret = notifier->get_owner(owner);
> + spin_unlock(&info->lock);
> + if (!ret)
> + return;
> +
> + notifier->fallocate(inode, owner, start, end);
I see notifier->fallocate(), i.e. memfd_fallocate(), discards
kvm_memfd_fallocate_range()'s return value. Should it be checked?
> + notifier->put_owner(owner);
> +#endif
> +}
> +
> +static void notify_invalidate_page(struct inode *inode, struct page
> *page,
> + pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFD_OPS
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + const struct memfd_falloc_notifier *notifier;
> + void *owner;
> + bool ret;
> +
> + if (!info->falloc_notifier)
> + return;
> +
> + spin_lock(&info->lock);
> + notifier = info->falloc_notifier;
> + if (!notifier) {
> + spin_unlock(&info->lock);
> + return;
> + }
> +
> + owner = info->owner;
> + ret = notifier->get_owner(owner);
> + spin_unlock(&info->lock);
> + if (!ret)
> + return;
> +
> + start = max(start, page->index);
> + end = min(end, page->index + thp_nr_pages(page));
> +
> + notifier->invalidate_page_range(inode, owner, start, end);
> + notifier->put_owner(owner);
> +#endif
> +}
> +
> /*
> * Remove range of pages and swap entries from page cache, and free
> them.
> * If !unfalloc, truncate or punch hole; if unfalloc, undo failed
> fallocate.
> @@ -949,6 +1012,8 @@ static void shmem_undo_range(struct inode
> *inode, loff_t lstart, loff_t lend,
> }
> index += thp_nr_pages(page) - 1;
>
> + notify_invalidate_page(inode, page, start,
> end);
> +
> if (!unfalloc || !PageUptodate(page))
> truncate_inode_page(mapping, page);
> unlock_page(page);
> @@ -1025,6 +1090,9 @@ static void shmem_undo_range(struct inode
> *inode, loff_t lstart, loff_t lend,
> index--;
> break;
> }
> +
> + notify_invalidate_page(inode, page,
> start, end);
> +
> VM_BUG_ON_PAGE(PageWriteback(page),
> page);
> if (shmem_punch_compound(page, start,
> end))
> truncate_inode_page(mapping,
> page);
> @@ -2815,6 +2883,7 @@ static long shmem_fallocate(struct file *file,
> int mode, loff_t offset,
> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode-
> >i_size)
> i_size_write(inode, offset + len);
> inode->i_ctime = current_time(inode);
> + notify_fallocate(inode, start, end);
> undone:
> spin_lock(&inode->i_lock);
> inode->i_private = NULL;
> @@ -3784,6 +3853,20 @@ static void shmem_destroy_inodecache(void)
> kmem_cache_destroy(shmem_inode_cachep);
> }
>
> +#ifdef CONFIG_MIGRATION
> +int shmem_migrate_page(struct address_space *mapping, struct page
> *newpage,
> + struct page *page, enum migrate_mode mode)
> +{
> +#ifdef CONFIG_MEMFD_OPS
> + struct inode *inode = mapping->host;
> +
> + if (SHMEM_I(inode)->owner)
> + return -EOPNOTSUPP;
> +#endif
> + return migrate_page(mapping, newpage, page, mode);
> +}
> +#endif
> +
> const struct address_space_operations shmem_aops = {
> .writepage = shmem_writepage,
> .set_page_dirty = __set_page_dirty_no_writeback,
> @@ -3798,6 +3881,81 @@ const struct address_space_operations
> shmem_aops = {
> };
> EXPORT_SYMBOL(shmem_aops);
>
> +#ifdef CONFIG_MEMFD_OPS
> +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset,
> int *order)
> +{
> + struct page *page;
> + int ret;
> +
> + ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
> + if (ret)
> + return ret;
> +
> + *order = thp_order(compound_head(page));
> +
> + return page_to_pfn(page);
> +}
> +
> +static void shmem_put_unlock_pfn(unsigned long pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> +
> + VM_BUG_ON_PAGE(!PageLocked(page), page);
> +
> + set_page_dirty(page);
> + unlock_page(page);
> + put_page(page);
> +}
> +
> +static const struct memfd_pfn_ops shmem_pfn_ops = {
> + .get_lock_pfn = shmem_get_lock_pfn,
> + .put_unlock_pfn = shmem_put_unlock_pfn,
> +};
> +
> +int shmem_register_falloc_notifier(struct inode *inode, void *owner,
> + const struct memfd_falloc_notifier
> *notifier,
> + const struct memfd_pfn_ops **pfn_ops)
> +{
> + gfp_t gfp;
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + if (!inode || !owner || !notifier || !pfn_ops ||
> + !notifier->invalidate_page_range ||
> + !notifier->fallocate ||
> + !notifier->get_owner ||
> + !notifier->put_owner)
> + return -EINVAL;
> +
> + spin_lock(&info->lock);
> + if (info->owner && info->owner != owner) {
> + spin_unlock(&info->lock);
> + return -EPERM;
> + }
> +
> + info->owner = owner;
> + info->falloc_notifier = notifier;
> + spin_unlock(&info->lock);
> +
> + gfp = mapping_gfp_mask(inode->i_mapping);
> + gfp &= ~__GFP_MOVABLE;
> + mapping_set_gfp_mask(inode->i_mapping, gfp);
> + mapping_set_unevictable(inode->i_mapping);
> +
> + *pfn_ops = &shmem_pfn_ops;
> + return 0;
> +}
> +
> +void shmem_unregister_falloc_notifier(struct inode *inode)
> +{
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + spin_lock(&info->lock);
> + info->owner = NULL;
> + info->falloc_notifier = NULL;
> + spin_unlock(&info->lock);
> +}
> +#endif
> +
> static const struct file_operations shmem_file_operations = {
> .mmap = shmem_mmap,
> .get_unmapped_area = shmem_get_unmapped_area,
On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> On Thu, Dec 23, 2021, Chao Peng wrote:
> > Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> > offset range of a fd-based memslot so the lookup by offset range can be
> > faster when memslot count is high.
>
> This won't work. The hva_tree relies on there being exactly one virtual address
> space, whereas with private memory, userspace can map multiple files into the
> guest at different gfns, but with overlapping offsets.
OK, that's the point.
>
> I also dislike hijacking __kvm_handle_hva_range() in patch 07.
>
> KVM also needs to disallow mapping the same file+offset into multiple gfns, which
> I don't see anywhere in this series.
This can be checked against file+offset overlapping with existing slots
when register a new one.
>
> In other words, there needs to be a 1:1 gfn:file+offset mapping. Since userspace
> likely wants to allocate a single file for guest private memory and map it into
> multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> of my head would be to register the notifier on a per-slot basis, not a per-VM
> basis. It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> not a huge deal.
>
> That way, KVM's notifier callback already knows the memslot and can compute overlap
> between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> Then, armed with the gfn and slot, invalidation is just a matter of constructing
> a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
KVM is easy but the kernel bits would be difficulty, it has to maintain
fd+offset to memslot mapping because one fd can have multiple memslots,
it need decide which memslot needs to be notified.
Thanks,
Chao
On Thu, Dec 23, 2021 at 06:34:22PM +0000, Sean Christopherson wrote:
> On Thu, Dec 23, 2021, Chao Peng wrote:
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 03b2ce34e7f4..86655cd660ca 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -46,6 +46,7 @@ config KVM
> > select SRCU
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > + select MEMFD_OPS
>
> MEMFD_OPS is a weird Kconfig name given that it's not just memfd() that can
> implement the ops.
>
> > help
> > Support hosting fully virtualized guest machines using hardware
> > virtualization extensions. You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3bd875f9669f..21f8b1880723 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -806,6 +806,12 @@ static inline void kvm_irqfd_exit(void)
> > {
> > }
> > #endif
> > +
> > +int kvm_memfd_register(struct kvm *kvm, struct kvm_memory_slot *slot);
> > +void kvm_memfd_unregister(struct kvm_memory_slot *slot);
> > +long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order);
> > +void kvm_memfd_put_pfn(kvm_pfn_t pfn);
> > +
> > int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
> > struct module *module);
> > void kvm_exit(void);
> > diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
> > index ffdcad3cc97a..8842128d8429 100644
> > --- a/virt/kvm/Makefile.kvm
> > +++ b/virt/kvm/Makefile.kvm
> > @@ -5,7 +5,7 @@
> >
> > KVM ?= ../../../virt/kvm
> >
> > -kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
> > +kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o $(KVM)/memfd.o
>
> This should be
>
> kvm-$(CONFIG_MEMFD_OPS) += $(KVM)/memfd.o
>
> with stubs provided in a header file as needed. I also really dislike naming KVM's
> file memfd.c, though I don't have a good alternative off the top of my head.
>
> > kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> > kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
> > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
>
> > +#ifdef CONFIG_MEMFD_OPS
> > +static const struct memfd_pfn_ops *memfd_ops;
>
> memfd_ops needs to be associated with the slot, e.g. userspace should be able to
> map multiple types of a backing stores into a single VM.
I considered this but gave up as I'm not so confident that we will
support other memory backends than memfd in the forthcoming future.
>This doesn't even allow
> that for multiple VMs, and there are all kinds of ordering issues.
Current memfd kAPI actually returns the same set of callback pointer for
all the VMs. It supports multiple VMs via callback parameter inode,
assume one inode can be associated with only one VM.
>
> > +void kvm_memfd_unregister(struct kvm_memory_slot *slot)
> > +{
> > +#ifdef CONFIG_MEMFD_OPS
> > + if (slot->file) {
> > + fput(slot->file);
>
> Needs to actually unregister.
Good catch, thanks.
>
> > + slot->file = NULL;
> > + }
> > +#endif
> > +}
> > --
> > 2.17.1
> >
On Thu, Dec 23, 2021 at 06:06:19PM +0000, Sean Christopherson wrote:
> On Thu, Dec 23, 2021, Chao Peng wrote:
> > This new function establishes the mapping in KVM page tables for a
> > given gfn range. It can be used in the memory fallocate callback for
> > memfd based memory to establish the mapping for KVM secondary MMU when
> > the pages are allocated in the memory backend.
>
> NAK, under no circumstance should KVM install SPTEs in response to allocating
> memory in a file. The correct thing to do is to invalidate the gfn range
> associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
> with the memslot.
Right, thanks.
On Thu, Dec 23, 2021 at 06:34:22PM +0000, Sean Christopherson wrote:
> On Thu, Dec 23, 2021, Chao Peng wrote:
> >
> > -kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
> > +kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o $(KVM)/memfd.o
>
> This should be
>
> kvm-$(CONFIG_MEMFD_OPS) += $(KVM)/memfd.o
>
> with stubs provided in a header file as needed. I also really dislike naming KVM's
> file memfd.c, though I don't have a good alternative off the top of my head.
Is memory-backend.c better? if we end up introducing the callback
definition in KVM we can call it CONFIG_KVM_MEMORY_BACKEDN_OPS.
Chao
On Fri, Dec 24, 2021 at 12:09:47AM +0100, Paolo Bonzini wrote:
> On 12/23/21 19:34, Sean Christopherson wrote:
> > > select HAVE_KVM_PM_NOTIFIER if PM
> > > + select MEMFD_OPS
> > MEMFD_OPS is a weird Kconfig name given that it's not just memfd() that can
> > implement the ops.
> >
>
> Or, it's kvm that implements them to talk to memfd?
The only thing is VFIO may also use the same set of callbacks, as
discussed in the v2. But I think that's fine.
Chao
>
> Paolo
On Fri, Dec 24, 2021 at 11:54:18AM +0800, Chao Peng wrote:
> On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> > > offset range of a fd-based memslot so the lookup by offset range can be
> > > faster when memslot count is high.
> >
> > This won't work. The hva_tree relies on there being exactly one virtual address
> > space, whereas with private memory, userspace can map multiple files into the
> > guest at different gfns, but with overlapping offsets.
>
> OK, that's the point.
>
> >
> > I also dislike hijacking __kvm_handle_hva_range() in patch 07.
> >
> > KVM also needs to disallow mapping the same file+offset into multiple gfns, which
> > I don't see anywhere in this series.
>
> This can be checked against file+offset overlapping with existing slots
> when register a new one.
>
> >
> > In other words, there needs to be a 1:1 gfn:file+offset mapping. Since userspace
> > likely wants to allocate a single file for guest private memory and map it into
> > multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> > of my head would be to register the notifier on a per-slot basis, not a per-VM
> > basis. It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> > not a huge deal.
> >
> > That way, KVM's notifier callback already knows the memslot and can compute overlap
> > between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> > Then, armed with the gfn and slot, invalidation is just a matter of constructing
> > a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
>
> KVM is easy but the kernel bits would be difficulty, it has to maintain
> fd+offset to memslot mapping because one fd can have multiple memslots,
> it need decide which memslot needs to be notified.
How about pass "context" of fd (e.g. the gfn/hva start point) when register
the invalidation notifier to fd, then in callback kvm can convert the
offset to absolute hva/gfn with such "context", then do memslot invalidation.
>
> Thanks,
> Chao
On Fri, Dec 24, 2021, Chao Peng wrote:
> On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> > > offset range of a fd-based memslot so the lookup by offset range can be
> > > faster when memslot count is high.
> >
> > This won't work. The hva_tree relies on there being exactly one virtual address
> > space, whereas with private memory, userspace can map multiple files into the
> > guest at different gfns, but with overlapping offsets.
>
> OK, that's the point.
>
> >
> > I also dislike hijacking __kvm_handle_hva_range() in patch 07.
> >
> > KVM also needs to disallow mapping the same file+offset into multiple gfns, which
> > I don't see anywhere in this series.
>
> This can be checked against file+offset overlapping with existing slots
> when register a new one.
>
> >
> > In other words, there needs to be a 1:1 gfn:file+offset mapping. Since userspace
> > likely wants to allocate a single file for guest private memory and map it into
> > multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> > of my head would be to register the notifier on a per-slot basis, not a per-VM
> > basis. It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> > not a huge deal.
> >
> > That way, KVM's notifier callback already knows the memslot and can compute overlap
> > between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> > Then, armed with the gfn and slot, invalidation is just a matter of constructing
> > a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
>
> KVM is easy but the kernel bits would be difficulty, it has to maintain
> fd+offset to memslot mapping because one fd can have multiple memslots,
> it need decide which memslot needs to be notified.
No, the kernel side maintains an opaque pointer like it does today, KVM handles
reverse engineering the memslot to get the offset and whatever else it needs.
notify_fallocate() and other callbacks are unchanged, though they probably can
drop the inode.
E.g. likely with bad math and handwaving on the overlap detection:
int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
{
struct kvm_memory_slot *slot = owner;
struct kvm_gfn_range gfn_range = {
.slot = slot,
.start = (start - slot->private_offset) >> PAGE_SHIFT,
.end = (end - slot->private_offset) >> PAGE_SHIFT,
.may_block = true,
};
if (!has_overlap(slot, start, end))
return 0;
gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
kvm_unmap_gfn_range(slot->kvm, &gfn_range);
return 0;
}
On Fri, Dec 24, 2021, Chao Peng wrote:
> On Fri, Dec 24, 2021 at 12:09:47AM +0100, Paolo Bonzini wrote:
> > On 12/23/21 19:34, Sean Christopherson wrote:
> > > > select HAVE_KVM_PM_NOTIFIER if PM
> > > > + select MEMFD_OPS
> > > MEMFD_OPS is a weird Kconfig name given that it's not just memfd() that can
> > > implement the ops.
> > >
> >
> > Or, it's kvm that implements them to talk to memfd?
>
> The only thing is VFIO may also use the same set of callbacks, as
> discussed in the v2. But I think that's fine.
I'm objecting to assuming that KVM is talking to memfd. KVM shouldn't know or
care what is sitting behind the fd, KVM only cares whether or not the backing store
provides the necessary APIs.
I also think that the API as whole should be abstracted from memfd. It's mostly
cosmectic, e.g. tweak the struct and Kconfig name. I don't really care if it's
initially dependent on MEMFD_CREATE, I just don't want to end up with an API and
KVM implementation that implies there's something fundamentally special about memfd.
On Tue, Dec 28, 2021 at 09:48:08PM +0000, Sean Christopherson wrote:
> On Fri, Dec 24, 2021, Chao Peng wrote:
> > On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> > > On Thu, Dec 23, 2021, Chao Peng wrote:
> > >
> > > In other words, there needs to be a 1:1 gfn:file+offset mapping. Since userspace
> > > likely wants to allocate a single file for guest private memory and map it into
> > > multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> > > of my head would be to register the notifier on a per-slot basis, not a per-VM
> > > basis. It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> > > not a huge deal.
> > >
> > > That way, KVM's notifier callback already knows the memslot and can compute overlap
> > > between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> > > Then, armed with the gfn and slot, invalidation is just a matter of constructing
> > > a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
> >
> > KVM is easy but the kernel bits would be difficulty, it has to maintain
> > fd+offset to memslot mapping because one fd can have multiple memslots,
> > it need decide which memslot needs to be notified.
>
> No, the kernel side maintains an opaque pointer like it does today,
But the opaque pointer will now become memslot, isn't it? That said,
kernel side should maintain a list of opaque pointer (memslot) instead
of one for each fd (inode) since a fd to memslot mapping is 1:M now.
>KVM handles
> reverse engineering the memslot to get the offset and whatever else it needs.
> notify_fallocate() and other callbacks are unchanged, though they probably can
> drop the inode.
>
> E.g. likely with bad math and handwaving on the overlap detection:
>
> int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
> {
> struct kvm_memory_slot *slot = owner;
> struct kvm_gfn_range gfn_range = {
> .slot = slot,
> .start = (start - slot->private_offset) >> PAGE_SHIFT,
> .end = (end - slot->private_offset) >> PAGE_SHIFT,
> .may_block = true,
> };
>
> if (!has_overlap(slot, start, end))
> return 0;
>
> gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
>
> kvm_unmap_gfn_range(slot->kvm, &gfn_range);
> return 0;
> }
I understand this KVM side handling, but again one fd can have multiple
memslots. How shmem decides to notify which memslot from a list of
memslots when it invokes the notify_fallocate()? Or just notify all
the possible memslots then let KVM to check?
Thanks,
Chao
On Fri, Dec 24, 2021 at 12:13:51PM +0800, Chao Peng wrote:
> On Thu, Dec 23, 2021 at 06:06:19PM +0000, Sean Christopherson wrote:
> > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > This new function establishes the mapping in KVM page tables for a
> > > given gfn range. It can be used in the memory fallocate callback for
> > > memfd based memory to establish the mapping for KVM secondary MMU when
> > > the pages are allocated in the memory backend.
> >
> > NAK, under no circumstance should KVM install SPTEs in response to allocating
> > memory in a file. The correct thing to do is to invalidate the gfn range
> > associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
> > with the memslot.
>
> Right, thanks.
BTW, I think the current fallocate() callback is just useless as long as
we don't want to install KVM SPTEs in response to allocating memory in a
file. The invalidation of the shared SPTEs should be notified through
mmu_notifier of the shared memory backend, not memfd_notifier of the
private memory backend.
Thanks,
Chao
On Fri, Dec 24, 2021 at 11:53:15AM +0800, Robert Hoo wrote:
> On Thu, 2021-12-23 at 20:29 +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > +static void notify_fallocate(struct inode *inode, pgoff_t start,
> > pgoff_t end)
> > +{
> > +#ifdef CONFIG_MEMFD_OPS
> > + struct shmem_inode_info *info = SHMEM_I(inode);
> > + const struct memfd_falloc_notifier *notifier;
> > + void *owner;
> > + bool ret;
> > +
> > + if (!info->falloc_notifier)
> > + return;
> > +
> > + spin_lock(&info->lock);
> > + notifier = info->falloc_notifier;
> > + if (!notifier) {
> > + spin_unlock(&info->lock);
> > + return;
> > + }
> > +
> > + owner = info->owner;
> > + ret = notifier->get_owner(owner);
> > + spin_unlock(&info->lock);
> > + if (!ret)
> > + return;
> > +
> > + notifier->fallocate(inode, owner, start, end);
>
> I see notifier->fallocate(), i.e. memfd_fallocate(), discards
> kvm_memfd_fallocate_range()'s return value. Should it be checked?
I think we can ignore it, just like how current mmu_notifier does,
the return value of __kvm_handle_hva_range is discarded in
kvm_mmu_notifier_invalidate_range_start(). Even when KVM side failed,
it's not fatal, it should not block the operation in the primary MMU.
Thanks,
Chao
>
> > + notifier->put_owner(owner);
> > +#endif
> > +}
> > +
On Thu, Dec 23, 2021 at 05:35:37PM +0000, Sean Christopherson wrote:
> On Thu, Dec 23, 2021, Chao Peng wrote:
>
> > + struct file *file;
>
> Please use more descriptive names, shaving characters is not at all priority.
>
> > + u64 ofs;
>
> I believe this should be loff_t.
>
> struct file *private_file;
> struct loff_t private_offset;
>
> > };
> >
> > +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> > +{
> > + if (slot && (slot->flags & KVM_MEM_PRIVATE))
> > + return true;
> > + return false;
>
> return slot && (slot->flags & KVM_MEM_PRIVATE);
>
> > +}
> > +
> > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > {
> > return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 1daa45268de2..41434322fa23 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> >
> > +struct kvm_userspace_memory_region_ext {
> > + __u32 slot;
> > + __u32 flags;
> > + __u64 guest_phys_addr;
> > + __u64 memory_size; /* bytes */
> > + __u64 userspace_addr; /* hva */
>
> Would it make sense to embed "struct kvm_userspace_memory_region"?
>
> > + __u64 ofs; /* offset into fd */
> > + __u32 fd;
>
> Again, use descriptive names, then comments like "offset into fd" are unnecessary.
>
> __u64 private_offset;
> __u32 private_fd;
My original thought is the same fields might be used for shared memslot
as well in future (e.g. there may be another KVM_MEM_* bit can reuse the
same fields for shared slot) so non private-specific name may sound
better. But definitely I have no objection and can use private_* names
for next version unless there is other objection.
Thanks,
Chao
>
> > + __u32 padding[5];
> > +};
> > +
> > /*
> > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +121,7 @@ struct kvm_userspace_memory_region {
> > */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > +#define KVM_MEM_PRIVATE (1UL << 2)
> >
> > /* for KVM_IRQ_LINE */
> > struct kvm_irq_level {
> > --
> > 2.17.1
> >
On Thu, Dec 23, 2021 at 08:30:09PM +0800, Chao Peng wrote:
> When a page fault from the secondary page table while the guest is
> running happens in a memslot with KVM_MEM_PRIVATE, we need go
> different paths for private access and shared access.
>
> - For private access, KVM checks if the page is already allocated in
> the memory backend, if yes KVM establishes the mapping, otherwise
> exits to userspace to convert a shared page to private one.
>
will this conversion be atomical or not?
For example, after punching a hole in a private memory slot, will KVM
see two notifications: one for invalidation of the whole private memory
slot, and one for fallocate of the rest ranges besides the hole?
Or, KVM only sees one invalidation notification for the hole?
Could you please show QEMU code about this conversion?
> - For shared access, KVM also checks if the page is already allocated
> in the memory backend, if yes then exit to userspace to convert a
> private page to shared one, otherwise it's treated as a traditional
> hva-based shared memory, KVM lets existing code to obtain a pfn with
> get_user_pages() and establish the mapping.
>
> The above code assume private memory is persistent and pre-allocated in
> the memory backend so KVM can use this information as an indicator for
> a page is private or shared. The above check is then performed by
> calling kvm_memfd_get_pfn() which currently is implemented as a
> pagecache search but in theory that can be implemented differently
> (i.e. when the page is even not mapped into host pagecache there should
> be some different implementation).
>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
> 2 files changed, 77 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2856eb662a21..fbcdf62f8281 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> + if (kvm_slot_is_private(slot))
> + return max_level;
> +
> host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> return min(host_level, max_level);
> }
> @@ -3950,7 +3953,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> }
>
> -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> + /*
> + * At this time private gfn has not been supported yet. Other patch
> + * that enables it should change this.
> + */
> + return false;
> +}
> +
> +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault,
> + bool *is_private_pfn, int *r)
> +{
> + int order;
> + int mem_convert_type;
> + struct kvm_memory_slot *slot = fault->slot;
> + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
For private memory slots, it's possible to have pfns backed by
backends other than memfd, e.g. devicefd. So is it possible to let those
private memslots keep private and use traditional hva-based way?
Reasons below:
1. only memfd is supported in this patch set.
2. qemu/host read/write to those private memslots backing up by devicefd may
not cause machine check.
Thanks
Yan
> +
> + if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
> + if (pfn < 0)
> + mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
> + else {
> + fault->pfn = pfn;
> + if (slot->flags & KVM_MEM_READONLY)
> + fault->map_writable = false;
> + else
> + fault->map_writable = true;
> +
> + if (order == 0)
> + fault->max_level = PG_LEVEL_4K;
> + *is_private_pfn = true;
> + *r = RET_PF_FIXED;
> + return true;
> + }
> + } else {
> + if (pfn < 0)
> + return false;
> +
> + kvm_memfd_put_pfn(pfn);
> + mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
> + }
> +
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> + vcpu->run->mem.type = mem_convert_type;
> + vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->mem.u.map.size = PAGE_SIZE;
> + fault->pfn = -1;
> + *r = -1;
> + return true;
> +}
> +
> +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> + bool *is_private_pfn, int *r)
> {
> struct kvm_memory_slot *slot = fault->slot;
> bool async;
> @@ -3984,6 +4039,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> }
> }
>
> + if (kvm_slot_is_private(slot) &&
> + kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r))
> + return *r == RET_PF_FIXED ? false : true;
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> fault->write, &fault->map_writable,
> @@ -4044,6 +4103,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
>
> unsigned long mmu_seq;
> + bool is_private_pfn = false;
> int r;
>
> fault->gfn = fault->addr >> PAGE_SHIFT;
> @@ -4063,7 +4123,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> mmu_seq = vcpu->kvm->mmu_notifier_seq;
> smp_rmb();
>
> - if (kvm_faultin_pfn(vcpu, fault, &r))
> + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
> return r;
>
> if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
> @@ -4076,7 +4136,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> else
> write_lock(&vcpu->kvm->mmu_lock);
>
> - if (is_page_fault_stale(vcpu, fault, mmu_seq))
> + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
> goto out_unlock;
>
> r = make_mmu_pages_available(vcpu);
> @@ -4093,7 +4153,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> read_unlock(&vcpu->kvm->mmu_lock);
> else
> write_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> +
> + if (is_private_pfn)
> + kvm_memfd_put_pfn(fault->pfn);
> + else
> + kvm_release_pfn_clean(fault->pfn);
> +
> return r;
> }
>
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5b5bdac97c7b..640fd1e2fe4c 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> int r;
> unsigned long mmu_seq;
> bool is_self_change_mapping;
> + bool is_private_pfn = false;
> +
>
> pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
> WARN_ON_ONCE(fault->is_tdp);
> @@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> mmu_seq = vcpu->kvm->mmu_notifier_seq;
> smp_rmb();
>
> - if (kvm_faultin_pfn(vcpu, fault, &r))
> + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
> return r;
>
> if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
> @@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> r = RET_PF_RETRY;
> write_lock(&vcpu->kvm->mmu_lock);
>
> - if (is_page_fault_stale(vcpu, fault, mmu_seq))
> + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
> goto out_unlock;
>
> kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
> @@ -913,7 +915,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> out_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> + if (is_private_pfn)
> + kvm_memfd_put_pfn(fault->pfn);
> + else
> + kvm_release_pfn_clean(fault->pfn);
> return r;
> }
>
> --
> 2.17.1
>
>
On Tue, Jan 04, 2022 at 09:46:35AM +0800, Yan Zhao wrote:
> On Thu, Dec 23, 2021 at 08:30:09PM +0800, Chao Peng wrote:
> > When a page fault from the secondary page table while the guest is
> > running happens in a memslot with KVM_MEM_PRIVATE, we need go
> > different paths for private access and shared access.
> >
> > - For private access, KVM checks if the page is already allocated in
> > the memory backend, if yes KVM establishes the mapping, otherwise
> > exits to userspace to convert a shared page to private one.
> >
> will this conversion be atomical or not?
> For example, after punching a hole in a private memory slot, will KVM
> see two notifications: one for invalidation of the whole private memory
> slot, and one for fallocate of the rest ranges besides the hole?
> Or, KVM only sees one invalidation notification for the hole?
Punching hole doesn't need to invalidate the whole memory slot. It only
send one invalidation notification to KVM for the 'hole' part.
Taking shared-to-private conversion as example it only invalidates the
'hole' part (that usually only the portion of the whole memory) on the
shared fd,, and then fallocate the private memory in the private fd at
the 'hole'. The KVM invalidation notification happens when the shared
hole gets invalidated. The establishment of the private mapping happens
at subsequent KVM page fault handlers.
> Could you please show QEMU code about this conversion?
See below for the QEMU side conversion code. The above described
invalidation and fallocation will be two steps in this conversion. If
error happens in the middle then this error will be propagated to
kvm_run to do the proper action (e.g. may kill the guest?).
int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
bool shared_to_private)
{
int ret;
int fd_from, fd_to;
if (!rb || rb->private_fd <= 0) {
return -1;
}
if (!QEMU_PTR_IS_ALIGNED(start, rb->page_size) ||
!QEMU_PTR_IS_ALIGNED(length, rb->page_size)) {
return -1;
}
if (length > rb->max_length) {
return -1;
}
if (shared_to_private) {
fd_from = rb->fd;
fd_to = rb->private_fd;
} else {
fd_from = rb->private_fd;
fd_to = rb->fd;
}
ret = ram_block_discard_range_fd(rb, start, length, fd_from);
if (ret) {
return ret;
}
if (fd_to > 0) {
return fallocate(fd_to, 0, start, length);
}
return 0;
}
>
>
> > - For shared access, KVM also checks if the page is already allocated
> > in the memory backend, if yes then exit to userspace to convert a
> > private page to shared one, otherwise it's treated as a traditional
> > hva-based shared memory, KVM lets existing code to obtain a pfn with
> > get_user_pages() and establish the mapping.
> >
> > The above code assume private memory is persistent and pre-allocated in
> > the memory backend so KVM can use this information as an indicator for
> > a page is private or shared. The above check is then performed by
> > calling kvm_memfd_get_pfn() which currently is implemented as a
> > pagecache search but in theory that can be implemented differently
> > (i.e. when the page is even not mapped into host pagecache there should
> > be some different implementation).
> >
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
> > 2 files changed, 77 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2856eb662a21..fbcdf62f8281 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > + if (kvm_slot_is_private(slot))
> > + return max_level;
> > +
> > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > return min(host_level, max_level);
> > }
> > @@ -3950,7 +3953,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> > }
> >
> > -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> > +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > + /*
> > + * At this time private gfn has not been supported yet. Other patch
> > + * that enables it should change this.
> > + */
> > + return false;
> > +}
> > +
> > +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault,
> > + bool *is_private_pfn, int *r)
> > +{
> > + int order;
> > + int mem_convert_type;
> > + struct kvm_memory_slot *slot = fault->slot;
> > + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> For private memory slots, it's possible to have pfns backed by
> backends other than memfd, e.g. devicefd.
Surely yes, although this patch only supports memfd, but it's designed
to be extensible to support other memory backing stores than memfd. There
is one assumption in this design however: one private memslot can be
backed by only one type of such memory backing store, e.g. if the
devicefd you mentioned can independently provide memory for a memslot
then that's no issue.
>So is it possible to let those
> private memslots keep private and use traditional hva-based way?
Typically this fd-based private memory uses the 'offset' as the
userspace address to get a pfn from the backing store fd. But I believe
the current code does not prevent you from using the hva as the
userspace address, as long as your memory backing store understand that
address and can provide the pfn basing on it. But since you already have
the hva, you probably already mmap-ed the fd to userspace, that seems
not this private memory patch can protect you. Probably I didn't quite
understand 'keep private' you mentioned here.
Thanks,
Chao
> Reasons below:
> 1. only memfd is supported in this patch set.
> 2. qemu/host read/write to those private memslots backing up by devicefd may
> not cause machine check.
>
> Thanks
> Yan
>
On Tue, Jan 04, 2022 at 05:10:08PM +0800, Chao Peng wrote:
> On Tue, Jan 04, 2022 at 09:46:35AM +0800, Yan Zhao wrote:
> > On Thu, Dec 23, 2021 at 08:30:09PM +0800, Chao Peng wrote:
> > > When a page fault from the secondary page table while the guest is
> > > running happens in a memslot with KVM_MEM_PRIVATE, we need go
> > > different paths for private access and shared access.
> > >
> > > - For private access, KVM checks if the page is already allocated in
> > > the memory backend, if yes KVM establishes the mapping, otherwise
> > > exits to userspace to convert a shared page to private one.
> > >
> > will this conversion be atomical or not?
> > For example, after punching a hole in a private memory slot, will KVM
> > see two notifications: one for invalidation of the whole private memory
> > slot, and one for fallocate of the rest ranges besides the hole?
> > Or, KVM only sees one invalidation notification for the hole?
>
> Punching hole doesn't need to invalidate the whole memory slot. It only
> send one invalidation notification to KVM for the 'hole' part.
good :)
>
> Taking shared-to-private conversion as example it only invalidates the
> 'hole' part (that usually only the portion of the whole memory) on the
> shared fd,, and then fallocate the private memory in the private fd at
> the 'hole'. The KVM invalidation notification happens when the shared
> hole gets invalidated. The establishment of the private mapping happens
> at subsequent KVM page fault handlers.
>
> > Could you please show QEMU code about this conversion?
>
> See below for the QEMU side conversion code. The above described
> invalidation and fallocation will be two steps in this conversion. If
> error happens in the middle then this error will be propagated to
> kvm_run to do the proper action (e.g. may kill the guest?).
>
> int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
> bool shared_to_private)
> {
> int ret;
> int fd_from, fd_to;
>
> if (!rb || rb->private_fd <= 0) {
> return -1;
> }
>
> if (!QEMU_PTR_IS_ALIGNED(start, rb->page_size) ||
> !QEMU_PTR_IS_ALIGNED(length, rb->page_size)) {
> return -1;
> }
>
> if (length > rb->max_length) {
> return -1;
> }
>
> if (shared_to_private) {
> fd_from = rb->fd;
> fd_to = rb->private_fd;
> } else {
> fd_from = rb->private_fd;
> fd_to = rb->fd;
> }
>
> ret = ram_block_discard_range_fd(rb, start, length, fd_from);
> if (ret) {
> return ret;
> }
>
> if (fd_to > 0) {
> return fallocate(fd_to, 0, start, length);
> }
>
> return 0;
> }
>
Thanks. So QEMU will re-generate memslots and set KVM_MEM_PRIVATE
accordingly? Will it involve slot deletion and create?
> >
> >
> > > - For shared access, KVM also checks if the page is already allocated
> > > in the memory backend, if yes then exit to userspace to convert a
> > > private page to shared one, otherwise it's treated as a traditional
> > > hva-based shared memory, KVM lets existing code to obtain a pfn with
> > > get_user_pages() and establish the mapping.
> > >
> > > The above code assume private memory is persistent and pre-allocated in
> > > the memory backend so KVM can use this information as an indicator for
> > > a page is private or shared. The above check is then performed by
> > > calling kvm_memfd_get_pfn() which currently is implemented as a
> > > pagecache search but in theory that can be implemented differently
> > > (i.e. when the page is even not mapped into host pagecache there should
> > > be some different implementation).
> > >
> > > Signed-off-by: Yu Zhang <[email protected]>
> > > Signed-off-by: Chao Peng <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++--
> > > arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
> > > 2 files changed, 77 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2856eb662a21..fbcdf62f8281 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > if (max_level == PG_LEVEL_4K)
> > > return PG_LEVEL_4K;
> > >
> > > + if (kvm_slot_is_private(slot))
> > > + return max_level;
> > > +
> > > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > > return min(host_level, max_level);
> > > }
> > > @@ -3950,7 +3953,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> > > }
> > >
> > > -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> > > +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > +{
> > > + /*
> > > + * At this time private gfn has not been supported yet. Other patch
> > > + * that enables it should change this.
> > > + */
> > > + return false;
> > > +}
> > > +
> > > +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > + struct kvm_page_fault *fault,
> > > + bool *is_private_pfn, int *r)
> > > +{
> > > + int order;
> > > + int mem_convert_type;
> > > + struct kvm_memory_slot *slot = fault->slot;
> > > + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> > For private memory slots, it's possible to have pfns backed by
> > backends other than memfd, e.g. devicefd.
>
> Surely yes, although this patch only supports memfd, but it's designed
> to be extensible to support other memory backing stores than memfd. There
> is one assumption in this design however: one private memslot can be
> backed by only one type of such memory backing store, e.g. if the
> devicefd you mentioned can independently provide memory for a memslot
> then that's no issue.
>
> >So is it possible to let those
> > private memslots keep private and use traditional hva-based way?
>
> Typically this fd-based private memory uses the 'offset' as the
> userspace address to get a pfn from the backing store fd. But I believe
> the current code does not prevent you from using the hva as the
By hva-based way, I mean mmap is required for this fd.
> userspace address, as long as your memory backing store understand that
> address and can provide the pfn basing on it. But since you already have
> the hva, you probably already mmap-ed the fd to userspace, that seems
> not this private memory patch can protect you. Probably I didn't quite
Yes, for this fd, though mapped in private memslot, there's no need to
prevent QEMU/host from accessing it as it will not cause the severe machine
check.
> understand 'keep private' you mentioned here.
'keep private' means allow this kind of private memslot which does not
require protection from this private memory patch :)
Thanks
Yan
> > Reasons below:
> > 1. only memfd is supported in this patch set.
> > 2. qemu/host read/write to those private memslots backing up by devicefd may
> > not cause machine check.
> >
On 23.12.21 13:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> the file is inaccessible from userspace in any possible ways like
> read(),write() or mmap() etc.
>
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this seal set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>
> At this time only shmem implements this seal.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/uapi/linux/fcntl.h | 1 +
> mm/shmem.c | 37 +++++++++++++++++++++++++++++++++++--
> 2 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..e2bad051936f 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
> #define F_SEAL_GROW 0x0004 /* prevent file from growing */
> #define F_SEAL_WRITE 0x0008 /* prevent writes */
> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
> +#define F_SEAL_INACCESSIBLE 0x0020 /* prevent file from accessing */
I think this needs more clarification: the file content can still be
accessed using in-kernel mechanisms such as MEMFD_OPS for KVM. It
effectively disallows traditional access to a file (read/write/mmap)
that will result in ordinary MMU access to file content.
Not sure how to best clarify that: maybe, prevent ordinary MMU access
(e.g., read/write/mmap) to file content?
> /* (1U << 31) is reserved for signed error codes */
>
> /*
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 18f93c2d68f1..faa7e9b1b9bc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1098,6 +1098,10 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
> (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> return -EPERM;
>
> + if ((info->seals & F_SEAL_INACCESSIBLE) &&
> + (newsize & ~PAGE_MASK))
> + return -EINVAL;
> +
What happens when sealing and there are existing mmaps?
--
Thanks,
David / dhildenb
On Fri, Dec 31, 2021, Chao Peng wrote:
> On Fri, Dec 24, 2021 at 12:13:51PM +0800, Chao Peng wrote:
> > On Thu, Dec 23, 2021 at 06:06:19PM +0000, Sean Christopherson wrote:
> > > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > > This new function establishes the mapping in KVM page tables for a
> > > > given gfn range. It can be used in the memory fallocate callback for
> > > > memfd based memory to establish the mapping for KVM secondary MMU when
> > > > the pages are allocated in the memory backend.
> > >
> > > NAK, under no circumstance should KVM install SPTEs in response to allocating
> > > memory in a file. The correct thing to do is to invalidate the gfn range
> > > associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
> > > with the memslot.
> >
> > Right, thanks.
>
> BTW, I think the current fallocate() callback is just useless as long as
> we don't want to install KVM SPTEs in response to allocating memory in a
> file. The invalidation of the shared SPTEs should be notified through
> mmu_notifier of the shared memory backend, not memfd_notifier of the
> private memory backend.
No, because the private fd is the final source of truth as to whether or not a
GPA is private, e.g. userspace may choose to not unmap the shared backing.
KVM's rule per Paolo's/this proposoal is that a GPA is private if it has a private
memslot and is present in the private backing store. And the other core rule is
that KVM must never map both the private and shared variants of a GPA into the
guest.
On Fri, Dec 31, 2021, Chao Peng wrote:
> On Thu, Dec 23, 2021 at 05:35:37PM +0000, Sean Christopherson wrote:
> > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 1daa45268de2..41434322fa23 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
> > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > };
> > >
> > > +struct kvm_userspace_memory_region_ext {
> > > + __u32 slot;
> > > + __u32 flags;
> > > + __u64 guest_phys_addr;
> > > + __u64 memory_size; /* bytes */
> > > + __u64 userspace_addr; /* hva */
> >
> > Would it make sense to embed "struct kvm_userspace_memory_region"?
> >
> > > + __u64 ofs; /* offset into fd */
> > > + __u32 fd;
> >
> > Again, use descriptive names, then comments like "offset into fd" are unnecessary.
> >
> > __u64 private_offset;
> > __u32 private_fd;
>
> My original thought is the same fields might be used for shared memslot
> as well in future (e.g. there may be another KVM_MEM_* bit can reuse the
> same fields for shared slot) so non private-specific name may sound
> better. But definitely I have no objection and can use private_* names
> for next version unless there is other objection.
If that does happen, it's easy enough to wrap them in a union.
On Fri, Dec 31, 2021, Chao Peng wrote:
> On Fri, Dec 24, 2021 at 11:53:15AM +0800, Robert Hoo wrote:
> > On Thu, 2021-12-23 at 20:29 +0800, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > >
> > > +static void notify_fallocate(struct inode *inode, pgoff_t start,
> > > pgoff_t end)
> > > +{
> > > +#ifdef CONFIG_MEMFD_OPS
> > > + struct shmem_inode_info *info = SHMEM_I(inode);
> > > + const struct memfd_falloc_notifier *notifier;
> > > + void *owner;
> > > + bool ret;
> > > +
> > > + if (!info->falloc_notifier)
> > > + return;
> > > +
> > > + spin_lock(&info->lock);
> > > + notifier = info->falloc_notifier;
> > > + if (!notifier) {
> > > + spin_unlock(&info->lock);
> > > + return;
> > > + }
> > > +
> > > + owner = info->owner;
> > > + ret = notifier->get_owner(owner);
> > > + spin_unlock(&info->lock);
> > > + if (!ret)
> > > + return;
> > > +
> > > + notifier->fallocate(inode, owner, start, end);
> >
> > I see notifier->fallocate(), i.e. memfd_fallocate(), discards
> > kvm_memfd_fallocate_range()'s return value. Should it be checked?
>
> I think we can ignore it, just like how current mmu_notifier does,
> the return value of __kvm_handle_hva_range is discarded in
> kvm_mmu_notifier_invalidate_range_start(). Even when KVM side failed,
> it's not fatal, it should not block the operation in the primary MMU.
If the return value is ignored, it'd be better to have no return value at all so
that it's clear fallocate() will continue on regardless of whether or not the
secondary MMU callback succeeds. E.g. if KVM can't handle the fallocate() for
whatever reason, then knowing that fallocate() will continue on means KVM should
mark the VM as dead so that the broken setup cannot be abused by userspace.
On Fri, Dec 31, 2021, Chao Peng wrote:
> On Tue, Dec 28, 2021 at 09:48:08PM +0000, Sean Christopherson wrote:
> >KVM handles
> > reverse engineering the memslot to get the offset and whatever else it needs.
> > notify_fallocate() and other callbacks are unchanged, though they probably can
> > drop the inode.
> >
> > E.g. likely with bad math and handwaving on the overlap detection:
> >
> > int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
> > {
> > struct kvm_memory_slot *slot = owner;
> > struct kvm_gfn_range gfn_range = {
> > .slot = slot,
> > .start = (start - slot->private_offset) >> PAGE_SHIFT,
> > .end = (end - slot->private_offset) >> PAGE_SHIFT,
> > .may_block = true,
> > };
> >
> > if (!has_overlap(slot, start, end))
> > return 0;
> >
> > gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> >
> > kvm_unmap_gfn_range(slot->kvm, &gfn_range);
> > return 0;
> > }
>
> I understand this KVM side handling, but again one fd can have multiple
> memslots. How shmem decides to notify which memslot from a list of
> memslots when it invokes the notify_fallocate()? Or just notify all
> the possible memslots then let KVM to check?
Heh, yeah, those are the two choices. :-)
Either the backing store needs to support registering callbacks for specific,
arbitrary ranges, or it needs to invoke all registered callbacks. Invoking all
callbacks has my vote; it's much simpler to implement and is unlikely to incur
meaningful overhead. _Something_ has to find the overlapping ranges, that cost
doesn't magically go away if it's pushed into the backing store.
Note, invoking all notifiers is also aligned with the mmu_notifier behavior.
On Tue, Jan 04, 2022 at 05:38:38PM +0000, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Chao Peng wrote:
> > On Fri, Dec 24, 2021 at 11:53:15AM +0800, Robert Hoo wrote:
> > > On Thu, 2021-12-23 at 20:29 +0800, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" <[email protected]>
> > > >
> > > > +static void notify_fallocate(struct inode *inode, pgoff_t start,
> > > > pgoff_t end)
> > > > +{
> > > > +#ifdef CONFIG_MEMFD_OPS
> > > > + struct shmem_inode_info *info = SHMEM_I(inode);
> > > > + const struct memfd_falloc_notifier *notifier;
> > > > + void *owner;
> > > > + bool ret;
> > > > +
> > > > + if (!info->falloc_notifier)
> > > > + return;
> > > > +
> > > > + spin_lock(&info->lock);
> > > > + notifier = info->falloc_notifier;
> > > > + if (!notifier) {
> > > > + spin_unlock(&info->lock);
> > > > + return;
> > > > + }
> > > > +
> > > > + owner = info->owner;
> > > > + ret = notifier->get_owner(owner);
> > > > + spin_unlock(&info->lock);
> > > > + if (!ret)
> > > > + return;
> > > > +
> > > > + notifier->fallocate(inode, owner, start, end);
> > >
> > > I see notifier->fallocate(), i.e. memfd_fallocate(), discards
> > > kvm_memfd_fallocate_range()'s return value. Should it be checked?
> >
> > I think we can ignore it, just like how current mmu_notifier does,
> > the return value of __kvm_handle_hva_range is discarded in
> > kvm_mmu_notifier_invalidate_range_start(). Even when KVM side failed,
> > it's not fatal, it should not block the operation in the primary MMU.
>
> If the return value is ignored, it'd be better to have no return value at all so
> that it's clear fallocate() will continue on regardless of whether or not the
> secondary MMU callback succeeds. E.g. if KVM can't handle the fallocate() for
> whatever reason, then knowing that fallocate() will continue on means KVM should
> mark the VM as dead so that the broken setup cannot be abused by userspace.
After a close look, kvm_unmap_gfn_range() actually does not return a
error code, so it's safe to not return in kvm_memfd_handle_range().
Chao
On Tue, Jan 04, 2022 at 05:43:50PM +0000, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Chao Peng wrote:
> > On Tue, Dec 28, 2021 at 09:48:08PM +0000, Sean Christopherson wrote:
> > >KVM handles
> > > reverse engineering the memslot to get the offset and whatever else it needs.
> > > notify_fallocate() and other callbacks are unchanged, though they probably can
> > > drop the inode.
> > >
> > > E.g. likely with bad math and handwaving on the overlap detection:
> > >
> > > int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
> > > {
> > > struct kvm_memory_slot *slot = owner;
> > > struct kvm_gfn_range gfn_range = {
> > > .slot = slot,
> > > .start = (start - slot->private_offset) >> PAGE_SHIFT,
> > > .end = (end - slot->private_offset) >> PAGE_SHIFT,
> > > .may_block = true,
> > > };
> > >
> > > if (!has_overlap(slot, start, end))
> > > return 0;
> > >
> > > gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > >
> > > kvm_unmap_gfn_range(slot->kvm, &gfn_range);
> > > return 0;
> > > }
> >
> > I understand this KVM side handling, but again one fd can have multiple
> > memslots. How shmem decides to notify which memslot from a list of
> > memslots when it invokes the notify_fallocate()? Or just notify all
> > the possible memslots then let KVM to check?
>
> Heh, yeah, those are the two choices. :-)
>
> Either the backing store needs to support registering callbacks for specific,
> arbitrary ranges, or it needs to invoke all registered callbacks. Invoking all
> callbacks has my vote; it's much simpler to implement and is unlikely to incur
> meaningful overhead. _Something_ has to find the overlapping ranges, that cost
> doesn't magically go away if it's pushed into the backing store.
>
> Note, invoking all notifiers is also aligned with the mmu_notifier behavior.
Sounds a good reason. Then shmem side only needs to maintain a list of
users.
Chao
On Tue, Jan 04, 2022 at 05:31:30PM +0000, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Chao Peng wrote:
> > On Fri, Dec 24, 2021 at 12:13:51PM +0800, Chao Peng wrote:
> > > On Thu, Dec 23, 2021 at 06:06:19PM +0000, Sean Christopherson wrote:
> > > > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > > > This new function establishes the mapping in KVM page tables for a
> > > > > given gfn range. It can be used in the memory fallocate callback for
> > > > > memfd based memory to establish the mapping for KVM secondary MMU when
> > > > > the pages are allocated in the memory backend.
> > > >
> > > > NAK, under no circumstance should KVM install SPTEs in response to allocating
> > > > memory in a file. The correct thing to do is to invalidate the gfn range
> > > > associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
> > > > with the memslot.
> > >
> > > Right, thanks.
> >
> > BTW, I think the current fallocate() callback is just useless as long as
> > we don't want to install KVM SPTEs in response to allocating memory in a
> > file. The invalidation of the shared SPTEs should be notified through
> > mmu_notifier of the shared memory backend, not memfd_notifier of the
> > private memory backend.
>
> No, because the private fd is the final source of truth as to whether or not a
> GPA is private, e.g. userspace may choose to not unmap the shared backing.
> KVM's rule per Paolo's/this proposoal is that a GPA is private if it has a private
> memslot and is present in the private backing store. And the other core rule is
> that KVM must never map both the private and shared variants of a GPA into the
> guest.
That's true, but I'm wondering if zapping the shared variant can be
handled at the time when the private one gets mapped in the KVM page
fault. No bothering the backing store to dedicate a callback to tell
KVM.
Chao
On Tue, Jan 04, 2022 at 06:06:12PM +0800, Yan Zhao wrote:
> On Tue, Jan 04, 2022 at 05:10:08PM +0800, Chao Peng wrote:
> > On Tue, Jan 04, 2022 at 09:46:35AM +0800, Yan Zhao wrote:
> > > On Thu, Dec 23, 2021 at 08:30:09PM +0800, Chao Peng wrote:
> > > > When a page fault from the secondary page table while the guest is
> > > > running happens in a memslot with KVM_MEM_PRIVATE, we need go
> > > > different paths for private access and shared access.
> > > >
> > > > - For private access, KVM checks if the page is already allocated in
> > > > the memory backend, if yes KVM establishes the mapping, otherwise
> > > > exits to userspace to convert a shared page to private one.
> > > >
> > > will this conversion be atomical or not?
> > > For example, after punching a hole in a private memory slot, will KVM
> > > see two notifications: one for invalidation of the whole private memory
> > > slot, and one for fallocate of the rest ranges besides the hole?
> > > Or, KVM only sees one invalidation notification for the hole?
> >
> > Punching hole doesn't need to invalidate the whole memory slot. It only
> > send one invalidation notification to KVM for the 'hole' part.
> good :)
>
> >
> > Taking shared-to-private conversion as example it only invalidates the
> > 'hole' part (that usually only the portion of the whole memory) on the
> > shared fd,, and then fallocate the private memory in the private fd at
> > the 'hole'. The KVM invalidation notification happens when the shared
> > hole gets invalidated. The establishment of the private mapping happens
> > at subsequent KVM page fault handlers.
> >
> > > Could you please show QEMU code about this conversion?
> >
> > See below for the QEMU side conversion code. The above described
> > invalidation and fallocation will be two steps in this conversion. If
> > error happens in the middle then this error will be propagated to
> > kvm_run to do the proper action (e.g. may kill the guest?).
> >
> > int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
> > bool shared_to_private)
> > {
> > int ret;
> > int fd_from, fd_to;
> >
> > if (!rb || rb->private_fd <= 0) {
> > return -1;
> > }
> >
> > if (!QEMU_PTR_IS_ALIGNED(start, rb->page_size) ||
> > !QEMU_PTR_IS_ALIGNED(length, rb->page_size)) {
> > return -1;
> > }
> >
> > if (length > rb->max_length) {
> > return -1;
> > }
> >
> > if (shared_to_private) {
> > fd_from = rb->fd;
> > fd_to = rb->private_fd;
> > } else {
> > fd_from = rb->private_fd;
> > fd_to = rb->fd;
> > }
> >
> > ret = ram_block_discard_range_fd(rb, start, length, fd_from);
> > if (ret) {
> > return ret;
> > }
> >
> > if (fd_to > 0) {
> > return fallocate(fd_to, 0, start, length);
> > }
> >
> > return 0;
> > }
> >
> Thanks. So QEMU will re-generate memslots and set KVM_MEM_PRIVATE
> accordingly? Will it involve slot deletion and create?
KVM will not re-generate memslots when do the conversion, instead, it
does unmap/map a range on the same memslot. For memslot with tag
KVM_MEM_PRIVATE, it always have two mappings (private/shared) but at a
time only one is effective. What conversion does is to turn off the
existing mapping and turn on the other mapping for specified range in
that slot.
>
> > >
> > >
> > > > - For shared access, KVM also checks if the page is already allocated
> > > > in the memory backend, if yes then exit to userspace to convert a
> > > > private page to shared one, otherwise it's treated as a traditional
> > > > hva-based shared memory, KVM lets existing code to obtain a pfn with
> > > > get_user_pages() and establish the mapping.
> > > >
> > > > The above code assume private memory is persistent and pre-allocated in
> > > > the memory backend so KVM can use this information as an indicator for
> > > > a page is private or shared. The above check is then performed by
> > > > calling kvm_memfd_get_pfn() which currently is implemented as a
> > > > pagecache search but in theory that can be implemented differently
> > > > (i.e. when the page is even not mapped into host pagecache there should
> > > > be some different implementation).
> > > >
> > > > Signed-off-by: Yu Zhang <[email protected]>
> > > > Signed-off-by: Chao Peng <[email protected]>
> > > > ---
> > > > arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++--
> > > > arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
> > > > 2 files changed, 77 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 2856eb662a21..fbcdf62f8281 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > > if (max_level == PG_LEVEL_4K)
> > > > return PG_LEVEL_4K;
> > > >
> > > > + if (kvm_slot_is_private(slot))
> > > > + return max_level;
> > > > +
> > > > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > > > return min(host_level, max_level);
> > > > }
> > > > @@ -3950,7 +3953,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > > kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> > > > }
> > > >
> > > > -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> > > > +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > > +{
> > > > + /*
> > > > + * At this time private gfn has not been supported yet. Other patch
> > > > + * that enables it should change this.
> > > > + */
> > > > + return false;
> > > > +}
> > > > +
> > > > +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > > + struct kvm_page_fault *fault,
> > > > + bool *is_private_pfn, int *r)
> > > > +{
> > > > + int order;
> > > > + int mem_convert_type;
> > > > + struct kvm_memory_slot *slot = fault->slot;
> > > > + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> > > For private memory slots, it's possible to have pfns backed by
> > > backends other than memfd, e.g. devicefd.
> >
> > Surely yes, although this patch only supports memfd, but it's designed
> > to be extensible to support other memory backing stores than memfd. There
> > is one assumption in this design however: one private memslot can be
> > backed by only one type of such memory backing store, e.g. if the
> > devicefd you mentioned can independently provide memory for a memslot
> > then that's no issue.
> >
> > >So is it possible to let those
> > > private memslots keep private and use traditional hva-based way?
> >
> > Typically this fd-based private memory uses the 'offset' as the
> > userspace address to get a pfn from the backing store fd. But I believe
> > the current code does not prevent you from using the hva as the
> By hva-based way, I mean mmap is required for this fd.
>
> > userspace address, as long as your memory backing store understand that
> > address and can provide the pfn basing on it. But since you already have
> > the hva, you probably already mmap-ed the fd to userspace, that seems
> > not this private memory patch can protect you. Probably I didn't quite
> Yes, for this fd, though mapped in private memslot, there's no need to
> prevent QEMU/host from accessing it as it will not cause the severe machine
> check.
>
> > understand 'keep private' you mentioned here.
> 'keep private' means allow this kind of private memslot which does not
> require protection from this private memory patch :)
Then I think such memory can be the shared part of memory of the
KVM_MEM_PRIVATE memslot. As said above, this is initially supported :)
Chao
>
>
> Thanks
> Yan
> > > Reasons below:
> > > 1. only memfd is supported in this patch set.
> > > 2. qemu/host read/write to those private memslots backing up by devicefd may
> > > not cause machine check.
> > >
On Wed, Jan 05, 2022 at 02:28:10PM +0800, Chao Peng wrote:
> On Tue, Jan 04, 2022 at 06:06:12PM +0800, Yan Zhao wrote:
> > On Tue, Jan 04, 2022 at 05:10:08PM +0800, Chao Peng wrote:
<...>
> > Thanks. So QEMU will re-generate memslots and set KVM_MEM_PRIVATE
> > accordingly? Will it involve slot deletion and create?
>
> KVM will not re-generate memslots when do the conversion, instead, it
> does unmap/map a range on the same memslot. For memslot with tag
> KVM_MEM_PRIVATE, it always have two mappings (private/shared) but at a
> time only one is effective. What conversion does is to turn off the
> existing mapping and turn on the other mapping for specified range in
> that slot.
>
got it. thanks!
<...>
> > > > > +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > > > + struct kvm_page_fault *fault,
> > > > > + bool *is_private_pfn, int *r)
> > > > > +{
> > > > > + int order;
> > > > > + int mem_convert_type;
> > > > > + struct kvm_memory_slot *slot = fault->slot;
> > > > > + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> > > > For private memory slots, it's possible to have pfns backed by
> > > > backends other than memfd, e.g. devicefd.
> > >
> > > Surely yes, although this patch only supports memfd, but it's designed
> > > to be extensible to support other memory backing stores than memfd. There
> > > is one assumption in this design however: one private memslot can be
> > > backed by only one type of such memory backing store, e.g. if the
> > > devicefd you mentioned can independently provide memory for a memslot
> > > then that's no issue.
> > >
> > > >So is it possible to let those
> > > > private memslots keep private and use traditional hva-based way?
> > >
> > > Typically this fd-based private memory uses the 'offset' as the
> > > userspace address to get a pfn from the backing store fd. But I believe
> > > the current code does not prevent you from using the hva as the
> > By hva-based way, I mean mmap is required for this fd.
> >
> > > userspace address, as long as your memory backing store understand that
> > > address and can provide the pfn basing on it. But since you already have
> > > the hva, you probably already mmap-ed the fd to userspace, that seems
> > > not this private memory patch can protect you. Probably I didn't quite
> > Yes, for this fd, though mapped in private memslot, there's no need to
> > prevent QEMU/host from accessing it as it will not cause the severe machine
> > check.
> >
> > > understand 'keep private' you mentioned here.
> > 'keep private' means allow this kind of private memslot which does not
> > require protection from this private memory patch :)
>
> Then I think such memory can be the shared part of memory of the
> KVM_MEM_PRIVATE memslot. As said above, this is initially supported :)
>
Sorry, maybe I didn't express it clearly.
As in the kvm_faultin_pfn_private(),
static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault,
bool *is_private_pfn, int *r)
{
int order;
int mem_convert_type;
struct kvm_memory_slot *slot = fault->slot;
long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
...
}
Currently, kvm_memfd_get_pfn() is called unconditionally.
However, if the backend of a private memslot is not memfd, and is device
fd for example, a different xxx_get_pfn() is required here.
Further, though mapped to a private gfn, it might be ok for QEMU to
access the device fd in hva-based way (or call it MMU access way, e.g.
read/write/mmap), it's desired that it could use the traditional to get
pfn without convert the range to a shared one.
pfn = __gfn_to_pfn_memslot(slot, fault->gfn, ...)
|->addr = __gfn_to_hva_many (slot, gfn,...)
| pfn = hva_to_pfn (addr,...)
So, is it possible to recognize such kind of backends in KVM, and to get
the pfn in traditional way without converting them to shared?
e.g.
- specify KVM_MEM_PRIVATE_NONPROTECT to memory regions with such kind
of backends, or
- detect the fd type and check if get_pfn is provided. if no, go the
traditional way.
Thanks
Yan
> > > > Reasons below:
> > > > 1. only memfd is supported in this patch set.
> > > > 2. qemu/host read/write to those private memslots backing up by devicefd may
> > > > not cause machine check.
On Wed, Jan 05, 2022, Chao Peng wrote:
> On Tue, Jan 04, 2022 at 05:31:30PM +0000, Sean Christopherson wrote:
> > On Fri, Dec 31, 2021, Chao Peng wrote:
> > > On Fri, Dec 24, 2021 at 12:13:51PM +0800, Chao Peng wrote:
> > > > On Thu, Dec 23, 2021 at 06:06:19PM +0000, Sean Christopherson wrote:
> > > > > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > > > > This new function establishes the mapping in KVM page tables for a
> > > > > > given gfn range. It can be used in the memory fallocate callback for
> > > > > > memfd based memory to establish the mapping for KVM secondary MMU when
> > > > > > the pages are allocated in the memory backend.
> > > > >
> > > > > NAK, under no circumstance should KVM install SPTEs in response to allocating
> > > > > memory in a file. The correct thing to do is to invalidate the gfn range
> > > > > associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
> > > > > with the memslot.
> > > >
> > > > Right, thanks.
> > >
> > > BTW, I think the current fallocate() callback is just useless as long as
> > > we don't want to install KVM SPTEs in response to allocating memory in a
> > > file. The invalidation of the shared SPTEs should be notified through
> > > mmu_notifier of the shared memory backend, not memfd_notifier of the
> > > private memory backend.
> >
> > No, because the private fd is the final source of truth as to whether or not a
> > GPA is private, e.g. userspace may choose to not unmap the shared backing.
> > KVM's rule per Paolo's/this proposoal is that a GPA is private if it has a private
> > memslot and is present in the private backing store. And the other core rule is
> > that KVM must never map both the private and shared variants of a GPA into the
> > guest.
>
> That's true, but I'm wondering if zapping the shared variant can be
> handled at the time when the private one gets mapped in the KVM page
> fault. No bothering the backing store to dedicate a callback to tell
> KVM.
Hmm, I don't think that would work for the TDP MMU due to page faults taking
mmu_lock for read. E.g. if two vCPUs concurrently fault in both the shared and
private variants, a race could exist where the private page fault sees the gfn
as private and the shared page fault sees it as shared. In that case, both faults
will install a SPTE and KVM would end up running with both variants mapped into the
guest.
There's also a performance penalty, as KVM would need to walk the shared EPT tree
on every private page fault.
On Wed, Jan 05, 2022, Yan Zhao wrote:
> Sorry, maybe I didn't express it clearly.
>
> As in the kvm_faultin_pfn_private(),
> static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault,
> bool *is_private_pfn, int *r)
> {
> int order;
> int mem_convert_type;
> struct kvm_memory_slot *slot = fault->slot;
> long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> ...
> }
> Currently, kvm_memfd_get_pfn() is called unconditionally.
> However, if the backend of a private memslot is not memfd, and is device
> fd for example, a different xxx_get_pfn() is required here.
Ya, I've complained about this in a different thread[*]. This should really be
something like kvm_private_fd_get_pfn(), where the underlying ops struct can point
at any compatible backing store.
https://lore.kernel.org/all/[email protected]/
> Further, though mapped to a private gfn, it might be ok for QEMU to
> access the device fd in hva-based way (or call it MMU access way, e.g.
> read/write/mmap), it's desired that it could use the traditional to get
> pfn without convert the range to a shared one.
No, this is expressly forbidden. The backing store for a private gfn must not
be accessible by userspace. It's possible a backing store could support both, but
not concurrently, and any conversion must be done without KVM being involved.
In other words, resolving a private gfn must either succeed or fail (exit to
userspace), KVM cannot initiate any conversions.
> pfn = __gfn_to_pfn_memslot(slot, fault->gfn, ...)
> |->addr = __gfn_to_hva_many (slot, gfn,...)
> | pfn = hva_to_pfn (addr,...)
>
>
> So, is it possible to recognize such kind of backends in KVM, and to get
> the pfn in traditional way without converting them to shared?
> e.g.
> - specify KVM_MEM_PRIVATE_NONPROTECT to memory regions with such kind
> of backends, or
> - detect the fd type and check if get_pfn is provided. if no, go the
> traditional way.
No, because the whole point of this is to make guest private memory inaccessible
to host userspace. Or did I misinterpret your questions?
On Wed, Jan 05, 2022 at 05:03:23PM +0000, Sean Christopherson wrote:
> On Wed, Jan 05, 2022, Chao Peng wrote:
> > On Tue, Jan 04, 2022 at 05:31:30PM +0000, Sean Christopherson wrote:
> > > On Fri, Dec 31, 2021, Chao Peng wrote:
> > > > On Fri, Dec 24, 2021 at 12:13:51PM +0800, Chao Peng wrote:
> > > > > On Thu, Dec 23, 2021 at 06:06:19PM +0000, Sean Christopherson wrote:
> > > > > > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > > > > > This new function establishes the mapping in KVM page tables for a
> > > > > > > given gfn range. It can be used in the memory fallocate callback for
> > > > > > > memfd based memory to establish the mapping for KVM secondary MMU when
> > > > > > > the pages are allocated in the memory backend.
> > > > > >
> > > > > > NAK, under no circumstance should KVM install SPTEs in response to allocating
> > > > > > memory in a file. The correct thing to do is to invalidate the gfn range
> > > > > > associated with the newly mapped range, i.e. wipe out any shared SPTEs associated
> > > > > > with the memslot.
> > > > >
> > > > > Right, thanks.
> > > >
> > > > BTW, I think the current fallocate() callback is just useless as long as
> > > > we don't want to install KVM SPTEs in response to allocating memory in a
> > > > file. The invalidation of the shared SPTEs should be notified through
> > > > mmu_notifier of the shared memory backend, not memfd_notifier of the
> > > > private memory backend.
> > >
> > > No, because the private fd is the final source of truth as to whether or not a
> > > GPA is private, e.g. userspace may choose to not unmap the shared backing.
> > > KVM's rule per Paolo's/this proposoal is that a GPA is private if it has a private
> > > memslot and is present in the private backing store. And the other core rule is
> > > that KVM must never map both the private and shared variants of a GPA into the
> > > guest.
> >
> > That's true, but I'm wondering if zapping the shared variant can be
> > handled at the time when the private one gets mapped in the KVM page
> > fault. No bothering the backing store to dedicate a callback to tell
> > KVM.
>
> Hmm, I don't think that would work for the TDP MMU due to page faults taking
> mmu_lock for read. E.g. if two vCPUs concurrently fault in both the shared and
> private variants, a race could exist where the private page fault sees the gfn
> as private and the shared page fault sees it as shared. In that case, both faults
> will install a SPTE and KVM would end up running with both variants mapped into the
> guest.
>
> There's also a performance penalty, as KVM would need to walk the shared EPT tree
> on every private page fault.
Make sense.
Thanks,
Chao
On Tue, Jan 04, 2022 at 03:22:07PM +0100, David Hildenbrand wrote:
> On 23.12.21 13:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > the file is inaccessible from userspace in any possible ways like
> > read(),write() or mmap() etc.
> >
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this seal set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >
> > At this time only shmem implements this seal.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > include/uapi/linux/fcntl.h | 1 +
> > mm/shmem.c | 37 +++++++++++++++++++++++++++++++++++--
> > 2 files changed, 36 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 2f86b2ad6d7e..e2bad051936f 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -43,6 +43,7 @@
> > #define F_SEAL_GROW 0x0004 /* prevent file from growing */
> > #define F_SEAL_WRITE 0x0008 /* prevent writes */
> > #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
> > +#define F_SEAL_INACCESSIBLE 0x0020 /* prevent file from accessing */
>
> I think this needs more clarification: the file content can still be
> accessed using in-kernel mechanisms such as MEMFD_OPS for KVM. It
> effectively disallows traditional access to a file (read/write/mmap)
> that will result in ordinary MMU access to file content.
>
> Not sure how to best clarify that: maybe, prevent ordinary MMU access
> (e.g., read/write/mmap) to file content?
Or: prevent userspace access (e.g., read/write/mmap) to file content?
>
> > /* (1U << 31) is reserved for signed error codes */
> >
> > /*
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 18f93c2d68f1..faa7e9b1b9bc 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1098,6 +1098,10 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
> > (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> > return -EPERM;
> >
> > + if ((info->seals & F_SEAL_INACCESSIBLE) &&
> > + (newsize & ~PAGE_MASK))
> > + return -EINVAL;
> > +
>
> What happens when sealing and there are existing mmaps?
I think this is similar to ftruncate, in either case we just allow that.
The existing mmaps will be unmapped and KVM will be notified to
invalidate the mapping in the secondary MMU as well. This assume we
trust the userspace even though it can not access the file content.
Thanks,
Chao
>
>
> --
> Thanks,
>
> David / dhildenb
On 06.01.22 14:06, Chao Peng wrote:
> On Tue, Jan 04, 2022 at 03:22:07PM +0100, David Hildenbrand wrote:
>> On 23.12.21 13:29, Chao Peng wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>>> the file is inaccessible from userspace in any possible ways like
>>> read(),write() or mmap() etc.
>>>
>>> It provides semantics required for KVM guest private memory support
>>> that a file descriptor with this seal set is going to be used as the
>>> source of guest memory in confidential computing environments such
>>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>>>
>>> At this time only shmem implements this seal.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> Signed-off-by: Chao Peng <[email protected]>
>>> ---
>>> include/uapi/linux/fcntl.h | 1 +
>>> mm/shmem.c | 37 +++++++++++++++++++++++++++++++++++--
>>> 2 files changed, 36 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>>> index 2f86b2ad6d7e..e2bad051936f 100644
>>> --- a/include/uapi/linux/fcntl.h
>>> +++ b/include/uapi/linux/fcntl.h
>>> @@ -43,6 +43,7 @@
>>> #define F_SEAL_GROW 0x0004 /* prevent file from growing */
>>> #define F_SEAL_WRITE 0x0008 /* prevent writes */
>>> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
>>> +#define F_SEAL_INACCESSIBLE 0x0020 /* prevent file from accessing */
>>
>> I think this needs more clarification: the file content can still be
>> accessed using in-kernel mechanisms such as MEMFD_OPS for KVM. It
>> effectively disallows traditional access to a file (read/write/mmap)
>> that will result in ordinary MMU access to file content.
>>
>> Not sure how to best clarify that: maybe, prevent ordinary MMU access
>> (e.g., read/write/mmap) to file content?
>
> Or: prevent userspace access (e.g., read/write/mmap) to file content?
The issue with that phrasing is that userspace will be able to access
that content, just via a different mechanism eventually ... e.g., via
the KVM MMU indirectly. If that makes it clearer what I mean :)
>>
>>> /* (1U << 31) is reserved for signed error codes */
>>>
>>> /*
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 18f93c2d68f1..faa7e9b1b9bc 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1098,6 +1098,10 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>>> (newsize > oldsize && (info->seals & F_SEAL_GROW)))
>>> return -EPERM;
>>>
>>> + if ((info->seals & F_SEAL_INACCESSIBLE) &&
>>> + (newsize & ~PAGE_MASK))
>>> + return -EINVAL;
>>> +
>>
>> What happens when sealing and there are existing mmaps?
>
> I think this is similar to ftruncate, in either case we just allow that.
> The existing mmaps will be unmapped and KVM will be notified to
> invalidate the mapping in the secondary MMU as well. This assume we
> trust the userspace even though it can not access the file content.
Can't we simply check+forbid instead?
--
Thanks,
David / dhildenb
hi Sean,
Sorry for the late reply. I just saw this mail in my mailbox.
On Wed, Jan 05, 2022 at 08:52:39PM +0000, Sean Christopherson wrote:
> On Wed, Jan 05, 2022, Yan Zhao wrote:
> > Sorry, maybe I didn't express it clearly.
> >
> > As in the kvm_faultin_pfn_private(),
> > static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > struct kvm_page_fault *fault,
> > bool *is_private_pfn, int *r)
> > {
> > int order;
> > int mem_convert_type;
> > struct kvm_memory_slot *slot = fault->slot;
> > long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
> > ...
> > }
> > Currently, kvm_memfd_get_pfn() is called unconditionally.
> > However, if the backend of a private memslot is not memfd, and is device
> > fd for example, a different xxx_get_pfn() is required here.
>
> Ya, I've complained about this in a different thread[*]. This should really be
> something like kvm_private_fd_get_pfn(), where the underlying ops struct can point
> at any compatible backing store.
>
> https://lore.kernel.org/all/[email protected]/
>
ok.
> > Further, though mapped to a private gfn, it might be ok for QEMU to
> > access the device fd in hva-based way (or call it MMU access way, e.g.
> > read/write/mmap), it's desired that it could use the traditional to get
> > pfn without convert the range to a shared one.
>
> No, this is expressly forbidden. The backing store for a private gfn must not
> be accessible by userspace. It's possible a backing store could support both, but
> not concurrently, and any conversion must be done without KVM being involved.
> In other words, resolving a private gfn must either succeed or fail (exit to
> userspace), KVM cannot initiate any conversions.
>
When it comes to a device passthrough via VFIO, there might be more work
related to the device fd as a backend.
First, unlike memfd which can allocate one private fd for a set of PFNs,
and one shared fd for another set of PFNs, for device fd, it needs to open
the same physical device twice, one for shared fd, and one for private fd.
Then, for private device fd, now its ramblock has to use qemu_ram_alloc_from_fd()
instead of current qemu_ram_alloc_from_ptr().
And as in VFIO, this private fd is shared by several ramblocks (each locating from
a different base offset), the base offsets also need to be kept somewhere
in order to call get_pfn successfully. (this info is kept in
vma through mmap() previously, so without mmap(), a new interface might
be required).
Also, for shared device fd, mmap() is required in order to allocate the
ramblock with qemu_ram_alloc_from_ptr(), and more importantly to make
the future gfn_to_hva, and hva_to_pfn possible.
But as the shared and private fds are based on the same physical device,
the vfio driver needs to record which vma ranges are allowed for the actual
mmap_fault, which vma area are not.
With the above changes, it only prevents the host user space from accessing
the device mapped to private GFNs.
For memory backends, host kernel space accessing is prevented via MKTME.
And for device, the device needs to the work to disallow host kernel
space access.
However, unlike memory side, the device side would not cause any MCE.
Thereby, host user space access to the device also would not cause MCEs, either.
So, I'm not sure if the above work is worthwhile to the device fd.
> > pfn = __gfn_to_pfn_memslot(slot, fault->gfn, ...)
> > |->addr = __gfn_to_hva_many (slot, gfn,...)
> > | pfn = hva_to_pfn (addr,...)
> >
> >
> > So, is it possible to recognize such kind of backends in KVM, and to get
> > the pfn in traditional way without converting them to shared?
> > e.g.
> > - specify KVM_MEM_PRIVATE_NONPROTECT to memory regions with such kind
> > of backends, or
> > - detect the fd type and check if get_pfn is provided. if no, go the
> > traditional way.
>
> No, because the whole point of this is to make guest private memory inaccessible
> to host userspace. Or did I misinterpret your questions?
I think the host unmap series is based on the assumption that host user
space access to the memory based to private guest GFNs would cause fatal
MCEs.
So, I hope for backends who will not bring this fatal error can keep
using traditional way to get pfn and be mapped to private GFNs at the
same time.
Thanks
Yan