2021-11-11 14:14:53

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 0/6] KVM: mm: fd-based approach for supporting KVM guest private memory

This RFC series try to implement the fd-based KVM guest private memory
proposal described at [1].

We had some offline discussions on this series already and that results
a different design proposal from Paolo. This thread includes both the
original RFC patch series for proposal [1] as well as the summary for
the new proposal from Paolo so that we can continue the discussion.

To understand the patch and the new proposal you are highly recommended
to read the original proposal [1] firstly.


Patch Description
=================
The patch include a private memory implementation in memfd/shmem backing
store and KVM support for private memory slot as well its counterpart in
QEMU.

Patch1: kernel part shmem/memfd support
Patch2-6: KVM part
Patch7-13: QEMU part

QEMU Usage:
-machine private-memory-backend=ram1 \
-object memory-backend-memfd,id=ram1,size=5G,guest_private=on,seal=off


New Proposal
============
Below is a summary of the changes for the new proposal that was discussed
in the offline thread.

In general, this new proposal reuses the concept of fd-based guest
memory backing store that described in [1] but uses a different way to
coordinate the private and shared parts into one single memslot instead
of introducing dedicated private memslot.

- memslot extension
The new proposal suggests to add the private fd and the offset to
existing 'shared' memslot so both private/shared memory can live in one
single memslot. A page in the memslot is either private or shared. A
page is private only when it's allocated in the private fd, all the
other cases it's treated as shared, this includes those already mapped
as shared as well as those having not been mapped.

- private memory map/unmap
Userspace's map/unmap operations are done by fallocate() ioctl on
private fd.
- map: default fallocate() with mode=0.
- unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.

There would be two new callbacks registered by KVM and called by memory
backing store during above map/unmap operations:
- map(inode, offset, size): memory backing store to tell related KVM
memslot to do a shared->private conversion.
- unmap(inode, offset, size): memory backing store to tell related KVM
memslot to do a private->shared conversion.

Memory backing store also needs to provide a new callback for KVM to
query if a page is already allocated in private-fd so KVM can know if
the page is private or not.
- page_allocated(inode, offset): for shmem this would simply return
pagecache_get_page().

There are two places in KVM that can exit to userspace to trigger
private/share conversion:
- explicit conversion: happens when guest calls into KVM to explicitly
map a range(as private or shared), KVM then exits to userspace to do
the above map/unmap operations.
- implicit conversion: happens in KVM page fault handler.
* if fault due to a private memory access then cause a userspace exit
for a shared->private conversion request when page_allocate() return
false, otherwise map that directly without usrspace exit.
* If fault due to a shared memory access then cause a userspace exit
for a private->shared conversion request when page_allocate() return
true, otherwise map that directly without userspace exit.

An example flow:

guest Linux userspace
------------------------- -------------------- -----------------------
ioctl(KVM_RUN)
access private memoryd
'--- EPT violation --.
v
userspace exit
'------------------.
v
munmap shared memfd
fallocate private memfd
.------------------'
v
fallocate()
call guest_ops
unmap shared PTE
map private PTE
...
ioctl(KVM_RUN)

Compared to the original proposal:
- no need to introduce KVM memslot hole punching API,
- would avoid potential memslot performance/scalability/fragment issue,
- may also reduce userspace complexity,
- but requires additional callbacks between KVM and memory backing
store.

[1] https://lkml.kernel.org/kvm/[email protected]/t/

Thanks,
Chao
---
Chao Peng (6):
mm: Add F_SEAL_GUEST to shmem/memfd
kvm: x86: Introduce guest private memory address space to memslot
kvm: x86: add private_ops to memslot
kvm: x86: implement private_ops for memfd backing store
kvm: x86: add KVM_EXIT_MEMORY_ERROR exit
KVM: add KVM_SPLIT_MEMORY_REGION

Documentation/virt/kvm/api.rst | 1 +
arch/x86/include/asm/kvm_host.h | 5 +-
arch/x86/include/uapi/asm/kvm.h | 4 +
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/memfd.c | 63 +++++++++++
arch/x86/kvm/mmu/mmu.c | 69 ++++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/x86.c | 3 +-
include/linux/kvm_host.h | 41 ++++++-
include/linux/memfd.h | 22 ++++
include/linux/shmem_fs.h | 9 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/kvm.h | 34 ++++++
mm/memfd.c | 34 +++++-
mm/shmem.c | 127 +++++++++++++++++++++-
virt/kvm/kvm_main.c | 185 +++++++++++++++++++++++++++++++-
16 files changed, 581 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/kvm/memfd.c

--
2.17.1



2021-11-11 14:15:06

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 1/6] mm: Add F_SEAL_GUEST to shmem/memfd

The new seal is only allowed if there's no pre-existing pages in the fd
and there's no existing mapping of the file. After the seal is set, no
read/write/mmap from userspace is allowed.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/memfd.h | 22 +++++++
include/linux/shmem_fs.h | 9 +++
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 34 +++++++++-
mm/shmem.c | 127 ++++++++++++++++++++++++++++++++++++-
5 files changed, 189 insertions(+), 4 deletions(-)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..ea213f5e3f95 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -4,13 +4,35 @@

#include <linux/file.h>

+struct guest_ops {
+ void (*invalidate_page_range)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+ unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
+ int *page_level);
+ void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+extern inline int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
{
return -EINVAL;
}
+static inline int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ return -EINVAL;
+}
#endif

#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..1b4c032680d5 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@

/* inode in-kernel data */

+struct guest_ops;
+struct guest_mem_ops;
+
struct shmem_inode_info {
spinlock_t lock;
unsigned int seals; /* shmem seals */
@@ -24,6 +27,8 @@ struct shmem_inode_info {
struct simple_xattrs xattrs; /* list of xattrs */
atomic_t stop_eviction; /* hold when working on inode */
struct inode vfs_inode;
+ void *guest_owner;
+ const struct guest_ops *guest_ops;
};

struct shmem_sb_info {
@@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);

+extern int shmem_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops);
+
/* Flag allocation requirements to shmem_getpage */
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_GUEST 0x0020
/* (1U << 31) is reserved for signed error codes */

/*
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..5a34173f55f4 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,26 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
return NULL;
}

+int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ if (shmem_mapping(inode->i_mapping)) {
+ return shmem_register_guest(inode, owner,
+ guest_ops, guest_mem_ops);
+ }
+
+ return -EINVAL;
+}
+
+EXPORT_SYMBOL_GPL(memfd_register_guest);
+
#define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
F_SEAL_WRITE | \
- F_SEAL_FUTURE_WRITE)
+ F_SEAL_FUTURE_WRITE | \
+ F_SEAL_GUEST)

static int memfd_add_seals(struct file *file, unsigned int seals)
{
@@ -203,10 +218,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
}
}

+ if (seals & F_SEAL_GUEST) {
+ i_mmap_lock_read(inode->i_mapping);
+
+ if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) {
+ error = -EBUSY;
+ goto unlock;
+ }
+
+ if (i_size_read(inode)) {
+ error = -EBUSY;
+ goto unlock;
+ }
+ }
+
*file_seals |= seals;
error = 0;

unlock:
+ if (seals & F_SEAL_GUEST)
+ i_mmap_unlock_read(inode->i_mapping);
+
inode_unlock(inode);
return error;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..978c841c42c4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -80,6 +80,7 @@ static struct vfsmount *shm_mnt;
#include <linux/userfaultfd_k.h>
#include <linux/rmap.h>
#include <linux/uuid.h>
+#include <linux/memfd.h>

#include <linux/uaccess.h>

@@ -883,6 +884,21 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
return split_huge_page(page) >= 0;
}

+static void guest_invalidate_page(struct inode *inode,
+ struct page *page, pgoff_t start, pgoff_t end)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!info->guest_ops || !info->guest_ops->invalidate_page_range)
+ return;
+
+ start = max(start, page->index);
+ end = min(end, page->index + HPAGE_PMD_NR) - 1;
+
+ info->guest_ops->invalidate_page_range(inode, info->guest_owner,
+ start, end);
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -923,6 +939,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
index += thp_nr_pages(page) - 1;

+ guest_invalidate_page(inode, page, start, end);
+
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -999,6 +1017,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index--;
break;
}
+
+ guest_invalidate_page(inode, page, start, end);
+
VM_BUG_ON_PAGE(PageWriteback(page), page);
if (shmem_punch_compound(page, start, end))
truncate_inode_page(mapping, page);
@@ -1074,6 +1095,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
(newsize > oldsize && (info->seals & F_SEAL_GROW)))
return -EPERM;

+ if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK))
+ return -EINVAL;
+
if (newsize != oldsize) {
error = shmem_reacct_size(SHMEM_I(inode)->flags,
oldsize, newsize);
@@ -1348,6 +1372,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
goto redirty;
if (!total_swap_pages)
goto redirty;
+ if (info->seals & F_SEAL_GUEST)
+ goto redirty;

/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -2278,6 +2304,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
vma->vm_flags &= ~(VM_MAYWRITE);
}

+ if (info->seals & F_SEAL_GUEST)
+ return -EPERM;
+
/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;

@@ -2519,12 +2548,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index = pos >> PAGE_SHIFT;

/* i_mutex is held by caller */
- if (unlikely(info->seals & (F_SEAL_GROW |
- F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+ F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) {
if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
+ if (info->seals & F_SEAL_GUEST)
+ return -EPERM;
}

return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2598,6 +2629,20 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
end_index = i_size >> PAGE_SHIFT;
if (index > end_index)
break;
+
+ /*
+ * inode_lock protects setting up seals as well as write to
+ * i_size. Setting F_SEAL_GUEST only allowed with i_size == 0.
+ *
+ * Check F_SEAL_GUEST after i_size. It effectively serialize
+ * read vs. setting F_SEAL_GUEST without taking inode_lock in
+ * read path.
+ */
+ if (SHMEM_I(inode)->seals & F_SEAL_GUEST) {
+ error = -EPERM;
+ break;
+ }
+
if (index == end_index) {
nr = i_size & ~PAGE_MASK;
if (nr <= offset)
@@ -2723,6 +2768,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}

+ if ((info->seals & F_SEAL_GUEST) &&
+ (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+ error = -EINVAL;
+ goto out;
+ }
+
shmem_falloc.waitq = &shmem_falloc_waitq;
shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -3806,6 +3857,20 @@ static void shmem_destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
}

+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ struct inode *inode = mapping->host;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (info->seals & F_SEAL_GUEST)
+ return -ENOTSUPP;
+ return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
const struct address_space_operations shmem_aops = {
.writepage = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
@@ -3814,12 +3879,68 @@ const struct address_space_operations shmem_aops = {
.write_end = shmem_write_end,
#endif
#ifdef CONFIG_MIGRATION
- .migratepage = migrate_page,
+ .migratepage = shmem_migrate_page,
#endif
.error_remove_page = generic_error_remove_page,
};
EXPORT_SYMBOL(shmem_aops);

+static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset,
+ int *page_level)
+{
+ struct page *page;
+ int ret;
+
+ ret = shmem_getpage(inode, offset, &page, SGP_WRITE);
+ if (ret)
+ return ret;
+
+ if (is_transparent_hugepage(page))
+ *page_level = PG_LEVEL_2M;
+ else
+ *page_level = PG_LEVEL_4K;
+
+ return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ set_page_dirty(page);
+ unlock_page(page);
+ put_page(page);
+}
+
+static const struct guest_mem_ops shmem_guest_ops = {
+ .get_lock_pfn = shmem_get_lock_pfn,
+ .put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!owner)
+ return -EINVAL;
+
+ if (info->guest_owner) {
+ if (info->guest_owner == owner)
+ return 0;
+ else
+ return -EPERM;
+ }
+
+ info->guest_owner = owner;
+ info->guest_ops = guest_ops;
+ *guest_mem_ops = &shmem_guest_ops;
+ return 0;
+}
+
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
.get_unmapped_area = shmem_get_unmapped_area,
--
2.17.1


2021-11-11 14:15:34

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 2/6] kvm: x86: Introduce guest private memory address space to memslot

Existing memslots functions are extended to pass a bool ‘private’
parameter to indicate whether the operation is on guest private memory
address space or not.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++--
arch/x86/include/uapi/asm/kvm.h | 4 ++++
arch/x86/kvm/mmu/mmu.c | 2 +-
include/linux/kvm_host.h | 23 ++++++++++++++++++++---
virt/kvm/kvm_main.c | 9 ++++++++-
5 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 20dfcdd20e81..048089883650 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1741,9 +1741,10 @@ enum {
#define HF_SMM_INSIDE_NMI_MASK (1 << 7)

#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
-#define KVM_ADDRESS_SPACE_NUM 2
+#define KVM_ADDRESS_SPACE_NUM 3

-#define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
+#define kvm_arch_vcpu_memslots_id(vcpu, private) \
+ (((vcpu)->arch.hflags & HF_SMM_MASK) ? 1 : (!!private) << 1)
#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)

asmlinkage void kvm_spurious_fault(void);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 47bc1a0df5ee..65189cfd3837 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -53,6 +53,10 @@
/* Architectural interrupt line count. */
#define KVM_NR_INTERRUPTS 256

+#define KVM_DEFAULT_ADDRESS_SPACE 0
+#define KVM_SMM_ADDRESS_SPACE 1
+#define KVM_PRIVATE_ADDRESS_SPACE 2
+
struct kvm_memory_alias {
__u32 slot; /* this has a different namespace than memory slots */
__u32 flags;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 79d4ae465a96..8483c15eac6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3938,7 +3938,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
return false;
}

- /* Don't expose private memslots to L2. */
+ /* Don't expose KVM's internal memslots to L2. */
if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
*pfn = KVM_PFN_NOSLOT;
*writable = false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 597841fe3d7a..8e5b197230ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -442,7 +442,7 @@ struct kvm_irq_routing_table {
#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)

#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
-static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
+static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu, bool private)
{
return 0;
}
@@ -699,13 +699,19 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
return __kvm_memslots(kvm, 0);
}

-static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
+static inline struct kvm_memslots *__kvm_vcpu_memslots(struct kvm_vcpu *vcpu,
+ bool private)
{
- int as_id = kvm_arch_vcpu_memslots_id(vcpu);
+ int as_id = kvm_arch_vcpu_memslots_id(vcpu, private);

return __kvm_memslots(vcpu->kvm, as_id);
}

+static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
+{
+ return __kvm_vcpu_memslots(vcpu, false);
+}
+
static inline
struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
{
@@ -721,6 +727,15 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
return slot;
}

+static inline bool memslot_is_private(const struct kvm_memory_slot *slot)
+{
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ return slot && slot->as_id == KVM_PRIVATE_ADDRESS_SPACE;
+#else
+ return false;
+#endif
+}
+
/*
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
* - create a new memory slot
@@ -860,6 +875,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, g
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);

struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
+struct kvm_memory_slot *__kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu,
+ gfn_t gfn, bool private);
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8815218630dc..fe62df334054 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1721,9 +1721,16 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(gfn_to_memslot);

+struct kvm_memory_slot *__kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu,
+ gfn_t gfn, bool private)
+{
+ return __gfn_to_memslot(__kvm_vcpu_memslots(vcpu, private), gfn);
+}
+EXPORT_SYMBOL_GPL(__kvm_vcpu_gfn_to_memslot);
+
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)
{
- return __gfn_to_memslot(kvm_vcpu_memslots(vcpu), gfn);
+ return __kvm_vcpu_gfn_to_memslot(vcpu, gfn, false);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);

--
2.17.1


2021-11-11 14:15:36

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 3/6] kvm: x86: add private_ops to memslot

Guest memory for guest private memslot is designed to be backed by an
"enlightened" file descriptor(fd). Some callbacks (working on the fd)
are implemented by some other kernel subsystems who want to provide
guest private memory to help KVM to establish the memory mapping.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
Documentation/virt/kvm/api.rst | 1 +
arch/x86/kvm/mmu/mmu.c | 47 ++++++++++++++++++++++++++++++----
arch/x86/kvm/mmu/paging_tmpl.h | 3 ++-
include/linux/kvm_host.h | 8 ++++++
include/uapi/linux/kvm.h | 3 +++
5 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 47054a79d395..16c06bf10302 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1260,6 +1260,7 @@ yet and must be cleared on entry.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+ __u32 fd; /* memory fd that provides guest memory */
};

/* for kvm_memory_region::flags */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8483c15eac6f..af5ecf4ef62a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2932,6 +2932,19 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
return level;
}

+static int host_private_pfn_mapping_level(const struct kvm_memory_slot *slot,
+ gfn_t gfn)
+{
+ kvm_pfn_t pfn;
+ int page_level = PG_LEVEL_4K;
+
+ pfn = slot->private_ops->get_lock_pfn(slot, gfn, &page_level);
+ if (pfn >= 0)
+ slot->private_ops->put_unlock_pfn(pfn);
+
+ return page_level;
+}
+
int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t pfn, int max_level)
{
@@ -2947,6 +2960,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_memory_slot *slot,
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;

+ if (memslot_is_private(slot))
+ return host_private_pfn_mapping_level(slot, gfn);
+
return host_pfn_mapping_level(kvm, gfn, pfn, slot);
}

@@ -3926,10 +3942,13 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,

static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
- bool write, bool *writable)
+ bool write, bool *writable, bool private)
{
- struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+ struct kvm_memory_slot *slot;
bool async;
+ int page_level;
+
+ slot = __kvm_vcpu_gfn_to_memslot(vcpu, gfn, private);

/* Don't expose aliases for no slot GFNs or private memslots */
if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
@@ -3945,6 +3964,17 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
return false;
}

+ if (private) {
+ *pfn = slot->private_ops->get_lock_pfn(slot, gfn, &page_level);
+ if (*pfn < 0)
+ *pfn = KVM_PFN_ERR_FAULT;
+ if (writable)
+ *writable = slot->flags & KVM_MEM_READONLY ?
+ false : true;
+
+ return false;
+ }
+
async = false;
*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async,
write, writable, hva);
@@ -3971,7 +4001,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
kvm_pfn_t *pfn)
{
bool write = error_code & PFERR_WRITE_MASK;
+ bool private = is_private_gfn(vcpu, gpa >> PAGE_SHIFT);
bool map_writable;
+ struct kvm_memory_slot *slot;

gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
unsigned long mmu_seq;
@@ -3995,7 +4027,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
smp_rmb();

if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, &hva,
- write, &map_writable))
+ write, &map_writable, private))
return RET_PF_RETRY;

if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, *pfn, ACC_ALL, &r))
@@ -4008,7 +4040,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
else
write_lock(&vcpu->kvm->mmu_lock);

- if (!is_noslot_pfn(*pfn) &&
+ if (!private && !is_noslot_pfn(*pfn) &&
mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
goto out_unlock;
r = make_mmu_pages_available(vcpu);
@@ -4027,7 +4059,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(*pfn);
+
+ if (!private)
+ kvm_release_pfn_clean(*pfn);
+ else
+ slot->private_ops->put_unlock_pfn(*pfn);
+
return r;
}

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1fc3a0826072..5ffeb9c85fba 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -799,6 +799,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
{
bool write_fault = error_code & PFERR_WRITE_MASK;
bool user_fault = error_code & PFERR_USER_MASK;
+ bool private = is_private_gfn(vcpu, addr >> PAGE_SHIFT);
struct guest_walker walker;
int r;
kvm_pfn_t pfn;
@@ -854,7 +855,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
smp_rmb();

if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
- write_fault, &map_writable))
+ write_fault, &map_writable, private))
return RET_PF_RETRY;

if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8e5b197230ed..83345460c5f5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -347,6 +347,12 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
}

+struct kvm_private_memory_ops {
+ unsigned long (*get_lock_pfn)(const struct kvm_memory_slot *slot,
+ gfn_t gfn, int *page_level);
+ void (*put_unlock_pfn)(unsigned long pfn);
+};
+
/*
* Some of the bitops functions do not support too long bitmaps.
* This number must be determined not to exceed such limits.
@@ -362,6 +368,8 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+ struct file *file;
+ struct kvm_private_memory_ops *private_ops;
};

static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d3c9caf86d80..8d20caae9180 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -100,6 +100,9 @@ struct kvm_userspace_memory_region {
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ __u32 fd; /* valid if memslot is guest private memory */
+#endif
};

/*
--
2.17.1


2021-11-11 14:15:40

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 4/6] kvm: x86: implement private_ops for memfd backing store

Call memfd_register_guest() module API to setup private_ops for a given
private memslot.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/memfd.c | 63 ++++++++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 6 ++++
virt/kvm/kvm_main.c | 29 ++++++++++++++++--
4 files changed, 96 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/kvm/memfd.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index e7ed25070206..72ad96c78bed 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -16,7 +16,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
- hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
+ hyperv.o debugfs.o memfd.o mmu/mmu.o mmu/page_track.o \
mmu/spte.o
kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
kvm-$(CONFIG_KVM_XEN) += xen.o
diff --git a/arch/x86/kvm/memfd.c b/arch/x86/kvm/memfd.c
new file mode 100644
index 000000000000..e08ab61d09f2
--- /dev/null
+++ b/arch/x86/kvm/memfd.c
@@ -0,0 +1,63 @@
+
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memfd.c: routines for fd based memory backing store
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/memfd.h>
+const static struct guest_mem_ops *memfd_ops;
+
+static void test_guest_invalidate_page_range(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end)
+{
+ //!!!We can get here after the owner no longer exists
+}
+
+static const struct guest_ops guest_ops = {
+ .invalidate_page_range = test_guest_invalidate_page_range,
+};
+
+static unsigned long memfd_get_lock_pfn(const struct kvm_memory_slot *slot,
+ gfn_t gfn, int *page_level)
+{
+ pgoff_t index = gfn - slot->base_gfn +
+ (slot->userspace_addr >> PAGE_SHIFT);
+
+ return memfd_ops->get_lock_pfn(slot->file->f_inode, index, page_level);
+}
+
+static void memfd_put_unlock_pfn(unsigned long pfn)
+{
+ memfd_ops->put_unlock_pfn(pfn);
+}
+
+static struct kvm_private_memory_ops memfd_private_ops = {
+ .get_lock_pfn = memfd_get_lock_pfn,
+ .put_unlock_pfn = memfd_put_unlock_pfn,
+};
+
+int kvm_register_private_memslot(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_memory_slot *slot)
+{
+ struct fd memfd = fdget(mem->fd);
+
+ if(!memfd.file)
+ return -EINVAL;
+
+ slot->file = memfd.file;
+ slot->private_ops = &memfd_private_ops;
+
+ memfd_register_guest(slot->file->f_inode, kvm, &guest_ops, &memfd_ops);
+ return 0;
+}
+
+void kvm_unregister_private_memslot(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_memory_slot *slot)
+{
+ fput(slot->file);
+}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 83345460c5f5..17fabb4f53bf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -777,6 +777,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_memory_slot *old,
const struct kvm_memory_slot *new,
enum kvm_mr_change change);
+int kvm_register_private_memslot(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_memory_slot *slot);
+void kvm_unregister_private_memslot(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_memory_slot *slot);
/* flush all memory translations */
void kvm_arch_flush_shadow_all(struct kvm *kvm);
/* flush memory translations pointing to 'slot' */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fe62df334054..e8e2c5b28aa4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1250,7 +1250,19 @@ static int kvm_set_memslot(struct kvm *kvm,
kvm_arch_flush_shadow_memslot(kvm, slot);
}

+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ if (change == KVM_MR_CREATE && as_id == KVM_PRIVATE_ADDRESS_SPACE) {
+ r = kvm_register_private_memslot(kvm, mem, new);
+ if (r)
+ goto out_slots;
+ }
+#endif
+
r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ if ((r || change == KVM_MR_DELETE) && as_id == KVM_PRIVATE_ADDRESS_SPACE)
+ kvm_unregister_private_memslot(kvm, mem, new);
+#endif
if (r)
goto out_slots;

@@ -1324,10 +1336,15 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
return -EINVAL;
- /* We can read the guest memory with __xxx_user() later on. */
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
- !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+ (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
+ return -EINVAL;
+ /* We can read the guest memory with __xxx_user() later on. */
+ if (
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ as_id != KVM_PRIVATE_ADDRESS_SPACE &&
+#endif
+ !access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
@@ -1368,6 +1385,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.dirty_bitmap = NULL;
memset(&new.arch, 0, sizeof(new.arch));
} else { /* Modify an existing slot. */
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ /* Private memslots are immutable, they can only be deleted. */
+ if (as_id == KVM_PRIVATE_ADDRESS_SPACE)
+ return -EINVAL;
+#endif
+
if ((new.userspace_addr != old.userspace_addr) ||
(new.npages != old.npages) ||
((new.flags ^ old.flags) & KVM_MEM_READONLY))
--
2.17.1


2021-11-11 14:15:51

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 5/6] kvm: x86: add KVM_EXIT_MEMORY_ERROR exit

Currently support to exit to userspace for private/shared memory
conversion.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++++++++
include/uapi/linux/kvm.h | 15 +++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index af5ecf4ef62a..780868888aa8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3950,6 +3950,17 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,

slot = __kvm_vcpu_gfn_to_memslot(vcpu, gfn, private);

+ /*
+ * Exit to userspace to map the requested private/shared memory region
+ * if there is no memslot and (a) the access is private or (b) there is
+ * an existing private memslot. Emulated MMIO must be accessed through
+ * shared GPAs, thus a memslot miss on a private GPA is always handled
+ * as an implicit conversion "request".
+ */
+ if (!slot &&
+ (private || __kvm_vcpu_gfn_to_memslot(vcpu, gfn, true)))
+ goto out_convert;
+
/* Don't expose aliases for no slot GFNs or private memslots */
if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
!kvm_is_visible_memslot(slot)) {
@@ -3994,6 +4005,15 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
write, writable, hva);
return false;
+
+out_convert:
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+ vcpu->run->mem.type = private ? KVM_EXIT_MEM_MAP_PRIVATE
+ : KVM_EXIT_MEM_MAP_SHARE;
+ vcpu->run->mem.u.map.gpa = cr2_or_gpa;
+ vcpu->run->mem.u.map.size = PAGE_SIZE;
+ return true;
+
}

static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8d20caae9180..470c472a9451 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -233,6 +233,18 @@ struct kvm_xen_exit {
} u;
};

+struct kvm_memory_exit {
+#define KVM_EXIT_MEM_MAP_SHARE 1
+#define KVM_EXIT_MEM_MAP_PRIVATE 2
+ __u32 type;
+ union {
+ struct {
+ __u64 gpa;
+ __u64 size;
+ } map;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -272,6 +284,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
#define KVM_EXIT_TDVMCALL 35
+#define KVM_EXIT_MEMORY_ERROR 36

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -455,6 +468,8 @@ struct kvm_run {
__u64 subfunc;
__u64 param[4];
} tdvmcall;
+ /* KVM_EXIT_MEMORY_ERROR */
+ struct kvm_memory_exit mem;
/* Fix the size of the union. */
char padding[256];
};
--
2.17.1


2021-11-11 14:16:01

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 6/6] KVM: add KVM_SPLIT_MEMORY_REGION

This new ioctl let user to split an exising memory region into two
parts. The first part reuses the existing memory region but have a
shrinked size. The second part is a newly created one.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/x86.c | 3 +-
include/linux/kvm_host.h | 4 ++
include/uapi/linux/kvm.h | 16 +++++
virt/kvm/kvm_main.c | 147 +++++++++++++++++++++++++++++++++++++++
4 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 98dbe602f47b..1d490c3d7766 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11020,7 +11020,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
- if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
+ if (change == KVM_MR_CREATE || change == KVM_MR_MOVE ||
+ change == KVM_MR_SHRINK)
return kvm_alloc_memslot_metadata(memslot,
mem->memory_size >> PAGE_SHIFT);
return 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 17fabb4f53bf..8b5a9217231b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -752,6 +752,9 @@ static inline bool memslot_is_private(const struct kvm_memory_slot *slot)
* -- move it in the guest physical memory space
* -- just change its flags
*
+ * KVM_SPLIT_MEMORY_REGION ioctl allows the following operation:
+ * - shrink an existing memory slot
+ *
* Since flags can be changed by some of these operations, the following
* differentiation is the best we can do for __kvm_set_memory_region():
*/
@@ -760,6 +763,7 @@ enum kvm_mr_change {
KVM_MR_DELETE,
KVM_MR_MOVE,
KVM_MR_FLAGS_ONLY,
+ KVM_MR_SHRINK,
};

int kvm_set_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 470c472a9451..e61c0eac91e7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1108,6 +1108,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_DIRTY_LOG_RING 192
#define KVM_CAP_X86_BUS_LOCK_EXIT 193
#define KVM_CAP_PPC_DAWR1 194
+#define KVM_CAP_MEMORY_REGION_SPLIT 195

#define KVM_CAP_VM_TYPES 1000

@@ -1885,4 +1886,19 @@ struct kvm_dirty_gfn {
#define KVM_BUS_LOCK_DETECTION_OFF (1 << 0)
#define KVM_BUS_LOCK_DETECTION_EXIT (1 << 1)

+/**
+ * struct kvm_split_memory_region_info - Infomation for memory region split.
+ * @slot1: The slot to be split.
+ * @slot2: The slot for the newly split part.
+ * @offset: The offset(bytes) in @slot1 to split.
+ */
+struct kvm_split_memory_region_info {
+ __u32 slot1;
+ __u32 slot2;
+ __u64 offset;
+};
+
+#define KVM_SPLIT_MEMORY_REGION _IOW(KVMIO, 0xcf, \
+ struct kvm_split_memory_region_info)
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e8e2c5b28aa4..11b0f3d8b9ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1467,6 +1467,140 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
return kvm_set_memory_region(kvm, mem);
}

+static void memslot_to_memory_region(struct kvm_userspace_memory_region *mem,
+ struct kvm_memory_slot *slot)
+{
+ mem->slot = (u32)slot->as_id << 16 | slot->id;
+ mem->flags = slot->flags;
+ mem->guest_phys_addr = slot->base_gfn >> PAGE_SHIFT;
+ mem->memory_size = slot->npages << PAGE_SHIFT;
+ mem->userspace_addr = slot->userspace_addr;
+}
+
+static int kvm_split_memory_region(struct kvm *kvm, int as_id, int id1, int id2,
+ gfn_t offset)
+{
+ struct kvm_memory_slot *slot1;
+ struct kvm_memory_slot slot2, old;
+ struct kvm_userspace_memory_region mem;
+ unsigned long *dirty_bitmap_slot1;
+ struct kvm_memslots *slots;
+ int r;
+
+ /* Make a full copy of the old memslot. */
+ slot1 = id_to_memslot(__kvm_memslots(kvm, as_id), id1);
+ if (!slot1)
+ return -EINVAL;
+ else
+ old = *slot1;
+
+ if( offset <= old.base_gfn ||
+ offset >= old.base_gfn + old.npages )
+ return -EINVAL;
+
+ /* Prepare the second half. */
+ slot2.as_id = as_id;
+ slot2.id = id2;
+ slot2.base_gfn = old.npages + offset;
+ slot2.npages = old.npages - offset;
+ slot2.flags = old.flags;
+ slot2.userspace_addr = old.userspace_addr + (offset >> PAGE_SHIFT);
+ slot2.file = old.file;
+ slot2.private_ops = old.private_ops;
+
+ if (!(old.flags & KVM_MEM_LOG_DIRTY_PAGES))
+ slot2.dirty_bitmap = NULL;
+ else if (!kvm->dirty_ring_size) {
+ slot1->npages = offset;
+ r = kvm_alloc_dirty_bitmap(slot1);
+ if (r)
+ return r;
+ else
+ dirty_bitmap_slot1 = slot1->dirty_bitmap;
+
+ r = kvm_alloc_dirty_bitmap(&slot2);
+ if (r)
+ goto out_bitmap;
+
+ //TODO: copy dirty_bitmap or return -EINVAL if logging is running
+ }
+
+// mutex_lock(&kvm->slots_arch_lock);
+
+ slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), KVM_MR_CREATE);
+ if (!slots) {
+// mutex_unlock(&kvm->slots_arch_lock);
+ r = -ENOMEM;
+ goto out_bitmap;
+ }
+
+ slot1 = id_to_memslot(slots, id1);
+ slot1->npages = offset;
+ slot1->dirty_bitmap = dirty_bitmap_slot1;
+
+ memslot_to_memory_region(&mem, slot1);
+ r = kvm_arch_prepare_memory_region(kvm, slot1, &mem, KVM_MR_SHRINK);
+ if (r)
+ goto out_slots;
+
+ memslot_to_memory_region(&mem, &slot2);
+ r = kvm_arch_prepare_memory_region(kvm, &slot2, &mem, KVM_MR_CREATE);
+ if (r)
+ goto out_slots;
+
+ update_memslots(slots, slot1, KVM_MR_SHRINK);
+ update_memslots(slots, &slot2, KVM_MR_CREATE);
+
+ slots = install_new_memslots(kvm, as_id, slots);
+
+ kvm_free_memslot(kvm, &old);
+
+ kvfree(slots);
+ return 0;
+
+out_slots:
+// mutex_unlock(&kvm->slots_arch_lock);
+ kvfree(slots);
+out_bitmap:
+ if (dirty_bitmap_slot1)
+ kvm_destroy_dirty_bitmap(slot1);
+ if (slot2.dirty_bitmap)
+ kvm_destroy_dirty_bitmap(&slot2);
+
+ return r;
+}
+
+static int kvm_vm_ioctl_split_memory_region(struct kvm *kvm,
+ struct kvm_split_memory_region_info *info)
+{
+ int as_id1, as_id2, id1, id2;
+ int r;
+
+ if ((u16)info->slot1 >= KVM_USER_MEM_SLOTS ||
+ (u16)info->slot2 >= KVM_USER_MEM_SLOTS)
+ return -EINVAL;
+ if (info->offset & (PAGE_SIZE - 1))
+ return -EINVAL;
+
+ as_id1 = info->slot1 >> 16;
+ as_id2 = info->slot2 >> 16;
+
+ if (as_id1 != as_id2 || as_id1 >= KVM_ADDRESS_SPACE_NUM)
+ return -EINVAL;
+
+ id1 = (u16)info->slot1;
+ id2 = (u16)info->slot2;
+ if (id1 == id2 || id1 >= KVM_MEM_SLOTS_NUM || id2 >= KVM_MEM_SLOTS_NUM)
+ return -EINVAL;
+
+ mutex_lock(&kvm->slots_lock);
+ r = kvm_split_memory_region(kvm, as_id1, id1, id2,
+ info->offset >> PAGE_SHIFT);
+ mutex_unlock(&kvm->slots_lock);
+
+ return r;
+}
+
#ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
/**
* kvm_get_dirty_log - get a snapshot of dirty pages
@@ -3765,6 +3899,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#else
return 0;
#endif
+ case KVM_CAP_MEMORY_REGION_SPLIT:
+ return 1;
default:
break;
}
@@ -3901,6 +4037,17 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
break;
}
+ case KVM_SPLIT_MEMORY_REGION: {
+ struct kvm_split_memory_region_info info;
+
+ r = -EFAULT;
+ if (copy_from_user(&info, argp, sizeof(info)))
+ goto out;
+
+ r = kvm_vm_ioctl_split_memory_region(kvm, &info);
+ break;
+ }
+
case KVM_GET_DIRTY_LOG: {
struct kvm_dirty_log log;

--
2.17.1


2021-11-11 14:16:11

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 07/13] linux-headers: Update

Signed-off-by: Chao Peng <[email protected]>
---
linux-headers/asm-x86/kvm.h | 5 +++++
linux-headers/linux/kvm.h | 29 +++++++++++++++++++++++++----
2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index a6c327f8ad..f9aadf0ebb 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -53,6 +53,10 @@
/* Architectural interrupt line count. */
#define KVM_NR_INTERRUPTS 256

+#define KVM_DEFAULT_ADDRESS_SPACE 0
+#define KVM_SMM_ADDRESS_SPACE 1
+#define KVM_PRIVATE_ADDRESS_SPACE 2
+
struct kvm_memory_alias {
__u32 slot; /* this has a different namespace than memory slots */
__u32 flags;
@@ -295,6 +299,7 @@ struct kvm_debug_exit_arch {
#define KVM_GUESTDBG_USE_HW_BP 0x00020000
#define KVM_GUESTDBG_INJECT_DB 0x00040000
#define KVM_GUESTDBG_INJECT_BP 0x00080000
+#define KVM_GUESTDBG_BLOCKIRQ 0x00100000

/* for KVM_SET_GUEST_DEBUG */
struct kvm_guest_debug_arch {
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index bcaf66cc4d..0a43202c04 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -101,6 +101,9 @@ struct kvm_userspace_memory_region {
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+ __u32 fd; /* valid if memslot is guest private memory */
+#endif
};

/*
@@ -231,6 +234,18 @@ struct kvm_xen_exit {
} u;
};

+struct kvm_memory_exit {
+#define KVM_EXIT_MEM_MAP_SHARE 1
+#define KVM_EXIT_MEM_MAP_PRIVATE 2
+ __u32 type;
+ union {
+ struct {
+ __u64 gpa;
+ __u64 size;
+ } map;
+ } u;
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -269,6 +284,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_AP_RESET_HOLD 32
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
+#define KVM_EXIT_MEMORY_ERROR 35

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -469,6 +485,8 @@ struct kvm_run {
} msr;
/* KVM_EXIT_XEN */
struct kvm_xen_exit xen;
+ /* KVM_EXIT_MEMORY_ERROR */
+ struct kvm_memory_exit mem;
/* Fix the size of the union. */
char padding[256];
};
@@ -1965,7 +1983,9 @@ struct kvm_stats_header {
#define KVM_STATS_TYPE_CUMULATIVE (0x0 << KVM_STATS_TYPE_SHIFT)
#define KVM_STATS_TYPE_INSTANT (0x1 << KVM_STATS_TYPE_SHIFT)
#define KVM_STATS_TYPE_PEAK (0x2 << KVM_STATS_TYPE_SHIFT)
-#define KVM_STATS_TYPE_MAX KVM_STATS_TYPE_PEAK
+#define KVM_STATS_TYPE_LINEAR_HIST (0x3 << KVM_STATS_TYPE_SHIFT)
+#define KVM_STATS_TYPE_LOG_HIST (0x4 << KVM_STATS_TYPE_SHIFT)
+#define KVM_STATS_TYPE_MAX KVM_STATS_TYPE_LOG_HIST

#define KVM_STATS_UNIT_SHIFT 4
#define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT)
@@ -1988,8 +2008,9 @@ struct kvm_stats_header {
* @size: The number of data items for this stats.
* Every data item is of type __u64.
* @offset: The offset of the stats to the start of stat structure in
- * struture kvm or kvm_vcpu.
- * @unused: Unused field for future usage. Always 0 for now.
+ * structure kvm or kvm_vcpu.
+ * @bucket_size: A parameter value used for histogram stats. It is only used
+ * for linear histogram stats, specifying the size of the bucket;
* @name: The name string for the stats. Its size is indicated by the
* &kvm_stats_header->name_size.
*/
@@ -1998,7 +2019,7 @@ struct kvm_stats_desc {
__s16 exponent;
__u16 size;
__u32 offset;
- __u32 unused;
+ __u32 bucket_size;
char name[];
};

--
2.17.1


2021-11-11 14:16:27

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 08/13] hostmem: Add guest private memory to memory backend

Currently only memfd is supported.

Signed-off-by: Chao Peng <[email protected]>
---
backends/hostmem-memfd.c | 12 +++++++++---
backends/hostmem.c | 24 ++++++++++++++++++++++++
include/exec/memory.h | 3 +++
include/exec/ram_addr.h | 3 ++-
include/qemu/memfd.h | 5 +++++
include/sysemu/hostmem.h | 1 +
softmmu/physmem.c | 33 +++++++++++++++++++--------------
util/memfd.c | 32 +++++++++++++++++++++++++-------
8 files changed, 88 insertions(+), 25 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 3fc85c3db8..ef057586a0 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -36,6 +36,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
{
HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
uint32_t ram_flags;
+ unsigned int seals;
char *name;
int fd;

@@ -44,10 +45,14 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
return;
}

+ seals = backend->guest_private ? F_SEAL_GUEST : 0;
+
+ if (m->seal) {
+ seals |= F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL;
+ }
+
fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
- m->hugetlb, m->hugetlbsize, m->seal ?
- F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
- errp);
+ m->hugetlb, m->hugetlbsize, seals, errp);
if (fd == -1) {
return;
}
@@ -55,6 +60,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
name = host_memory_backend_get_name(backend);
ram_flags = backend->share ? RAM_SHARED : 0;
ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+ ram_flags |= backend->guest_private ? RAM_GUEST_PRIVATE : 0;
memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
backend->size, ram_flags, fd, 0, errp);
g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 4c05862ed5..a90d1be0a0 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -472,6 +472,23 @@ host_memory_backend_set_use_canonical_path(Object *obj, bool value,
backend->use_canonical_path = value;
}

+static bool
+host_memory_backend_get_guest_private(Object *obj, Error **errp)
+{
+ HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+ return backend->guest_private;
+
+}
+
+static void
+host_memory_backend_set_guest_private(Object *obj, bool value, Error **errp)
+{
+ HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+ backend->guest_private = value;
+}
+
static void
host_memory_backend_class_init(ObjectClass *oc, void *data)
{
@@ -542,6 +559,13 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
object_class_property_add_bool(oc, "x-use-canonical-path-for-ramblock-id",
host_memory_backend_get_use_canonical_path,
host_memory_backend_set_use_canonical_path);
+
+ object_class_property_add_bool(oc, "guest-private",
+ host_memory_backend_get_guest_private,
+ host_memory_backend_set_guest_private);
+ object_class_property_set_description(oc, "guest-private",
+ "Guest private memory");
+
}

static const TypeInfo host_memory_backend_info = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3d417d317..ae9d3bc574 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -190,6 +190,9 @@ typedef struct IOMMUTLBEvent {
*/
#define RAM_NORESERVE (1 << 7)

+/* RAM is guest private memory that can not be mmap-ed. */
+#define RAM_GUEST_PRIVATE (1 << 8)
+
static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
IOMMUNotifierFlag flags,
hwaddr start, hwaddr end,
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 551876bed0..32768291de 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -74,7 +74,8 @@ static inline bool clear_bmap_test_and_clear(RAMBlock *rb, uint64_t page)

static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
{
- return (b && b->host && offset < b->used_length) ? true : false;
+ return (b && (b->flags & RAM_GUEST_PRIVATE || b->host)
+ && offset < b->used_length) ? true : false;
}

static inline void *ramblock_ptr(RAMBlock *block, ram_addr_t offset)
diff --git a/include/qemu/memfd.h b/include/qemu/memfd.h
index 975b6bdb77..f021a0730a 100644
--- a/include/qemu/memfd.h
+++ b/include/qemu/memfd.h
@@ -14,6 +14,11 @@
#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
+
+#endif
+
+#ifndef F_SEAL_GUEST
+#define F_SEAL_GUEST 0x0020 /* guest private memory */
#endif

#ifndef MFD_CLOEXEC
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 9ff5c16963..ddf742a69b 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -65,6 +65,7 @@ struct HostMemoryBackend {
uint64_t size;
bool merge, dump, use_canonical_path;
bool prealloc, is_mapped, share, reserve;
+ bool guest_private;
uint32_t prealloc_threads;
DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
HostMemPolicy policy;
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 23e77cb771..f4d6eeaa17 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1591,15 +1591,19 @@ static void *file_ram_alloc(RAMBlock *block,
perror("ftruncate");
}

- qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
- qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
- qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
- qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
- area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
- if (area == MAP_FAILED) {
- error_setg_errno(errp, errno,
- "unable to map backing store for guest RAM");
- return NULL;
+ if (block->flags & RAM_GUEST_PRIVATE) {
+ area = (void*)offset;
+ } else {
+ qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
+ qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
+ qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
+ qemu_map_flags |= (block->flags & RAM_NORESERVE) ? QEMU_MAP_NORESERVE : 0;
+ area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
+ if (area == MAP_FAILED) {
+ error_setg_errno(errp, errno,
+ "unable to map backing store for guest RAM");
+ return NULL;
+ }
}

block->fd = fd;
@@ -1971,7 +1975,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
qemu_mutex_lock_ramlist();
new_block->offset = find_ram_offset(new_block->max_length);

- if (!new_block->host) {
+ if (!new_block->host && !(new_block->flags & RAM_GUEST_PRIVATE)) {
if (xen_enabled()) {
xen_ram_alloc(new_block->offset, new_block->max_length,
new_block->mr, &err);
@@ -2028,7 +2032,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
new_block->used_length,
DIRTY_CLIENTS_ALL);

- if (new_block->host) {
+ if (new_block->host && !(new_block->flags & RAM_GUEST_PRIVATE)) {
qemu_ram_setup_dump(new_block->host, new_block->max_length);
qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);
/*
@@ -2055,7 +2059,8 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
int64_t file_size, file_align;

/* Just support these ram flags by now. */
- assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE)) == 0);
+ assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
+ RAM_GUEST_PRIVATE)) == 0);

if (xen_enabled()) {
error_setg(errp, "-mem-path not supported with Xen");
@@ -2092,7 +2097,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
new_block->flags = ram_flags;
new_block->host = file_ram_alloc(new_block, size, fd, readonly,
!file_size, offset, errp);
- if (!new_block->host) {
+ if (!new_block->host && !(ram_flags & RAM_GUEST_PRIVATE)) {
g_free(new_block);
return NULL;
}
@@ -2392,7 +2397,7 @@ RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,

RAMBLOCK_FOREACH(block) {
/* This case append when the block is not mapped. */
- if (block->host == NULL) {
+ if (block->host == NULL && !(block->flags & RAM_GUEST_PRIVATE)) {
continue;
}
if (host - block->host < block->max_length) {
diff --git a/util/memfd.c b/util/memfd.c
index 4a3c07e0be..3b4b88d81e 100644
--- a/util/memfd.c
+++ b/util/memfd.c
@@ -76,14 +76,32 @@ int qemu_memfd_create(const char *name, size_t size, bool hugetlb,
goto err;
}

- if (ftruncate(mfd, size) == -1) {
- error_setg_errno(errp, errno, "failed to resize memfd to %zu", size);
- goto err;
- }

- if (seals && fcntl(mfd, F_ADD_SEALS, seals) == -1) {
- error_setg_errno(errp, errno, "failed to add seals 0x%x", seals);
- goto err;
+ /*
+ * The call sequence of F_ADD_SEALS and ftruncate matters here.
+ * For SEAL_GUEST, it requires the size to be 0 at the time of setting seal
+ * For SEAL_GROW/SHRINK, ftruncate should be called before setting seal.
+ */
+ if (seals & F_SEAL_GUEST) {
+ if (seals && fcntl(mfd, F_ADD_SEALS, seals) == -1) {
+ error_setg_errno(errp, errno, "failed to add seals 0x%x", seals);
+ goto err;
+ }
+
+ if (ftruncate(mfd, size) == -1) {
+ error_setg_errno(errp, errno, "failed to resize memfd to %zu", size);
+ goto err;
+ }
+ } else {
+ if (ftruncate(mfd, size) == -1) {
+ error_setg_errno(errp, errno, "failed to resize memfd to %zu", size);
+ goto err;
+ }
+
+ if (seals && fcntl(mfd, F_ADD_SEALS, seals) == -1) {
+ error_setg_errno(errp, errno, "failed to add seals 0x%x", seals);
+ goto err;
+ }
}

return mfd;
--
2.17.1


2021-11-11 14:16:37

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 09/13] qmp: Include "guest-private" property for memory backends

Signed-off-by: Chao Peng <[email protected]>
---
hw/core/machine-hmp-cmds.c | 3 +++
hw/core/machine-qmp-cmds.c | 1 +
qapi/machine.json | 3 +++
qapi/qom.json | 3 +++
4 files changed, 10 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 76b22b00d6..6bd66c25b7 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -112,6 +112,9 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
m->value->prealloc ? "true" : "false");
monitor_printf(mon, " share: %s\n",
m->value->share ? "true" : "false");
+ monitor_printf(mon, " guest private: %s\n",
+ m->value->guest_private ? "true" : "false");
+
if (m->value->has_reserve) {
monitor_printf(mon, " reserve: %s\n",
m->value->reserve ? "true" : "false");
diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 216fdfaf3a..2c1c1de73f 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -174,6 +174,7 @@ static int query_memdev(Object *obj, void *opaque)
m->dump = object_property_get_bool(obj, "dump", &error_abort);
m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
m->share = object_property_get_bool(obj, "share", &error_abort);
+ m->guest_private = object_property_get_bool(obj, "guest-private", &error_abort);
m->reserve = object_property_get_bool(obj, "reserve", &err);
if (err) {
error_free_or_abort(&err);
diff --git a/qapi/machine.json b/qapi/machine.json
index 157712f006..f568a6a0bf 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -798,6 +798,8 @@
#
# @share: whether memory is private to QEMU or shared (since 6.1)
#
+# @guest-private: whether memory is private to guest (since X.X)
+#
# @reserve: whether swap space (or huge pages) was reserved if applicable.
# This corresponds to the user configuration and not the actual
# behavior implemented in the OS to perform the reservation.
@@ -818,6 +820,7 @@
'dump': 'bool',
'prealloc': 'bool',
'share': 'bool',
+ 'guest-private': 'bool',
'*reserve': 'bool',
'host-nodes': ['uint16'],
'policy': 'HostMemPolicy' }}
diff --git a/qapi/qom.json b/qapi/qom.json
index a25616bc7a..93af9b106e 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -550,6 +550,8 @@
# @share: if false, the memory is private to QEMU; if true, it is shared
# (default: false)
#
+# @guest-private: if true, the memory is guest private memory (default: false)
+#
# @reserve: if true, reserve swap space (or huge pages) if applicable
# (default: true) (since 6.1)
#
@@ -580,6 +582,7 @@
'*prealloc': 'bool',
'*prealloc-threads': 'uint32',
'*share': 'bool',
+ '*guest-private': 'bool',
'*reserve': 'bool',
'size': 'size',
'*x-use-canonical-path-for-ramblock-id': 'bool' } }
--
2.17.1


2021-11-11 14:16:44

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 10/13] softmmu/physmem: Add private memory address space

Signed-off-by: Chao Peng <[email protected]>
---
include/exec/address-spaces.h | 2 ++
softmmu/physmem.c | 13 +++++++++++++
2 files changed, 15 insertions(+)

diff --git a/include/exec/address-spaces.h b/include/exec/address-spaces.h
index db8bfa9a92..b3f45001c0 100644
--- a/include/exec/address-spaces.h
+++ b/include/exec/address-spaces.h
@@ -27,6 +27,7 @@
* until a proper bus interface is available.
*/
MemoryRegion *get_system_memory(void);
+MemoryRegion *get_system_private_memory(void);

/* Get the root I/O port region. This interface should only be used
* temporarily until a proper bus interface is available.
@@ -34,6 +35,7 @@ MemoryRegion *get_system_memory(void);
MemoryRegion *get_system_io(void);

extern AddressSpace address_space_memory;
+extern AddressSpace address_space_private_memory;
extern AddressSpace address_space_io;

#endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index f4d6eeaa17..a2d339fd88 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -85,10 +85,13 @@
RAMList ram_list = { .blocks = QLIST_HEAD_INITIALIZER(ram_list.blocks) };

static MemoryRegion *system_memory;
+static MemoryRegion *system_private_memory;
static MemoryRegion *system_io;

AddressSpace address_space_io;
AddressSpace address_space_memory;
+AddressSpace address_space_private_memory;
+

static MemoryRegion io_mem_unassigned;

@@ -2669,6 +2672,11 @@ static void memory_map_init(void)
memory_region_init(system_memory, NULL, "system", UINT64_MAX);
address_space_init(&address_space_memory, system_memory, "memory");

+ system_private_memory = g_malloc(sizeof(*system_private_memory));
+
+ memory_region_init(system_private_memory, NULL, "system-private", UINT64_MAX);
+ address_space_init(&address_space_private_memory, system_private_memory, "private-memory");
+
system_io = g_malloc(sizeof(*system_io));
memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
65536);
@@ -2680,6 +2688,11 @@ MemoryRegion *get_system_memory(void)
return system_memory;
}

+MemoryRegion *get_system_private_memory(void)
+{
+ return system_private_memory;
+}
+
MemoryRegion *get_system_io(void)
{
return system_io;
--
2.17.1


2021-11-11 14:17:05

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 12/13] kvm: handle private to shared memory conversion

Signed-off-by: Chao Peng <[email protected]>
---
accel/kvm/kvm-all.c | 49 ++++++++++++++++++++++++++++++++++++++++++
include/sysemu/kvm.h | 1 +
target/arm/kvm.c | 5 +++++
target/i386/kvm/kvm.c | 27 +++++++++++++++++++++++
target/mips/kvm.c | 5 +++++
target/ppc/kvm.c | 5 +++++
target/s390x/kvm/kvm.c | 5 +++++
7 files changed, 97 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d336458e9e..6feda9c89b 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1445,6 +1445,38 @@ out:
kvm_slots_unlock();
}

+static int kvm_map_private_memory(hwaddr start, hwaddr size)
+{
+ return 0;
+}
+
+static int kvm_map_shared_memory(hwaddr start, hwaddr size)
+{
+ MemoryRegionSection section;
+ void *addr;
+ RAMBlock *rb;
+ ram_addr_t offset;
+
+ /* Punch a hole in private memory. */
+ section = memory_region_find(get_system_private_memory(), start, size);
+ if (section.mr) {
+ addr = memory_region_get_ram_ptr(section.mr) +
+ section.offset_within_region;
+ rb = qemu_ram_block_from_host(addr, false, &offset);
+ ram_block_discard_range(rb, offset, size);
+ memory_region_unref(section.mr);
+ }
+
+ /* Create new shared memory. */
+ section = memory_region_find(get_system_memory(), start, size);
+ if (section.mr) {
+ memory_region_unref(section.mr);
+ return -1; /*Already existed. */
+ }
+
+ return kvm_arch_map_shared_memory(start, size);
+}
+
static void *kvm_dirty_ring_reaper_thread(void *data)
{
KVMState *s = data;
@@ -2957,6 +2989,23 @@ int kvm_cpu_exec(CPUState *cpu)
break;
}
break;
+ case KVM_EXIT_MEMORY_ERROR:
+ switch (run->mem.type) {
+ case KVM_EXIT_MEM_MAP_PRIVATE:
+ ret = kvm_map_private_memory(run->mem.u.map.gpa,
+ run->mem.u.map.size);
+ break;
+ case KVM_EXIT_MEM_MAP_SHARE:
+ ret = kvm_map_shared_memory(run->mem.u.map.gpa,
+ run->mem.u.map.size);
+ break;
+ default:
+ DPRINTF("kvm_arch_handle_exit\n");
+ ret = kvm_arch_handle_exit(cpu, run);
+ break;
+ }
+ break;
+
default:
DPRINTF("kvm_arch_handle_exit\n");
ret = kvm_arch_handle_exit(cpu, run);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..5f00aa0ee0 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -547,4 +547,5 @@ bool kvm_cpu_check_are_resettable(void);

bool kvm_arch_cpu_check_are_resettable(void);

+int kvm_arch_map_shared_memory(hwaddr start, hwaddr size);
#endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 5d55de1a49..97e51b8b88 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1051,3 +1051,8 @@ bool kvm_arch_cpu_check_are_resettable(void)
{
return true;
}
+
+int kvm_arch_map_shared_memory(hwaddr start, hwaddr size)
+{
+ return 0;
+}
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 500d2e0e68..b3209402bc 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -4925,3 +4925,30 @@ bool kvm_arch_cpu_check_are_resettable(void)
{
return !sev_es_enabled();
}
+
+int kvm_arch_map_shared_memory(hwaddr start, hwaddr size)
+{
+ MachineState *pcms = current_machine;
+ X86MachineState *x86ms = X86_MACHINE(pcms);
+ MemoryRegion *system_memory = get_system_memory();
+ MemoryRegion *region;
+ char name[134];
+ hwaddr offset;
+
+ if (start + size < x86ms->below_4g_mem_size) {
+ sprintf(name, "0x%lx@0x%lx", size, start);
+ region = g_malloc(sizeof(*region));
+ memory_region_init_alias(region, NULL, name, pcms->ram, start, size);
+ memory_region_add_subregion(system_memory, start, region);
+ return 0;
+ } else if (start > 0x100000000ULL){
+ sprintf(name, "0x%lx@0x%lx", size, start);
+ offset = start - 0x100000000ULL + x86ms->below_4g_mem_size;
+ region = g_malloc(sizeof(*region));
+ memory_region_init_alias(region, NULL, name, pcms->ram, offset, size);
+ memory_region_add_subregion(system_memory, start, region);
+ return 0;
+ }
+
+ return -1;
+}
diff --git a/target/mips/kvm.c b/target/mips/kvm.c
index 086debd9f0..4aed54aa9f 100644
--- a/target/mips/kvm.c
+++ b/target/mips/kvm.c
@@ -1295,3 +1295,8 @@ bool kvm_arch_cpu_check_are_resettable(void)
{
return true;
}
+
+int kvm_arch_map_shared_memory(hwaddr start, hwaddr size)
+{
+ return 0;
+}
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index dc93b99189..cc31a7c38d 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -2959,3 +2959,8 @@ bool kvm_arch_cpu_check_are_resettable(void)
{
return true;
}
+
+int kvm_arch_map_shared_memory(hwaddr start, hwaddr size)
+{
+ return 0;
+}
diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c
index 5b1fdb55c4..4a9161ba3a 100644
--- a/target/s390x/kvm/kvm.c
+++ b/target/s390x/kvm/kvm.c
@@ -2562,3 +2562,8 @@ bool kvm_arch_cpu_check_are_resettable(void)
{
return true;
}
+
+int kvm_arch_map_shared_memory(hwaddr start, hwaddr size)
+{
+ return 0;
+}
--
2.17.1


2021-11-11 14:17:16

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 11/13] kvm: register private memory slots

Signed-off-by: Chao Peng <[email protected]>
---
accel/kvm/kvm-all.c | 9 +++++++++
include/sysemu/kvm_int.h | 1 +
2 files changed, 10 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 0125c17edb..d336458e9e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -138,6 +138,7 @@ struct KVMState
QTAILQ_HEAD(, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE];
#endif
KVMMemoryListener memory_listener;
+ KVMMemoryListener private_memory_listener;
QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus;

/* For "info mtree -f" to tell if an MR is registered in KVM */
@@ -359,6 +360,7 @@ static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, boo
mem.guest_phys_addr = slot->start_addr;
mem.userspace_addr = (unsigned long)slot->ram;
mem.flags = slot->flags;
+ mem.fd = slot->fd;

if (slot->memory_size && !new && (mem.flags ^ slot->old_flags) & KVM_MEM_READONLY) {
/* Set the slot size to 0 before setting the slot to the desired
@@ -1423,6 +1425,9 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
mem->ram_start_offset = ram_start_offset;
mem->ram = ram;
mem->flags = kvm_mem_flags(mr);
+ if (mr->ram_block) {
+ mem->fd = mr->ram_block->fd;
+ }
kvm_slot_init_dirty_bitmap(mem);
err = kvm_set_user_memory_region(kml, mem, true);
if (err) {
@@ -2580,6 +2585,9 @@ static int kvm_init(MachineState *ms)

kvm_memory_listener_register(s, &s->memory_listener,
&address_space_memory, 0);
+ kvm_memory_listener_register(s, &s->private_memory_listener,
+ &address_space_private_memory, 2);
+
if (kvm_eventfds_allowed) {
memory_listener_register(&kvm_io_listener,
&address_space_io);
@@ -2613,6 +2621,7 @@ err:
close(s->fd);
}
g_free(s->memory_listener.slots);
+ g_free(s->private_memory_listener.slots);

return ret;
}
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index c788452cd9..0c11c63263 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -28,6 +28,7 @@ typedef struct KVMSlot
int as_id;
/* Cache of the offset in ram address space */
ram_addr_t ram_start_offset;
+ int fd;
} KVMSlot;

typedef struct KVMMemoryListener {
--
2.17.1


2021-11-11 14:17:29

by Chao Peng

[permalink] [raw]
Subject: [RFC PATCH 13/13] machine: Add 'private-memory-backend' property

Signed-off-by: Chao Peng <[email protected]>
---
hw/core/machine.c | 38 ++++++++++++++++++++++++++++++++++++++
hw/i386/pc.c | 22 ++++++++++++++++------
include/hw/boards.h | 2 ++
softmmu/vl.c | 16 ++++++++++------
4 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 067f42b528..d092bf400b 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -589,6 +589,22 @@ static void machine_set_memdev(Object *obj, const char *value, Error **errp)
ms->ram_memdev_id = g_strdup(value);
}

+static char *machine_get_private_memdev(Object *obj, Error **errp)
+{
+ MachineState *ms = MACHINE(obj);
+
+ return g_strdup(ms->private_ram_memdev_id);
+}
+
+static void machine_set_private_memdev(Object *obj, const char *value,
+ Error **errp)
+{
+ MachineState *ms = MACHINE(obj);
+
+ g_free(ms->private_ram_memdev_id);
+ ms->private_ram_memdev_id = g_strdup(value);
+}
+
static void machine_init_notify(Notifier *notifier, void *data)
{
MachineState *machine = MACHINE(qdev_get_machine());
@@ -962,6 +978,13 @@ static void machine_class_init(ObjectClass *oc, void *data)
object_class_property_set_description(oc, "memory-backend",
"Set RAM backend"
"Valid value is ID of hostmem based backend");
+
+ object_class_property_add_str(oc, "private-memory-backend",
+ machine_get_private_memdev,
+ machine_set_private_memdev);
+ object_class_property_set_description(oc, "private-memory-backend",
+ "Set guest private RAM backend"
+ "Valid value is ID of hostmem based backend");
}

static void machine_class_base_init(ObjectClass *oc, void *data)
@@ -1208,6 +1231,21 @@ void machine_run_board_init(MachineState *machine)
machine->ram = machine_consume_memdev(machine, MEMORY_BACKEND(o));
}

+ if (machine->private_ram_memdev_id) {
+ Object *o;
+ HostMemoryBackend *backend;
+ o = object_resolve_path_type(machine->private_ram_memdev_id,
+ TYPE_MEMORY_BACKEND, NULL);
+ backend = MEMORY_BACKEND(o);
+ if (backend->guest_private) {
+ machine->private_ram = machine_consume_memdev(machine, backend);
+ } else {
+ error_report("memorybaend %s is not guest private memory.",
+ object_get_canonical_path_component(OBJECT(backend)));
+ exit(EXIT_FAILURE);
+ }
+ }
+
if (machine->numa_state) {
numa_complete_configuration(machine);
if (machine->numa_state->num_nodes) {
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 1276bfeee4..e6209428c1 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -865,30 +865,40 @@ void pc_memory_init(PCMachineState *pcms,
MachineClass *mc = MACHINE_GET_CLASS(machine);
PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
X86MachineState *x86ms = X86_MACHINE(pcms);
+ MemoryRegion *ram, *root_region;

assert(machine->ram_size == x86ms->below_4g_mem_size +
x86ms->above_4g_mem_size);

linux_boot = (machine->kernel_filename != NULL);

+ *ram_memory = machine->ram;
+
+ /* Map private memory if set. Shared memory will be mapped per request. */
+ if (machine->private_ram) {
+ ram = machine->private_ram;
+ root_region = get_system_private_memory();
+ } else {
+ ram = machine->ram;
+ root_region = system_memory;
+ }
+
/*
* Split single memory region and use aliases to address portions of it,
* done for backwards compatibility with older qemus.
*/
- *ram_memory = machine->ram;
ram_below_4g = g_malloc(sizeof(*ram_below_4g));
- memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
+ memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", ram,
0, x86ms->below_4g_mem_size);
- memory_region_add_subregion(system_memory, 0, ram_below_4g);
+ memory_region_add_subregion(root_region, 0, ram_below_4g);
e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
if (x86ms->above_4g_mem_size > 0) {
ram_above_4g = g_malloc(sizeof(*ram_above_4g));
memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
- machine->ram,
+ ram,
x86ms->below_4g_mem_size,
x86ms->above_4g_mem_size);
- memory_region_add_subregion(system_memory, 0x100000000ULL,
- ram_above_4g);
+ memory_region_add_subregion(root_region, 0x100000000ULL, ram_above_4g);
e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
}

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 463a5514f9..dd6a3a3e03 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -313,11 +313,13 @@ struct MachineState {
bool enable_graphics;
ConfidentialGuestSupport *cgs;
char *ram_memdev_id;
+ char *private_ram_memdev_id;
/*
* convenience alias to ram_memdev_id backend memory region
* or to numa container memory region
*/
MemoryRegion *ram;
+ MemoryRegion *private_ram;
DeviceMemoryState *device_memory;

ram_addr_t ram_size;
diff --git a/softmmu/vl.c b/softmmu/vl.c
index ea05bb39c5..9665ccdb16 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1985,17 +1985,15 @@ static bool have_custom_ram_size(void)
return !!qemu_opt_get_size(opts, "size", 0);
}

-static void qemu_resolve_machine_memdev(void)
+static void check_memdev(char *id)
{
- if (current_machine->ram_memdev_id) {
+ if (id) {
Object *backend;
ram_addr_t backend_size;

- backend = object_resolve_path_type(current_machine->ram_memdev_id,
- TYPE_MEMORY_BACKEND, NULL);
+ backend = object_resolve_path_type(id, TYPE_MEMORY_BACKEND, NULL);
if (!backend) {
- error_report("Memory backend '%s' not found",
- current_machine->ram_memdev_id);
+ error_report("Memory backend '%s' not found", id);
exit(EXIT_FAILURE);
}
backend_size = object_property_get_uint(backend, "size", &error_abort);
@@ -2011,6 +2009,12 @@ static void qemu_resolve_machine_memdev(void)
}
ram_size = backend_size;
}
+}
+
+static void qemu_resolve_machine_memdev(void)
+{
+ check_memdev(current_machine->ram_memdev_id);
+ check_memdev(current_machine->private_ram_memdev_id);

if (!xen_enabled()) {
/* On 32-bit hosts, QEMU is limited by virtual address space */
--
2.17.1


2021-11-11 15:08:59

by Mika Penttilä

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] kvm: x86: add KVM_EXIT_MEMORY_ERROR exit



On 11.11.2021 16.13, Chao Peng wrote:
> Currently support to exit to userspace for private/shared memory
> conversion.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++++++++
> include/uapi/linux/kvm.h | 15 +++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index af5ecf4ef62a..780868888aa8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3950,6 +3950,17 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>
> slot = __kvm_vcpu_gfn_to_memslot(vcpu, gfn, private);
>
> + /*
> + * Exit to userspace to map the requested private/shared memory region
> + * if there is no memslot and (a) the access is private or (b) there is
> + * an existing private memslot. Emulated MMIO must be accessed through
> + * shared GPAs, thus a memslot miss on a private GPA is always handled
> + * as an implicit conversion "request".
> + */
> + if (!slot &&
> + (private || __kvm_vcpu_gfn_to_memslot(vcpu, gfn, true)))
> + goto out_convert;
> +
> /* Don't expose aliases for no slot GFNs or private memslots */
> if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
> !kvm_is_visible_memslot(slot)) {
> @@ -3994,6 +4005,15 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
> write, writable, hva);
> return false;
> +
> +out_convert:
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> + vcpu->run->mem.type = private ? KVM_EXIT_MEM_MAP_PRIVATE
> + : KVM_EXIT_MEM_MAP_SHARE;
> + vcpu->run->mem.u.map.gpa = cr2_or_gpa;
> + vcpu->run->mem.u.map.size = PAGE_SIZE;
> + return true;
> +
>
I think this does just retry, no exit to user space?




> }
>
> static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 8d20caae9180..470c472a9451 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -233,6 +233,18 @@ struct kvm_xen_exit {
> } u;
> };
>
> +struct kvm_memory_exit {
> +#define KVM_EXIT_MEM_MAP_SHARE 1
> +#define KVM_EXIT_MEM_MAP_PRIVATE 2
> + __u32 type;
> + union {
> + struct {
> + __u64 gpa;
> + __u64 size;
> + } map;
> + } u;
> +};
> +
> #define KVM_S390_GET_SKEYS_NONE 1
> #define KVM_S390_SKEYS_MAX 1048576
>
> @@ -272,6 +284,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_X86_BUS_LOCK 33
> #define KVM_EXIT_XEN 34
> #define KVM_EXIT_TDVMCALL 35
> +#define KVM_EXIT_MEMORY_ERROR 36
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -455,6 +468,8 @@ struct kvm_run {
> __u64 subfunc;
> __u64 param[4];
> } tdvmcall;
> + /* KVM_EXIT_MEMORY_ERROR */
> + struct kvm_memory_exit mem;
> /* Fix the size of the union. */
> char padding[256];
> };


2021-11-12 05:51:40

by Chao Peng

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] kvm: x86: add KVM_EXIT_MEMORY_ERROR exit

On Thu, Nov 11, 2021 at 05:08:47PM +0200, Mika Penttil? wrote:
>
>
> On 11.11.2021 16.13, Chao Peng wrote:
> > Currently support to exit to userspace for private/shared memory
> > conversion.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++++++++
> > include/uapi/linux/kvm.h | 15 +++++++++++++++
> > 2 files changed, 35 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index af5ecf4ef62a..780868888aa8 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3950,6 +3950,17 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> > slot = __kvm_vcpu_gfn_to_memslot(vcpu, gfn, private);
> > + /*
> > + * Exit to userspace to map the requested private/shared memory region
> > + * if there is no memslot and (a) the access is private or (b) there is
> > + * an existing private memslot. Emulated MMIO must be accessed through
> > + * shared GPAs, thus a memslot miss on a private GPA is always handled
> > + * as an implicit conversion "request".
> > + */
> > + if (!slot &&
> > + (private || __kvm_vcpu_gfn_to_memslot(vcpu, gfn, true)))
> > + goto out_convert;
> > +
> > /* Don't expose aliases for no slot GFNs or private memslots */
> > if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
> > !kvm_is_visible_memslot(slot)) {
> > @@ -3994,6 +4005,15 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> > *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
> > write, writable, hva);
> > return false;
> > +
> > +out_convert:
> > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> > + vcpu->run->mem.type = private ? KVM_EXIT_MEM_MAP_PRIVATE
> > + : KVM_EXIT_MEM_MAP_SHARE;
> > + vcpu->run->mem.u.map.gpa = cr2_or_gpa;
> > + vcpu->run->mem.u.map.size = PAGE_SIZE;
> > + return true;
> > +
> I think this does just retry, no exit to user space?

Good catch, thanks.
Chao
>
>
>
>
> > }
> > static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 8d20caae9180..470c472a9451 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -233,6 +233,18 @@ struct kvm_xen_exit {
> > } u;
> > };
> > +struct kvm_memory_exit {
> > +#define KVM_EXIT_MEM_MAP_SHARE 1
> > +#define KVM_EXIT_MEM_MAP_PRIVATE 2
> > + __u32 type;
> > + union {
> > + struct {
> > + __u64 gpa;
> > + __u64 size;
> > + } map;
> > + } u;
> > +};
> > +
> > #define KVM_S390_GET_SKEYS_NONE 1
> > #define KVM_S390_SKEYS_MAX 1048576
> > @@ -272,6 +284,7 @@ struct kvm_xen_exit {
> > #define KVM_EXIT_X86_BUS_LOCK 33
> > #define KVM_EXIT_XEN 34
> > #define KVM_EXIT_TDVMCALL 35
> > +#define KVM_EXIT_MEMORY_ERROR 36
> > /* For KVM_EXIT_INTERNAL_ERROR */
> > /* Emulate instruction failed. */
> > @@ -455,6 +468,8 @@ struct kvm_run {
> > __u64 subfunc;
> > __u64 param[4];
> > } tdvmcall;
> > + /* KVM_EXIT_MEMORY_ERROR */
> > + struct kvm_memory_exit mem;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
>

2021-11-12 19:28:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] mm: Add F_SEAL_GUEST to shmem/memfd

On Thu, Nov 11, 2021 at 10:13:40PM +0800, Chao Peng wrote:
> The new seal is only allowed if there's no pre-existing pages in the fd
> and there's no existing mapping of the file. After the seal is set, no
> read/write/mmap from userspace is allowed.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>

Below is replacement patch with fallocate callback support.

I also replaced page_level if order of the page because PG_LEVEL_2M/4K is
x86-specific can cannot be used in the generic code.

There's also bugix in guest_invalidate_page().


From 9419ccb4bc3c1df4cc88f6c8ba212f4b16955559 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Fri, 12 Nov 2021 21:27:40 +0300
Subject: [PATCH] mm/shmem: Introduce F_SEAL_GUEST

The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.

F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with falloc or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
register itself as the owner of the memfd via memfd_register_guest().

KVM provides callback that needed to be called on fallocate and punch
hole.

memfd_register_guest() returns callbacks that need be used for
requesting a new page from memfd.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/memfd.h | 24 ++++++++
include/linux/shmem_fs.h | 9 +++
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 32 +++++++++-
mm/shmem.c | 117 ++++++++++++++++++++++++++++++++++++-
5 files changed, 179 insertions(+), 4 deletions(-)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..500dfe88043e 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -4,13 +4,37 @@

#include <linux/file.h>

+struct guest_ops {
+ void (*invalidate_page_range)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+ void (*fallocate)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+ unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
+ int *order);
+ void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+extern inline int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
{
return -EINVAL;
}
+static inline int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ return -EINVAL;
+}
#endif

#endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..265d0c13bc5e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@

/* inode in-kernel data */

+struct guest_ops;
+struct guest_mem_ops;
+
struct shmem_inode_info {
spinlock_t lock;
unsigned int seals; /* shmem seals */
@@ -24,6 +27,8 @@ struct shmem_inode_info {
struct simple_xattrs xattrs; /* list of xattrs */
atomic_t stop_eviction; /* hold when working on inode */
struct inode vfs_inode;
+ void *guest_owner;
+ const struct guest_ops *guest_ops;
};

struct shmem_sb_info {
@@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);

+extern int shmem_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops);
+
/* Flag allocation requirements to shmem_getpage */
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_GUEST 0x0020
/* (1U << 31) is reserved for signed error codes */

/*
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..ae43454789f4 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
return NULL;
}

+int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ if (shmem_mapping(inode->i_mapping)) {
+ return shmem_register_guest(inode, owner,
+ guest_ops, guest_mem_ops);
+ }
+
+ return -EINVAL;
+}
+
#define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
F_SEAL_WRITE | \
- F_SEAL_FUTURE_WRITE)
+ F_SEAL_FUTURE_WRITE | \
+ F_SEAL_GUEST)

static int memfd_add_seals(struct file *file, unsigned int seals)
{
@@ -203,10 +216,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
}
}

+ if (seals & F_SEAL_GUEST) {
+ i_mmap_lock_read(inode->i_mapping);
+
+ if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) {
+ error = -EBUSY;
+ goto unlock;
+ }
+
+ if (i_size_read(inode)) {
+ error = -EBUSY;
+ goto unlock;
+ }
+ }
+
*file_seals |= seals;
error = 0;

unlock:
+ if (seals & F_SEAL_GUEST)
+ i_mmap_unlock_read(inode->i_mapping);
+
inode_unlock(inode);
return error;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index dacda7463d54..5d8ea4f02a94 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -80,6 +80,7 @@ static struct vfsmount *shm_mnt;
#include <linux/userfaultfd_k.h>
#include <linux/rmap.h>
#include <linux/uuid.h>
+#include <linux/memfd.h>

#include <linux/uaccess.h>

@@ -883,6 +884,18 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
return split_huge_page(page) >= 0;
}

+static void guest_invalidate_page(struct inode *inode,
+ struct page *page, pgoff_t start, pgoff_t end)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ start = max(start, page->index);
+ end = min(end, page->index + thp_nr_pages(page)) - 1;
+
+ info->guest_ops->invalidate_page_range(inode, info->guest_owner,
+ start, end);
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -923,6 +936,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
index += thp_nr_pages(page) - 1;

+ guest_invalidate_page(inode, page, start, end);
+
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -999,6 +1014,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index--;
break;
}
+
+ guest_invalidate_page(inode, page, start, end);
+
VM_BUG_ON_PAGE(PageWriteback(page), page);
if (shmem_punch_compound(page, start, end))
truncate_inode_page(mapping, page);
@@ -1074,6 +1092,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
(newsize > oldsize && (info->seals & F_SEAL_GROW)))
return -EPERM;

+ if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK))
+ return -EINVAL;
+
if (newsize != oldsize) {
error = shmem_reacct_size(SHMEM_I(inode)->flags,
oldsize, newsize);
@@ -1348,6 +1369,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
goto redirty;
if (!total_swap_pages)
goto redirty;
+ if (info->seals & F_SEAL_GUEST)
+ goto redirty;

/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -2274,6 +2297,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
if (ret)
return ret;

+ if (info->seals & F_SEAL_GUEST)
+ return -EPERM;
+
/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;

@@ -2471,12 +2497,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index = pos >> PAGE_SHIFT;

/* i_mutex is held by caller */
- if (unlikely(info->seals & (F_SEAL_GROW |
- F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+ F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) {
if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
+ if (info->seals & F_SEAL_GUEST)
+ return -EPERM;
}

return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2550,6 +2578,20 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
end_index = i_size >> PAGE_SHIFT;
if (index > end_index)
break;
+
+ /*
+ * inode_lock protects setting up seals as well as write to
+ * i_size. Setting F_SEAL_GUEST only allowed with i_size == 0.
+ *
+ * Check F_SEAL_GUEST after i_size. It effectively serialize
+ * read vs. setting F_SEAL_GUEST without taking inode_lock in
+ * read path.
+ */
+ if (SHMEM_I(inode)->seals & F_SEAL_GUEST) {
+ error = -EPERM;
+ break;
+ }
+
if (index == end_index) {
nr = i_size & ~PAGE_MASK;
if (nr <= offset)
@@ -2675,6 +2717,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}

+ if ((info->seals & F_SEAL_GUEST) &&
+ (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+ error = -EINVAL;
+ goto out;
+ }
+
shmem_falloc.waitq = &shmem_falloc_waitq;
shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -2771,6 +2819,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
i_size_write(inode, offset + len);
inode->i_ctime = current_time(inode);
+ info->guest_ops->fallocate(inode, info->guest_owner, start, end);
undone:
spin_lock(&inode->i_lock);
inode->i_private = NULL;
@@ -3761,6 +3810,20 @@ static void shmem_destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
}

+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ struct inode *inode = mapping->host;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (info->seals & F_SEAL_GUEST)
+ return -ENOTSUPP;
+ return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
const struct address_space_operations shmem_aops = {
.writepage = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
@@ -3769,12 +3832,60 @@ const struct address_space_operations shmem_aops = {
.write_end = shmem_write_end,
#endif
#ifdef CONFIG_MIGRATION
- .migratepage = migrate_page,
+ .migratepage = shmem_migrate_page,
#endif
.error_remove_page = generic_error_remove_page,
};
EXPORT_SYMBOL(shmem_aops);

+static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset,
+ int *order)
+{
+ struct page *page;
+ int ret;
+
+ ret = shmem_getpage(inode, offset, &page, SGP_WRITE);
+ if (ret)
+ return ret;
+
+ *order = thp_order(thp_head(page));
+
+ return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ set_page_dirty(page);
+ unlock_page(page);
+ put_page(page);
+}
+
+static const struct guest_mem_ops shmem_guest_ops = {
+ .get_lock_pfn = shmem_get_lock_pfn,
+ .put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!owner)
+ return -EINVAL;
+
+ if (info->guest_owner && info->guest_owner != owner)
+ return -EPERM;
+
+ info->guest_ops = guest_ops;
+ *guest_mem_ops = &shmem_guest_ops;
+ return 0;
+}
+
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
.get_unmapped_area = shmem_get_unmapped_area,
--
Kirill A. Shutemov

2022-01-20 01:10:28

by Philippe Mathieu-Daudé

[permalink] [raw]
Subject: Re: [RFC PATCH 10/13] softmmu/physmem: Add private memory address space

Hi,

On 11/11/21 15:13, Chao Peng wrote:
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/exec/address-spaces.h | 2 ++
> softmmu/physmem.c | 13 +++++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/include/exec/address-spaces.h b/include/exec/address-spaces.h
> index db8bfa9a92..b3f45001c0 100644
> --- a/include/exec/address-spaces.h
> +++ b/include/exec/address-spaces.h
> @@ -27,6 +27,7 @@
> * until a proper bus interface is available.
> */
> MemoryRegion *get_system_memory(void);
> +MemoryRegion *get_system_private_memory(void);
>
> /* Get the root I/O port region. This interface should only be used
> * temporarily until a proper bus interface is available.
> @@ -34,6 +35,7 @@ MemoryRegion *get_system_memory(void);
> MemoryRegion *get_system_io(void);
>
> extern AddressSpace address_space_memory;
> +extern AddressSpace address_space_private_memory;
> extern AddressSpace address_space_io;
>
> #endif
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index f4d6eeaa17..a2d339fd88 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -85,10 +85,13 @@
> RAMList ram_list = { .blocks = QLIST_HEAD_INITIALIZER(ram_list.blocks) };
>
> static MemoryRegion *system_memory;
> +static MemoryRegion *system_private_memory;
> static MemoryRegion *system_io;
>
> AddressSpace address_space_io;
> AddressSpace address_space_memory;
> +AddressSpace address_space_private_memory;
> +
>
> static MemoryRegion io_mem_unassigned;
>
> @@ -2669,6 +2672,11 @@ static void memory_map_init(void)
> memory_region_init(system_memory, NULL, "system", UINT64_MAX);
> address_space_init(&address_space_memory, system_memory, "memory");
>
> + system_private_memory = g_malloc(sizeof(*system_private_memory));
> +
> + memory_region_init(system_private_memory, NULL, "system-private", UINT64_MAX);
> + address_space_init(&address_space_private_memory, system_private_memory, "private-memory");

Since the description is quite scarce, I don't understand why we need to
add this KVM specific "system-private" MR/AS to all machines on all
architectures.

> system_io = g_malloc(sizeof(*system_io));
> memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
> 65536);

(We already want to get ride of the "io" MR/AS which is specific to
x86 or machines).

> @@ -2680,6 +2688,11 @@ MemoryRegion *get_system_memory(void)
> return system_memory;
> }
>
> +MemoryRegion *get_system_private_memory(void)
> +{
> + return system_private_memory;
> +}
> +
> MemoryRegion *get_system_io(void)
> {
> return system_io;